Mapreduce-Based Distributed Clustering Method Using CF<sup>+</sup> Tree
Clustering exceptionally large data sets is becoming a major challenge in data analytics with the continuous increase in their size. Summary-based clustering methods and distributed computing frameworks such as MapReduce can efficiently handle this challenge. These methods include BIRCH and its exte...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2020-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/9104976/ |
_version_ | 1819275679426936832 |
---|---|
author | Hyeong-Cheol Ryu Sungwon Jung |
author_facet | Hyeong-Cheol Ryu Sungwon Jung |
author_sort | Hyeong-Cheol Ryu |
collection | DOAJ |
description | Clustering exceptionally large data sets is becoming a major challenge in data analytics with the continuous increase in their size. Summary-based clustering methods and distributed computing frameworks such as MapReduce can efficiently handle this challenge. These methods include BIRCH and its extension CF<sup>+</sup>-ERC. CF<sup>+</sup>-ERC can reduce the clustering time of large data sets by utilizing the structure of a CF<sup>+</sup> tree. However, CF<sup>+</sup>-ERC is a sequential clustering method, so it cannot be used with multiple machines to reduce the clustering time. In this study, we propose a novel MapReduce-based distributed clustering method called CF<sup>+</sup>-ERC on MapReduce (CF<sup>+</sup>ERC_MR). It builds a CF<sup>+</sup> tree for clustering an exceptionally large data set with a given threshold and finds the final clusters using MapReduce, which significantly reduces the clustering time. Further, our method is scalable with respect to the number of machines. The efficacy of this method is validated through not only its theoretical analysis but also in-depth experimental analysis of exceptionally large synthetic and real data sets. The experimental results demonstrate that the clustering speed of our approach is far superior to that of the existing clustering methods. |
first_indexed | 2024-12-23T23:28:09Z |
format | Article |
id | doaj.art-d25cc50492214137a8c472da6e935c8f |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-12-23T23:28:09Z |
publishDate | 2020-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-d25cc50492214137a8c472da6e935c8f2022-12-21T17:26:09ZengIEEEIEEE Access2169-35362020-01-01810423210424610.1109/ACCESS.2020.29990859104976Mapreduce-Based Distributed Clustering Method Using CF<sup>+</sup> TreeHyeong-Cheol Ryu0https://orcid.org/0000-0002-4283-7242Sungwon Jung1https://orcid.org/0000-0002-5332-5947Department of Computer Science and Engineering, Sogang University, Seoul, South KoreaDepartment of Computer Science and Engineering, Sogang University, Seoul, South KoreaClustering exceptionally large data sets is becoming a major challenge in data analytics with the continuous increase in their size. Summary-based clustering methods and distributed computing frameworks such as MapReduce can efficiently handle this challenge. These methods include BIRCH and its extension CF<sup>+</sup>-ERC. CF<sup>+</sup>-ERC can reduce the clustering time of large data sets by utilizing the structure of a CF<sup>+</sup> tree. However, CF<sup>+</sup>-ERC is a sequential clustering method, so it cannot be used with multiple machines to reduce the clustering time. In this study, we propose a novel MapReduce-based distributed clustering method called CF<sup>+</sup>-ERC on MapReduce (CF<sup>+</sup>ERC_MR). It builds a CF<sup>+</sup> tree for clustering an exceptionally large data set with a given threshold and finds the final clusters using MapReduce, which significantly reduces the clustering time. Further, our method is scalable with respect to the number of machines. The efficacy of this method is validated through not only its theoretical analysis but also in-depth experimental analysis of exceptionally large synthetic and real data sets. The experimental results demonstrate that the clustering speed of our approach is far superior to that of the existing clustering methods.https://ieeexplore.ieee.org/document/9104976/ClusteringBIRCHCF⁺ treerange queryvery large data setsMapReduce |
spellingShingle | Hyeong-Cheol Ryu Sungwon Jung Mapreduce-Based Distributed Clustering Method Using CF<sup>+</sup> Tree IEEE Access Clustering BIRCH CF⁺ tree range query very large data sets MapReduce |
title | Mapreduce-Based Distributed Clustering Method Using CF<sup>+</sup> Tree |
title_full | Mapreduce-Based Distributed Clustering Method Using CF<sup>+</sup> Tree |
title_fullStr | Mapreduce-Based Distributed Clustering Method Using CF<sup>+</sup> Tree |
title_full_unstemmed | Mapreduce-Based Distributed Clustering Method Using CF<sup>+</sup> Tree |
title_short | Mapreduce-Based Distributed Clustering Method Using CF<sup>+</sup> Tree |
title_sort | mapreduce based distributed clustering method using cf sup sup tree |
topic | Clustering BIRCH CF⁺ tree range query very large data sets MapReduce |
url | https://ieeexplore.ieee.org/document/9104976/ |
work_keys_str_mv | AT hyeongcheolryu mapreducebaseddistributedclusteringmethodusingcfsupsuptree AT sungwonjung mapreducebaseddistributedclusteringmethodusingcfsupsuptree |