Mapreduce-Based Distributed Clustering Method Using CF<sup>+</sup> Tree

Clustering exceptionally large data sets is becoming a major challenge in data analytics with the continuous increase in their size. Summary-based clustering methods and distributed computing frameworks such as MapReduce can efficiently handle this challenge. These methods include BIRCH and its exte...

Full description

Bibliographic Details
Main Authors: Hyeong-Cheol Ryu, Sungwon Jung
Format: Article
Language:English
Published: IEEE 2020-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9104976/
_version_ 1819275679426936832
author Hyeong-Cheol Ryu
Sungwon Jung
author_facet Hyeong-Cheol Ryu
Sungwon Jung
author_sort Hyeong-Cheol Ryu
collection DOAJ
description Clustering exceptionally large data sets is becoming a major challenge in data analytics with the continuous increase in their size. Summary-based clustering methods and distributed computing frameworks such as MapReduce can efficiently handle this challenge. These methods include BIRCH and its extension CF<sup>+</sup>-ERC. CF<sup>+</sup>-ERC can reduce the clustering time of large data sets by utilizing the structure of a CF<sup>+</sup> tree. However, CF<sup>+</sup>-ERC is a sequential clustering method, so it cannot be used with multiple machines to reduce the clustering time. In this study, we propose a novel MapReduce-based distributed clustering method called CF<sup>+</sup>-ERC on MapReduce (CF<sup>+</sup>ERC_MR). It builds a CF<sup>+</sup> tree for clustering an exceptionally large data set with a given threshold and finds the final clusters using MapReduce, which significantly reduces the clustering time. Further, our method is scalable with respect to the number of machines. The efficacy of this method is validated through not only its theoretical analysis but also in-depth experimental analysis of exceptionally large synthetic and real data sets. The experimental results demonstrate that the clustering speed of our approach is far superior to that of the existing clustering methods.
first_indexed 2024-12-23T23:28:09Z
format Article
id doaj.art-d25cc50492214137a8c472da6e935c8f
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-12-23T23:28:09Z
publishDate 2020-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-d25cc50492214137a8c472da6e935c8f2022-12-21T17:26:09ZengIEEEIEEE Access2169-35362020-01-01810423210424610.1109/ACCESS.2020.29990859104976Mapreduce-Based Distributed Clustering Method Using CF<sup>+</sup> TreeHyeong-Cheol Ryu0https://orcid.org/0000-0002-4283-7242Sungwon Jung1https://orcid.org/0000-0002-5332-5947Department of Computer Science and Engineering, Sogang University, Seoul, South KoreaDepartment of Computer Science and Engineering, Sogang University, Seoul, South KoreaClustering exceptionally large data sets is becoming a major challenge in data analytics with the continuous increase in their size. Summary-based clustering methods and distributed computing frameworks such as MapReduce can efficiently handle this challenge. These methods include BIRCH and its extension CF<sup>+</sup>-ERC. CF<sup>+</sup>-ERC can reduce the clustering time of large data sets by utilizing the structure of a CF<sup>+</sup> tree. However, CF<sup>+</sup>-ERC is a sequential clustering method, so it cannot be used with multiple machines to reduce the clustering time. In this study, we propose a novel MapReduce-based distributed clustering method called CF<sup>+</sup>-ERC on MapReduce (CF<sup>+</sup>ERC_MR). It builds a CF<sup>+</sup> tree for clustering an exceptionally large data set with a given threshold and finds the final clusters using MapReduce, which significantly reduces the clustering time. Further, our method is scalable with respect to the number of machines. The efficacy of this method is validated through not only its theoretical analysis but also in-depth experimental analysis of exceptionally large synthetic and real data sets. The experimental results demonstrate that the clustering speed of our approach is far superior to that of the existing clustering methods.https://ieeexplore.ieee.org/document/9104976/ClusteringBIRCHCF⁺ treerange queryvery large data setsMapReduce
spellingShingle Hyeong-Cheol Ryu
Sungwon Jung
Mapreduce-Based Distributed Clustering Method Using CF<sup>+</sup> Tree
IEEE Access
Clustering
BIRCH
CF⁺ tree
range query
very large data sets
MapReduce
title Mapreduce-Based Distributed Clustering Method Using CF<sup>+</sup> Tree
title_full Mapreduce-Based Distributed Clustering Method Using CF<sup>+</sup> Tree
title_fullStr Mapreduce-Based Distributed Clustering Method Using CF<sup>+</sup> Tree
title_full_unstemmed Mapreduce-Based Distributed Clustering Method Using CF<sup>+</sup> Tree
title_short Mapreduce-Based Distributed Clustering Method Using CF<sup>+</sup> Tree
title_sort mapreduce based distributed clustering method using cf sup sup tree
topic Clustering
BIRCH
CF⁺ tree
range query
very large data sets
MapReduce
url https://ieeexplore.ieee.org/document/9104976/
work_keys_str_mv AT hyeongcheolryu mapreducebaseddistributedclusteringmethodusingcfsupsuptree
AT sungwonjung mapreducebaseddistributedclusteringmethodusingcfsupsuptree