Mapreduce-Based Distributed Clustering Method Using CF<sup>+</sup> Tree

Clustering exceptionally large data sets is becoming a major challenge in data analytics with the continuous increase in their size. Summary-based clustering methods and distributed computing frameworks such as MapReduce can efficiently handle this challenge. These methods include BIRCH and its exte...

Full description

Bibliographic Details
Main Authors:	Hyeong-Cheol Ryu, Sungwon Jung
Format:	Article
Language:	English
Published:	IEEE 2020-01-01
Series:	IEEE Access
Subjects:	Clustering BIRCH CF⁺ tree range query very large data sets MapReduce
Online Access:	https://ieeexplore.ieee.org/document/9104976/

_version_	1819275679426936832
author	Hyeong-Cheol Ryu Sungwon Jung
author_facet	Hyeong-Cheol Ryu Sungwon Jung
author_sort	Hyeong-Cheol Ryu
collection	DOAJ
description	Clustering exceptionally large data sets is becoming a major challenge in data analytics with the continuous increase in their size. Summary-based clustering methods and distributed computing frameworks such as MapReduce can efficiently handle this challenge. These methods include BIRCH and its extension CF<sup>+</sup>-ERC. CF<sup>+</sup>-ERC can reduce the clustering time of large data sets by utilizing the structure of a CF<sup>+</sup> tree. However, CF<sup>+</sup>-ERC is a sequential clustering method, so it cannot be used with multiple machines to reduce the clustering time. In this study, we propose a novel MapReduce-based distributed clustering method called CF<sup>+</sup>-ERC on MapReduce (CF<sup>+</sup>ERC_MR). It builds a CF<sup>+</sup> tree for clustering an exceptionally large data set with a given threshold and finds the final clusters using MapReduce, which significantly reduces the clustering time. Further, our method is scalable with respect to the number of machines. The efficacy of this method is validated through not only its theoretical analysis but also in-depth experimental analysis of exceptionally large synthetic and real data sets. The experimental results demonstrate that the clustering speed of our approach is far superior to that of the existing clustering methods.
first_indexed	2024-12-23T23:28:09Z
format	Article
id	doaj.art-d25cc50492214137a8c472da6e935c8f
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-12-23T23:28:09Z
publishDate	2020-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-d25cc50492214137a8c472da6e935c8f2022-12-21T17:26:09ZengIEEEIEEE Access2169-35362020-01-01810423210424610.1109/ACCESS.2020.29990859104976Mapreduce-Based Distributed Clustering Method Using CF<sup>+</sup> TreeHyeong-Cheol Ryu0https://orcid.org/0000-0002-4283-7242Sungwon Jung1https://orcid.org/0000-0002-5332-5947Department of Computer Science and Engineering, Sogang University, Seoul, South KoreaDepartment of Computer Science and Engineering, Sogang University, Seoul, South KoreaClustering exceptionally large data sets is becoming a major challenge in data analytics with the continuous increase in their size. Summary-based clustering methods and distributed computing frameworks such as MapReduce can efficiently handle this challenge. These methods include BIRCH and its extension CF<sup>+</sup>-ERC. CF<sup>+</sup>-ERC can reduce the clustering time of large data sets by utilizing the structure of a CF<sup>+</sup> tree. However, CF<sup>+</sup>-ERC is a sequential clustering method, so it cannot be used with multiple machines to reduce the clustering time. In this study, we propose a novel MapReduce-based distributed clustering method called CF<sup>+</sup>-ERC on MapReduce (CF<sup>+</sup>ERC_MR). It builds a CF<sup>+</sup> tree for clustering an exceptionally large data set with a given threshold and finds the final clusters using MapReduce, which significantly reduces the clustering time. Further, our method is scalable with respect to the number of machines. The efficacy of this method is validated through not only its theoretical analysis but also in-depth experimental analysis of exceptionally large synthetic and real data sets. The experimental results demonstrate that the clustering speed of our approach is far superior to that of the existing clustering methods.https://ieeexplore.ieee.org/document/9104976/ClusteringBIRCHCF⁺ treerange queryvery large data setsMapReduce
spellingShingle	Hyeong-Cheol Ryu Sungwon Jung Mapreduce-Based Distributed Clustering Method Using CF<sup>+</sup> Tree IEEE Access Clustering BIRCH CF⁺ tree range query very large data sets MapReduce
title	Mapreduce-Based Distributed Clustering Method Using CF<sup>+</sup> Tree
title_full	Mapreduce-Based Distributed Clustering Method Using CF<sup>+</sup> Tree
title_fullStr	Mapreduce-Based Distributed Clustering Method Using CF<sup>+</sup> Tree
title_full_unstemmed	Mapreduce-Based Distributed Clustering Method Using CF<sup>+</sup> Tree
title_short	Mapreduce-Based Distributed Clustering Method Using CF<sup>+</sup> Tree
title_sort	mapreduce based distributed clustering method using cf sup sup tree
topic	Clustering BIRCH CF⁺ tree range query very large data sets MapReduce
url	https://ieeexplore.ieee.org/document/9104976/
work_keys_str_mv	AT hyeongcheolryu mapreducebaseddistributedclusteringmethodusingcfsupsuptree AT sungwonjung mapreducebaseddistributedclusteringmethodusingcfsupsuptree

Mapreduce-Based Distributed Clustering Method Using CF<sup>+</sup> Tree

Similar Items