Distributed Clustering of Text Collections

Current data processing tasks require efficient approaches capable of dealing with large databases. A promising strategy consists in distributing the data along with several computers that partially solve the undertaken problem. Finally, these partial answers are integrated to obtain a final solutio...

Full description

Bibliographic Details
Main Authors: Juan Zamora, Hector Allende-Cid, Marcelo Mendoza
Format: Article
Language:English
Published: IEEE 2019-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8882328/
_version_ 1818557560971591680
author Juan Zamora
Hector Allende-Cid
Marcelo Mendoza
author_facet Juan Zamora
Hector Allende-Cid
Marcelo Mendoza
author_sort Juan Zamora
collection DOAJ
description Current data processing tasks require efficient approaches capable of dealing with large databases. A promising strategy consists in distributing the data along with several computers that partially solve the undertaken problem. Finally, these partial answers are integrated to obtain a final solution. We introduce distributed shared nearest neighbors (D-SNN), a novel clustering algorithm that work with disjoint partitions of data. Our algorithm produces a global clustering solution that achieves a competitive performance regarding centralized approaches. The algorithm works effectively with high dimensional data, being advisable for document clustering tasks. Experimental results over five data sets show that our proposal is competitive in terms of quality performance measures when compared to state of the art methods.
first_indexed 2024-12-14T00:01:08Z
format Article
id doaj.art-aaf0d0b3e1a1492486eb339f7905b6de
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-12-14T00:01:08Z
publishDate 2019-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-aaf0d0b3e1a1492486eb339f7905b6de2022-12-21T23:26:19ZengIEEEIEEE Access2169-35362019-01-01715567115568510.1109/ACCESS.2019.29494558882328Distributed Clustering of Text CollectionsJuan Zamora0https://orcid.org/0000-0003-0003-182XHector Allende-Cid1Marcelo Mendoza2Instituto de Estadística, Pontificia Universidad Católica de Valparaíso, Valparaíso, ChileEscuela de Ingeniería Informática, Pontificia Universidad Católica de Valparaíso, Valparaíso, ChileCentro Científico y Tecnológico de Valparaíso, Universidad Técnica Federico Santa María, Valparaíso, ChileCurrent data processing tasks require efficient approaches capable of dealing with large databases. A promising strategy consists in distributing the data along with several computers that partially solve the undertaken problem. Finally, these partial answers are integrated to obtain a final solution. We introduce distributed shared nearest neighbors (D-SNN), a novel clustering algorithm that work with disjoint partitions of data. Our algorithm produces a global clustering solution that achieves a competitive performance regarding centralized approaches. The algorithm works effectively with high dimensional data, being advisable for document clustering tasks. Experimental results over five data sets show that our proposal is competitive in terms of quality performance measures when compared to state of the art methods.https://ieeexplore.ieee.org/document/8882328/Distributed algorithmsdistributed text clusteringhigh dimensional data
spellingShingle Juan Zamora
Hector Allende-Cid
Marcelo Mendoza
Distributed Clustering of Text Collections
IEEE Access
Distributed algorithms
distributed text clustering
high dimensional data
title Distributed Clustering of Text Collections
title_full Distributed Clustering of Text Collections
title_fullStr Distributed Clustering of Text Collections
title_full_unstemmed Distributed Clustering of Text Collections
title_short Distributed Clustering of Text Collections
title_sort distributed clustering of text collections
topic Distributed algorithms
distributed text clustering
high dimensional data
url https://ieeexplore.ieee.org/document/8882328/
work_keys_str_mv AT juanzamora distributedclusteringoftextcollections
AT hectorallendecid distributedclusteringoftextcollections
AT marcelomendoza distributedclusteringoftextcollections