DENCAST: distributed density-based clustering for multi-target regression
Abstract Recent developments in sensor networks and mobile computing led to a huge increase in data generated that need to be processed and analyzed efficiently. In this context, many distributed data mining algorithms have recently been proposed. Following this line of research, we propose the DENC...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
SpringerOpen
2019-06-01
|
Series: | Journal of Big Data |
Subjects: | |
Online Access: | http://link.springer.com/article/10.1186/s40537-019-0207-2 |
_version_ | 1818937641516662784 |
---|---|
author | Roberto Corizzo Gianvito Pio Michelangelo Ceci Donato Malerba |
author_facet | Roberto Corizzo Gianvito Pio Michelangelo Ceci Donato Malerba |
author_sort | Roberto Corizzo |
collection | DOAJ |
description | Abstract Recent developments in sensor networks and mobile computing led to a huge increase in data generated that need to be processed and analyzed efficiently. In this context, many distributed data mining algorithms have recently been proposed. Following this line of research, we propose the DENCAST system, a novel distributed algorithm implemented in Apache Spark, which performs density-based clustering and exploits the identified clusters to solve both single- and multi-target regression tasks (and thus, solves complex tasks such as time series prediction). Contrary to existing distributed methods, DENCAST does not require a final merging step (usually performed on a single machine) and is able to handle large-scale, high-dimensional data by taking advantage of locality sensitive hashing. Experiments show that DENCAST performs clustering more efficiently than a state-of-the-art distributed clustering algorithm, especially when the number of objects increases significantly. The quality of the extracted clusters is confirmed by the predictive capabilities of DENCAST on several datasets: It is able to significantly outperform (p-value $$<0.05$$ <0.05 ) state-of-the-art distributed regression methods, in both single and multi-target settings. |
first_indexed | 2024-12-20T05:55:11Z |
format | Article |
id | doaj.art-f58cd3d407cb4a97887a4025b1a5fe95 |
institution | Directory Open Access Journal |
issn | 2196-1115 |
language | English |
last_indexed | 2024-12-20T05:55:11Z |
publishDate | 2019-06-01 |
publisher | SpringerOpen |
record_format | Article |
series | Journal of Big Data |
spelling | doaj.art-f58cd3d407cb4a97887a4025b1a5fe952022-12-21T19:51:05ZengSpringerOpenJournal of Big Data2196-11152019-06-016112710.1186/s40537-019-0207-2DENCAST: distributed density-based clustering for multi-target regressionRoberto Corizzo0Gianvito Pio1Michelangelo Ceci2Donato Malerba3Department of Computer Science, University of Bari Aldo MoroDepartment of Computer Science, University of Bari Aldo MoroDepartment of Computer Science, University of Bari Aldo MoroDepartment of Computer Science, University of Bari Aldo MoroAbstract Recent developments in sensor networks and mobile computing led to a huge increase in data generated that need to be processed and analyzed efficiently. In this context, many distributed data mining algorithms have recently been proposed. Following this line of research, we propose the DENCAST system, a novel distributed algorithm implemented in Apache Spark, which performs density-based clustering and exploits the identified clusters to solve both single- and multi-target regression tasks (and thus, solves complex tasks such as time series prediction). Contrary to existing distributed methods, DENCAST does not require a final merging step (usually performed on a single machine) and is able to handle large-scale, high-dimensional data by taking advantage of locality sensitive hashing. Experiments show that DENCAST performs clustering more efficiently than a state-of-the-art distributed clustering algorithm, especially when the number of objects increases significantly. The quality of the extracted clusters is confirmed by the predictive capabilities of DENCAST on several datasets: It is able to significantly outperform (p-value $$<0.05$$ <0.05 ) state-of-the-art distributed regression methods, in both single and multi-target settings.http://link.springer.com/article/10.1186/s40537-019-0207-2Distributed clusteringMulti-target regressionApache Spark |
spellingShingle | Roberto Corizzo Gianvito Pio Michelangelo Ceci Donato Malerba DENCAST: distributed density-based clustering for multi-target regression Journal of Big Data Distributed clustering Multi-target regression Apache Spark |
title | DENCAST: distributed density-based clustering for multi-target regression |
title_full | DENCAST: distributed density-based clustering for multi-target regression |
title_fullStr | DENCAST: distributed density-based clustering for multi-target regression |
title_full_unstemmed | DENCAST: distributed density-based clustering for multi-target regression |
title_short | DENCAST: distributed density-based clustering for multi-target regression |
title_sort | dencast distributed density based clustering for multi target regression |
topic | Distributed clustering Multi-target regression Apache Spark |
url | http://link.springer.com/article/10.1186/s40537-019-0207-2 |
work_keys_str_mv | AT robertocorizzo dencastdistributeddensitybasedclusteringformultitargetregression AT gianvitopio dencastdistributeddensitybasedclusteringformultitargetregression AT michelangeloceci dencastdistributeddensitybasedclusteringformultitargetregression AT donatomalerba dencastdistributeddensitybasedclusteringformultitargetregression |