DENCAST: distributed density-based clustering for multi-target regression

Abstract Recent developments in sensor networks and mobile computing led to a huge increase in data generated that need to be processed and analyzed efficiently. In this context, many distributed data mining algorithms have recently been proposed. Following this line of research, we propose the DENC...

Full description

Bibliographic Details
Main Authors: Roberto Corizzo, Gianvito Pio, Michelangelo Ceci, Donato Malerba
Format: Article
Language:English
Published: SpringerOpen 2019-06-01
Series:Journal of Big Data
Subjects:
Online Access:http://link.springer.com/article/10.1186/s40537-019-0207-2
_version_ 1818937641516662784
author Roberto Corizzo
Gianvito Pio
Michelangelo Ceci
Donato Malerba
author_facet Roberto Corizzo
Gianvito Pio
Michelangelo Ceci
Donato Malerba
author_sort Roberto Corizzo
collection DOAJ
description Abstract Recent developments in sensor networks and mobile computing led to a huge increase in data generated that need to be processed and analyzed efficiently. In this context, many distributed data mining algorithms have recently been proposed. Following this line of research, we propose the DENCAST system, a novel distributed algorithm implemented in Apache Spark, which performs density-based clustering and exploits the identified clusters to solve both single- and multi-target regression tasks (and thus, solves complex tasks such as time series prediction). Contrary to existing distributed methods, DENCAST does not require a final merging step (usually performed on a single machine) and is able to handle large-scale, high-dimensional data by taking advantage of locality sensitive hashing. Experiments show that DENCAST performs clustering more efficiently than a state-of-the-art distributed clustering algorithm, especially when the number of objects increases significantly. The quality of the extracted clusters is confirmed by the predictive capabilities of DENCAST on several datasets: It is able to significantly outperform (p-value $$<0.05$$ <0.05 ) state-of-the-art distributed regression methods, in both single and multi-target settings.
first_indexed 2024-12-20T05:55:11Z
format Article
id doaj.art-f58cd3d407cb4a97887a4025b1a5fe95
institution Directory Open Access Journal
issn 2196-1115
language English
last_indexed 2024-12-20T05:55:11Z
publishDate 2019-06-01
publisher SpringerOpen
record_format Article
series Journal of Big Data
spelling doaj.art-f58cd3d407cb4a97887a4025b1a5fe952022-12-21T19:51:05ZengSpringerOpenJournal of Big Data2196-11152019-06-016112710.1186/s40537-019-0207-2DENCAST: distributed density-based clustering for multi-target regressionRoberto Corizzo0Gianvito Pio1Michelangelo Ceci2Donato Malerba3Department of Computer Science, University of Bari Aldo MoroDepartment of Computer Science, University of Bari Aldo MoroDepartment of Computer Science, University of Bari Aldo MoroDepartment of Computer Science, University of Bari Aldo MoroAbstract Recent developments in sensor networks and mobile computing led to a huge increase in data generated that need to be processed and analyzed efficiently. In this context, many distributed data mining algorithms have recently been proposed. Following this line of research, we propose the DENCAST system, a novel distributed algorithm implemented in Apache Spark, which performs density-based clustering and exploits the identified clusters to solve both single- and multi-target regression tasks (and thus, solves complex tasks such as time series prediction). Contrary to existing distributed methods, DENCAST does not require a final merging step (usually performed on a single machine) and is able to handle large-scale, high-dimensional data by taking advantage of locality sensitive hashing. Experiments show that DENCAST performs clustering more efficiently than a state-of-the-art distributed clustering algorithm, especially when the number of objects increases significantly. The quality of the extracted clusters is confirmed by the predictive capabilities of DENCAST on several datasets: It is able to significantly outperform (p-value $$<0.05$$ <0.05 ) state-of-the-art distributed regression methods, in both single and multi-target settings.http://link.springer.com/article/10.1186/s40537-019-0207-2Distributed clusteringMulti-target regressionApache Spark
spellingShingle Roberto Corizzo
Gianvito Pio
Michelangelo Ceci
Donato Malerba
DENCAST: distributed density-based clustering for multi-target regression
Journal of Big Data
Distributed clustering
Multi-target regression
Apache Spark
title DENCAST: distributed density-based clustering for multi-target regression
title_full DENCAST: distributed density-based clustering for multi-target regression
title_fullStr DENCAST: distributed density-based clustering for multi-target regression
title_full_unstemmed DENCAST: distributed density-based clustering for multi-target regression
title_short DENCAST: distributed density-based clustering for multi-target regression
title_sort dencast distributed density based clustering for multi target regression
topic Distributed clustering
Multi-target regression
Apache Spark
url http://link.springer.com/article/10.1186/s40537-019-0207-2
work_keys_str_mv AT robertocorizzo dencastdistributeddensitybasedclusteringformultitargetregression
AT gianvitopio dencastdistributeddensitybasedclusteringformultitargetregression
AT michelangeloceci dencastdistributeddensitybasedclusteringformultitargetregression
AT donatomalerba dencastdistributeddensitybasedclusteringformultitargetregression