Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classification
Abstract The goal of this paper is to reduce the classification (inference) complexity of tree ensembles by choosing a single representative model out of ensemble of multiple decision-tree models. We compute the similarity between different models in the ensemble and choose the model, which is most...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
SpringerOpen
2019-02-01
|
Series: | Journal of Big Data |
Subjects: | |
Online Access: | http://link.springer.com/article/10.1186/s40537-019-0186-3 |
_version_ | 1818219040008568832 |
---|---|
author | Abraham Itzhak Weinberg Mark Last |
author_facet | Abraham Itzhak Weinberg Mark Last |
author_sort | Abraham Itzhak Weinberg |
collection | DOAJ |
description | Abstract The goal of this paper is to reduce the classification (inference) complexity of tree ensembles by choosing a single representative model out of ensemble of multiple decision-tree models. We compute the similarity between different models in the ensemble and choose the model, which is most similar to others as the best representative of the entire dataset. The similarity-based approach is implemented with three different similarity metrics: a syntactic, a semantic, and a linear combination of the two. We compare this tree selection methodology to a popular ensemble algorithm (majority voting) and to the baseline of randomly choosing one of the local models. In addition, we evaluate two alternative tree selection strategies: choosing the tree having the highest validation accuracy and reducing the original ensemble to five most representative trees. The comparative evaluation experiments are performed on six big datasets using two popular decision-tree algorithms (J48 and CART) and splitting each dataset horizontally into six different amounts of equal-size slices (from 32 to 1024). In most experiments, the syntactic similarity approach, named SySM—Syntactic Similarity Method, provides a significantly higher testing accuracy than the semantic and the combined ones. The mean accuracy of SySM over all datasets is $${0.835} \pm {0.065}$$ 0.835±0.065 for CART and $${0.769} \pm {0.066}$$ 0.769±0.066 for J48. On the other hand, we find no statistically significant difference between the testing accuracy of the trees selected by SySM and the trees having the highest validation accuracy. Comparing to ensemble algorithms, the representative models selected by the proposed methods provide a higher speed for big data classification along with being more compact and interpretable. |
first_indexed | 2024-12-12T07:33:20Z |
format | Article |
id | doaj.art-b45835779e044e8b959e5969b52be442 |
institution | Directory Open Access Journal |
issn | 2196-1115 |
language | English |
last_indexed | 2024-12-12T07:33:20Z |
publishDate | 2019-02-01 |
publisher | SpringerOpen |
record_format | Article |
series | Journal of Big Data |
spelling | doaj.art-b45835779e044e8b959e5969b52be4422022-12-22T00:32:58ZengSpringerOpenJournal of Big Data2196-11152019-02-016111710.1186/s40537-019-0186-3Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classificationAbraham Itzhak Weinberg0Mark Last1Department of Software and Information Systems Engineering, Ben-Gurion University of the NegevDepartment of Software and Information Systems Engineering, Ben-Gurion University of the NegevAbstract The goal of this paper is to reduce the classification (inference) complexity of tree ensembles by choosing a single representative model out of ensemble of multiple decision-tree models. We compute the similarity between different models in the ensemble and choose the model, which is most similar to others as the best representative of the entire dataset. The similarity-based approach is implemented with three different similarity metrics: a syntactic, a semantic, and a linear combination of the two. We compare this tree selection methodology to a popular ensemble algorithm (majority voting) and to the baseline of randomly choosing one of the local models. In addition, we evaluate two alternative tree selection strategies: choosing the tree having the highest validation accuracy and reducing the original ensemble to five most representative trees. The comparative evaluation experiments are performed on six big datasets using two popular decision-tree algorithms (J48 and CART) and splitting each dataset horizontally into six different amounts of equal-size slices (from 32 to 1024). In most experiments, the syntactic similarity approach, named SySM—Syntactic Similarity Method, provides a significantly higher testing accuracy than the semantic and the combined ones. The mean accuracy of SySM over all datasets is $${0.835} \pm {0.065}$$ 0.835±0.065 for CART and $${0.769} \pm {0.066}$$ 0.769±0.066 for J48. On the other hand, we find no statistically significant difference between the testing accuracy of the trees selected by SySM and the trees having the highest validation accuracy. Comparing to ensemble algorithms, the representative models selected by the proposed methods provide a higher speed for big data classification along with being more compact and interpretable.http://link.springer.com/article/10.1186/s40537-019-0186-3Big dataEnsemble learningLazy ensemble evaluationDecision treesEditing distanceTree similarity |
spellingShingle | Abraham Itzhak Weinberg Mark Last Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classification Journal of Big Data Big data Ensemble learning Lazy ensemble evaluation Decision trees Editing distance Tree similarity |
title | Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classification |
title_full | Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classification |
title_fullStr | Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classification |
title_full_unstemmed | Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classification |
title_short | Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classification |
title_sort | selecting a representative decision tree from an ensemble of decision tree models for fast big data classification |
topic | Big data Ensemble learning Lazy ensemble evaluation Decision trees Editing distance Tree similarity |
url | http://link.springer.com/article/10.1186/s40537-019-0186-3 |
work_keys_str_mv | AT abrahamitzhakweinberg selectingarepresentativedecisiontreefromanensembleofdecisiontreemodelsforfastbigdataclassification AT marklast selectingarepresentativedecisiontreefromanensembleofdecisiontreemodelsforfastbigdataclassification |