Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classification

Abstract The goal of this paper is to reduce the classification (inference) complexity of tree ensembles by choosing a single representative model out of ensemble of multiple decision-tree models. We compute the similarity between different models in the ensemble and choose the model, which is most...

Full description

Bibliographic Details
Main Authors: Abraham Itzhak Weinberg, Mark Last
Format: Article
Language:English
Published: SpringerOpen 2019-02-01
Series:Journal of Big Data
Subjects:
Online Access:http://link.springer.com/article/10.1186/s40537-019-0186-3
_version_ 1818219040008568832
author Abraham Itzhak Weinberg
Mark Last
author_facet Abraham Itzhak Weinberg
Mark Last
author_sort Abraham Itzhak Weinberg
collection DOAJ
description Abstract The goal of this paper is to reduce the classification (inference) complexity of tree ensembles by choosing a single representative model out of ensemble of multiple decision-tree models. We compute the similarity between different models in the ensemble and choose the model, which is most similar to others as the best representative of the entire dataset. The similarity-based approach is implemented with three different similarity metrics: a syntactic, a semantic, and a linear combination of the two. We compare this tree selection methodology to a popular ensemble algorithm (majority voting) and to the baseline of randomly choosing one of the local models. In addition, we evaluate two alternative tree selection strategies: choosing the tree having the highest validation accuracy and reducing the original ensemble to five most representative trees. The comparative evaluation experiments are performed on six big datasets using two popular decision-tree algorithms (J48 and CART) and splitting each dataset horizontally into six different amounts of equal-size slices (from 32 to 1024). In most experiments, the syntactic similarity approach, named SySM—Syntactic Similarity Method, provides a significantly higher testing accuracy than the semantic and the combined ones. The mean accuracy of SySM over all datasets is $${0.835} \pm {0.065}$$ 0.835±0.065 for CART and $${0.769} \pm {0.066}$$ 0.769±0.066 for J48. On the other hand, we find no statistically significant difference between the testing accuracy of the trees selected by SySM and the trees having the highest validation accuracy. Comparing to ensemble algorithms, the representative models selected by the proposed methods provide a higher speed for big data classification along with being more compact and interpretable.
first_indexed 2024-12-12T07:33:20Z
format Article
id doaj.art-b45835779e044e8b959e5969b52be442
institution Directory Open Access Journal
issn 2196-1115
language English
last_indexed 2024-12-12T07:33:20Z
publishDate 2019-02-01
publisher SpringerOpen
record_format Article
series Journal of Big Data
spelling doaj.art-b45835779e044e8b959e5969b52be4422022-12-22T00:32:58ZengSpringerOpenJournal of Big Data2196-11152019-02-016111710.1186/s40537-019-0186-3Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classificationAbraham Itzhak Weinberg0Mark Last1Department of Software and Information Systems Engineering, Ben-Gurion University of the NegevDepartment of Software and Information Systems Engineering, Ben-Gurion University of the NegevAbstract The goal of this paper is to reduce the classification (inference) complexity of tree ensembles by choosing a single representative model out of ensemble of multiple decision-tree models. We compute the similarity between different models in the ensemble and choose the model, which is most similar to others as the best representative of the entire dataset. The similarity-based approach is implemented with three different similarity metrics: a syntactic, a semantic, and a linear combination of the two. We compare this tree selection methodology to a popular ensemble algorithm (majority voting) and to the baseline of randomly choosing one of the local models. In addition, we evaluate two alternative tree selection strategies: choosing the tree having the highest validation accuracy and reducing the original ensemble to five most representative trees. The comparative evaluation experiments are performed on six big datasets using two popular decision-tree algorithms (J48 and CART) and splitting each dataset horizontally into six different amounts of equal-size slices (from 32 to 1024). In most experiments, the syntactic similarity approach, named SySM—Syntactic Similarity Method, provides a significantly higher testing accuracy than the semantic and the combined ones. The mean accuracy of SySM over all datasets is $${0.835} \pm {0.065}$$ 0.835±0.065 for CART and $${0.769} \pm {0.066}$$ 0.769±0.066 for J48. On the other hand, we find no statistically significant difference between the testing accuracy of the trees selected by SySM and the trees having the highest validation accuracy. Comparing to ensemble algorithms, the representative models selected by the proposed methods provide a higher speed for big data classification along with being more compact and interpretable.http://link.springer.com/article/10.1186/s40537-019-0186-3Big dataEnsemble learningLazy ensemble evaluationDecision treesEditing distanceTree similarity
spellingShingle Abraham Itzhak Weinberg
Mark Last
Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classification
Journal of Big Data
Big data
Ensemble learning
Lazy ensemble evaluation
Decision trees
Editing distance
Tree similarity
title Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classification
title_full Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classification
title_fullStr Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classification
title_full_unstemmed Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classification
title_short Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classification
title_sort selecting a representative decision tree from an ensemble of decision tree models for fast big data classification
topic Big data
Ensemble learning
Lazy ensemble evaluation
Decision trees
Editing distance
Tree similarity
url http://link.springer.com/article/10.1186/s40537-019-0186-3
work_keys_str_mv AT abrahamitzhakweinberg selectingarepresentativedecisiontreefromanensembleofdecisiontreemodelsforfastbigdataclassification
AT marklast selectingarepresentativedecisiontreefromanensembleofdecisiontreemodelsforfastbigdataclassification