Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classification

Abstract The goal of this paper is to reduce the classification (inference) complexity of tree ensembles by choosing a single representative model out of ensemble of multiple decision-tree models. We compute the similarity between different models in the ensemble and choose the model, which is most...

Full description

Bibliographic Details
Main Authors:	Abraham Itzhak Weinberg, Mark Last
Format:	Article
Language:	English
Published:	SpringerOpen 2019-02-01
Series:	Journal of Big Data
Subjects:	Big data Ensemble learning Lazy ensemble evaluation Decision trees Editing distance Tree similarity
Online Access:	http://link.springer.com/article/10.1186/s40537-019-0186-3

_version_	1818219040008568832
author	Abraham Itzhak Weinberg Mark Last
author_facet	Abraham Itzhak Weinberg Mark Last
author_sort	Abraham Itzhak Weinberg
collection	DOAJ
description	Abstract The goal of this paper is to reduce the classification (inference) complexity of tree ensembles by choosing a single representative model out of ensemble of multiple decision-tree models. We compute the similarity between different models in the ensemble and choose the model, which is most similar to others as the best representative of the entire dataset. The similarity-based approach is implemented with three different similarity metrics: a syntactic, a semantic, and a linear combination of the two. We compare this tree selection methodology to a popular ensemble algorithm (majority voting) and to the baseline of randomly choosing one of the local models. In addition, we evaluate two alternative tree selection strategies: choosing the tree having the highest validation accuracy and reducing the original ensemble to five most representative trees. The comparative evaluation experiments are performed on six big datasets using two popular decision-tree algorithms (J48 and CART) and splitting each dataset horizontally into six different amounts of equal-size slices (from 32 to 1024). In most experiments, the syntactic similarity approach, named SySM—Syntactic Similarity Method, provides a significantly higher testing accuracy than the semantic and the combined ones. The mean accuracy of SySM over all datasets is $${0.835} \pm {0.065}$$ 0.835±0.065 for CART and $${0.769} \pm {0.066}$$ 0.769±0.066 for J48. On the other hand, we find no statistically significant difference between the testing accuracy of the trees selected by SySM and the trees having the highest validation accuracy. Comparing to ensemble algorithms, the representative models selected by the proposed methods provide a higher speed for big data classification along with being more compact and interpretable.
first_indexed	2024-12-12T07:33:20Z
format	Article
id	doaj.art-b45835779e044e8b959e5969b52be442
institution	Directory Open Access Journal
issn	2196-1115
language	English
last_indexed	2024-12-12T07:33:20Z
publishDate	2019-02-01
publisher	SpringerOpen
record_format	Article
series	Journal of Big Data
spelling	doaj.art-b45835779e044e8b959e5969b52be4422022-12-22T00:32:58ZengSpringerOpenJournal of Big Data2196-11152019-02-016111710.1186/s40537-019-0186-3Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classificationAbraham Itzhak Weinberg0Mark Last1Department of Software and Information Systems Engineering, Ben-Gurion University of the NegevDepartment of Software and Information Systems Engineering, Ben-Gurion University of the NegevAbstract The goal of this paper is to reduce the classification (inference) complexity of tree ensembles by choosing a single representative model out of ensemble of multiple decision-tree models. We compute the similarity between different models in the ensemble and choose the model, which is most similar to others as the best representative of the entire dataset. The similarity-based approach is implemented with three different similarity metrics: a syntactic, a semantic, and a linear combination of the two. We compare this tree selection methodology to a popular ensemble algorithm (majority voting) and to the baseline of randomly choosing one of the local models. In addition, we evaluate two alternative tree selection strategies: choosing the tree having the highest validation accuracy and reducing the original ensemble to five most representative trees. The comparative evaluation experiments are performed on six big datasets using two popular decision-tree algorithms (J48 and CART) and splitting each dataset horizontally into six different amounts of equal-size slices (from 32 to 1024). In most experiments, the syntactic similarity approach, named SySM—Syntactic Similarity Method, provides a significantly higher testing accuracy than the semantic and the combined ones. The mean accuracy of SySM over all datasets is $${0.835} \pm {0.065}$$ 0.835±0.065 for CART and $${0.769} \pm {0.066}$$ 0.769±0.066 for J48. On the other hand, we find no statistically significant difference between the testing accuracy of the trees selected by SySM and the trees having the highest validation accuracy. Comparing to ensemble algorithms, the representative models selected by the proposed methods provide a higher speed for big data classification along with being more compact and interpretable.http://link.springer.com/article/10.1186/s40537-019-0186-3Big dataEnsemble learningLazy ensemble evaluationDecision treesEditing distanceTree similarity
spellingShingle	Abraham Itzhak Weinberg Mark Last Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classification Journal of Big Data Big data Ensemble learning Lazy ensemble evaluation Decision trees Editing distance Tree similarity
title	Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classification
title_full	Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classification
title_fullStr	Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classification
title_full_unstemmed	Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classification
title_short	Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classification
title_sort	selecting a representative decision tree from an ensemble of decision tree models for fast big data classification
topic	Big data Ensemble learning Lazy ensemble evaluation Decision trees Editing distance Tree similarity
url	http://link.springer.com/article/10.1186/s40537-019-0186-3
work_keys_str_mv	AT abrahamitzhakweinberg selectingarepresentativedecisiontreefromanensembleofdecisiontreemodelsforfastbigdataclassification AT marklast selectingarepresentativedecisiontreefromanensembleofdecisiontreemodelsforfastbigdataclassification

Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classification

Similar Items