A Fast Parallel Random Forest Algorithm Based on Spark

To improve the computational efficiency and classification accuracy in the context of big data, an optimized parallel random forest algorithm is proposed based on the Spark computing framework. First, a new Gini coefficient is defined to reduce the impact of feature redundancy for higher classificat...

Full description

Bibliographic Details
Main Authors: Linzi Yin, Ken Chen, Zhaohui Jiang, Xuemei Xu
Format: Article
Language:English
Published: MDPI AG 2023-05-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/13/10/6121
_version_ 1797601162133766144
author Linzi Yin
Ken Chen
Zhaohui Jiang
Xuemei Xu
author_facet Linzi Yin
Ken Chen
Zhaohui Jiang
Xuemei Xu
author_sort Linzi Yin
collection DOAJ
description To improve the computational efficiency and classification accuracy in the context of big data, an optimized parallel random forest algorithm is proposed based on the Spark computing framework. First, a new Gini coefficient is defined to reduce the impact of feature redundancy for higher classification accuracy. Next, to reduce the number of candidate split points and Gini coefficient calculations for continuous features, an approximate equal-frequency binning method is proposed to determine the optimal split points efficiently. Finally, based on Apache Spark computing framework, the forest sampling index (FSI) table is defined to speed up the parallel training process of decision trees and reduce data communication overhead. Experimental results show that the proposed algorithm improves the efficiency of constructing random forests while ensuring classification accuracy, and is superior to Spark-MLRF in terms of performance and scalability.
first_indexed 2024-03-11T03:58:28Z
format Article
id doaj.art-35d3ebe5d17b4ed6ac69da130e9c028e
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-11T03:58:28Z
publishDate 2023-05-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-35d3ebe5d17b4ed6ac69da130e9c028e2023-11-18T00:20:54ZengMDPI AGApplied Sciences2076-34172023-05-011310612110.3390/app13106121A Fast Parallel Random Forest Algorithm Based on SparkLinzi Yin0Ken Chen1Zhaohui Jiang2Xuemei Xu3School of Physics and Electronics, Central South University, Changsha 410012, ChinaSchool of Physics and Electronics, Central South University, Changsha 410012, ChinaSchool of Automation, Central South University, Changsha 410012, ChinaSchool of Physics and Electronics, Central South University, Changsha 410012, ChinaTo improve the computational efficiency and classification accuracy in the context of big data, an optimized parallel random forest algorithm is proposed based on the Spark computing framework. First, a new Gini coefficient is defined to reduce the impact of feature redundancy for higher classification accuracy. Next, to reduce the number of candidate split points and Gini coefficient calculations for continuous features, an approximate equal-frequency binning method is proposed to determine the optimal split points efficiently. Finally, based on Apache Spark computing framework, the forest sampling index (FSI) table is defined to speed up the parallel training process of decision trees and reduce data communication overhead. Experimental results show that the proposed algorithm improves the efficiency of constructing random forests while ensuring classification accuracy, and is superior to Spark-MLRF in terms of performance and scalability.https://www.mdpi.com/2076-3417/13/10/6121Apache Sparkapproximate equal-frequency binningGini coefficientforest sampling index
spellingShingle Linzi Yin
Ken Chen
Zhaohui Jiang
Xuemei Xu
A Fast Parallel Random Forest Algorithm Based on Spark
Applied Sciences
Apache Spark
approximate equal-frequency binning
Gini coefficient
forest sampling index
title A Fast Parallel Random Forest Algorithm Based on Spark
title_full A Fast Parallel Random Forest Algorithm Based on Spark
title_fullStr A Fast Parallel Random Forest Algorithm Based on Spark
title_full_unstemmed A Fast Parallel Random Forest Algorithm Based on Spark
title_short A Fast Parallel Random Forest Algorithm Based on Spark
title_sort fast parallel random forest algorithm based on spark
topic Apache Spark
approximate equal-frequency binning
Gini coefficient
forest sampling index
url https://www.mdpi.com/2076-3417/13/10/6121
work_keys_str_mv AT linziyin afastparallelrandomforestalgorithmbasedonspark
AT kenchen afastparallelrandomforestalgorithmbasedonspark
AT zhaohuijiang afastparallelrandomforestalgorithmbasedonspark
AT xuemeixu afastparallelrandomforestalgorithmbasedonspark
AT linziyin fastparallelrandomforestalgorithmbasedonspark
AT kenchen fastparallelrandomforestalgorithmbasedonspark
AT zhaohuijiang fastparallelrandomforestalgorithmbasedonspark
AT xuemeixu fastparallelrandomforestalgorithmbasedonspark