A Fast Parallel Random Forest Algorithm Based on Spark

To improve the computational efficiency and classification accuracy in the context of big data, an optimized parallel random forest algorithm is proposed based on the Spark computing framework. First, a new Gini coefficient is defined to reduce the impact of feature redundancy for higher classificat...

Full description

Bibliographic Details
Main Authors:	Linzi Yin, Ken Chen, Zhaohui Jiang, Xuemei Xu
Format:	Article
Language:	English
Published:	MDPI AG 2023-05-01
Series:	Applied Sciences
Subjects:	Apache Spark approximate equal-frequency binning Gini coefficient forest sampling index
Online Access:	https://www.mdpi.com/2076-3417/13/10/6121

_version_	1797601162133766144
author	Linzi Yin Ken Chen Zhaohui Jiang Xuemei Xu
author_facet	Linzi Yin Ken Chen Zhaohui Jiang Xuemei Xu
author_sort	Linzi Yin
collection	DOAJ
description	To improve the computational efficiency and classification accuracy in the context of big data, an optimized parallel random forest algorithm is proposed based on the Spark computing framework. First, a new Gini coefficient is defined to reduce the impact of feature redundancy for higher classification accuracy. Next, to reduce the number of candidate split points and Gini coefficient calculations for continuous features, an approximate equal-frequency binning method is proposed to determine the optimal split points efficiently. Finally, based on Apache Spark computing framework, the forest sampling index (FSI) table is defined to speed up the parallel training process of decision trees and reduce data communication overhead. Experimental results show that the proposed algorithm improves the efficiency of constructing random forests while ensuring classification accuracy, and is superior to Spark-MLRF in terms of performance and scalability.
first_indexed	2024-03-11T03:58:28Z
format	Article
id	doaj.art-35d3ebe5d17b4ed6ac69da130e9c028e
institution	Directory Open Access Journal
issn	2076-3417
language	English
last_indexed	2024-03-11T03:58:28Z
publishDate	2023-05-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj.art-35d3ebe5d17b4ed6ac69da130e9c028e2023-11-18T00:20:54ZengMDPI AGApplied Sciences2076-34172023-05-011310612110.3390/app13106121A Fast Parallel Random Forest Algorithm Based on SparkLinzi Yin0Ken Chen1Zhaohui Jiang2Xuemei Xu3School of Physics and Electronics, Central South University, Changsha 410012, ChinaSchool of Physics and Electronics, Central South University, Changsha 410012, ChinaSchool of Automation, Central South University, Changsha 410012, ChinaSchool of Physics and Electronics, Central South University, Changsha 410012, ChinaTo improve the computational efficiency and classification accuracy in the context of big data, an optimized parallel random forest algorithm is proposed based on the Spark computing framework. First, a new Gini coefficient is defined to reduce the impact of feature redundancy for higher classification accuracy. Next, to reduce the number of candidate split points and Gini coefficient calculations for continuous features, an approximate equal-frequency binning method is proposed to determine the optimal split points efficiently. Finally, based on Apache Spark computing framework, the forest sampling index (FSI) table is defined to speed up the parallel training process of decision trees and reduce data communication overhead. Experimental results show that the proposed algorithm improves the efficiency of constructing random forests while ensuring classification accuracy, and is superior to Spark-MLRF in terms of performance and scalability.https://www.mdpi.com/2076-3417/13/10/6121Apache Sparkapproximate equal-frequency binningGini coefficientforest sampling index
spellingShingle	Linzi Yin Ken Chen Zhaohui Jiang Xuemei Xu A Fast Parallel Random Forest Algorithm Based on Spark Applied Sciences Apache Spark approximate equal-frequency binning Gini coefficient forest sampling index
title	A Fast Parallel Random Forest Algorithm Based on Spark
title_full	A Fast Parallel Random Forest Algorithm Based on Spark
title_fullStr	A Fast Parallel Random Forest Algorithm Based on Spark
title_full_unstemmed	A Fast Parallel Random Forest Algorithm Based on Spark
title_short	A Fast Parallel Random Forest Algorithm Based on Spark
title_sort	fast parallel random forest algorithm based on spark
topic	Apache Spark approximate equal-frequency binning Gini coefficient forest sampling index
url	https://www.mdpi.com/2076-3417/13/10/6121
work_keys_str_mv	AT linziyin afastparallelrandomforestalgorithmbasedonspark AT kenchen afastparallelrandomforestalgorithmbasedonspark AT zhaohuijiang afastparallelrandomforestalgorithmbasedonspark AT xuemeixu afastparallelrandomforestalgorithmbasedonspark AT linziyin fastparallelrandomforestalgorithmbasedonspark AT kenchen fastparallelrandomforestalgorithmbasedonspark AT zhaohuijiang fastparallelrandomforestalgorithmbasedonspark AT xuemeixu fastparallelrandomforestalgorithmbasedonspark

A Fast Parallel Random Forest Algorithm Based on Spark

Similar Items