A Fast Parallel Random Forest Algorithm Based on Spark
To improve the computational efficiency and classification accuracy in the context of big data, an optimized parallel random forest algorithm is proposed based on the Spark computing framework. First, a new Gini coefficient is defined to reduce the impact of feature redundancy for higher classificat...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2023-05-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/13/10/6121 |
_version_ | 1797601162133766144 |
---|---|
author | Linzi Yin Ken Chen Zhaohui Jiang Xuemei Xu |
author_facet | Linzi Yin Ken Chen Zhaohui Jiang Xuemei Xu |
author_sort | Linzi Yin |
collection | DOAJ |
description | To improve the computational efficiency and classification accuracy in the context of big data, an optimized parallel random forest algorithm is proposed based on the Spark computing framework. First, a new Gini coefficient is defined to reduce the impact of feature redundancy for higher classification accuracy. Next, to reduce the number of candidate split points and Gini coefficient calculations for continuous features, an approximate equal-frequency binning method is proposed to determine the optimal split points efficiently. Finally, based on Apache Spark computing framework, the forest sampling index (FSI) table is defined to speed up the parallel training process of decision trees and reduce data communication overhead. Experimental results show that the proposed algorithm improves the efficiency of constructing random forests while ensuring classification accuracy, and is superior to Spark-MLRF in terms of performance and scalability. |
first_indexed | 2024-03-11T03:58:28Z |
format | Article |
id | doaj.art-35d3ebe5d17b4ed6ac69da130e9c028e |
institution | Directory Open Access Journal |
issn | 2076-3417 |
language | English |
last_indexed | 2024-03-11T03:58:28Z |
publishDate | 2023-05-01 |
publisher | MDPI AG |
record_format | Article |
series | Applied Sciences |
spelling | doaj.art-35d3ebe5d17b4ed6ac69da130e9c028e2023-11-18T00:20:54ZengMDPI AGApplied Sciences2076-34172023-05-011310612110.3390/app13106121A Fast Parallel Random Forest Algorithm Based on SparkLinzi Yin0Ken Chen1Zhaohui Jiang2Xuemei Xu3School of Physics and Electronics, Central South University, Changsha 410012, ChinaSchool of Physics and Electronics, Central South University, Changsha 410012, ChinaSchool of Automation, Central South University, Changsha 410012, ChinaSchool of Physics and Electronics, Central South University, Changsha 410012, ChinaTo improve the computational efficiency and classification accuracy in the context of big data, an optimized parallel random forest algorithm is proposed based on the Spark computing framework. First, a new Gini coefficient is defined to reduce the impact of feature redundancy for higher classification accuracy. Next, to reduce the number of candidate split points and Gini coefficient calculations for continuous features, an approximate equal-frequency binning method is proposed to determine the optimal split points efficiently. Finally, based on Apache Spark computing framework, the forest sampling index (FSI) table is defined to speed up the parallel training process of decision trees and reduce data communication overhead. Experimental results show that the proposed algorithm improves the efficiency of constructing random forests while ensuring classification accuracy, and is superior to Spark-MLRF in terms of performance and scalability.https://www.mdpi.com/2076-3417/13/10/6121Apache Sparkapproximate equal-frequency binningGini coefficientforest sampling index |
spellingShingle | Linzi Yin Ken Chen Zhaohui Jiang Xuemei Xu A Fast Parallel Random Forest Algorithm Based on Spark Applied Sciences Apache Spark approximate equal-frequency binning Gini coefficient forest sampling index |
title | A Fast Parallel Random Forest Algorithm Based on Spark |
title_full | A Fast Parallel Random Forest Algorithm Based on Spark |
title_fullStr | A Fast Parallel Random Forest Algorithm Based on Spark |
title_full_unstemmed | A Fast Parallel Random Forest Algorithm Based on Spark |
title_short | A Fast Parallel Random Forest Algorithm Based on Spark |
title_sort | fast parallel random forest algorithm based on spark |
topic | Apache Spark approximate equal-frequency binning Gini coefficient forest sampling index |
url | https://www.mdpi.com/2076-3417/13/10/6121 |
work_keys_str_mv | AT linziyin afastparallelrandomforestalgorithmbasedonspark AT kenchen afastparallelrandomforestalgorithmbasedonspark AT zhaohuijiang afastparallelrandomforestalgorithmbasedonspark AT xuemeixu afastparallelrandomforestalgorithmbasedonspark AT linziyin fastparallelrandomforestalgorithmbasedonspark AT kenchen fastparallelrandomforestalgorithmbasedonspark AT zhaohuijiang fastparallelrandomforestalgorithmbasedonspark AT xuemeixu fastparallelrandomforestalgorithmbasedonspark |