A Fast Parallel Random Forest Algorithm Based on Spark

A Fast Parallel Random Forest Algorithm Based on Spark

To improve the computational efficiency and classification accuracy in the context of big data, an optimized parallel random forest algorithm is proposed based on the Spark computing framework. First, a new Gini coefficient is defined to reduce the impact of feature redundancy for higher classificat...

Full description

Bibliographic Details
Main Authors:	Linzi Yin, Ken Chen, Zhaohui Jiang, Xuemei Xu
Format:	Article
Language:	English
Published:	MDPI AG 2023-05-01
Series:	Applied Sciences
Subjects:	Apache Spark approximate equal-frequency binning Gini coefficient forest sampling index
Online Access:	https://www.mdpi.com/2076-3417/13/10/6121

Similar Items

Apache Spark ile Makine Öğrenmesi Destekli Diyabet Rahatsızlığı Tahmini
by: Emre Yıldırım, et al.
Published: (2022-07-01)

Framing Apache Spark in life sciences
by: Andrea Manconi, et al.
Published: (2023-02-01)

A Parallel Multiobjective PSO Weighted Average Clustering Algorithm Based on Apache Spark
by: Huidong Ling, et al.
Published: (2023-01-01)

QoS-Aware Approximate Query Processing for Smart Cities Spatial Data Streams
by: Isam Mashhour Al Jawarneh, et al.
Published: (2021-06-01)

Performance Evaluation of Query Plan Recommendation with Apache Hadoop and Apache Spark
by: Elham Azhir, et al.
Published: (2022-09-01)

Large-Scale Music Genre Analysis and Classification Using Machine Learning with Apache Spark
by: Mousumi Chaudhury, et al.
Published: (2022-08-01)

A Strategy of Parallel SLIC Superpixels for Handling Large-Scale Images over Apache Spark
by: Ning Wang, et al.
Published: (2022-03-01)

SparkEC: speeding up alignment-based DNA error correction tools
by: Roberto R. Expósito, et al.
Published: (2022-11-01)

A Novel Reinforcement Learning Approach for Spark Configuration Parameter Optimization
by: Xu Huang, et al.
Published: (2022-08-01)

Nodule Detection with Convolutional Neural Network Using Apache Spark and GPU Frameworks
by: Nikitha Johnsirani Venkatesan, et al.
Published: (2021-03-01)

Statistical analysis of the performance of four Apache Spark ML algorithms
by: Genaro Camele, et al.
Published: (2022-10-01)

An Efficient Distributed SPARQL Query Processing Scheme Considering Communication Costs in Spark Environments
by: Jongtae Lim, et al.
Published: (2021-12-01)

Comparative Analysis of Skew-Join Strategies for Large-Scale Datasets with MapReduce and Spark
by: Anh-Cang Phan, et al.
Published: (2022-06-01)

A Regularization-Based Big Data Framework for Winter Precipitation Forecasting on Streaming Data
by: Andreas Kanavos, et al.
Published: (2021-08-01)

Stream Processing with Apache Spark : Mastering Structured Streaming and Spark Streaming /
by: Maas, Gerard, author, et al.
Published: (2019)

Lightweight Computational Complexity Stepping Up the NTRU Post-Quantum Algorithm Using Parallel Computing
by: Ghada Farouk Elkabbany, et al.
Published: (2023-12-01)

PUC: parallel mining of high-utility itemsets with load balancing on spark
by: Brahmavar Anup Bhat, et al.
Published: (2022-05-01)

New distributed-topsis approach for multi-criteria decision-making problems in a big data context
by: Loubna Lamrini, et al.
Published: (2023-06-01)

JAMPI: Efficient Matrix Multiplication in Spark Using Barrier Execution Mode
by: Tamas Foldi, et al.
Published: (2020-11-01)

Using Apache Spark on genome assembly for scalable overlap-graph reduction
by: Alexander J. Paul, et al.
Published: (2019-10-01)

Distributed Fast Self-Organized Maps for Massive Spectrophotometric Data Analysis †
by: Carlos Dafonte, et al.
Published: (2018-05-01)

A NOVEL TRUE REAL-TIME SPATIOTEMPORAL DATA STREAM PROCESSING FRAMEWORK
by: ATURE ANGBERA, et al.
Published: (2022-09-01)

Defining Semantically Close Words of Kazakh Language with Distributed System Apache Spark
by: Dauren Ayazbayev, et al.
Published: (2023-09-01)

A New Big Data Processing Framework for the Online Roadshow
by: Kang-Ren Leow, et al.
Published: (2023-06-01)

An Enhanced Parallelisation Model for Performance Prediction of Apache Spark on a Multinode Hadoop Cluster
by: Nasim Ahmed, et al.
Published: (2021-11-01)

Exploiting Machine Learning for Improving In-Memory Execution of Data-Intensive Workflows on Parallel Machines
by: Riccardo Cantini, et al.
Published: (2021-05-01)

An Estimation of Gini Coefficient in Iran
by: Mohsen Jalali
Published: (2008-09-01)

Privacy-Preserving Machine Learning on Apache Spark
by: Claudia V. Brito, et al.
Published: (2023-01-01)

Influence of interest rates on equality of income distribution
by: Paunović Miloš
Published: (2021-01-01)

A Study of the Distribution of Library Resources in Public Libraries of Markazi Province through Gini Coefficient
by: Mohsen Motiei
Published: (2014-03-01)

Big Data Analysis Using Apache Spark MLlib and Hadoop HDFS with Scala and Java
by: Hoger Khayrolla Omar, et al.
Published: (2019-05-01)

Research on high sampling frequency mine electric spark image recognition and anti-interference methods
by: LI Xiaowei, et al.
Published: (2023-08-01)

Evaluation of the Temporal Efficiency of Big Data Storage Formats in the Dynamics of Data Growth
by: Vladimir Belov, et al.
Published: (2021-12-01)

Parallel Ant Colony Optimization Algorithm for Finding the Shortest Path for Mountain Climbing
by: Esra'a Alhenawi, et al.
Published: (2023-01-01)

FDR<sup>2</sup>-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems
by: María José Basgall, et al.
Published: (2021-07-01)

Lorenz Curves, Size Classification, and Dimensions of Bubble Size Distributions
by: Sonja Sauerbrei
Published: (2009-12-01)

A Distributed Parallel Algorithm Based on Low-Rank and Sparse Representation for Anomaly Detection in Hyperspectral Images
by: Yi Zhang, et al.
Published: (2018-10-01)

A hierarchical indexing strategy for optimizing Apache Spark with HDFS to efficiently query big geospatial raster data
by: Fei Hu, et al.
Published: (2020-03-01)

Deconvolute individual genomes from metagenome sequences through short read clustering
by: Kexue Li, et al.
Published: (2020-04-01)

A DISTRIBUTED ALGORITHM FOR PROTEIN IDENTIFICATION FROM TANDEM MASS SPECTROMETRY DATA
by: Katarzyna ORZECHOWSKA, et al.
Published: (2022-06-01)