Optimization of the Join between Large Tables in the Spark Distributed Framework

The Join task between Spark large tables takes a long time to run and produces a lot of disk I/O, network I/O and disk occupation in the Shuffle process. This paper proposes a lightweight distributed data filtering model that combines broadcast variables and accumulators using RoaringBitmap. When th...

Full description

Bibliographic Details
Main Authors:	Xiang Wu, Yueshun He
Format:	Article
Language:	English
Published:	MDPI AG 2023-05-01
Series:	Applied Sciences
Subjects:	Join Spark Shuffle optimization method RoaringBitmap
Online Access:	https://www.mdpi.com/2076-3417/13/10/6257

Internet

https://www.mdpi.com/2076-3417/13/10/6257

Optimization of the Join between Large Tables in the Spark Distributed Framework

Internet

Similar Items