Execution Repair for Spark Programs by Active Maintenance of Partition Dependency

Spark programs typically codify to reuse some of their generated datasets, called partition instances, to make their subsequent computations complete in a reasonable time. At runtime, however, the underlying Spark platform may independently delete such instances or accidentally cause these instances...

Full description

Bibliographic Details
Main Authors: Xiupei Mei, Imran Ashraf, Xiaoxue Ma, Hao Zhang, Zhengyuan Wei, Haipeng Wang, W. K. Chan
Format: Article
Language:English
Published: IEEE 2021-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9488244/
Description
Summary:Spark programs typically codify to reuse some of their generated datasets, called partition instances, to make their subsequent computations complete in a reasonable time. At runtime, however, the underlying Spark platform may independently delete such instances or accidentally cause these instances inaccessible to the program executions. Those instances will invalidate the computation assumption made in writing these programs that such depending instances are present, which leads performance bloat and even breaks the executions. In this paper, we present FAR, a novel and effective framework to handle such performance bloat and actively repair the executions by maintaining the instance dependencies in Spark program executions. FAR monitors the partition instance lifecycle activities at all levels, and determines from the execution plan of the current Spark action in the current program execution on whether a partition instance will have a dependency relation with a later one underlying the computation of that action. The experimental results showed that with the active execution repair mechanism of FAR, when some dependency partition instances were inaccessible, programs can achieve 7.3x to 67.0x speedup in re-generating them. The results also interestingly revealed that the program executions actively repaired by FAR can run to successful completion in environments with 1.7x-2.0x fewer available memory.
ISSN:2169-3536