RCM: A Remote Cache Management Framework for Spark

With the rapid growth of Internet data, the performance of big data processing platforms is attracting more and more attention. In Spark, cache data are replaced by the Least Recently Used (LRU) Algorithm. LRU cannot identify the cost of cache data, which leads to replacing some important cache data...

Full description

Bibliographic Details
Main Authors: Yixin Song, Junyang Yu, Bohan Li, Han Li, Xin He, Jinjiang Wang, Rui Zhai
Format: Article
Language:English
Published: MDPI AG 2022-11-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/12/22/11491
_version_ 1797465995244208128
author Yixin Song
Junyang Yu
Bohan Li
Han Li
Xin He
Jinjiang Wang
Rui Zhai
author_facet Yixin Song
Junyang Yu
Bohan Li
Han Li
Xin He
Jinjiang Wang
Rui Zhai
author_sort Yixin Song
collection DOAJ
description With the rapid growth of Internet data, the performance of big data processing platforms is attracting more and more attention. In Spark, cache data are replaced by the Least Recently Used (LRU) Algorithm. LRU cannot identify the cost of cache data, which leads to replacing some important cache data. In addition, the placement of cache data is random, which lacks a measure to find efficient cache servers. Focusing on the above problems, a remote cache management framework (RCM) for the Spark platform was proposed, including a cache weight generation module (CWG), cache replacement module (CREP), and cache placement module (CPL). CWG establishes initial weights from three main factors: the response time of the query database, the number of queries, and the data size. Then, CWG reduces the old data weight through a time loss function. CREP promises that the sum of cache data weights is maximized by a greedy strategy. CPL allocates the best cache server for data based on the Kuhn-Munkres matching algorithm to improve cooperation efficiency. To verify the effectiveness of RCM, RCM is implemented on Redis and deployed on eight computing nodes and four cache servers. Three groups of benchmark jobs, PageRank, K-means and WordCount, is tested. The result of experiments confirmed that compared with MCM, SACM and DMAOM, the execution time of RCM is reduced by 42.1% at most.
first_indexed 2024-03-09T18:30:31Z
format Article
id doaj.art-9e46333d02f84e4d8828581761292ddc
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-09T18:30:31Z
publishDate 2022-11-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-9e46333d02f84e4d8828581761292ddc2023-11-24T07:36:11ZengMDPI AGApplied Sciences2076-34172022-11-0112221149110.3390/app122211491RCM: A Remote Cache Management Framework for SparkYixin Song0Junyang Yu1Bohan Li2Han Li3Xin He4Jinjiang Wang5Rui Zhai6School of Software, Henan University, Kaifeng 475001, ChinaSchool of Software, Henan University, Kaifeng 475001, ChinaSchool of Software, Henan University, Kaifeng 475001, ChinaSchool of Software, Henan University, Kaifeng 475001, ChinaSchool of Software, Henan University, Kaifeng 475001, ChinaSchool of Software, Henan University, Kaifeng 475001, ChinaSchool of Software, Henan University, Kaifeng 475001, ChinaWith the rapid growth of Internet data, the performance of big data processing platforms is attracting more and more attention. In Spark, cache data are replaced by the Least Recently Used (LRU) Algorithm. LRU cannot identify the cost of cache data, which leads to replacing some important cache data. In addition, the placement of cache data is random, which lacks a measure to find efficient cache servers. Focusing on the above problems, a remote cache management framework (RCM) for the Spark platform was proposed, including a cache weight generation module (CWG), cache replacement module (CREP), and cache placement module (CPL). CWG establishes initial weights from three main factors: the response time of the query database, the number of queries, and the data size. Then, CWG reduces the old data weight through a time loss function. CREP promises that the sum of cache data weights is maximized by a greedy strategy. CPL allocates the best cache server for data based on the Kuhn-Munkres matching algorithm to improve cooperation efficiency. To verify the effectiveness of RCM, RCM is implemented on Redis and deployed on eight computing nodes and four cache servers. Three groups of benchmark jobs, PageRank, K-means and WordCount, is tested. The result of experiments confirmed that compared with MCM, SACM and DMAOM, the execution time of RCM is reduced by 42.1% at most.https://www.mdpi.com/2076-3417/12/22/11491cache weight generationcache replacementcache placementcache management frameworkSpark
spellingShingle Yixin Song
Junyang Yu
Bohan Li
Han Li
Xin He
Jinjiang Wang
Rui Zhai
RCM: A Remote Cache Management Framework for Spark
Applied Sciences
cache weight generation
cache replacement
cache placement
cache management framework
Spark
title RCM: A Remote Cache Management Framework for Spark
title_full RCM: A Remote Cache Management Framework for Spark
title_fullStr RCM: A Remote Cache Management Framework for Spark
title_full_unstemmed RCM: A Remote Cache Management Framework for Spark
title_short RCM: A Remote Cache Management Framework for Spark
title_sort rcm a remote cache management framework for spark
topic cache weight generation
cache replacement
cache placement
cache management framework
Spark
url https://www.mdpi.com/2076-3417/12/22/11491
work_keys_str_mv AT yixinsong rcmaremotecachemanagementframeworkforspark
AT junyangyu rcmaremotecachemanagementframeworkforspark
AT bohanli rcmaremotecachemanagementframeworkforspark
AT hanli rcmaremotecachemanagementframeworkforspark
AT xinhe rcmaremotecachemanagementframeworkforspark
AT jinjiangwang rcmaremotecachemanagementframeworkforspark
AT ruizhai rcmaremotecachemanagementframeworkforspark