Performance Optimization System for Hadoop and Spark Frameworks

The optimization of large-scale data sets depends on the technologies and methods used. The MapReduce model, implemented on Apache Hadoop or Spark, allows splitting large data sets into a set of blocks distributed on several machines. Data compression reduces data size and transfer time between disk...

Full description

Bibliographic Details
Main Authors: Astsatryan Hrachya, Kocharyan Aram, Hagimont Daniel, Lalayan Arthur
Format: Article
Language:English
Published: Sciendo 2020-12-01
Series:Cybernetics and Information Technologies
Subjects:
Online Access:https://doi.org/10.2478/cait-2020-0056
_version_ 1811272303882797056
author Astsatryan Hrachya
Kocharyan Aram
Hagimont Daniel
Lalayan Arthur
author_facet Astsatryan Hrachya
Kocharyan Aram
Hagimont Daniel
Lalayan Arthur
author_sort Astsatryan Hrachya
collection DOAJ
description The optimization of large-scale data sets depends on the technologies and methods used. The MapReduce model, implemented on Apache Hadoop or Spark, allows splitting large data sets into a set of blocks distributed on several machines. Data compression reduces data size and transfer time between disks and memory but requires additional processing. Therefore, finding an optimal tradeoff is a challenge, as a high compression factor may underload Input/Output but overload the processor. The paper aims to present a system enabling the selection of the compression tools and tuning the compression factor to reach the best performance in Apache Hadoop and Spark infrastructures based on simulation analyzes.
first_indexed 2024-04-12T22:36:50Z
format Article
id doaj.art-12889a3dee384215a31e4c53b35f54b6
institution Directory Open Access Journal
issn 1314-4081
language English
last_indexed 2024-04-12T22:36:50Z
publishDate 2020-12-01
publisher Sciendo
record_format Article
series Cybernetics and Information Technologies
spelling doaj.art-12889a3dee384215a31e4c53b35f54b62022-12-22T03:13:49ZengSciendoCybernetics and Information Technologies1314-40812020-12-0120651710.2478/cait-2020-0056Performance Optimization System for Hadoop and Spark FrameworksAstsatryan Hrachya0Kocharyan Aram1Hagimont Daniel2Lalayan Arthur3Institute for Informatics and Automation Problems of the National Academy of Sciences of the Republic of Armenia, Yerevan0014, ArmeniaUniversité Fédérale Toulouse Midi-Pyrénées, Toulouse Cedex 7, FranceUniversité Fédérale Toulouse Midi-Pyrénées, Toulouse Cedex 7, FranceNational Polytechnic University of Armenia, Yerevan0009, ArmeniaThe optimization of large-scale data sets depends on the technologies and methods used. The MapReduce model, implemented on Apache Hadoop or Spark, allows splitting large data sets into a set of blocks distributed on several machines. Data compression reduces data size and transfer time between disks and memory but requires additional processing. Therefore, finding an optimal tradeoff is a challenge, as a high compression factor may underload Input/Output but overload the processor. The paper aims to present a system enabling the selection of the compression tools and tuning the compression factor to reach the best performance in Apache Hadoop and Spark infrastructures based on simulation analyzes.https://doi.org/10.2478/cait-2020-0056hadoopsparkdata compressioncpu/io tradeoffperformance optimization
spellingShingle Astsatryan Hrachya
Kocharyan Aram
Hagimont Daniel
Lalayan Arthur
Performance Optimization System for Hadoop and Spark Frameworks
Cybernetics and Information Technologies
hadoop
spark
data compression
cpu/io tradeoff
performance optimization
title Performance Optimization System for Hadoop and Spark Frameworks
title_full Performance Optimization System for Hadoop and Spark Frameworks
title_fullStr Performance Optimization System for Hadoop and Spark Frameworks
title_full_unstemmed Performance Optimization System for Hadoop and Spark Frameworks
title_short Performance Optimization System for Hadoop and Spark Frameworks
title_sort performance optimization system for hadoop and spark frameworks
topic hadoop
spark
data compression
cpu/io tradeoff
performance optimization
url https://doi.org/10.2478/cait-2020-0056
work_keys_str_mv AT astsatryanhrachya performanceoptimizationsystemforhadoopandsparkframeworks
AT kocharyanaram performanceoptimizationsystemforhadoopandsparkframeworks
AT hagimontdaniel performanceoptimizationsystemforhadoopandsparkframeworks
AT lalayanarthur performanceoptimizationsystemforhadoopandsparkframeworks