Choosing a Data Storage Format in the Apache Hadoop System Based on Experimental Evaluation Using Apache Spark

One of the most important tasks of any platform for big data processing is storing the data received. Different systems have different requirements for the storage formats of big data, which raises the problem of choosing the optimal data storage format to solve the current problem. This paper descr...

Full description

Bibliographic Details
Main Authors: Vladimir Belov, Andrey Tatarintsev, Evgeny Nikulchev
Format: Article
Language:English
Published: MDPI AG 2021-01-01
Series:Symmetry
Subjects:
Online Access:https://www.mdpi.com/2073-8994/13/2/195
Description
Summary:One of the most important tasks of any platform for big data processing is storing the data received. Different systems have different requirements for the storage formats of big data, which raises the problem of choosing the optimal data storage format to solve the current problem. This paper describes the five most popular formats for storing big data, presents an experimental evaluation of these formats and a methodology for choosing the format. The following data storage formats will be considered: avro, CSV, JSON, ORC, parquet. At the first stage, a comparative analysis of the main characteristics of the studied formats was carried out; at the second stage, an experimental evaluation of these formats was prepared and carried out. For the experiment, an experimental stand was deployed with tools for processing big data installed on it. The aim of the experiment was to find out characteristics of data storage formats, such as the volume and processing speed for different operations using the Apache Spark framework. In addition, within the study, an algorithm for choosing the optimal format from the presented alternatives was developed using tropical optimization methods. The result of the study is presented in the form of a technique for obtaining a vector of ratings of data storage formats for the Apache Hadoop system, based on an experimental assessment using Apache Spark.
ISSN:2073-8994