Framing Apache Spark in life sciences

Advances in high-throughput and digital technologies have required the adoption of big data for handling complex tasks in life sciences. However, the drift to big data led researchers to face technical and infrastructural challenges for storing, sharing, and analysing them. In fact, this kind of tas...

Full description

Bibliographic Details
Main Authors: Andrea Manconi, Matteo Gnocchi, Luciano Milanesi, Osvaldo Marullo, Giuliano Armano
Format: Article
Language:English
Published: Elsevier 2023-02-01
Series:Heliyon
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2405844023005753
_version_ 1811161753849954304
author Andrea Manconi
Matteo Gnocchi
Luciano Milanesi
Osvaldo Marullo
Giuliano Armano
author_facet Andrea Manconi
Matteo Gnocchi
Luciano Milanesi
Osvaldo Marullo
Giuliano Armano
author_sort Andrea Manconi
collection DOAJ
description Advances in high-throughput and digital technologies have required the adoption of big data for handling complex tasks in life sciences. However, the drift to big data led researchers to face technical and infrastructural challenges for storing, sharing, and analysing them. In fact, this kind of tasks requires distributed computing systems and algorithms able to ensure efficient processing. Cutting edge distributed programming frameworks allow to implement flexible algorithms able to adapt the computation to the data over on-premise HPC clusters or cloud architectures. In this context, Apache Spark is a very powerful HPC engine for large-scale data processing on clusters. Also thanks to specialised libraries for working with structured and relational data, it allows to support machine learning, graph-based computation, and stream processing. This review article is aimed at helping life sciences researchers to ascertain the features of Apache Spark and to assess whether it can be successfully used in their research activities.
first_indexed 2024-04-10T06:20:14Z
format Article
id doaj.art-1724a17df5b24e1f9e38155892740db4
institution Directory Open Access Journal
issn 2405-8440
language English
last_indexed 2024-04-10T06:20:14Z
publishDate 2023-02-01
publisher Elsevier
record_format Article
series Heliyon
spelling doaj.art-1724a17df5b24e1f9e38155892740db42023-03-02T05:01:14ZengElsevierHeliyon2405-84402023-02-0192e13368Framing Apache Spark in life sciencesAndrea Manconi0Matteo Gnocchi1Luciano Milanesi2Osvaldo Marullo3Giuliano Armano4Institute of Biomedical Technologies - National Research Council of Italy, Segrate (Mi), Italy; Corresponding author.Institute of Biomedical Technologies - National Research Council of Italy, Segrate (Mi), ItalyInstitute of Biomedical Technologies - National Research Council of Italy, Segrate (Mi), ItalyDepartment of Mathematics and Computer science - University of Cagliari, Cagliari, ItalyDepartment of Mathematics and Computer science - University of Cagliari, Cagliari, ItalyAdvances in high-throughput and digital technologies have required the adoption of big data for handling complex tasks in life sciences. However, the drift to big data led researchers to face technical and infrastructural challenges for storing, sharing, and analysing them. In fact, this kind of tasks requires distributed computing systems and algorithms able to ensure efficient processing. Cutting edge distributed programming frameworks allow to implement flexible algorithms able to adapt the computation to the data over on-premise HPC clusters or cloud architectures. In this context, Apache Spark is a very powerful HPC engine for large-scale data processing on clusters. Also thanks to specialised libraries for working with structured and relational data, it allows to support machine learning, graph-based computation, and stream processing. This review article is aimed at helping life sciences researchers to ascertain the features of Apache Spark and to assess whether it can be successfully used in their research activities.http://www.sciencedirect.com/science/article/pii/S2405844023005753Apache SparkBig dataParallel computingHPC
spellingShingle Andrea Manconi
Matteo Gnocchi
Luciano Milanesi
Osvaldo Marullo
Giuliano Armano
Framing Apache Spark in life sciences
Heliyon
Apache Spark
Big data
Parallel computing
HPC
title Framing Apache Spark in life sciences
title_full Framing Apache Spark in life sciences
title_fullStr Framing Apache Spark in life sciences
title_full_unstemmed Framing Apache Spark in life sciences
title_short Framing Apache Spark in life sciences
title_sort framing apache spark in life sciences
topic Apache Spark
Big data
Parallel computing
HPC
url http://www.sciencedirect.com/science/article/pii/S2405844023005753
work_keys_str_mv AT andreamanconi framingapachesparkinlifesciences
AT matteognocchi framingapachesparkinlifesciences
AT lucianomilanesi framingapachesparkinlifesciences
AT osvaldomarullo framingapachesparkinlifesciences
AT giulianoarmano framingapachesparkinlifesciences