Framing Apache Spark in life sciences

Advances in high-throughput and digital technologies have required the adoption of big data for handling complex tasks in life sciences. However, the drift to big data led researchers to face technical and infrastructural challenges for storing, sharing, and analysing them. In fact, this kind of tas...

Full description

Bibliographic Details
Main Authors:	Andrea Manconi, Matteo Gnocchi, Luciano Milanesi, Osvaldo Marullo, Giuliano Armano
Format:	Article
Language:	English
Published:	Elsevier 2023-02-01
Series:	Heliyon
Subjects:	Apache Spark Big data Parallel computing HPC
Online Access:	http://www.sciencedirect.com/science/article/pii/S2405844023005753

_version_	1811161753849954304
author	Andrea Manconi Matteo Gnocchi Luciano Milanesi Osvaldo Marullo Giuliano Armano
author_facet	Andrea Manconi Matteo Gnocchi Luciano Milanesi Osvaldo Marullo Giuliano Armano
author_sort	Andrea Manconi
collection	DOAJ
description	Advances in high-throughput and digital technologies have required the adoption of big data for handling complex tasks in life sciences. However, the drift to big data led researchers to face technical and infrastructural challenges for storing, sharing, and analysing them. In fact, this kind of tasks requires distributed computing systems and algorithms able to ensure efficient processing. Cutting edge distributed programming frameworks allow to implement flexible algorithms able to adapt the computation to the data over on-premise HPC clusters or cloud architectures. In this context, Apache Spark is a very powerful HPC engine for large-scale data processing on clusters. Also thanks to specialised libraries for working with structured and relational data, it allows to support machine learning, graph-based computation, and stream processing. This review article is aimed at helping life sciences researchers to ascertain the features of Apache Spark and to assess whether it can be successfully used in their research activities.
first_indexed	2024-04-10T06:20:14Z
format	Article
id	doaj.art-1724a17df5b24e1f9e38155892740db4
institution	Directory Open Access Journal
issn	2405-8440
language	English
last_indexed	2024-04-10T06:20:14Z
publishDate	2023-02-01
publisher	Elsevier
record_format	Article
series	Heliyon
spelling	doaj.art-1724a17df5b24e1f9e38155892740db42023-03-02T05:01:14ZengElsevierHeliyon2405-84402023-02-0192e13368Framing Apache Spark in life sciencesAndrea Manconi0Matteo Gnocchi1Luciano Milanesi2Osvaldo Marullo3Giuliano Armano4Institute of Biomedical Technologies - National Research Council of Italy, Segrate (Mi), Italy; Corresponding author.Institute of Biomedical Technologies - National Research Council of Italy, Segrate (Mi), ItalyInstitute of Biomedical Technologies - National Research Council of Italy, Segrate (Mi), ItalyDepartment of Mathematics and Computer science - University of Cagliari, Cagliari, ItalyDepartment of Mathematics and Computer science - University of Cagliari, Cagliari, ItalyAdvances in high-throughput and digital technologies have required the adoption of big data for handling complex tasks in life sciences. However, the drift to big data led researchers to face technical and infrastructural challenges for storing, sharing, and analysing them. In fact, this kind of tasks requires distributed computing systems and algorithms able to ensure efficient processing. Cutting edge distributed programming frameworks allow to implement flexible algorithms able to adapt the computation to the data over on-premise HPC clusters or cloud architectures. In this context, Apache Spark is a very powerful HPC engine for large-scale data processing on clusters. Also thanks to specialised libraries for working with structured and relational data, it allows to support machine learning, graph-based computation, and stream processing. This review article is aimed at helping life sciences researchers to ascertain the features of Apache Spark and to assess whether it can be successfully used in their research activities.http://www.sciencedirect.com/science/article/pii/S2405844023005753Apache SparkBig dataParallel computingHPC
spellingShingle	Andrea Manconi Matteo Gnocchi Luciano Milanesi Osvaldo Marullo Giuliano Armano Framing Apache Spark in life sciences Heliyon Apache Spark Big data Parallel computing HPC
title	Framing Apache Spark in life sciences
title_full	Framing Apache Spark in life sciences
title_fullStr	Framing Apache Spark in life sciences
title_full_unstemmed	Framing Apache Spark in life sciences
title_short	Framing Apache Spark in life sciences
title_sort	framing apache spark in life sciences
topic	Apache Spark Big data Parallel computing HPC
url	http://www.sciencedirect.com/science/article/pii/S2405844023005753
work_keys_str_mv	AT andreamanconi framingapachesparkinlifesciences AT matteognocchi framingapachesparkinlifesciences AT lucianomilanesi framingapachesparkinlifesciences AT osvaldomarullo framingapachesparkinlifesciences AT giulianoarmano framingapachesparkinlifesciences

Framing Apache Spark in life sciences

Similar Items