Framing Apache Spark in life sciences
Advances in high-throughput and digital technologies have required the adoption of big data for handling complex tasks in life sciences. However, the drift to big data led researchers to face technical and infrastructural challenges for storing, sharing, and analysing them. In fact, this kind of tas...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2023-02-01
|
Series: | Heliyon |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2405844023005753 |
_version_ | 1811161753849954304 |
---|---|
author | Andrea Manconi Matteo Gnocchi Luciano Milanesi Osvaldo Marullo Giuliano Armano |
author_facet | Andrea Manconi Matteo Gnocchi Luciano Milanesi Osvaldo Marullo Giuliano Armano |
author_sort | Andrea Manconi |
collection | DOAJ |
description | Advances in high-throughput and digital technologies have required the adoption of big data for handling complex tasks in life sciences. However, the drift to big data led researchers to face technical and infrastructural challenges for storing, sharing, and analysing them. In fact, this kind of tasks requires distributed computing systems and algorithms able to ensure efficient processing. Cutting edge distributed programming frameworks allow to implement flexible algorithms able to adapt the computation to the data over on-premise HPC clusters or cloud architectures. In this context, Apache Spark is a very powerful HPC engine for large-scale data processing on clusters. Also thanks to specialised libraries for working with structured and relational data, it allows to support machine learning, graph-based computation, and stream processing. This review article is aimed at helping life sciences researchers to ascertain the features of Apache Spark and to assess whether it can be successfully used in their research activities. |
first_indexed | 2024-04-10T06:20:14Z |
format | Article |
id | doaj.art-1724a17df5b24e1f9e38155892740db4 |
institution | Directory Open Access Journal |
issn | 2405-8440 |
language | English |
last_indexed | 2024-04-10T06:20:14Z |
publishDate | 2023-02-01 |
publisher | Elsevier |
record_format | Article |
series | Heliyon |
spelling | doaj.art-1724a17df5b24e1f9e38155892740db42023-03-02T05:01:14ZengElsevierHeliyon2405-84402023-02-0192e13368Framing Apache Spark in life sciencesAndrea Manconi0Matteo Gnocchi1Luciano Milanesi2Osvaldo Marullo3Giuliano Armano4Institute of Biomedical Technologies - National Research Council of Italy, Segrate (Mi), Italy; Corresponding author.Institute of Biomedical Technologies - National Research Council of Italy, Segrate (Mi), ItalyInstitute of Biomedical Technologies - National Research Council of Italy, Segrate (Mi), ItalyDepartment of Mathematics and Computer science - University of Cagliari, Cagliari, ItalyDepartment of Mathematics and Computer science - University of Cagliari, Cagliari, ItalyAdvances in high-throughput and digital technologies have required the adoption of big data for handling complex tasks in life sciences. However, the drift to big data led researchers to face technical and infrastructural challenges for storing, sharing, and analysing them. In fact, this kind of tasks requires distributed computing systems and algorithms able to ensure efficient processing. Cutting edge distributed programming frameworks allow to implement flexible algorithms able to adapt the computation to the data over on-premise HPC clusters or cloud architectures. In this context, Apache Spark is a very powerful HPC engine for large-scale data processing on clusters. Also thanks to specialised libraries for working with structured and relational data, it allows to support machine learning, graph-based computation, and stream processing. This review article is aimed at helping life sciences researchers to ascertain the features of Apache Spark and to assess whether it can be successfully used in their research activities.http://www.sciencedirect.com/science/article/pii/S2405844023005753Apache SparkBig dataParallel computingHPC |
spellingShingle | Andrea Manconi Matteo Gnocchi Luciano Milanesi Osvaldo Marullo Giuliano Armano Framing Apache Spark in life sciences Heliyon Apache Spark Big data Parallel computing HPC |
title | Framing Apache Spark in life sciences |
title_full | Framing Apache Spark in life sciences |
title_fullStr | Framing Apache Spark in life sciences |
title_full_unstemmed | Framing Apache Spark in life sciences |
title_short | Framing Apache Spark in life sciences |
title_sort | framing apache spark in life sciences |
topic | Apache Spark Big data Parallel computing HPC |
url | http://www.sciencedirect.com/science/article/pii/S2405844023005753 |
work_keys_str_mv | AT andreamanconi framingapachesparkinlifesciences AT matteognocchi framingapachesparkinlifesciences AT lucianomilanesi framingapachesparkinlifesciences AT osvaldomarullo framingapachesparkinlifesciences AT giulianoarmano framingapachesparkinlifesciences |