Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey

Big Data Proteogenomics lies at the intersection of high-throughput Mass Spectrometry (MS) based proteomics and Next Generation Sequencing based genomics. The combined and integrated analysis of these two high-throughput technologies can help discover novel proteins using genomic, and transcriptomic...

Full description

Bibliographic Details
Main Authors: Muhammad Usman Tariq, Muhammad Haseeb, Mohammed Aledhari, Rehma Razzak, Reza M. Parizi, Fahad Saeed
Format: Article
Language:English
Published: IEEE 2021-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9309010/
_version_ 1818411767646126080
author Muhammad Usman Tariq
Muhammad Haseeb
Mohammed Aledhari
Rehma Razzak
Reza M. Parizi
Fahad Saeed
author_facet Muhammad Usman Tariq
Muhammad Haseeb
Mohammed Aledhari
Rehma Razzak
Reza M. Parizi
Fahad Saeed
author_sort Muhammad Usman Tariq
collection DOAJ
description Big Data Proteogenomics lies at the intersection of high-throughput Mass Spectrometry (MS) based proteomics and Next Generation Sequencing based genomics. The combined and integrated analysis of these two high-throughput technologies can help discover novel proteins using genomic, and transcriptomic data. Due to the biological significance of integrated analysis, the recent past has seen an influx of proteogenomic tools that perform various tasks, including mapping proteins to the genomic data, searching experimental MS spectra against a six-frame translation genome database, and automating the process of annotating genome sequences. To date, most of such tools have not focused on scalability issues that are inherent in proteogenomic data analysis where the size of the database is much larger than a typical protein database. These state-of-the-art tools can take more than half a month to process a small-scale dataset of one million spectra against a genome of 3 GB. In this article, we provide an up-to-date review of tools that can analyze proteogenomic datasets, providing a critical analysis of the techniques' relative merits and potential pitfalls. We also point out potential bottlenecks and recommendations that can be incorporated in the future design of these workflows to ensure scalability with the increasing size of proteogenomic data. Lastly, we make a case of how high-performance computing (HPC) solutions may be the best bet to ensure the scalability of future big data proteogenomic data analysis.
first_indexed 2024-12-14T10:36:39Z
format Article
id doaj.art-c0f8574a8ff64b949a36eef5a93f9b57
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-12-14T10:36:39Z
publishDate 2021-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-c0f8574a8ff64b949a36eef5a93f9b572022-12-21T23:05:53ZengIEEEIEEE Access2169-35362021-01-0195497551610.1109/ACCESS.2020.30475889309010Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A SurveyMuhammad Usman Tariq0Muhammad Haseeb1https://orcid.org/0000-0002-0697-6894Mohammed Aledhari2https://orcid.org/0000-0002-5380-6003Rehma Razzak3https://orcid.org/0000-0002-5301-8955Reza M. Parizi4https://orcid.org/0000-0002-0049-4296Fahad Saeed5https://orcid.org/0000-0002-3410-9552School of Computing and Information Sciences, Florida International University, Miami, FL, USASchool of Computing and Information Sciences, Florida International University, Miami, FL, USACollege of Computing and Software Engineering, Kennesaw State University, Marietta, GA, USACollege of Computing and Software Engineering, Kennesaw State University, Marietta, GA, USACollege of Computing and Software Engineering, Kennesaw State University, Marietta, GA, USASchool of Computing and Information Sciences, Florida International University, Miami, FL, USABig Data Proteogenomics lies at the intersection of high-throughput Mass Spectrometry (MS) based proteomics and Next Generation Sequencing based genomics. The combined and integrated analysis of these two high-throughput technologies can help discover novel proteins using genomic, and transcriptomic data. Due to the biological significance of integrated analysis, the recent past has seen an influx of proteogenomic tools that perform various tasks, including mapping proteins to the genomic data, searching experimental MS spectra against a six-frame translation genome database, and automating the process of annotating genome sequences. To date, most of such tools have not focused on scalability issues that are inherent in proteogenomic data analysis where the size of the database is much larger than a typical protein database. These state-of-the-art tools can take more than half a month to process a small-scale dataset of one million spectra against a genome of 3 GB. In this article, we provide an up-to-date review of tools that can analyze proteogenomic datasets, providing a critical analysis of the techniques' relative merits and potential pitfalls. We also point out potential bottlenecks and recommendations that can be incorporated in the future design of these workflows to ensure scalability with the increasing size of proteogenomic data. Lastly, we make a case of how high-performance computing (HPC) solutions may be the best bet to ensure the scalability of future big data proteogenomic data analysis.https://ieeexplore.ieee.org/document/9309010/Proteogenomicsproteomicshigh-performance computingworkflowgenomicsbig data
spellingShingle Muhammad Usman Tariq
Muhammad Haseeb
Mohammed Aledhari
Rehma Razzak
Reza M. Parizi
Fahad Saeed
Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey
IEEE Access
Proteogenomics
proteomics
high-performance computing
workflow
genomics
big data
title Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey
title_full Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey
title_fullStr Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey
title_full_unstemmed Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey
title_short Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey
title_sort methods for proteogenomics data analysis challenges and scalability bottlenecks a survey
topic Proteogenomics
proteomics
high-performance computing
workflow
genomics
big data
url https://ieeexplore.ieee.org/document/9309010/
work_keys_str_mv AT muhammadusmantariq methodsforproteogenomicsdataanalysischallengesandscalabilitybottlenecksasurvey
AT muhammadhaseeb methodsforproteogenomicsdataanalysischallengesandscalabilitybottlenecksasurvey
AT mohammedaledhari methodsforproteogenomicsdataanalysischallengesandscalabilitybottlenecksasurvey
AT rehmarazzak methodsforproteogenomicsdataanalysischallengesandscalabilitybottlenecksasurvey
AT rezamparizi methodsforproteogenomicsdataanalysischallengesandscalabilitybottlenecksasurvey
AT fahadsaeed methodsforproteogenomicsdataanalysischallengesandscalabilitybottlenecksasurvey