Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey

Big Data Proteogenomics lies at the intersection of high-throughput Mass Spectrometry (MS) based proteomics and Next Generation Sequencing based genomics. The combined and integrated analysis of these two high-throughput technologies can help discover novel proteins using genomic, and transcriptomic...

Full description

Bibliographic Details
Main Authors:	Muhammad Usman Tariq, Muhammad Haseeb, Mohammed Aledhari, Rehma Razzak, Reza M. Parizi, Fahad Saeed
Format:	Article
Language:	English
Published:	IEEE 2021-01-01
Series:	IEEE Access
Subjects:	Proteogenomics proteomics high-performance computing workflow genomics big data
Online Access:	https://ieeexplore.ieee.org/document/9309010/

_version_	1818411767646126080
author	Muhammad Usman Tariq Muhammad Haseeb Mohammed Aledhari Rehma Razzak Reza M. Parizi Fahad Saeed
author_facet	Muhammad Usman Tariq Muhammad Haseeb Mohammed Aledhari Rehma Razzak Reza M. Parizi Fahad Saeed
author_sort	Muhammad Usman Tariq
collection	DOAJ
description	Big Data Proteogenomics lies at the intersection of high-throughput Mass Spectrometry (MS) based proteomics and Next Generation Sequencing based genomics. The combined and integrated analysis of these two high-throughput technologies can help discover novel proteins using genomic, and transcriptomic data. Due to the biological significance of integrated analysis, the recent past has seen an influx of proteogenomic tools that perform various tasks, including mapping proteins to the genomic data, searching experimental MS spectra against a six-frame translation genome database, and automating the process of annotating genome sequences. To date, most of such tools have not focused on scalability issues that are inherent in proteogenomic data analysis where the size of the database is much larger than a typical protein database. These state-of-the-art tools can take more than half a month to process a small-scale dataset of one million spectra against a genome of 3 GB. In this article, we provide an up-to-date review of tools that can analyze proteogenomic datasets, providing a critical analysis of the techniques' relative merits and potential pitfalls. We also point out potential bottlenecks and recommendations that can be incorporated in the future design of these workflows to ensure scalability with the increasing size of proteogenomic data. Lastly, we make a case of how high-performance computing (HPC) solutions may be the best bet to ensure the scalability of future big data proteogenomic data analysis.
first_indexed	2024-12-14T10:36:39Z
format	Article
id	doaj.art-c0f8574a8ff64b949a36eef5a93f9b57
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-12-14T10:36:39Z
publishDate	2021-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-c0f8574a8ff64b949a36eef5a93f9b572022-12-21T23:05:53ZengIEEEIEEE Access2169-35362021-01-0195497551610.1109/ACCESS.2020.30475889309010Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A SurveyMuhammad Usman Tariq0Muhammad Haseeb1https://orcid.org/0000-0002-0697-6894Mohammed Aledhari2https://orcid.org/0000-0002-5380-6003Rehma Razzak3https://orcid.org/0000-0002-5301-8955Reza M. Parizi4https://orcid.org/0000-0002-0049-4296Fahad Saeed5https://orcid.org/0000-0002-3410-9552School of Computing and Information Sciences, Florida International University, Miami, FL, USASchool of Computing and Information Sciences, Florida International University, Miami, FL, USACollege of Computing and Software Engineering, Kennesaw State University, Marietta, GA, USACollege of Computing and Software Engineering, Kennesaw State University, Marietta, GA, USACollege of Computing and Software Engineering, Kennesaw State University, Marietta, GA, USASchool of Computing and Information Sciences, Florida International University, Miami, FL, USABig Data Proteogenomics lies at the intersection of high-throughput Mass Spectrometry (MS) based proteomics and Next Generation Sequencing based genomics. The combined and integrated analysis of these two high-throughput technologies can help discover novel proteins using genomic, and transcriptomic data. Due to the biological significance of integrated analysis, the recent past has seen an influx of proteogenomic tools that perform various tasks, including mapping proteins to the genomic data, searching experimental MS spectra against a six-frame translation genome database, and automating the process of annotating genome sequences. To date, most of such tools have not focused on scalability issues that are inherent in proteogenomic data analysis where the size of the database is much larger than a typical protein database. These state-of-the-art tools can take more than half a month to process a small-scale dataset of one million spectra against a genome of 3 GB. In this article, we provide an up-to-date review of tools that can analyze proteogenomic datasets, providing a critical analysis of the techniques' relative merits and potential pitfalls. We also point out potential bottlenecks and recommendations that can be incorporated in the future design of these workflows to ensure scalability with the increasing size of proteogenomic data. Lastly, we make a case of how high-performance computing (HPC) solutions may be the best bet to ensure the scalability of future big data proteogenomic data analysis.https://ieeexplore.ieee.org/document/9309010/Proteogenomicsproteomicshigh-performance computingworkflowgenomicsbig data
spellingShingle	Muhammad Usman Tariq Muhammad Haseeb Mohammed Aledhari Rehma Razzak Reza M. Parizi Fahad Saeed Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey IEEE Access Proteogenomics proteomics high-performance computing workflow genomics big data
title	Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey
title_full	Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey
title_fullStr	Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey
title_full_unstemmed	Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey
title_short	Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey
title_sort	methods for proteogenomics data analysis challenges and scalability bottlenecks a survey
topic	Proteogenomics proteomics high-performance computing workflow genomics big data
url	https://ieeexplore.ieee.org/document/9309010/
work_keys_str_mv	AT muhammadusmantariq methodsforproteogenomicsdataanalysischallengesandscalabilitybottlenecksasurvey AT muhammadhaseeb methodsforproteogenomicsdataanalysischallengesandscalabilitybottlenecksasurvey AT mohammedaledhari methodsforproteogenomicsdataanalysischallengesandscalabilitybottlenecksasurvey AT rehmarazzak methodsforproteogenomicsdataanalysischallengesandscalabilitybottlenecksasurvey AT rezamparizi methodsforproteogenomicsdataanalysischallengesandscalabilitybottlenecksasurvey AT fahadsaeed methodsforproteogenomicsdataanalysischallengesandscalabilitybottlenecksasurvey

Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey

Similar Items