Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors

The emergence of third-generation single-molecule sequencing (TGS) technology has revolutionized the generation of long reads, which are essential for genome assembly and have been widely employed in sequencing the SARS-CoV-2 virus during the COVID-19 pandemic. Although long-read sequencing has been...

Full description

Bibliographic Details
Main Authors:	Bikram Sahoo, Sarwan Ali, Pin-Yu Chen, Murray Patterson, Alexander Zelikovsky
Format:	Article
Language:	English
Published:	MDPI AG 2023-06-01
Series:	Biomolecules
Subjects:	sequencing error third-generation single-molecule sequencing (TGS) long read machine learning embedding methods classification
Online Access:	https://www.mdpi.com/2218-273X/13/6/934

_version_	1797595815547502592
author	Bikram Sahoo Sarwan Ali Pin-Yu Chen Murray Patterson Alexander Zelikovsky
author_facet	Bikram Sahoo Sarwan Ali Pin-Yu Chen Murray Patterson Alexander Zelikovsky
author_sort	Bikram Sahoo
collection	DOAJ
description	The emergence of third-generation single-molecule sequencing (TGS) technology has revolutionized the generation of long reads, which are essential for genome assembly and have been widely employed in sequencing the SARS-CoV-2 virus during the COVID-19 pandemic. Although long-read sequencing has been crucial in understanding the evolution and transmission of the virus, the high error rate associated with these reads can lead to inadequate genome assembly and downstream biological interpretation. In this study, we evaluate the accuracy and robustness of machine learning (ML) models using six different embedding techniques on SARS-CoV-2 error-incorporated genome sequences. Our analysis includes two types of error-incorporated genome sequences: those generated using simulation tools to emulate error profiles of long-read sequencing platforms and those generated by introducing random errors. We show that the spaced <i>k</i>-mers embedding method achieves high accuracy in classifying error-free SARS-CoV-2 genome sequences, and the spaced <i>k</i>-mers and weighted <i>k</i>-mers embedding methods are highly accurate in predicting error-incorporated sequences. The fixed-length vectors generated by these methods contribute to the high accuracy achieved. Our study provides valuable insights for researchers to effectively evaluate ML models and gain a better understanding of the approach for accurate identification of critical SARS-CoV-2 genome sequences.
first_indexed	2024-03-11T02:41:50Z
format	Article
id	doaj.art-7b29413c29bb4697a57fc154f70244d7
institution	Directory Open Access Journal
issn	2218-273X
language	English
last_indexed	2024-03-11T02:41:50Z
publishDate	2023-06-01
publisher	MDPI AG
record_format	Article
series	Biomolecules
spelling	doaj.art-7b29413c29bb4697a57fc154f70244d72023-11-18T09:30:52ZengMDPI AGBiomolecules2218-273X2023-06-0113693410.3390/biom13060934Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific ErrorsBikram Sahoo0Sarwan Ali1Pin-Yu Chen2Murray Patterson3Alexander Zelikovsky4Department of Computer Science, Georgia State University, Atlanta, GA 30303, USADepartment of Computer Science, Georgia State University, Atlanta, GA 30303, USAIBM Research, IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, USADepartment of Computer Science, Georgia State University, Atlanta, GA 30303, USADepartment of Computer Science, Georgia State University, Atlanta, GA 30303, USAThe emergence of third-generation single-molecule sequencing (TGS) technology has revolutionized the generation of long reads, which are essential for genome assembly and have been widely employed in sequencing the SARS-CoV-2 virus during the COVID-19 pandemic. Although long-read sequencing has been crucial in understanding the evolution and transmission of the virus, the high error rate associated with these reads can lead to inadequate genome assembly and downstream biological interpretation. In this study, we evaluate the accuracy and robustness of machine learning (ML) models using six different embedding techniques on SARS-CoV-2 error-incorporated genome sequences. Our analysis includes two types of error-incorporated genome sequences: those generated using simulation tools to emulate error profiles of long-read sequencing platforms and those generated by introducing random errors. We show that the spaced <i>k</i>-mers embedding method achieves high accuracy in classifying error-free SARS-CoV-2 genome sequences, and the spaced <i>k</i>-mers and weighted <i>k</i>-mers embedding methods are highly accurate in predicting error-incorporated sequences. The fixed-length vectors generated by these methods contribute to the high accuracy achieved. Our study provides valuable insights for researchers to effectively evaluate ML models and gain a better understanding of the approach for accurate identification of critical SARS-CoV-2 genome sequences.https://www.mdpi.com/2218-273X/13/6/934sequencing errorthird-generation single-molecule sequencing (TGS)long readmachine learningembedding methodsclassification
spellingShingle	Bikram Sahoo Sarwan Ali Pin-Yu Chen Murray Patterson Alexander Zelikovsky Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors Biomolecules sequencing error third-generation single-molecule sequencing (TGS) long read machine learning embedding methods classification
title	Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors
title_full	Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors
title_fullStr	Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors
title_full_unstemmed	Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors
title_short	Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors
title_sort	assessing the resilience of machine learning classification algorithms on sars cov 2 genome sequences generated with long read specific errors
topic	sequencing error third-generation single-molecule sequencing (TGS) long read machine learning embedding methods classification
url	https://www.mdpi.com/2218-273X/13/6/934
work_keys_str_mv	AT bikramsahoo assessingtheresilienceofmachinelearningclassificationalgorithmsonsarscov2genomesequencesgeneratedwithlongreadspecificerrors AT sarwanali assessingtheresilienceofmachinelearningclassificationalgorithmsonsarscov2genomesequencesgeneratedwithlongreadspecificerrors AT pinyuchen assessingtheresilienceofmachinelearningclassificationalgorithmsonsarscov2genomesequencesgeneratedwithlongreadspecificerrors AT murraypatterson assessingtheresilienceofmachinelearningclassificationalgorithmsonsarscov2genomesequencesgeneratedwithlongreadspecificerrors AT alexanderzelikovsky assessingtheresilienceofmachinelearningclassificationalgorithmsonsarscov2genomesequencesgeneratedwithlongreadspecificerrors

Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors

Similar Items