Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors

The emergence of third-generation single-molecule sequencing (TGS) technology has revolutionized the generation of long reads, which are essential for genome assembly and have been widely employed in sequencing the SARS-CoV-2 virus during the COVID-19 pandemic. Although long-read sequencing has been...

Full description

Bibliographic Details
Main Authors: Bikram Sahoo, Sarwan Ali, Pin-Yu Chen, Murray Patterson, Alexander Zelikovsky
Format: Article
Language:English
Published: MDPI AG 2023-06-01
Series:Biomolecules
Subjects:
Online Access:https://www.mdpi.com/2218-273X/13/6/934
_version_ 1797595815547502592
author Bikram Sahoo
Sarwan Ali
Pin-Yu Chen
Murray Patterson
Alexander Zelikovsky
author_facet Bikram Sahoo
Sarwan Ali
Pin-Yu Chen
Murray Patterson
Alexander Zelikovsky
author_sort Bikram Sahoo
collection DOAJ
description The emergence of third-generation single-molecule sequencing (TGS) technology has revolutionized the generation of long reads, which are essential for genome assembly and have been widely employed in sequencing the SARS-CoV-2 virus during the COVID-19 pandemic. Although long-read sequencing has been crucial in understanding the evolution and transmission of the virus, the high error rate associated with these reads can lead to inadequate genome assembly and downstream biological interpretation. In this study, we evaluate the accuracy and robustness of machine learning (ML) models using six different embedding techniques on SARS-CoV-2 error-incorporated genome sequences. Our analysis includes two types of error-incorporated genome sequences: those generated using simulation tools to emulate error profiles of long-read sequencing platforms and those generated by introducing random errors. We show that the spaced <i>k</i>-mers embedding method achieves high accuracy in classifying error-free SARS-CoV-2 genome sequences, and the spaced <i>k</i>-mers and weighted <i>k</i>-mers embedding methods are highly accurate in predicting error-incorporated sequences. The fixed-length vectors generated by these methods contribute to the high accuracy achieved. Our study provides valuable insights for researchers to effectively evaluate ML models and gain a better understanding of the approach for accurate identification of critical SARS-CoV-2 genome sequences.
first_indexed 2024-03-11T02:41:50Z
format Article
id doaj.art-7b29413c29bb4697a57fc154f70244d7
institution Directory Open Access Journal
issn 2218-273X
language English
last_indexed 2024-03-11T02:41:50Z
publishDate 2023-06-01
publisher MDPI AG
record_format Article
series Biomolecules
spelling doaj.art-7b29413c29bb4697a57fc154f70244d72023-11-18T09:30:52ZengMDPI AGBiomolecules2218-273X2023-06-0113693410.3390/biom13060934Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific ErrorsBikram Sahoo0Sarwan Ali1Pin-Yu Chen2Murray Patterson3Alexander Zelikovsky4Department of Computer Science, Georgia State University, Atlanta, GA 30303, USADepartment of Computer Science, Georgia State University, Atlanta, GA 30303, USAIBM Research, IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, USADepartment of Computer Science, Georgia State University, Atlanta, GA 30303, USADepartment of Computer Science, Georgia State University, Atlanta, GA 30303, USAThe emergence of third-generation single-molecule sequencing (TGS) technology has revolutionized the generation of long reads, which are essential for genome assembly and have been widely employed in sequencing the SARS-CoV-2 virus during the COVID-19 pandemic. Although long-read sequencing has been crucial in understanding the evolution and transmission of the virus, the high error rate associated with these reads can lead to inadequate genome assembly and downstream biological interpretation. In this study, we evaluate the accuracy and robustness of machine learning (ML) models using six different embedding techniques on SARS-CoV-2 error-incorporated genome sequences. Our analysis includes two types of error-incorporated genome sequences: those generated using simulation tools to emulate error profiles of long-read sequencing platforms and those generated by introducing random errors. We show that the spaced <i>k</i>-mers embedding method achieves high accuracy in classifying error-free SARS-CoV-2 genome sequences, and the spaced <i>k</i>-mers and weighted <i>k</i>-mers embedding methods are highly accurate in predicting error-incorporated sequences. The fixed-length vectors generated by these methods contribute to the high accuracy achieved. Our study provides valuable insights for researchers to effectively evaluate ML models and gain a better understanding of the approach for accurate identification of critical SARS-CoV-2 genome sequences.https://www.mdpi.com/2218-273X/13/6/934sequencing errorthird-generation single-molecule sequencing (TGS)long readmachine learningembedding methodsclassification
spellingShingle Bikram Sahoo
Sarwan Ali
Pin-Yu Chen
Murray Patterson
Alexander Zelikovsky
Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors
Biomolecules
sequencing error
third-generation single-molecule sequencing (TGS)
long read
machine learning
embedding methods
classification
title Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors
title_full Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors
title_fullStr Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors
title_full_unstemmed Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors
title_short Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors
title_sort assessing the resilience of machine learning classification algorithms on sars cov 2 genome sequences generated with long read specific errors
topic sequencing error
third-generation single-molecule sequencing (TGS)
long read
machine learning
embedding methods
classification
url https://www.mdpi.com/2218-273X/13/6/934
work_keys_str_mv AT bikramsahoo assessingtheresilienceofmachinelearningclassificationalgorithmsonsarscov2genomesequencesgeneratedwithlongreadspecificerrors
AT sarwanali assessingtheresilienceofmachinelearningclassificationalgorithmsonsarscov2genomesequencesgeneratedwithlongreadspecificerrors
AT pinyuchen assessingtheresilienceofmachinelearningclassificationalgorithmsonsarscov2genomesequencesgeneratedwithlongreadspecificerrors
AT murraypatterson assessingtheresilienceofmachinelearningclassificationalgorithmsonsarscov2genomesequencesgeneratedwithlongreadspecificerrors
AT alexanderzelikovsky assessingtheresilienceofmachinelearningclassificationalgorithmsonsarscov2genomesequencesgeneratedwithlongreadspecificerrors