Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors
The emergence of third-generation single-molecule sequencing (TGS) technology has revolutionized the generation of long reads, which are essential for genome assembly and have been widely employed in sequencing the SARS-CoV-2 virus during the COVID-19 pandemic. Although long-read sequencing has been...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2023-06-01
|
Series: | Biomolecules |
Subjects: | |
Online Access: | https://www.mdpi.com/2218-273X/13/6/934 |
_version_ | 1797595815547502592 |
---|---|
author | Bikram Sahoo Sarwan Ali Pin-Yu Chen Murray Patterson Alexander Zelikovsky |
author_facet | Bikram Sahoo Sarwan Ali Pin-Yu Chen Murray Patterson Alexander Zelikovsky |
author_sort | Bikram Sahoo |
collection | DOAJ |
description | The emergence of third-generation single-molecule sequencing (TGS) technology has revolutionized the generation of long reads, which are essential for genome assembly and have been widely employed in sequencing the SARS-CoV-2 virus during the COVID-19 pandemic. Although long-read sequencing has been crucial in understanding the evolution and transmission of the virus, the high error rate associated with these reads can lead to inadequate genome assembly and downstream biological interpretation. In this study, we evaluate the accuracy and robustness of machine learning (ML) models using six different embedding techniques on SARS-CoV-2 error-incorporated genome sequences. Our analysis includes two types of error-incorporated genome sequences: those generated using simulation tools to emulate error profiles of long-read sequencing platforms and those generated by introducing random errors. We show that the spaced <i>k</i>-mers embedding method achieves high accuracy in classifying error-free SARS-CoV-2 genome sequences, and the spaced <i>k</i>-mers and weighted <i>k</i>-mers embedding methods are highly accurate in predicting error-incorporated sequences. The fixed-length vectors generated by these methods contribute to the high accuracy achieved. Our study provides valuable insights for researchers to effectively evaluate ML models and gain a better understanding of the approach for accurate identification of critical SARS-CoV-2 genome sequences. |
first_indexed | 2024-03-11T02:41:50Z |
format | Article |
id | doaj.art-7b29413c29bb4697a57fc154f70244d7 |
institution | Directory Open Access Journal |
issn | 2218-273X |
language | English |
last_indexed | 2024-03-11T02:41:50Z |
publishDate | 2023-06-01 |
publisher | MDPI AG |
record_format | Article |
series | Biomolecules |
spelling | doaj.art-7b29413c29bb4697a57fc154f70244d72023-11-18T09:30:52ZengMDPI AGBiomolecules2218-273X2023-06-0113693410.3390/biom13060934Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific ErrorsBikram Sahoo0Sarwan Ali1Pin-Yu Chen2Murray Patterson3Alexander Zelikovsky4Department of Computer Science, Georgia State University, Atlanta, GA 30303, USADepartment of Computer Science, Georgia State University, Atlanta, GA 30303, USAIBM Research, IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, USADepartment of Computer Science, Georgia State University, Atlanta, GA 30303, USADepartment of Computer Science, Georgia State University, Atlanta, GA 30303, USAThe emergence of third-generation single-molecule sequencing (TGS) technology has revolutionized the generation of long reads, which are essential for genome assembly and have been widely employed in sequencing the SARS-CoV-2 virus during the COVID-19 pandemic. Although long-read sequencing has been crucial in understanding the evolution and transmission of the virus, the high error rate associated with these reads can lead to inadequate genome assembly and downstream biological interpretation. In this study, we evaluate the accuracy and robustness of machine learning (ML) models using six different embedding techniques on SARS-CoV-2 error-incorporated genome sequences. Our analysis includes two types of error-incorporated genome sequences: those generated using simulation tools to emulate error profiles of long-read sequencing platforms and those generated by introducing random errors. We show that the spaced <i>k</i>-mers embedding method achieves high accuracy in classifying error-free SARS-CoV-2 genome sequences, and the spaced <i>k</i>-mers and weighted <i>k</i>-mers embedding methods are highly accurate in predicting error-incorporated sequences. The fixed-length vectors generated by these methods contribute to the high accuracy achieved. Our study provides valuable insights for researchers to effectively evaluate ML models and gain a better understanding of the approach for accurate identification of critical SARS-CoV-2 genome sequences.https://www.mdpi.com/2218-273X/13/6/934sequencing errorthird-generation single-molecule sequencing (TGS)long readmachine learningembedding methodsclassification |
spellingShingle | Bikram Sahoo Sarwan Ali Pin-Yu Chen Murray Patterson Alexander Zelikovsky Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors Biomolecules sequencing error third-generation single-molecule sequencing (TGS) long read machine learning embedding methods classification |
title | Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors |
title_full | Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors |
title_fullStr | Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors |
title_full_unstemmed | Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors |
title_short | Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors |
title_sort | assessing the resilience of machine learning classification algorithms on sars cov 2 genome sequences generated with long read specific errors |
topic | sequencing error third-generation single-molecule sequencing (TGS) long read machine learning embedding methods classification |
url | https://www.mdpi.com/2218-273X/13/6/934 |
work_keys_str_mv | AT bikramsahoo assessingtheresilienceofmachinelearningclassificationalgorithmsonsarscov2genomesequencesgeneratedwithlongreadspecificerrors AT sarwanali assessingtheresilienceofmachinelearningclassificationalgorithmsonsarscov2genomesequencesgeneratedwithlongreadspecificerrors AT pinyuchen assessingtheresilienceofmachinelearningclassificationalgorithmsonsarscov2genomesequencesgeneratedwithlongreadspecificerrors AT murraypatterson assessingtheresilienceofmachinelearningclassificationalgorithmsonsarscov2genomesequencesgeneratedwithlongreadspecificerrors AT alexanderzelikovsky assessingtheresilienceofmachinelearningclassificationalgorithmsonsarscov2genomesequencesgeneratedwithlongreadspecificerrors |