MAC-ErrorReads: machine learning-assisted classifier for filtering erroneous NGS reads

Abstract Background The rapid advancement of next-generation sequencing (NGS) machines in terms of speed and affordability has led to the generation of a massive amount of biological data at the expense of data quality as errors become more prevalent. This introduces the need to utilize different ap...

Full description

Bibliographic Details
Main Authors:	Amira Sami, Sara El-Metwally, M. Z. Rashad
Format:	Article
Language:	English
Published:	BMC 2024-02-01
Series:	BMC Bioinformatics
Subjects:	Next-generation sequencing Error filtration Machine learning Classification Feature extraction
Online Access:	https://doi.org/10.1186/s12859-024-05681-1

_version_	1827325942321643520
author	Amira Sami Sara El-Metwally M. Z. Rashad
author_facet	Amira Sami Sara El-Metwally M. Z. Rashad
author_sort	Amira Sami
collection	DOAJ
description	Abstract Background The rapid advancement of next-generation sequencing (NGS) machines in terms of speed and affordability has led to the generation of a massive amount of biological data at the expense of data quality as errors become more prevalent. This introduces the need to utilize different approaches to detect and filtrate errors, and data quality assurance is moved from the hardware space to the software preprocessing stages. Results We introduce MAC-ErrorReads, a novel Machine learning-Assisted Classifier designed for filtering Erroneous NGS Reads. MAC-ErrorReads transforms the erroneous NGS read filtration process into a robust binary classification task, employing five supervised machine learning algorithms. These models are trained on features extracted through the computation of Term Frequency-Inverse Document Frequency (TF_IDF) values from various datasets such as E. coli, GAGE S. aureus, H. Chr14, Arabidopsis thaliana Chr1 and Metriaclima zebra. Notably, Naive Bayes demonstrated robust performance across various datasets, displaying high accuracy, precision, recall, F1-score, MCC, and ROC values. The MAC-ErrorReads NB model accurately classified S. aureus reads, surpassing most error correction tools with a 38.69% alignment rate. For H. Chr14, tools like Lighter, Karect, CARE, Pollux, and MAC-ErrorReads showed rates above 99%. BFC and RECKONER exceeded 98%, while Fiona had 95.78%. For the Arabidopsis thaliana Chr1, Pollux, Karect, RECKONER, and MAC-ErrorReads demonstrated good alignment rates of 92.62%, 91.80%, 91.78%, and 90.87%, respectively. For the Metriaclima zebra, Pollux achieved a high alignment rate of 91.23%, despite having the lowest number of mapped reads. MAC-ErrorReads, Karect, and RECKONER demonstrated good alignment rates of 83.76%, 83.71%, and 83.67%, respectively, while also producing reasonable numbers of mapped reads to the reference genome. Conclusions This study demonstrates that machine learning approaches for filtering NGS reads effectively identify and retain the most accurate reads, significantly enhancing assembly quality and genomic coverage. The integration of genomics and artificial intelligence through machine learning algorithms holds promise for enhancing NGS data quality, advancing downstream data analysis accuracy, and opening new opportunities in genetics, genomics, and personalized medicine research.
first_indexed	2024-03-07T14:37:32Z
format	Article
id	doaj.art-18482c2efefe4070936b7505ec70b794
institution	Directory Open Access Journal
issn	1471-2105
language	English
last_indexed	2024-03-07T14:37:32Z
publishDate	2024-02-01
publisher	BMC
record_format	Article
series	BMC Bioinformatics
spelling	doaj.art-18482c2efefe4070936b7505ec70b7942024-03-05T20:32:01ZengBMCBMC Bioinformatics1471-21052024-02-0125113010.1186/s12859-024-05681-1MAC-ErrorReads: machine learning-assisted classifier for filtering erroneous NGS readsAmira Sami0Sara El-Metwally1M. Z. Rashad2Department of Computer Science, Faculty of Computers and Information, Mansoura UniversityDepartment of Computer Science, Faculty of Computers and Information, Mansoura UniversityDepartment of Computer Science, Faculty of Computers and Information, Mansoura UniversityAbstract Background The rapid advancement of next-generation sequencing (NGS) machines in terms of speed and affordability has led to the generation of a massive amount of biological data at the expense of data quality as errors become more prevalent. This introduces the need to utilize different approaches to detect and filtrate errors, and data quality assurance is moved from the hardware space to the software preprocessing stages. Results We introduce MAC-ErrorReads, a novel Machine learning-Assisted Classifier designed for filtering Erroneous NGS Reads. MAC-ErrorReads transforms the erroneous NGS read filtration process into a robust binary classification task, employing five supervised machine learning algorithms. These models are trained on features extracted through the computation of Term Frequency-Inverse Document Frequency (TF_IDF) values from various datasets such as E. coli, GAGE S. aureus, H. Chr14, Arabidopsis thaliana Chr1 and Metriaclima zebra. Notably, Naive Bayes demonstrated robust performance across various datasets, displaying high accuracy, precision, recall, F1-score, MCC, and ROC values. The MAC-ErrorReads NB model accurately classified S. aureus reads, surpassing most error correction tools with a 38.69% alignment rate. For H. Chr14, tools like Lighter, Karect, CARE, Pollux, and MAC-ErrorReads showed rates above 99%. BFC and RECKONER exceeded 98%, while Fiona had 95.78%. For the Arabidopsis thaliana Chr1, Pollux, Karect, RECKONER, and MAC-ErrorReads demonstrated good alignment rates of 92.62%, 91.80%, 91.78%, and 90.87%, respectively. For the Metriaclima zebra, Pollux achieved a high alignment rate of 91.23%, despite having the lowest number of mapped reads. MAC-ErrorReads, Karect, and RECKONER demonstrated good alignment rates of 83.76%, 83.71%, and 83.67%, respectively, while also producing reasonable numbers of mapped reads to the reference genome. Conclusions This study demonstrates that machine learning approaches for filtering NGS reads effectively identify and retain the most accurate reads, significantly enhancing assembly quality and genomic coverage. The integration of genomics and artificial intelligence through machine learning algorithms holds promise for enhancing NGS data quality, advancing downstream data analysis accuracy, and opening new opportunities in genetics, genomics, and personalized medicine research.https://doi.org/10.1186/s12859-024-05681-1Next-generation sequencingError filtrationMachine learningClassificationFeature extraction
spellingShingle	Amira Sami Sara El-Metwally M. Z. Rashad MAC-ErrorReads: machine learning-assisted classifier for filtering erroneous NGS reads BMC Bioinformatics Next-generation sequencing Error filtration Machine learning Classification Feature extraction
title	MAC-ErrorReads: machine learning-assisted classifier for filtering erroneous NGS reads
title_full	MAC-ErrorReads: machine learning-assisted classifier for filtering erroneous NGS reads
title_fullStr	MAC-ErrorReads: machine learning-assisted classifier for filtering erroneous NGS reads
title_full_unstemmed	MAC-ErrorReads: machine learning-assisted classifier for filtering erroneous NGS reads
title_short	MAC-ErrorReads: machine learning-assisted classifier for filtering erroneous NGS reads
title_sort	mac errorreads machine learning assisted classifier for filtering erroneous ngs reads
topic	Next-generation sequencing Error filtration Machine learning Classification Feature extraction
url	https://doi.org/10.1186/s12859-024-05681-1
work_keys_str_mv	AT amirasami macerrorreadsmachinelearningassistedclassifierforfilteringerroneousngsreads AT saraelmetwally macerrorreadsmachinelearningassistedclassifierforfilteringerroneousngsreads AT mzrashad macerrorreadsmachinelearningassistedclassifierforfilteringerroneousngsreads

MAC-ErrorReads: machine learning-assisted classifier for filtering erroneous NGS reads

Similar Items