Application of Machine Learning Models in Error and Variant Detection in High-Variation Genomics Datasets

For metagenomics datasets, datasets of complex polyploid genomes, and other high-variation genomics datasets, there are difficulties with the analysis, error detection and variant calling, stemming from the challenges of discerning sequencing errors from biological variation. Confirming base candida...

Full description

Bibliographic Details
Main Authors: Milko Krachunov, Maria Nisheva, Dimitar Vassilev
Format: Article
Language:English
Published: MDPI AG 2017-11-01
Series:Computers
Subjects:
Online Access:https://www.mdpi.com/2073-431X/6/4/29
_version_ 1798003613476323328
author Milko Krachunov
Maria Nisheva
Dimitar Vassilev
author_facet Milko Krachunov
Maria Nisheva
Dimitar Vassilev
author_sort Milko Krachunov
collection DOAJ
description For metagenomics datasets, datasets of complex polyploid genomes, and other high-variation genomics datasets, there are difficulties with the analysis, error detection and variant calling, stemming from the challenges of discerning sequencing errors from biological variation. Confirming base candidates with high frequency of occurrence is no longer a reliable measure because of the natural variation and the presence of rare bases. The paper discusses an approach to the application of machine learning models to classify bases into erroneous and rare variations after preselecting potential error candidates with a weighted frequency measure, which aims to focus on unexpected variations by using the inter-sequence pairwise similarity. Different similarity measures are used to account for different types of datasets. Four machine learning models are implemented and tested.
first_indexed 2024-04-11T12:10:44Z
format Article
id doaj.art-672bf857894e479887a6e0a1de09dcda
institution Directory Open Access Journal
issn 2073-431X
language English
last_indexed 2024-04-11T12:10:44Z
publishDate 2017-11-01
publisher MDPI AG
record_format Article
series Computers
spelling doaj.art-672bf857894e479887a6e0a1de09dcda2022-12-22T04:24:37ZengMDPI AGComputers2073-431X2017-11-01642910.3390/computers6040029computers6040029Application of Machine Learning Models in Error and Variant Detection in High-Variation Genomics DatasetsMilko Krachunov0Maria Nisheva1Dimitar Vassilev2Faculty of Mathematics and Informatics, Sofia University, 5 James Bourchier Blvd., 1164 Sofia, BulgariaFaculty of Mathematics and Informatics, Sofia University, 5 James Bourchier Blvd., 1164 Sofia, BulgariaFaculty of Mathematics and Informatics, Sofia University, 5 James Bourchier Blvd., 1164 Sofia, BulgariaFor metagenomics datasets, datasets of complex polyploid genomes, and other high-variation genomics datasets, there are difficulties with the analysis, error detection and variant calling, stemming from the challenges of discerning sequencing errors from biological variation. Confirming base candidates with high frequency of occurrence is no longer a reliable measure because of the natural variation and the presence of rare bases. The paper discusses an approach to the application of machine learning models to classify bases into erroneous and rare variations after preselecting potential error candidates with a weighted frequency measure, which aims to focus on unexpected variations by using the inter-sequence pairwise similarity. Different similarity measures are used to account for different types of datasets. Four machine learning models are implemented and tested.https://www.mdpi.com/2073-431X/6/4/29machine learningerror discoveryvariant callingmetagenomicspolyploid genomes
spellingShingle Milko Krachunov
Maria Nisheva
Dimitar Vassilev
Application of Machine Learning Models in Error and Variant Detection in High-Variation Genomics Datasets
Computers
machine learning
error discovery
variant calling
metagenomics
polyploid genomes
title Application of Machine Learning Models in Error and Variant Detection in High-Variation Genomics Datasets
title_full Application of Machine Learning Models in Error and Variant Detection in High-Variation Genomics Datasets
title_fullStr Application of Machine Learning Models in Error and Variant Detection in High-Variation Genomics Datasets
title_full_unstemmed Application of Machine Learning Models in Error and Variant Detection in High-Variation Genomics Datasets
title_short Application of Machine Learning Models in Error and Variant Detection in High-Variation Genomics Datasets
title_sort application of machine learning models in error and variant detection in high variation genomics datasets
topic machine learning
error discovery
variant calling
metagenomics
polyploid genomes
url https://www.mdpi.com/2073-431X/6/4/29
work_keys_str_mv AT milkokrachunov applicationofmachinelearningmodelsinerrorandvariantdetectioninhighvariationgenomicsdatasets
AT marianisheva applicationofmachinelearningmodelsinerrorandvariantdetectioninhighvariationgenomicsdatasets
AT dimitarvassilev applicationofmachinelearningmodelsinerrorandvariantdetectioninhighvariationgenomicsdatasets