Application of Machine Learning Models in Error and Variant Detection in High-Variation Genomics Datasets
For metagenomics datasets, datasets of complex polyploid genomes, and other high-variation genomics datasets, there are difficulties with the analysis, error detection and variant calling, stemming from the challenges of discerning sequencing errors from biological variation. Confirming base candida...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2017-11-01
|
Series: | Computers |
Subjects: | |
Online Access: | https://www.mdpi.com/2073-431X/6/4/29 |
_version_ | 1798003613476323328 |
---|---|
author | Milko Krachunov Maria Nisheva Dimitar Vassilev |
author_facet | Milko Krachunov Maria Nisheva Dimitar Vassilev |
author_sort | Milko Krachunov |
collection | DOAJ |
description | For metagenomics datasets, datasets of complex polyploid genomes, and other high-variation genomics datasets, there are difficulties with the analysis, error detection and variant calling, stemming from the challenges of discerning sequencing errors from biological variation. Confirming base candidates with high frequency of occurrence is no longer a reliable measure because of the natural variation and the presence of rare bases. The paper discusses an approach to the application of machine learning models to classify bases into erroneous and rare variations after preselecting potential error candidates with a weighted frequency measure, which aims to focus on unexpected variations by using the inter-sequence pairwise similarity. Different similarity measures are used to account for different types of datasets. Four machine learning models are implemented and tested. |
first_indexed | 2024-04-11T12:10:44Z |
format | Article |
id | doaj.art-672bf857894e479887a6e0a1de09dcda |
institution | Directory Open Access Journal |
issn | 2073-431X |
language | English |
last_indexed | 2024-04-11T12:10:44Z |
publishDate | 2017-11-01 |
publisher | MDPI AG |
record_format | Article |
series | Computers |
spelling | doaj.art-672bf857894e479887a6e0a1de09dcda2022-12-22T04:24:37ZengMDPI AGComputers2073-431X2017-11-01642910.3390/computers6040029computers6040029Application of Machine Learning Models in Error and Variant Detection in High-Variation Genomics DatasetsMilko Krachunov0Maria Nisheva1Dimitar Vassilev2Faculty of Mathematics and Informatics, Sofia University, 5 James Bourchier Blvd., 1164 Sofia, BulgariaFaculty of Mathematics and Informatics, Sofia University, 5 James Bourchier Blvd., 1164 Sofia, BulgariaFaculty of Mathematics and Informatics, Sofia University, 5 James Bourchier Blvd., 1164 Sofia, BulgariaFor metagenomics datasets, datasets of complex polyploid genomes, and other high-variation genomics datasets, there are difficulties with the analysis, error detection and variant calling, stemming from the challenges of discerning sequencing errors from biological variation. Confirming base candidates with high frequency of occurrence is no longer a reliable measure because of the natural variation and the presence of rare bases. The paper discusses an approach to the application of machine learning models to classify bases into erroneous and rare variations after preselecting potential error candidates with a weighted frequency measure, which aims to focus on unexpected variations by using the inter-sequence pairwise similarity. Different similarity measures are used to account for different types of datasets. Four machine learning models are implemented and tested.https://www.mdpi.com/2073-431X/6/4/29machine learningerror discoveryvariant callingmetagenomicspolyploid genomes |
spellingShingle | Milko Krachunov Maria Nisheva Dimitar Vassilev Application of Machine Learning Models in Error and Variant Detection in High-Variation Genomics Datasets Computers machine learning error discovery variant calling metagenomics polyploid genomes |
title | Application of Machine Learning Models in Error and Variant Detection in High-Variation Genomics Datasets |
title_full | Application of Machine Learning Models in Error and Variant Detection in High-Variation Genomics Datasets |
title_fullStr | Application of Machine Learning Models in Error and Variant Detection in High-Variation Genomics Datasets |
title_full_unstemmed | Application of Machine Learning Models in Error and Variant Detection in High-Variation Genomics Datasets |
title_short | Application of Machine Learning Models in Error and Variant Detection in High-Variation Genomics Datasets |
title_sort | application of machine learning models in error and variant detection in high variation genomics datasets |
topic | machine learning error discovery variant calling metagenomics polyploid genomes |
url | https://www.mdpi.com/2073-431X/6/4/29 |
work_keys_str_mv | AT milkokrachunov applicationofmachinelearningmodelsinerrorandvariantdetectioninhighvariationgenomicsdatasets AT marianisheva applicationofmachinelearningmodelsinerrorandvariantdetectioninhighvariationgenomicsdatasets AT dimitarvassilev applicationofmachinelearningmodelsinerrorandvariantdetectioninhighvariationgenomicsdatasets |