Application of machine learning in SNP discovery

Abstract Background Single nucleotide polymorphisms (SNP) constitute more than 90% of the genetic variation, and hence can account for most trait differences among individuals in a given species. Polymorphism detection software PolyBayes and PolyPhred g...

Full description

Bibliographic Details
Main Authors:	Cregan Perry B, Choi Ik-Young, Hyten David L, Grefenstette John J, Matukumalli Lakshmi K, Van Tassell Curtis P
Format:	Article
Language:	English
Published:	BMC 2006-01-01
Series:	BMC Bioinformatics
Online Access:	http://www.biomedcentral.com/1471-2105/7/4

_version_	1818359288975851520
author	Cregan Perry B Choi Ik-Young Hyten David L Grefenstette John J Matukumalli Lakshmi K Van Tassell Curtis P
author_facet	Cregan Perry B Choi Ik-Young Hyten David L Grefenstette John J Matukumalli Lakshmi K Van Tassell Curtis P
author_sort	Cregan Perry B
collection	DOAJ
description	<p>Abstract</p> <p>Background</p> <p>Single nucleotide polymorphisms (SNP) constitute more than 90% of the genetic variation, and hence can account for most trait differences among individuals in a given species. Polymorphism detection software PolyBayes and PolyPhred give high false positive SNP predictions even with stringent parameter values. We developed a machine learning (ML) method to augment PolyBayes to improve its prediction accuracy. ML methods have also been successfully applied to other bioinformatics problems in predicting genes, promoters, transcription factor binding sites and protein structures.</p> <p>Results</p> <p>The ML program C4.5 was applied to a set of features in order to build a SNP classifier from training data based on human expert decisions (True/False). The training data were 27,275 candidate SNP generated by sequencing 1973 STS (sequence tag sites) (12 Mb) in both directions from 6 diverse homozygous soybean cultivars and PolyBayes analysis. Test data of 18,390 candidate SNP were generated similarly from 1359 additional STS (8 Mb). SNP from both sets were classified by experts. After training the ML classifier, it agreed with the experts on 97.3% of test data compared with 7.8% agreement between PolyBayes and experts. The PolyBayes positive predictive values (PPV) (i.e., fraction of candidate SNP being real) were 7.8% for all predictions and 16.7% for those with 100% posterior probability of being real. Using ML improved the PPV to 84.8%, a 5- to 10-fold increase. While both ML and PolyBayes produced a similar number of true positives, the ML program generated only 249 false positives as compared to 16,955 for PolyBayes. The complexity of the soybean genome may have contributed to high false SNP predictions by PolyBayes and hence results may differ for other genomes.</p> <p>Conclusion</p> <p>A machine learning (ML) method was developed as a supplementary feature to the polymorphism detection software for improving prediction accuracies. The results from this study indicate that a trained ML classifier can significantly reduce human intervention and in this case achieved a 5–10 fold enhanced productivity. The optimized feature set and ML framework can also be applied to all polymorphism discovery software. ML support software is written in Perl and can be easily integrated into an existing SNP discovery pipeline.</p>
first_indexed	2024-12-13T20:42:31Z
format	Article
id	doaj.art-9308a38f4b88488e870ee352ae038f76
institution	Directory Open Access Journal
issn	1471-2105
language	English
last_indexed	2024-12-13T20:42:31Z
publishDate	2006-01-01
publisher	BMC
record_format	Article
series	BMC Bioinformatics
spelling	doaj.art-9308a38f4b88488e870ee352ae038f762022-12-21T23:32:06ZengBMCBMC Bioinformatics1471-21052006-01-0171410.1186/1471-2105-7-4Application of machine learning in SNP discoveryCregan Perry BChoi Ik-YoungHyten David LGrefenstette John JMatukumalli Lakshmi KVan Tassell Curtis P<p>Abstract</p> <p>Background</p> <p>Single nucleotide polymorphisms (SNP) constitute more than 90% of the genetic variation, and hence can account for most trait differences among individuals in a given species. Polymorphism detection software PolyBayes and PolyPhred give high false positive SNP predictions even with stringent parameter values. We developed a machine learning (ML) method to augment PolyBayes to improve its prediction accuracy. ML methods have also been successfully applied to other bioinformatics problems in predicting genes, promoters, transcription factor binding sites and protein structures.</p> <p>Results</p> <p>The ML program C4.5 was applied to a set of features in order to build a SNP classifier from training data based on human expert decisions (True/False). The training data were 27,275 candidate SNP generated by sequencing 1973 STS (sequence tag sites) (12 Mb) in both directions from 6 diverse homozygous soybean cultivars and PolyBayes analysis. Test data of 18,390 candidate SNP were generated similarly from 1359 additional STS (8 Mb). SNP from both sets were classified by experts. After training the ML classifier, it agreed with the experts on 97.3% of test data compared with 7.8% agreement between PolyBayes and experts. The PolyBayes positive predictive values (PPV) (i.e., fraction of candidate SNP being real) were 7.8% for all predictions and 16.7% for those with 100% posterior probability of being real. Using ML improved the PPV to 84.8%, a 5- to 10-fold increase. While both ML and PolyBayes produced a similar number of true positives, the ML program generated only 249 false positives as compared to 16,955 for PolyBayes. The complexity of the soybean genome may have contributed to high false SNP predictions by PolyBayes and hence results may differ for other genomes.</p> <p>Conclusion</p> <p>A machine learning (ML) method was developed as a supplementary feature to the polymorphism detection software for improving prediction accuracies. The results from this study indicate that a trained ML classifier can significantly reduce human intervention and in this case achieved a 5–10 fold enhanced productivity. The optimized feature set and ML framework can also be applied to all polymorphism discovery software. ML support software is written in Perl and can be easily integrated into an existing SNP discovery pipeline.</p>http://www.biomedcentral.com/1471-2105/7/4
spellingShingle	Cregan Perry B Choi Ik-Young Hyten David L Grefenstette John J Matukumalli Lakshmi K Van Tassell Curtis P Application of machine learning in SNP discovery BMC Bioinformatics
title	Application of machine learning in SNP discovery
title_full	Application of machine learning in SNP discovery
title_fullStr	Application of machine learning in SNP discovery
title_full_unstemmed	Application of machine learning in SNP discovery
title_short	Application of machine learning in SNP discovery
title_sort	application of machine learning in snp discovery
url	http://www.biomedcentral.com/1471-2105/7/4
work_keys_str_mv	AT creganperryb applicationofmachinelearninginsnpdiscovery AT choiikyoung applicationofmachinelearninginsnpdiscovery AT hytendavidl applicationofmachinelearninginsnpdiscovery AT grefenstettejohnj applicationofmachinelearninginsnpdiscovery AT matukumallilakshmik applicationofmachinelearninginsnpdiscovery AT vantassellcurtisp applicationofmachinelearninginsnpdiscovery

Application of machine learning in SNP discovery

Similar Items