Novel methods for genotype imputation to whole-genome sequence and a simple linear model to predict imputation accuracy

Abstract Background Accurate imputation plays a major role in genomic studies of livestock industries, where the number of genotyped or sequenced animals is limited by costs. This study explored methods to create an ideal reference population for imputation to Next Generation Sequencing data in catt...

Full description

Bibliographic Details
Main Authors:	Steven G. Larmer, Mehdi Sargolzaei, Luiz F. Brito, Ricardo V. Ventura, Flávio S. Schenkel
Format:	Article
Language:	English
Published:	BMC 2017-12-01
Series:	BMC Genetics
Subjects:	Cattle genomics Genomic clustering Genotype imputation Sequencing data
Online Access:	http://link.springer.com/article/10.1186/s12863-017-0588-1

_version_	1829112371636862976
author	Steven G. Larmer Mehdi Sargolzaei Luiz F. Brito Ricardo V. Ventura Flávio S. Schenkel
author_facet	Steven G. Larmer Mehdi Sargolzaei Luiz F. Brito Ricardo V. Ventura Flávio S. Schenkel
author_sort	Steven G. Larmer
collection	DOAJ
description	Abstract Background Accurate imputation plays a major role in genomic studies of livestock industries, where the number of genotyped or sequenced animals is limited by costs. This study explored methods to create an ideal reference population for imputation to Next Generation Sequencing data in cattle. Methods Methods for clustering of animals for imputation were explored, using 1000 Bull Genomes Project sequence data on 1146 animals from a variety of beef and dairy breeds. Imputation from 50 K to 777 K was first carried out to choose an ideal clustering method, using ADMIXTURE or PLINK clustering algorithms with either genotypes or reconstructed haplotypes. Results Due to efficiency, accuracy and ease of use, clustering with PLINK using haplotypes as quasi-genotypes was chosen as the most advantageous grouping method. It was found that using a clustered population slightly decreased computing time, while maintaining accuracy across the population. Although overall accuracy remained the same, a slight increase in accuracy was observed for groups of animals in some breeds (primarily purebred beef cattle from breeds with fewer sequenced animals) and for other groups, primarily crossbreed animals, a slight decrease in accuracy was observed. However, it was noted that some animals in each breed were poorly imputed across all methods. When imputed sequences were included in the reference population to aid imputation of poorly imputed animals, a small increase in overall accuracy was observed for nearly every individual in the population. Two models were created to predict imputation accuracy, a complete model using all information available including Euclidean distances from genotypes and haplotypes, pedigree information, and clustering groups and a simple model using only breed and an Euclidean distance matrix as predictors. Both models were successful in predicting imputation accuracy, with correlations between predicted and true imputation accuracy as measured by concordance rate of 0.87 and 0.83, respectively. Conclusions A clustering methodology can be very useful to subgroup cattle for efficient genotype imputation. In addition, accuracy of genotype imputation from medium to high-density Single Nucleotide Polymorphisms (SNP) chip panels to whole-genome sequence can be predicted well using a simple linear model defined in this study.
first_indexed	2024-12-12T15:16:06Z
format	Article
id	doaj.art-e195df0d2c114b6bab65aa9df87b34c4
institution	Directory Open Access Journal
issn	1471-2156
language	English
last_indexed	2024-12-12T15:16:06Z
publishDate	2017-12-01
publisher	BMC
record_format	Article
series	BMC Genetics
spelling	doaj.art-e195df0d2c114b6bab65aa9df87b34c42022-12-22T00:20:29ZengBMCBMC Genetics1471-21562017-12-0118111210.1186/s12863-017-0588-1Novel methods for genotype imputation to whole-genome sequence and a simple linear model to predict imputation accuracySteven G. Larmer0Mehdi Sargolzaei1Luiz F. Brito2Ricardo V. Ventura3Flávio S. Schenkel4Centre for Genetic Improvement of Livestock, Department of Animal Biosciences, University of GuelphCentre for Genetic Improvement of Livestock, Department of Animal Biosciences, University of GuelphCentre for Genetic Improvement of Livestock, Department of Animal Biosciences, University of GuelphCentre for Genetic Improvement of Livestock, Department of Animal Biosciences, University of GuelphCentre for Genetic Improvement of Livestock, Department of Animal Biosciences, University of GuelphAbstract Background Accurate imputation plays a major role in genomic studies of livestock industries, where the number of genotyped or sequenced animals is limited by costs. This study explored methods to create an ideal reference population for imputation to Next Generation Sequencing data in cattle. Methods Methods for clustering of animals for imputation were explored, using 1000 Bull Genomes Project sequence data on 1146 animals from a variety of beef and dairy breeds. Imputation from 50 K to 777 K was first carried out to choose an ideal clustering method, using ADMIXTURE or PLINK clustering algorithms with either genotypes or reconstructed haplotypes. Results Due to efficiency, accuracy and ease of use, clustering with PLINK using haplotypes as quasi-genotypes was chosen as the most advantageous grouping method. It was found that using a clustered population slightly decreased computing time, while maintaining accuracy across the population. Although overall accuracy remained the same, a slight increase in accuracy was observed for groups of animals in some breeds (primarily purebred beef cattle from breeds with fewer sequenced animals) and for other groups, primarily crossbreed animals, a slight decrease in accuracy was observed. However, it was noted that some animals in each breed were poorly imputed across all methods. When imputed sequences were included in the reference population to aid imputation of poorly imputed animals, a small increase in overall accuracy was observed for nearly every individual in the population. Two models were created to predict imputation accuracy, a complete model using all information available including Euclidean distances from genotypes and haplotypes, pedigree information, and clustering groups and a simple model using only breed and an Euclidean distance matrix as predictors. Both models were successful in predicting imputation accuracy, with correlations between predicted and true imputation accuracy as measured by concordance rate of 0.87 and 0.83, respectively. Conclusions A clustering methodology can be very useful to subgroup cattle for efficient genotype imputation. In addition, accuracy of genotype imputation from medium to high-density Single Nucleotide Polymorphisms (SNP) chip panels to whole-genome sequence can be predicted well using a simple linear model defined in this study.http://link.springer.com/article/10.1186/s12863-017-0588-1Cattle genomicsGenomic clusteringGenotype imputationSequencing data
spellingShingle	Steven G. Larmer Mehdi Sargolzaei Luiz F. Brito Ricardo V. Ventura Flávio S. Schenkel Novel methods for genotype imputation to whole-genome sequence and a simple linear model to predict imputation accuracy BMC Genetics Cattle genomics Genomic clustering Genotype imputation Sequencing data
title	Novel methods for genotype imputation to whole-genome sequence and a simple linear model to predict imputation accuracy
title_full	Novel methods for genotype imputation to whole-genome sequence and a simple linear model to predict imputation accuracy
title_fullStr	Novel methods for genotype imputation to whole-genome sequence and a simple linear model to predict imputation accuracy
title_full_unstemmed	Novel methods for genotype imputation to whole-genome sequence and a simple linear model to predict imputation accuracy
title_short	Novel methods for genotype imputation to whole-genome sequence and a simple linear model to predict imputation accuracy
title_sort	novel methods for genotype imputation to whole genome sequence and a simple linear model to predict imputation accuracy
topic	Cattle genomics Genomic clustering Genotype imputation Sequencing data
url	http://link.springer.com/article/10.1186/s12863-017-0588-1
work_keys_str_mv	AT stevenglarmer novelmethodsforgenotypeimputationtowholegenomesequenceandasimplelinearmodeltopredictimputationaccuracy AT mehdisargolzaei novelmethodsforgenotypeimputationtowholegenomesequenceandasimplelinearmodeltopredictimputationaccuracy AT luizfbrito novelmethodsforgenotypeimputationtowholegenomesequenceandasimplelinearmodeltopredictimputationaccuracy AT ricardovventura novelmethodsforgenotypeimputationtowholegenomesequenceandasimplelinearmodeltopredictimputationaccuracy AT flaviosschenkel novelmethodsforgenotypeimputationtowholegenomesequenceandasimplelinearmodeltopredictimputationaccuracy

Novel methods for genotype imputation to whole-genome sequence and a simple linear model to predict imputation accuracy

Similar Items