Novel methods for genotype imputation to whole-genome sequence and a simple linear model to predict imputation accuracy

Abstract Background Accurate imputation plays a major role in genomic studies of livestock industries, where the number of genotyped or sequenced animals is limited by costs. This study explored methods to create an ideal reference population for imputation to Next Generation Sequencing data in catt...

Full description

Bibliographic Details
Main Authors: Steven G. Larmer, Mehdi Sargolzaei, Luiz F. Brito, Ricardo V. Ventura, Flávio S. Schenkel
Format: Article
Language:English
Published: BMC 2017-12-01
Series:BMC Genetics
Subjects:
Online Access:http://link.springer.com/article/10.1186/s12863-017-0588-1
_version_ 1829112371636862976
author Steven G. Larmer
Mehdi Sargolzaei
Luiz F. Brito
Ricardo V. Ventura
Flávio S. Schenkel
author_facet Steven G. Larmer
Mehdi Sargolzaei
Luiz F. Brito
Ricardo V. Ventura
Flávio S. Schenkel
author_sort Steven G. Larmer
collection DOAJ
description Abstract Background Accurate imputation plays a major role in genomic studies of livestock industries, where the number of genotyped or sequenced animals is limited by costs. This study explored methods to create an ideal reference population for imputation to Next Generation Sequencing data in cattle. Methods Methods for clustering of animals for imputation were explored, using 1000 Bull Genomes Project sequence data on 1146 animals from a variety of beef and dairy breeds. Imputation from 50 K to 777 K was first carried out to choose an ideal clustering method, using ADMIXTURE or PLINK clustering algorithms with either genotypes or reconstructed haplotypes. Results Due to efficiency, accuracy and ease of use, clustering with PLINK using haplotypes as quasi-genotypes was chosen as the most advantageous grouping method. It was found that using a clustered population slightly decreased computing time, while maintaining accuracy across the population. Although overall accuracy remained the same, a slight increase in accuracy was observed for groups of animals in some breeds (primarily purebred beef cattle from breeds with fewer sequenced animals) and for other groups, primarily crossbreed animals, a slight decrease in accuracy was observed. However, it was noted that some animals in each breed were poorly imputed across all methods. When imputed sequences were included in the reference population to aid imputation of poorly imputed animals, a small increase in overall accuracy was observed for nearly every individual in the population. Two models were created to predict imputation accuracy, a complete model using all information available including Euclidean distances from genotypes and haplotypes, pedigree information, and clustering groups and a simple model using only breed and an Euclidean distance matrix as predictors. Both models were successful in predicting imputation accuracy, with correlations between predicted and true imputation accuracy as measured by concordance rate of 0.87 and 0.83, respectively. Conclusions A clustering methodology can be very useful to subgroup cattle for efficient genotype imputation. In addition, accuracy of genotype imputation from medium to high-density Single Nucleotide Polymorphisms (SNP) chip panels to whole-genome sequence can be predicted well using a simple linear model defined in this study.
first_indexed 2024-12-12T15:16:06Z
format Article
id doaj.art-e195df0d2c114b6bab65aa9df87b34c4
institution Directory Open Access Journal
issn 1471-2156
language English
last_indexed 2024-12-12T15:16:06Z
publishDate 2017-12-01
publisher BMC
record_format Article
series BMC Genetics
spelling doaj.art-e195df0d2c114b6bab65aa9df87b34c42022-12-22T00:20:29ZengBMCBMC Genetics1471-21562017-12-0118111210.1186/s12863-017-0588-1Novel methods for genotype imputation to whole-genome sequence and a simple linear model to predict imputation accuracySteven G. Larmer0Mehdi Sargolzaei1Luiz F. Brito2Ricardo V. Ventura3Flávio S. Schenkel4Centre for Genetic Improvement of Livestock, Department of Animal Biosciences, University of GuelphCentre for Genetic Improvement of Livestock, Department of Animal Biosciences, University of GuelphCentre for Genetic Improvement of Livestock, Department of Animal Biosciences, University of GuelphCentre for Genetic Improvement of Livestock, Department of Animal Biosciences, University of GuelphCentre for Genetic Improvement of Livestock, Department of Animal Biosciences, University of GuelphAbstract Background Accurate imputation plays a major role in genomic studies of livestock industries, where the number of genotyped or sequenced animals is limited by costs. This study explored methods to create an ideal reference population for imputation to Next Generation Sequencing data in cattle. Methods Methods for clustering of animals for imputation were explored, using 1000 Bull Genomes Project sequence data on 1146 animals from a variety of beef and dairy breeds. Imputation from 50 K to 777 K was first carried out to choose an ideal clustering method, using ADMIXTURE or PLINK clustering algorithms with either genotypes or reconstructed haplotypes. Results Due to efficiency, accuracy and ease of use, clustering with PLINK using haplotypes as quasi-genotypes was chosen as the most advantageous grouping method. It was found that using a clustered population slightly decreased computing time, while maintaining accuracy across the population. Although overall accuracy remained the same, a slight increase in accuracy was observed for groups of animals in some breeds (primarily purebred beef cattle from breeds with fewer sequenced animals) and for other groups, primarily crossbreed animals, a slight decrease in accuracy was observed. However, it was noted that some animals in each breed were poorly imputed across all methods. When imputed sequences were included in the reference population to aid imputation of poorly imputed animals, a small increase in overall accuracy was observed for nearly every individual in the population. Two models were created to predict imputation accuracy, a complete model using all information available including Euclidean distances from genotypes and haplotypes, pedigree information, and clustering groups and a simple model using only breed and an Euclidean distance matrix as predictors. Both models were successful in predicting imputation accuracy, with correlations between predicted and true imputation accuracy as measured by concordance rate of 0.87 and 0.83, respectively. Conclusions A clustering methodology can be very useful to subgroup cattle for efficient genotype imputation. In addition, accuracy of genotype imputation from medium to high-density Single Nucleotide Polymorphisms (SNP) chip panels to whole-genome sequence can be predicted well using a simple linear model defined in this study.http://link.springer.com/article/10.1186/s12863-017-0588-1Cattle genomicsGenomic clusteringGenotype imputationSequencing data
spellingShingle Steven G. Larmer
Mehdi Sargolzaei
Luiz F. Brito
Ricardo V. Ventura
Flávio S. Schenkel
Novel methods for genotype imputation to whole-genome sequence and a simple linear model to predict imputation accuracy
BMC Genetics
Cattle genomics
Genomic clustering
Genotype imputation
Sequencing data
title Novel methods for genotype imputation to whole-genome sequence and a simple linear model to predict imputation accuracy
title_full Novel methods for genotype imputation to whole-genome sequence and a simple linear model to predict imputation accuracy
title_fullStr Novel methods for genotype imputation to whole-genome sequence and a simple linear model to predict imputation accuracy
title_full_unstemmed Novel methods for genotype imputation to whole-genome sequence and a simple linear model to predict imputation accuracy
title_short Novel methods for genotype imputation to whole-genome sequence and a simple linear model to predict imputation accuracy
title_sort novel methods for genotype imputation to whole genome sequence and a simple linear model to predict imputation accuracy
topic Cattle genomics
Genomic clustering
Genotype imputation
Sequencing data
url http://link.springer.com/article/10.1186/s12863-017-0588-1
work_keys_str_mv AT stevenglarmer novelmethodsforgenotypeimputationtowholegenomesequenceandasimplelinearmodeltopredictimputationaccuracy
AT mehdisargolzaei novelmethodsforgenotypeimputationtowholegenomesequenceandasimplelinearmodeltopredictimputationaccuracy
AT luizfbrito novelmethodsforgenotypeimputationtowholegenomesequenceandasimplelinearmodeltopredictimputationaccuracy
AT ricardovventura novelmethodsforgenotypeimputationtowholegenomesequenceandasimplelinearmodeltopredictimputationaccuracy
AT flaviosschenkel novelmethodsforgenotypeimputationtowholegenomesequenceandasimplelinearmodeltopredictimputationaccuracy