Inverted genomic regions between reference genome builds in humans impact imputation accuracy and decrease the power of association testing

Summary: Over the last two decades, the human reference genome has undergone multiple updates as we complete a linear representation of our genome. Two versions of human references are currently used in the biomedical literature, GRCh37/hg19 and GRCh38. Conversions between these versions are critica...

Full description

Bibliographic Details
Main Authors:	Xin Sheng, Lucy Xia, Jordan L. Cahoon, David V. Conti, Christopher A. Haiman, Linda Kachuri, Charleston W.K. Chiang
Format:	Article
Language:	English
Published:	Elsevier 2023-01-01
Series:	HGG Advances
Subjects:	genome build bioinformatics reference genome genetic associations imputation
Online Access:	http://www.sciencedirect.com/science/article/pii/S2666247722000768

_version_	1828127193796968448
author	Xin Sheng Lucy Xia Jordan L. Cahoon David V. Conti Christopher A. Haiman Linda Kachuri Charleston W.K. Chiang
author_facet	Xin Sheng Lucy Xia Jordan L. Cahoon David V. Conti Christopher A. Haiman Linda Kachuri Charleston W.K. Chiang
author_sort	Xin Sheng
collection	DOAJ
description	Summary: Over the last two decades, the human reference genome has undergone multiple updates as we complete a linear representation of our genome. Two versions of human references are currently used in the biomedical literature, GRCh37/hg19 and GRCh38. Conversions between these versions are critical for quality control, imputation, and association analysis. In the present study, we show that single-nucleotide variants (SNVs) in regions inverted between different builds of the reference genome are often mishandled bioinformatically. Depending on the array type, SNVs are found in approximately 2–5 Mb of the genome that are inverted between reference builds. Coordinate conversions of these variants are mishandled by both the TOPMed imputation server as well as routine in-house quality control pipelines, leading to underrecognized downstream analytical consequences. Specifically, we observe that undetected allelic conversion errors for palindromic (i.e., A/T or C/G) variants in these inverted regions would destabilize the local haplotype structure, leading to loss of imputation accuracy and power in association analyses. Though only a small proportion of the genome is affected, these regions include important disease susceptibility variants that would be affected. For example, the p value of a known locus associated with prostate cancer on chromosome 10 (chr10) would drop from 2.86 × 10−7 to 0.0011 in a case-control analysis of 20,286 Africans and African Americans (10,643 cases and 9,643 controls). We devise a straight-forward heuristic based on the popular tool, liftOver, that can easily detect and correct these variants in the inverted regions between genome builds to locally improve imputation accuracy.
first_indexed	2024-04-11T15:46:38Z
format	Article
id	doaj.art-2746dd1309aa4bb7aee44398da16eb7f
institution	Directory Open Access Journal
issn	2666-2477
language	English
last_indexed	2024-04-11T15:46:38Z
publishDate	2023-01-01
publisher	Elsevier
record_format	Article
series	HGG Advances
spelling	doaj.art-2746dd1309aa4bb7aee44398da16eb7f2022-12-22T04:15:31ZengElsevierHGG Advances2666-24772023-01-0141100159Inverted genomic regions between reference genome builds in humans impact imputation accuracy and decrease the power of association testingXin Sheng0Lucy Xia1Jordan L. Cahoon2David V. Conti3Christopher A. Haiman4Linda Kachuri5Charleston W.K. Chiang6Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USACenter for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USADepartment of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA; Department of Computer Science, University of Southern California, Los Angeles, CA 90089, USACenter for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA; Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA 90033, USACenter for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA; Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA 90033, USADepartment of Epidemiology and Population Health, Stanford University, Stanford, CA 94305, USACenter for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA; Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA; Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA 90033, USA; Corresponding authorSummary: Over the last two decades, the human reference genome has undergone multiple updates as we complete a linear representation of our genome. Two versions of human references are currently used in the biomedical literature, GRCh37/hg19 and GRCh38. Conversions between these versions are critical for quality control, imputation, and association analysis. In the present study, we show that single-nucleotide variants (SNVs) in regions inverted between different builds of the reference genome are often mishandled bioinformatically. Depending on the array type, SNVs are found in approximately 2–5 Mb of the genome that are inverted between reference builds. Coordinate conversions of these variants are mishandled by both the TOPMed imputation server as well as routine in-house quality control pipelines, leading to underrecognized downstream analytical consequences. Specifically, we observe that undetected allelic conversion errors for palindromic (i.e., A/T or C/G) variants in these inverted regions would destabilize the local haplotype structure, leading to loss of imputation accuracy and power in association analyses. Though only a small proportion of the genome is affected, these regions include important disease susceptibility variants that would be affected. For example, the p value of a known locus associated with prostate cancer on chromosome 10 (chr10) would drop from 2.86 × 10−7 to 0.0011 in a case-control analysis of 20,286 Africans and African Americans (10,643 cases and 9,643 controls). We devise a straight-forward heuristic based on the popular tool, liftOver, that can easily detect and correct these variants in the inverted regions between genome builds to locally improve imputation accuracy.http://www.sciencedirect.com/science/article/pii/S2666247722000768genome buildbioinformaticsreference genomegenetic associationsimputation
spellingShingle	Xin Sheng Lucy Xia Jordan L. Cahoon David V. Conti Christopher A. Haiman Linda Kachuri Charleston W.K. Chiang Inverted genomic regions between reference genome builds in humans impact imputation accuracy and decrease the power of association testing HGG Advances genome build bioinformatics reference genome genetic associations imputation
title	Inverted genomic regions between reference genome builds in humans impact imputation accuracy and decrease the power of association testing
title_full	Inverted genomic regions between reference genome builds in humans impact imputation accuracy and decrease the power of association testing
title_fullStr	Inverted genomic regions between reference genome builds in humans impact imputation accuracy and decrease the power of association testing
title_full_unstemmed	Inverted genomic regions between reference genome builds in humans impact imputation accuracy and decrease the power of association testing
title_short	Inverted genomic regions between reference genome builds in humans impact imputation accuracy and decrease the power of association testing
title_sort	inverted genomic regions between reference genome builds in humans impact imputation accuracy and decrease the power of association testing
topic	genome build bioinformatics reference genome genetic associations imputation
url	http://www.sciencedirect.com/science/article/pii/S2666247722000768
work_keys_str_mv	AT xinsheng invertedgenomicregionsbetweenreferencegenomebuildsinhumansimpactimputationaccuracyanddecreasethepowerofassociationtesting AT lucyxia invertedgenomicregionsbetweenreferencegenomebuildsinhumansimpactimputationaccuracyanddecreasethepowerofassociationtesting AT jordanlcahoon invertedgenomicregionsbetweenreferencegenomebuildsinhumansimpactimputationaccuracyanddecreasethepowerofassociationtesting AT davidvconti invertedgenomicregionsbetweenreferencegenomebuildsinhumansimpactimputationaccuracyanddecreasethepowerofassociationtesting AT christopherahaiman invertedgenomicregionsbetweenreferencegenomebuildsinhumansimpactimputationaccuracyanddecreasethepowerofassociationtesting AT lindakachuri invertedgenomicregionsbetweenreferencegenomebuildsinhumansimpactimputationaccuracyanddecreasethepowerofassociationtesting AT charlestonwkchiang invertedgenomicregionsbetweenreferencegenomebuildsinhumansimpactimputationaccuracyanddecreasethepowerofassociationtesting

Inverted genomic regions between reference genome builds in humans impact imputation accuracy and decrease the power of association testing

Similar Items