Inverted genomic regions between reference genome builds in humans impact imputation accuracy and decrease the power of association testing

Summary: Over the last two decades, the human reference genome has undergone multiple updates as we complete a linear representation of our genome. Two versions of human references are currently used in the biomedical literature, GRCh37/hg19 and GRCh38. Conversions between these versions are critica...

Full description

Bibliographic Details
Main Authors: Xin Sheng, Lucy Xia, Jordan L. Cahoon, David V. Conti, Christopher A. Haiman, Linda Kachuri, Charleston W.K. Chiang
Format: Article
Language:English
Published: Elsevier 2023-01-01
Series:HGG Advances
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2666247722000768
_version_ 1828127193796968448
author Xin Sheng
Lucy Xia
Jordan L. Cahoon
David V. Conti
Christopher A. Haiman
Linda Kachuri
Charleston W.K. Chiang
author_facet Xin Sheng
Lucy Xia
Jordan L. Cahoon
David V. Conti
Christopher A. Haiman
Linda Kachuri
Charleston W.K. Chiang
author_sort Xin Sheng
collection DOAJ
description Summary: Over the last two decades, the human reference genome has undergone multiple updates as we complete a linear representation of our genome. Two versions of human references are currently used in the biomedical literature, GRCh37/hg19 and GRCh38. Conversions between these versions are critical for quality control, imputation, and association analysis. In the present study, we show that single-nucleotide variants (SNVs) in regions inverted between different builds of the reference genome are often mishandled bioinformatically. Depending on the array type, SNVs are found in approximately 2–5 Mb of the genome that are inverted between reference builds. Coordinate conversions of these variants are mishandled by both the TOPMed imputation server as well as routine in-house quality control pipelines, leading to underrecognized downstream analytical consequences. Specifically, we observe that undetected allelic conversion errors for palindromic (i.e., A/T or C/G) variants in these inverted regions would destabilize the local haplotype structure, leading to loss of imputation accuracy and power in association analyses. Though only a small proportion of the genome is affected, these regions include important disease susceptibility variants that would be affected. For example, the p value of a known locus associated with prostate cancer on chromosome 10 (chr10) would drop from 2.86 × 10−7 to 0.0011 in a case-control analysis of 20,286 Africans and African Americans (10,643 cases and 9,643 controls). We devise a straight-forward heuristic based on the popular tool, liftOver, that can easily detect and correct these variants in the inverted regions between genome builds to locally improve imputation accuracy.
first_indexed 2024-04-11T15:46:38Z
format Article
id doaj.art-2746dd1309aa4bb7aee44398da16eb7f
institution Directory Open Access Journal
issn 2666-2477
language English
last_indexed 2024-04-11T15:46:38Z
publishDate 2023-01-01
publisher Elsevier
record_format Article
series HGG Advances
spelling doaj.art-2746dd1309aa4bb7aee44398da16eb7f2022-12-22T04:15:31ZengElsevierHGG Advances2666-24772023-01-0141100159Inverted genomic regions between reference genome builds in humans impact imputation accuracy and decrease the power of association testingXin Sheng0Lucy Xia1Jordan L. Cahoon2David V. Conti3Christopher A. Haiman4Linda Kachuri5Charleston W.K. Chiang6Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USACenter for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USADepartment of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA; Department of Computer Science, University of Southern California, Los Angeles, CA 90089, USACenter for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA; Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA 90033, USACenter for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA; Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA 90033, USADepartment of Epidemiology and Population Health, Stanford University, Stanford, CA 94305, USACenter for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA; Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA; Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA 90033, USA; Corresponding authorSummary: Over the last two decades, the human reference genome has undergone multiple updates as we complete a linear representation of our genome. Two versions of human references are currently used in the biomedical literature, GRCh37/hg19 and GRCh38. Conversions between these versions are critical for quality control, imputation, and association analysis. In the present study, we show that single-nucleotide variants (SNVs) in regions inverted between different builds of the reference genome are often mishandled bioinformatically. Depending on the array type, SNVs are found in approximately 2–5 Mb of the genome that are inverted between reference builds. Coordinate conversions of these variants are mishandled by both the TOPMed imputation server as well as routine in-house quality control pipelines, leading to underrecognized downstream analytical consequences. Specifically, we observe that undetected allelic conversion errors for palindromic (i.e., A/T or C/G) variants in these inverted regions would destabilize the local haplotype structure, leading to loss of imputation accuracy and power in association analyses. Though only a small proportion of the genome is affected, these regions include important disease susceptibility variants that would be affected. For example, the p value of a known locus associated with prostate cancer on chromosome 10 (chr10) would drop from 2.86 × 10−7 to 0.0011 in a case-control analysis of 20,286 Africans and African Americans (10,643 cases and 9,643 controls). We devise a straight-forward heuristic based on the popular tool, liftOver, that can easily detect and correct these variants in the inverted regions between genome builds to locally improve imputation accuracy.http://www.sciencedirect.com/science/article/pii/S2666247722000768genome buildbioinformaticsreference genomegenetic associationsimputation
spellingShingle Xin Sheng
Lucy Xia
Jordan L. Cahoon
David V. Conti
Christopher A. Haiman
Linda Kachuri
Charleston W.K. Chiang
Inverted genomic regions between reference genome builds in humans impact imputation accuracy and decrease the power of association testing
HGG Advances
genome build
bioinformatics
reference genome
genetic associations
imputation
title Inverted genomic regions between reference genome builds in humans impact imputation accuracy and decrease the power of association testing
title_full Inverted genomic regions between reference genome builds in humans impact imputation accuracy and decrease the power of association testing
title_fullStr Inverted genomic regions between reference genome builds in humans impact imputation accuracy and decrease the power of association testing
title_full_unstemmed Inverted genomic regions between reference genome builds in humans impact imputation accuracy and decrease the power of association testing
title_short Inverted genomic regions between reference genome builds in humans impact imputation accuracy and decrease the power of association testing
title_sort inverted genomic regions between reference genome builds in humans impact imputation accuracy and decrease the power of association testing
topic genome build
bioinformatics
reference genome
genetic associations
imputation
url http://www.sciencedirect.com/science/article/pii/S2666247722000768
work_keys_str_mv AT xinsheng invertedgenomicregionsbetweenreferencegenomebuildsinhumansimpactimputationaccuracyanddecreasethepowerofassociationtesting
AT lucyxia invertedgenomicregionsbetweenreferencegenomebuildsinhumansimpactimputationaccuracyanddecreasethepowerofassociationtesting
AT jordanlcahoon invertedgenomicregionsbetweenreferencegenomebuildsinhumansimpactimputationaccuracyanddecreasethepowerofassociationtesting
AT davidvconti invertedgenomicregionsbetweenreferencegenomebuildsinhumansimpactimputationaccuracyanddecreasethepowerofassociationtesting
AT christopherahaiman invertedgenomicregionsbetweenreferencegenomebuildsinhumansimpactimputationaccuracyanddecreasethepowerofassociationtesting
AT lindakachuri invertedgenomicregionsbetweenreferencegenomebuildsinhumansimpactimputationaccuracyanddecreasethepowerofassociationtesting
AT charlestonwkchiang invertedgenomicregionsbetweenreferencegenomebuildsinhumansimpactimputationaccuracyanddecreasethepowerofassociationtesting