Inverted genomic regions between reference genome builds in humans impact imputation accuracy and decrease the power of association testing
Summary: Over the last two decades, the human reference genome has undergone multiple updates as we complete a linear representation of our genome. Two versions of human references are currently used in the biomedical literature, GRCh37/hg19 and GRCh38. Conversions between these versions are critica...
Main Authors: | , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2023-01-01
|
Series: | HGG Advances |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2666247722000768 |
_version_ | 1828127193796968448 |
---|---|
author | Xin Sheng Lucy Xia Jordan L. Cahoon David V. Conti Christopher A. Haiman Linda Kachuri Charleston W.K. Chiang |
author_facet | Xin Sheng Lucy Xia Jordan L. Cahoon David V. Conti Christopher A. Haiman Linda Kachuri Charleston W.K. Chiang |
author_sort | Xin Sheng |
collection | DOAJ |
description | Summary: Over the last two decades, the human reference genome has undergone multiple updates as we complete a linear representation of our genome. Two versions of human references are currently used in the biomedical literature, GRCh37/hg19 and GRCh38. Conversions between these versions are critical for quality control, imputation, and association analysis. In the present study, we show that single-nucleotide variants (SNVs) in regions inverted between different builds of the reference genome are often mishandled bioinformatically. Depending on the array type, SNVs are found in approximately 2–5 Mb of the genome that are inverted between reference builds. Coordinate conversions of these variants are mishandled by both the TOPMed imputation server as well as routine in-house quality control pipelines, leading to underrecognized downstream analytical consequences. Specifically, we observe that undetected allelic conversion errors for palindromic (i.e., A/T or C/G) variants in these inverted regions would destabilize the local haplotype structure, leading to loss of imputation accuracy and power in association analyses. Though only a small proportion of the genome is affected, these regions include important disease susceptibility variants that would be affected. For example, the p value of a known locus associated with prostate cancer on chromosome 10 (chr10) would drop from 2.86 × 10−7 to 0.0011 in a case-control analysis of 20,286 Africans and African Americans (10,643 cases and 9,643 controls). We devise a straight-forward heuristic based on the popular tool, liftOver, that can easily detect and correct these variants in the inverted regions between genome builds to locally improve imputation accuracy. |
first_indexed | 2024-04-11T15:46:38Z |
format | Article |
id | doaj.art-2746dd1309aa4bb7aee44398da16eb7f |
institution | Directory Open Access Journal |
issn | 2666-2477 |
language | English |
last_indexed | 2024-04-11T15:46:38Z |
publishDate | 2023-01-01 |
publisher | Elsevier |
record_format | Article |
series | HGG Advances |
spelling | doaj.art-2746dd1309aa4bb7aee44398da16eb7f2022-12-22T04:15:31ZengElsevierHGG Advances2666-24772023-01-0141100159Inverted genomic regions between reference genome builds in humans impact imputation accuracy and decrease the power of association testingXin Sheng0Lucy Xia1Jordan L. Cahoon2David V. Conti3Christopher A. Haiman4Linda Kachuri5Charleston W.K. Chiang6Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USACenter for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USADepartment of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA; Department of Computer Science, University of Southern California, Los Angeles, CA 90089, USACenter for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA; Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA 90033, USACenter for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA; Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA 90033, USADepartment of Epidemiology and Population Health, Stanford University, Stanford, CA 94305, USACenter for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA; Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA; Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA 90033, USA; Corresponding authorSummary: Over the last two decades, the human reference genome has undergone multiple updates as we complete a linear representation of our genome. Two versions of human references are currently used in the biomedical literature, GRCh37/hg19 and GRCh38. Conversions between these versions are critical for quality control, imputation, and association analysis. In the present study, we show that single-nucleotide variants (SNVs) in regions inverted between different builds of the reference genome are often mishandled bioinformatically. Depending on the array type, SNVs are found in approximately 2–5 Mb of the genome that are inverted between reference builds. Coordinate conversions of these variants are mishandled by both the TOPMed imputation server as well as routine in-house quality control pipelines, leading to underrecognized downstream analytical consequences. Specifically, we observe that undetected allelic conversion errors for palindromic (i.e., A/T or C/G) variants in these inverted regions would destabilize the local haplotype structure, leading to loss of imputation accuracy and power in association analyses. Though only a small proportion of the genome is affected, these regions include important disease susceptibility variants that would be affected. For example, the p value of a known locus associated with prostate cancer on chromosome 10 (chr10) would drop from 2.86 × 10−7 to 0.0011 in a case-control analysis of 20,286 Africans and African Americans (10,643 cases and 9,643 controls). We devise a straight-forward heuristic based on the popular tool, liftOver, that can easily detect and correct these variants in the inverted regions between genome builds to locally improve imputation accuracy.http://www.sciencedirect.com/science/article/pii/S2666247722000768genome buildbioinformaticsreference genomegenetic associationsimputation |
spellingShingle | Xin Sheng Lucy Xia Jordan L. Cahoon David V. Conti Christopher A. Haiman Linda Kachuri Charleston W.K. Chiang Inverted genomic regions between reference genome builds in humans impact imputation accuracy and decrease the power of association testing HGG Advances genome build bioinformatics reference genome genetic associations imputation |
title | Inverted genomic regions between reference genome builds in humans impact imputation accuracy and decrease the power of association testing |
title_full | Inverted genomic regions between reference genome builds in humans impact imputation accuracy and decrease the power of association testing |
title_fullStr | Inverted genomic regions between reference genome builds in humans impact imputation accuracy and decrease the power of association testing |
title_full_unstemmed | Inverted genomic regions between reference genome builds in humans impact imputation accuracy and decrease the power of association testing |
title_short | Inverted genomic regions between reference genome builds in humans impact imputation accuracy and decrease the power of association testing |
title_sort | inverted genomic regions between reference genome builds in humans impact imputation accuracy and decrease the power of association testing |
topic | genome build bioinformatics reference genome genetic associations imputation |
url | http://www.sciencedirect.com/science/article/pii/S2666247722000768 |
work_keys_str_mv | AT xinsheng invertedgenomicregionsbetweenreferencegenomebuildsinhumansimpactimputationaccuracyanddecreasethepowerofassociationtesting AT lucyxia invertedgenomicregionsbetweenreferencegenomebuildsinhumansimpactimputationaccuracyanddecreasethepowerofassociationtesting AT jordanlcahoon invertedgenomicregionsbetweenreferencegenomebuildsinhumansimpactimputationaccuracyanddecreasethepowerofassociationtesting AT davidvconti invertedgenomicregionsbetweenreferencegenomebuildsinhumansimpactimputationaccuracyanddecreasethepowerofassociationtesting AT christopherahaiman invertedgenomicregionsbetweenreferencegenomebuildsinhumansimpactimputationaccuracyanddecreasethepowerofassociationtesting AT lindakachuri invertedgenomicregionsbetweenreferencegenomebuildsinhumansimpactimputationaccuracyanddecreasethepowerofassociationtesting AT charlestonwkchiang invertedgenomicregionsbetweenreferencegenomebuildsinhumansimpactimputationaccuracyanddecreasethepowerofassociationtesting |