High-accuracy HLA type inference from whole-genome sequencing data using population reference graphs

Genetic variation at the Human Leucocyte Antigen (HLA) genes is associated with many autoimmune and infectious disease phenotypes, is an important element of the immunological distinction between self and non-self, and shapes immune epitope repertoires. Determining the allelic state of the HLA genes...

Full description

Bibliographic Details
Main Authors:	Dilthey, A, Gourraud, P, Mentzer, A, Cereb, N, Iqbal, Z, McVean, G
Format:	Journal article
Published:	Public Library of Science 2016

_version_	1826300174754906112
author	Dilthey, A Gourraud, P Mentzer, A Cereb, N Iqbal, Z McVean, G
author_facet	Dilthey, A Gourraud, P Mentzer, A Cereb, N Iqbal, Z McVean, G
author_sort	Dilthey, A
collection	OXFORD
description	Genetic variation at the Human Leucocyte Antigen (HLA) genes is associated with many autoimmune and infectious disease phenotypes, is an important element of the immunological distinction between self and non-self, and shapes immune epitope repertoires. Determining the allelic state of the HLA genes (HLA typing) as a by-product of standard whole-genome sequencing data would therefore be highly desirable and enable the immunogenetic characterization of samples in currently ongoing population sequencing projects. Extensive hyperpolymorphism and sequence similarity between the HLA genes, however, pose problems for accurate read mapping and make HLA type inference from whole-genome sequencing data a challenging problem. We describe how to address these challenges in a Population Reference Graph (PRG) framework. First, we construct a PRG for 46 (mostly HLA) genes and pseudogenes, their genomic context and their characterized sequence variants, integrating a database of over 10,000 known allele sequences. Second, we present a sequence-to-PRG paired-end read mapping algorithm that enables accurate read mapping for the HLA genes. Third, we infer the most likely pair of underlying alleles at G group resolution from the IMGT/HLA database at each locus, employing a simple likelihood framework. We show that HLAPRG, our algorithm, outperforms existing methods by a wide margin. We evaluate HLAPRG on six classical class I and class II HLA genes (HLA-A, -B, -C, -DQA1, -DQB1, -DRB1) and on a set of 14 samples (3 samples with 2 x 100bp, 11 samples with 2 x 250bp Illumina HiSeq data). Of 158 alleles tested, we correctly infer 157 alleles (99.4%). We also identify and re-type two erroneous alleles in the original validation data. We conclude that HLA*PRG for the first time achieves accuracies comparable to gold-standard reference methods from standard whole-genome sequencing data, though high computational demands (currently ~30–250 CPU hours per sample) remain a significant challenge to practical application.
first_indexed	2024-03-07T05:13:10Z
format	Journal article
id	oxford-uuid:dc41e599-d128-4f87-b0bd-f4dce0eb1462
institution	University of Oxford
last_indexed	2024-03-07T05:13:10Z
publishDate	2016
publisher	Public Library of Science
record_format	dspace
spelling	oxford-uuid:dc41e599-d128-4f87-b0bd-f4dce0eb14622022-03-27T09:16:38ZHigh-accuracy HLA type inference from whole-genome sequencing data using population reference graphsJournal articlehttp://purl.org/coar/resource_type/c_dcae04bcuuid:dc41e599-d128-4f87-b0bd-f4dce0eb1462Symplectic Elements at OxfordPublic Library of Science2016Dilthey, AGourraud, PMentzer, ACereb, NIqbal, ZMcVean, GGenetic variation at the Human Leucocyte Antigen (HLA) genes is associated with many autoimmune and infectious disease phenotypes, is an important element of the immunological distinction between self and non-self, and shapes immune epitope repertoires. Determining the allelic state of the HLA genes (HLA typing) as a by-product of standard whole-genome sequencing data would therefore be highly desirable and enable the immunogenetic characterization of samples in currently ongoing population sequencing projects. Extensive hyperpolymorphism and sequence similarity between the HLA genes, however, pose problems for accurate read mapping and make HLA type inference from whole-genome sequencing data a challenging problem. We describe how to address these challenges in a Population Reference Graph (PRG) framework. First, we construct a PRG for 46 (mostly HLA) genes and pseudogenes, their genomic context and their characterized sequence variants, integrating a database of over 10,000 known allele sequences. Second, we present a sequence-to-PRG paired-end read mapping algorithm that enables accurate read mapping for the HLA genes. Third, we infer the most likely pair of underlying alleles at G group resolution from the IMGT/HLA database at each locus, employing a simple likelihood framework. We show that HLAPRG, our algorithm, outperforms existing methods by a wide margin. We evaluate HLAPRG on six classical class I and class II HLA genes (HLA-A, -B, -C, -DQA1, -DQB1, -DRB1) and on a set of 14 samples (3 samples with 2 x 100bp, 11 samples with 2 x 250bp Illumina HiSeq data). Of 158 alleles tested, we correctly infer 157 alleles (99.4%). We also identify and re-type two erroneous alleles in the original validation data. We conclude that HLA*PRG for the first time achieves accuracies comparable to gold-standard reference methods from standard whole-genome sequencing data, though high computational demands (currently ~30–250 CPU hours per sample) remain a significant challenge to practical application.
spellingShingle	Dilthey, A Gourraud, P Mentzer, A Cereb, N Iqbal, Z McVean, G High-accuracy HLA type inference from whole-genome sequencing data using population reference graphs
title	High-accuracy HLA type inference from whole-genome sequencing data using population reference graphs
title_full	High-accuracy HLA type inference from whole-genome sequencing data using population reference graphs
title_fullStr	High-accuracy HLA type inference from whole-genome sequencing data using population reference graphs
title_full_unstemmed	High-accuracy HLA type inference from whole-genome sequencing data using population reference graphs
title_short	High-accuracy HLA type inference from whole-genome sequencing data using population reference graphs
title_sort	high accuracy hla type inference from whole genome sequencing data using population reference graphs
work_keys_str_mv	AT diltheya highaccuracyhlatypeinferencefromwholegenomesequencingdatausingpopulationreferencegraphs AT gourraudp highaccuracyhlatypeinferencefromwholegenomesequencingdatausingpopulationreferencegraphs AT mentzera highaccuracyhlatypeinferencefromwholegenomesequencingdatausingpopulationreferencegraphs AT cerebn highaccuracyhlatypeinferencefromwholegenomesequencingdatausingpopulationreferencegraphs AT iqbalz highaccuracyhlatypeinferencefromwholegenomesequencingdatausingpopulationreferencegraphs AT mcveang highaccuracyhlatypeinferencefromwholegenomesequencingdatausingpopulationreferencegraphs

High-accuracy HLA type inference from whole-genome sequencing data using population reference graphs

Similar Items