A Random Forests Framework for Modeling Haplotypes as Mosaics of Reference Haplotypes

Many genomic data analyses such as phasing, genotype imputation, or local ancestry inference share a common core task: matching pairs of haplotypes at any position along the chromosome, thereby inferring a target haplotype as a succession of pieces from reference haplotypes, commonly called a mosaic...

Full description

Bibliographic Details
Main Authors:	Pierre Faux, Pierre Geurts, Tom Druet
Format:	Article
Language:	English
Published:	Frontiers Media S.A. 2019-06-01
Series:	Frontiers in Genetics
Subjects:	random forests supervised classification haplotype mosaic imputation extra-trees
Online Access:	https://www.frontiersin.org/article/10.3389/fgene.2019.00562/full

_version_	1818042744761745408
author	Pierre Faux Pierre Geurts Tom Druet
author_facet	Pierre Faux Pierre Geurts Tom Druet
author_sort	Pierre Faux
collection	DOAJ
description	Many genomic data analyses such as phasing, genotype imputation, or local ancestry inference share a common core task: matching pairs of haplotypes at any position along the chromosome, thereby inferring a target haplotype as a succession of pieces from reference haplotypes, commonly called a mosaic of reference haplotypes. For that purpose, these analyses combine information provided by linkage disequilibrium, linkage and/or genealogy through a set of heuristic rules or, most often, by a hidden Markov model. Here, we develop an extremely randomized trees framework to address the issue of local haplotype matching. In our approach, a supervised classifier using extra-trees (a particular type of random forests) learns how to identify the best local matches between haplotypes using a collection of observed examples. For each example, various features related to the different sources of information are observed, such as the length of a segment shared between haplotypes, or estimates of relationships between individuals, gametes, and haplotypes. The random forests framework was fed with 30 relevant features for local haplotype matching. Repeated cross-validations allowed ranking these features in regard to their importance for local haplotype matching. The distance to the edge of a segment shared by both haplotypes being matched was found to be the most important feature. Similarity comparisons between predicted and true whole-genome sequence haplotypes showed that the random forests framework was more efficient than a hidden Markov model in reconstructing a target haplotype as a mosaic of reference haplotypes. To further evaluate its efficiency, the random forests framework was applied to imputation of whole-genome sequence from 50k genotypes and it yielded average reliabilities similar or slightly better than IMPUTE2. Through this exploratory study, we lay the foundations of a new framework to automatically learn local haplotype matching and we show that extra-trees are a promising approach for such purposes. The use of this new technique also reveals some useful lessons on the relevant features for the purpose of haplotype matching. We also discuss potential improvements for routine implementation.
first_indexed	2024-12-10T08:51:11Z
format	Article
id	doaj.art-efc85df9b89743f48f1bc60c02f65b0a
institution	Directory Open Access Journal
issn	1664-8021
language	English
last_indexed	2024-12-10T08:51:11Z
publishDate	2019-06-01
publisher	Frontiers Media S.A.
record_format	Article
series	Frontiers in Genetics
spelling	doaj.art-efc85df9b89743f48f1bc60c02f65b0a2022-12-22T01:55:36ZengFrontiers Media S.A.Frontiers in Genetics1664-80212019-06-011010.3389/fgene.2019.00562429639A Random Forests Framework for Modeling Haplotypes as Mosaics of Reference HaplotypesPierre Faux0Pierre Geurts1Tom Druet2Unit of Animal Genomics, GIGA-R, Faculty of Veterinary Medicine, University of Liège, Liège, BelgiumDepartment of Electrical Engineering and Computer Science, Montefiore Institute, University of Liège, Liège, BelgiumUnit of Animal Genomics, GIGA-R, Faculty of Veterinary Medicine, University of Liège, Liège, BelgiumMany genomic data analyses such as phasing, genotype imputation, or local ancestry inference share a common core task: matching pairs of haplotypes at any position along the chromosome, thereby inferring a target haplotype as a succession of pieces from reference haplotypes, commonly called a mosaic of reference haplotypes. For that purpose, these analyses combine information provided by linkage disequilibrium, linkage and/or genealogy through a set of heuristic rules or, most often, by a hidden Markov model. Here, we develop an extremely randomized trees framework to address the issue of local haplotype matching. In our approach, a supervised classifier using extra-trees (a particular type of random forests) learns how to identify the best local matches between haplotypes using a collection of observed examples. For each example, various features related to the different sources of information are observed, such as the length of a segment shared between haplotypes, or estimates of relationships between individuals, gametes, and haplotypes. The random forests framework was fed with 30 relevant features for local haplotype matching. Repeated cross-validations allowed ranking these features in regard to their importance for local haplotype matching. The distance to the edge of a segment shared by both haplotypes being matched was found to be the most important feature. Similarity comparisons between predicted and true whole-genome sequence haplotypes showed that the random forests framework was more efficient than a hidden Markov model in reconstructing a target haplotype as a mosaic of reference haplotypes. To further evaluate its efficiency, the random forests framework was applied to imputation of whole-genome sequence from 50k genotypes and it yielded average reliabilities similar or slightly better than IMPUTE2. Through this exploratory study, we lay the foundations of a new framework to automatically learn local haplotype matching and we show that extra-trees are a promising approach for such purposes. The use of this new technique also reveals some useful lessons on the relevant features for the purpose of haplotype matching. We also discuss potential improvements for routine implementation.https://www.frontiersin.org/article/10.3389/fgene.2019.00562/fullrandom forestssupervised classificationhaplotype mosaicimputationextra-trees
spellingShingle	Pierre Faux Pierre Geurts Tom Druet A Random Forests Framework for Modeling Haplotypes as Mosaics of Reference Haplotypes Frontiers in Genetics random forests supervised classification haplotype mosaic imputation extra-trees
title	A Random Forests Framework for Modeling Haplotypes as Mosaics of Reference Haplotypes
title_full	A Random Forests Framework for Modeling Haplotypes as Mosaics of Reference Haplotypes
title_fullStr	A Random Forests Framework for Modeling Haplotypes as Mosaics of Reference Haplotypes
title_full_unstemmed	A Random Forests Framework for Modeling Haplotypes as Mosaics of Reference Haplotypes
title_short	A Random Forests Framework for Modeling Haplotypes as Mosaics of Reference Haplotypes
title_sort	random forests framework for modeling haplotypes as mosaics of reference haplotypes
topic	random forests supervised classification haplotype mosaic imputation extra-trees
url	https://www.frontiersin.org/article/10.3389/fgene.2019.00562/full
work_keys_str_mv	AT pierrefaux arandomforestsframeworkformodelinghaplotypesasmosaicsofreferencehaplotypes AT pierregeurts arandomforestsframeworkformodelinghaplotypesasmosaicsofreferencehaplotypes AT tomdruet arandomforestsframeworkformodelinghaplotypesasmosaicsofreferencehaplotypes AT pierrefaux randomforestsframeworkformodelinghaplotypesasmosaicsofreferencehaplotypes AT pierregeurts randomforestsframeworkformodelinghaplotypesasmosaicsofreferencehaplotypes AT tomdruet randomforestsframeworkformodelinghaplotypesasmosaicsofreferencehaplotypes

A Random Forests Framework for Modeling Haplotypes as Mosaics of Reference Haplotypes

Similar Items