Performance of random forest when SNPs are in linkage disequilibrium

<p>Abstract</p> <p>Background</p> <p>Single nucleotide polymorphisms (SNPs) may be correlated due to linkage disequilibrium (LD). Association studies look for both direct and indirect associations with disease loci. In a Random Forest (RF) analysis, correlation between...

Full description

Bibliographic Details
Main Authors: Cupples L Adrienne, Yu Yi, Meng Yan A, Farrer Lindsay A, Lunetta Kathryn L
Format: Article
Language:English
Published: BMC 2009-03-01
Series:BMC Bioinformatics
Online Access:http://www.biomedcentral.com/1471-2105/10/78
_version_ 1811287036192096256
author Cupples L Adrienne
Yu Yi
Meng Yan A
Farrer Lindsay A
Lunetta Kathryn L
author_facet Cupples L Adrienne
Yu Yi
Meng Yan A
Farrer Lindsay A
Lunetta Kathryn L
author_sort Cupples L Adrienne
collection DOAJ
description <p>Abstract</p> <p>Background</p> <p>Single nucleotide polymorphisms (SNPs) may be correlated due to linkage disequilibrium (LD). Association studies look for both direct and indirect associations with disease loci. In a Random Forest (RF) analysis, correlation between a true risk SNP and SNPs in LD may lead to diminished variable importance for the true risk SNP. One approach to address this problem is to select SNPs in linkage equilibrium (LE) for analysis. Here, we explore alternative methods for dealing with SNPs in LD: change the tree-building algorithm by building each tree in an RF only with SNPs in LE, modify the importance measure (IM), and use haplotypes instead of SNPs to build a RF.</p> <p>Results</p> <p>We evaluated the performance of our alternative methods by simulation of a spectrum of complex genetics models. When a haplotype rather than an individual SNP is the risk factor, we find that the original Random Forest method performed on SNPs provides good performance. When individual, genotyped SNPs are the risk factors, we find that the stronger the genetic effect, the stronger the effect LD has on the performance of the original RF. A revised importance measure used with the original RF is relatively robust to LD among SNPs; this revised importance measure used with the revised RF is sometimes inflated. Overall, we find that the revised importance measure used with the original RF is the best choice when the genetic model and the number of SNPs in LD with risk SNPs are unknown. For the haplotype-based method, under a multiplicative heterogeneity model, we observed a decrease in the performance of RF with increasing LD among the SNPs in the haplotype.</p> <p>Conclusion</p> <p>Our results suggest that by strategically revising the Random Forest method tree-building or importance measure calculation, power can increase when LD exists between SNPs. We conclude that the revised Random Forest method performed on SNPs offers an advantage of not requiring genotype phase, making it a viable tool for use in the context of thousands of SNPs, such as candidate gene studies and follow-up of top candidates from genome wide association studies.</p>
first_indexed 2024-04-13T03:11:42Z
format Article
id doaj.art-1ed2e1cca4924b3c84ca48db9d0c5a26
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-04-13T03:11:42Z
publishDate 2009-03-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-1ed2e1cca4924b3c84ca48db9d0c5a262022-12-22T03:05:01ZengBMCBMC Bioinformatics1471-21052009-03-011017810.1186/1471-2105-10-78Performance of random forest when SNPs are in linkage disequilibriumCupples L AdrienneYu YiMeng Yan AFarrer Lindsay ALunetta Kathryn L<p>Abstract</p> <p>Background</p> <p>Single nucleotide polymorphisms (SNPs) may be correlated due to linkage disequilibrium (LD). Association studies look for both direct and indirect associations with disease loci. In a Random Forest (RF) analysis, correlation between a true risk SNP and SNPs in LD may lead to diminished variable importance for the true risk SNP. One approach to address this problem is to select SNPs in linkage equilibrium (LE) for analysis. Here, we explore alternative methods for dealing with SNPs in LD: change the tree-building algorithm by building each tree in an RF only with SNPs in LE, modify the importance measure (IM), and use haplotypes instead of SNPs to build a RF.</p> <p>Results</p> <p>We evaluated the performance of our alternative methods by simulation of a spectrum of complex genetics models. When a haplotype rather than an individual SNP is the risk factor, we find that the original Random Forest method performed on SNPs provides good performance. When individual, genotyped SNPs are the risk factors, we find that the stronger the genetic effect, the stronger the effect LD has on the performance of the original RF. A revised importance measure used with the original RF is relatively robust to LD among SNPs; this revised importance measure used with the revised RF is sometimes inflated. Overall, we find that the revised importance measure used with the original RF is the best choice when the genetic model and the number of SNPs in LD with risk SNPs are unknown. For the haplotype-based method, under a multiplicative heterogeneity model, we observed a decrease in the performance of RF with increasing LD among the SNPs in the haplotype.</p> <p>Conclusion</p> <p>Our results suggest that by strategically revising the Random Forest method tree-building or importance measure calculation, power can increase when LD exists between SNPs. We conclude that the revised Random Forest method performed on SNPs offers an advantage of not requiring genotype phase, making it a viable tool for use in the context of thousands of SNPs, such as candidate gene studies and follow-up of top candidates from genome wide association studies.</p>http://www.biomedcentral.com/1471-2105/10/78
spellingShingle Cupples L Adrienne
Yu Yi
Meng Yan A
Farrer Lindsay A
Lunetta Kathryn L
Performance of random forest when SNPs are in linkage disequilibrium
BMC Bioinformatics
title Performance of random forest when SNPs are in linkage disequilibrium
title_full Performance of random forest when SNPs are in linkage disequilibrium
title_fullStr Performance of random forest when SNPs are in linkage disequilibrium
title_full_unstemmed Performance of random forest when SNPs are in linkage disequilibrium
title_short Performance of random forest when SNPs are in linkage disequilibrium
title_sort performance of random forest when snps are in linkage disequilibrium
url http://www.biomedcentral.com/1471-2105/10/78
work_keys_str_mv AT cupplesladrienne performanceofrandomforestwhensnpsareinlinkagedisequilibrium
AT yuyi performanceofrandomforestwhensnpsareinlinkagedisequilibrium
AT mengyana performanceofrandomforestwhensnpsareinlinkagedisequilibrium
AT farrerlindsaya performanceofrandomforestwhensnpsareinlinkagedisequilibrium
AT lunettakathrynl performanceofrandomforestwhensnpsareinlinkagedisequilibrium