A machine learning-based SNP-set analysis approach for identifying disease-associated susceptibility loci

Abstract Identifying disease-associated susceptibility loci is one of the most pressing and crucial challenges in modeling complex diseases. Existing approaches to biomarker discovery are subject to several limitations including underpowered detection, neglect for variant interactions, and restricti...

Full description

Bibliographic Details
Main Authors: Princess P. Silva, Joverlyn D. Gaudillo, Julianne A. Vilela, Ranzivelle Marianne L. Roxas-Villanueva, Beatrice J. Tiangco, Mario R. Domingo, Jason R. Albia
Format: Article
Language:English
Published: Nature Portfolio 2022-09-01
Series:Scientific Reports
Online Access:https://doi.org/10.1038/s41598-022-19708-1
_version_ 1811265136190554112
author Princess P. Silva
Joverlyn D. Gaudillo
Julianne A. Vilela
Ranzivelle Marianne L. Roxas-Villanueva
Beatrice J. Tiangco
Mario R. Domingo
Jason R. Albia
author_facet Princess P. Silva
Joverlyn D. Gaudillo
Julianne A. Vilela
Ranzivelle Marianne L. Roxas-Villanueva
Beatrice J. Tiangco
Mario R. Domingo
Jason R. Albia
author_sort Princess P. Silva
collection DOAJ
description Abstract Identifying disease-associated susceptibility loci is one of the most pressing and crucial challenges in modeling complex diseases. Existing approaches to biomarker discovery are subject to several limitations including underpowered detection, neglect for variant interactions, and restrictive dependence on prior biological knowledge. Addressing these challenges necessitates more ingenious ways of approaching the “missing heritability” problem. This study aims to discover disease-associated susceptibility loci by augmenting previous genome-wide association study (GWAS) using the integration of random forest and cluster analysis. The proposed integrated framework is applied to a hepatitis B virus surface antigen (HBsAg) seroclearance GWAS data. Multiple cluster analyses were performed on (1) single nucleotide polymorphisms (SNPs) considered significant by GWAS and (2) SNPs with the highest feature importance scores obtained using random forest. The resulting SNP-sets from the cluster analyses were subsequently tested for trait-association. Three susceptibility loci possibly associated with HBsAg seroclearance were identified: (1) SNP rs2399971, (2) gene LINC00578, and (3) locus 11p15. SNP rs2399971 is a biomarker reported in the literature to be significantly associated with HBsAg seroclearance in patients who had received antiviral treatment. The latter two loci are linked with diseases influenced by the presence of hepatitis B virus infection. These findings demonstrate the potential of the proposed integrated framework in identifying disease-associated susceptibility loci. With further validation, results herein could aid in better understanding complex disease etiologies and provide inputs for a more advanced disease risk assessment for patients.
first_indexed 2024-04-12T20:17:23Z
format Article
id doaj.art-8a21439e06184c13bc9f9e1e6903e85d
institution Directory Open Access Journal
issn 2045-2322
language English
last_indexed 2024-04-12T20:17:23Z
publishDate 2022-09-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj.art-8a21439e06184c13bc9f9e1e6903e85d2022-12-22T03:18:05ZengNature PortfolioScientific Reports2045-23222022-09-0112111010.1038/s41598-022-19708-1A machine learning-based SNP-set analysis approach for identifying disease-associated susceptibility lociPrincess P. Silva0Joverlyn D. Gaudillo1Julianne A. Vilela2Ranzivelle Marianne L. Roxas-Villanueva3Beatrice J. Tiangco4Mario R. Domingo5Jason R. Albia6Data-Driven Research Laboratory (DARELab), Institute of Mathematical Sciences and Physics, University of the Philippines Los BañosData-Driven Research Laboratory (DARELab), Institute of Mathematical Sciences and Physics, University of the Philippines Los BañosPhilippine Genome Center Program for Agriculture, Office of the Vice Chancellor for Research and Extension, University of the Philippines Los BañosData-Driven Research Laboratory (DARELab), Institute of Mathematical Sciences and Physics, University of the Philippines Los BañosNational Institute of Health, UP College of MedicineDomingo AI Research Center (DARC Labs)Data-Driven Research Laboratory (DARELab), Institute of Mathematical Sciences and Physics, University of the Philippines Los BañosAbstract Identifying disease-associated susceptibility loci is one of the most pressing and crucial challenges in modeling complex diseases. Existing approaches to biomarker discovery are subject to several limitations including underpowered detection, neglect for variant interactions, and restrictive dependence on prior biological knowledge. Addressing these challenges necessitates more ingenious ways of approaching the “missing heritability” problem. This study aims to discover disease-associated susceptibility loci by augmenting previous genome-wide association study (GWAS) using the integration of random forest and cluster analysis. The proposed integrated framework is applied to a hepatitis B virus surface antigen (HBsAg) seroclearance GWAS data. Multiple cluster analyses were performed on (1) single nucleotide polymorphisms (SNPs) considered significant by GWAS and (2) SNPs with the highest feature importance scores obtained using random forest. The resulting SNP-sets from the cluster analyses were subsequently tested for trait-association. Three susceptibility loci possibly associated with HBsAg seroclearance were identified: (1) SNP rs2399971, (2) gene LINC00578, and (3) locus 11p15. SNP rs2399971 is a biomarker reported in the literature to be significantly associated with HBsAg seroclearance in patients who had received antiviral treatment. The latter two loci are linked with diseases influenced by the presence of hepatitis B virus infection. These findings demonstrate the potential of the proposed integrated framework in identifying disease-associated susceptibility loci. With further validation, results herein could aid in better understanding complex disease etiologies and provide inputs for a more advanced disease risk assessment for patients.https://doi.org/10.1038/s41598-022-19708-1
spellingShingle Princess P. Silva
Joverlyn D. Gaudillo
Julianne A. Vilela
Ranzivelle Marianne L. Roxas-Villanueva
Beatrice J. Tiangco
Mario R. Domingo
Jason R. Albia
A machine learning-based SNP-set analysis approach for identifying disease-associated susceptibility loci
Scientific Reports
title A machine learning-based SNP-set analysis approach for identifying disease-associated susceptibility loci
title_full A machine learning-based SNP-set analysis approach for identifying disease-associated susceptibility loci
title_fullStr A machine learning-based SNP-set analysis approach for identifying disease-associated susceptibility loci
title_full_unstemmed A machine learning-based SNP-set analysis approach for identifying disease-associated susceptibility loci
title_short A machine learning-based SNP-set analysis approach for identifying disease-associated susceptibility loci
title_sort machine learning based snp set analysis approach for identifying disease associated susceptibility loci
url https://doi.org/10.1038/s41598-022-19708-1
work_keys_str_mv AT princesspsilva amachinelearningbasedsnpsetanalysisapproachforidentifyingdiseaseassociatedsusceptibilityloci
AT joverlyndgaudillo amachinelearningbasedsnpsetanalysisapproachforidentifyingdiseaseassociatedsusceptibilityloci
AT julianneavilela amachinelearningbasedsnpsetanalysisapproachforidentifyingdiseaseassociatedsusceptibilityloci
AT ranzivellemariannelroxasvillanueva amachinelearningbasedsnpsetanalysisapproachforidentifyingdiseaseassociatedsusceptibilityloci
AT beatricejtiangco amachinelearningbasedsnpsetanalysisapproachforidentifyingdiseaseassociatedsusceptibilityloci
AT mariordomingo amachinelearningbasedsnpsetanalysisapproachforidentifyingdiseaseassociatedsusceptibilityloci
AT jasonralbia amachinelearningbasedsnpsetanalysisapproachforidentifyingdiseaseassociatedsusceptibilityloci
AT princesspsilva machinelearningbasedsnpsetanalysisapproachforidentifyingdiseaseassociatedsusceptibilityloci
AT joverlyndgaudillo machinelearningbasedsnpsetanalysisapproachforidentifyingdiseaseassociatedsusceptibilityloci
AT julianneavilela machinelearningbasedsnpsetanalysisapproachforidentifyingdiseaseassociatedsusceptibilityloci
AT ranzivellemariannelroxasvillanueva machinelearningbasedsnpsetanalysisapproachforidentifyingdiseaseassociatedsusceptibilityloci
AT beatricejtiangco machinelearningbasedsnpsetanalysisapproachforidentifyingdiseaseassociatedsusceptibilityloci
AT mariordomingo machinelearningbasedsnpsetanalysisapproachforidentifyingdiseaseassociatedsusceptibilityloci
AT jasonralbia machinelearningbasedsnpsetanalysisapproachforidentifyingdiseaseassociatedsusceptibilityloci