Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data

Abstract Background In individuals or animals suffering from genetic or acquired diseases, it is important to identify which clinical or phenotypic variables can be used to discriminate between disease and non-disease states, the response to treatments or sexual dimorphism. However, the data often s...

Full description

Bibliographic Details
Main Authors: Maria del Mar Muñiz Moreno, Claire Gavériaux-Ruff, Yann Herault
Format: Article
Language:English
Published: BMC 2023-01-01
Series:BMC Bioinformatics
Subjects:
Online Access:https://doi.org/10.1186/s12859-022-05111-0
_version_ 1811175676769730560
author Maria del Mar Muñiz Moreno
Claire Gavériaux-Ruff
Yann Herault
author_facet Maria del Mar Muñiz Moreno
Claire Gavériaux-Ruff
Yann Herault
author_sort Maria del Mar Muñiz Moreno
collection DOAJ
description Abstract Background In individuals or animals suffering from genetic or acquired diseases, it is important to identify which clinical or phenotypic variables can be used to discriminate between disease and non-disease states, the response to treatments or sexual dimorphism. However, the data often suffers from low number of samples, high number of variables or unbalanced experimental designs. Moreover, several parameters can be recorded in the same test. Thus, correlations should be assessed, and a more complex statistical framework is necessary for the analysis. Packages already exist that provide analysis tools, but they are not found together, rendering the decision method and implementation difficult for non-statisticians. Result We present Gdaphen, a fast joint-pipeline allowing the identification of most important qualitative and quantitative predictor variables to discriminate between genotypes, treatments, or sex. Gdaphen takes as input behavioral/clinical data and uses a Multiple Factor Analysis (MFA) to deal with groups of variables recorded from the same individuals or anonymize genotype-based recordings. Gdaphen uses as optimized input the non-correlated variables with 30% correlation or higher on the MFA-Principal Component Analysis (PCA), increasing the discriminative power and the classifier’s predictive model efficiency. Gdaphen can determine the strongest variables that predict gene dosage effects thanks to the General Linear Model (GLM)-based classifiers or determine the most discriminative not linear distributed variables thanks to Random Forest (RF) implementation. Moreover, Gdaphen provides the efficacy of each classifier and several visualization options to fully understand and support the results as easily readable plots ready to be included in publications. We demonstrate Gdaphen capabilities on several datasets and provide easily followable vignettes. Conclusions Gdaphen makes the analysis of phenotypic data much easier for medical or preclinical behavioral researchers, providing an integrated framework to perform: (1) pre-processing steps as data imputation or anonymization; (2) a full statistical assessment to identify which variables are the most important discriminators; and (3) state of the art visualizations ready for publication to support the conclusions of the analyses. Gdaphen is open-source and freely available at https://github.com/munizmom/gdaphen , together with vignettes, documentation for the functions and examples to guide you in each own implementation.
first_indexed 2024-04-10T19:39:39Z
format Article
id doaj.art-39bbe89150a94398b7a0067274ca96ff
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-04-10T19:39:39Z
publishDate 2023-01-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-39bbe89150a94398b7a0067274ca96ff2023-01-29T12:23:08ZengBMCBMC Bioinformatics1471-21052023-01-0124111810.1186/s12859-022-05111-0Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic dataMaria del Mar Muñiz Moreno0Claire Gavériaux-Ruff1Yann Herault2Université de Strasbourg, CNRS UMR7104, INSERM U1258, Institut de Génétique, Biologie Moléculaire Et Cellulaire (IGBMC)Université de Strasbourg, CNRS UMR7104, INSERM U1258, Institut de Génétique, Biologie Moléculaire Et Cellulaire (IGBMC)Université de Strasbourg, CNRS UMR7104, INSERM U1258, Institut de Génétique, Biologie Moléculaire Et Cellulaire (IGBMC)Abstract Background In individuals or animals suffering from genetic or acquired diseases, it is important to identify which clinical or phenotypic variables can be used to discriminate between disease and non-disease states, the response to treatments or sexual dimorphism. However, the data often suffers from low number of samples, high number of variables or unbalanced experimental designs. Moreover, several parameters can be recorded in the same test. Thus, correlations should be assessed, and a more complex statistical framework is necessary for the analysis. Packages already exist that provide analysis tools, but they are not found together, rendering the decision method and implementation difficult for non-statisticians. Result We present Gdaphen, a fast joint-pipeline allowing the identification of most important qualitative and quantitative predictor variables to discriminate between genotypes, treatments, or sex. Gdaphen takes as input behavioral/clinical data and uses a Multiple Factor Analysis (MFA) to deal with groups of variables recorded from the same individuals or anonymize genotype-based recordings. Gdaphen uses as optimized input the non-correlated variables with 30% correlation or higher on the MFA-Principal Component Analysis (PCA), increasing the discriminative power and the classifier’s predictive model efficiency. Gdaphen can determine the strongest variables that predict gene dosage effects thanks to the General Linear Model (GLM)-based classifiers or determine the most discriminative not linear distributed variables thanks to Random Forest (RF) implementation. Moreover, Gdaphen provides the efficacy of each classifier and several visualization options to fully understand and support the results as easily readable plots ready to be included in publications. We demonstrate Gdaphen capabilities on several datasets and provide easily followable vignettes. Conclusions Gdaphen makes the analysis of phenotypic data much easier for medical or preclinical behavioral researchers, providing an integrated framework to perform: (1) pre-processing steps as data imputation or anonymization; (2) a full statistical assessment to identify which variables are the most important discriminators; and (3) state of the art visualizations ready for publication to support the conclusions of the analyses. Gdaphen is open-source and freely available at https://github.com/munizmom/gdaphen , together with vignettes, documentation for the functions and examples to guide you in each own implementation.https://doi.org/10.1186/s12859-022-05111-0R packagePhenotypic dataClinical dataDiscriminationGeneralized linear modelsRandom forest
spellingShingle Maria del Mar Muñiz Moreno
Claire Gavériaux-Ruff
Yann Herault
Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data
BMC Bioinformatics
R package
Phenotypic data
Clinical data
Discrimination
Generalized linear models
Random forest
title Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data
title_full Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data
title_fullStr Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data
title_full_unstemmed Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data
title_short Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data
title_sort gdaphen r pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data
topic R package
Phenotypic data
Clinical data
Discrimination
Generalized linear models
Random forest
url https://doi.org/10.1186/s12859-022-05111-0
work_keys_str_mv AT mariadelmarmunizmoreno gdaphenrpipelinetoidentifythemostimportantqualitativeandquantitativepredictorvariablesfromphenotypicdata
AT clairegaveriauxruff gdaphenrpipelinetoidentifythemostimportantqualitativeandquantitativepredictorvariablesfromphenotypicdata
AT yannherault gdaphenrpipelinetoidentifythemostimportantqualitativeandquantitativepredictorvariablesfromphenotypicdata