Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data

Abstract Background In individuals or animals suffering from genetic or acquired diseases, it is important to identify which clinical or phenotypic variables can be used to discriminate between disease and non-disease states, the response to treatments or sexual dimorphism. However, the data often s...

Full description

Bibliographic Details
Main Authors:	Maria del Mar Muñiz Moreno, Claire Gavériaux-Ruff, Yann Herault
Format:	Article
Language:	English
Published:	BMC 2023-01-01
Series:	BMC Bioinformatics
Subjects:	R package Phenotypic data Clinical data Discrimination Generalized linear models Random forest
Online Access:	https://doi.org/10.1186/s12859-022-05111-0

_version_	1811175676769730560
author	Maria del Mar Muñiz Moreno Claire Gavériaux-Ruff Yann Herault
author_facet	Maria del Mar Muñiz Moreno Claire Gavériaux-Ruff Yann Herault
author_sort	Maria del Mar Muñiz Moreno
collection	DOAJ
description	Abstract Background In individuals or animals suffering from genetic or acquired diseases, it is important to identify which clinical or phenotypic variables can be used to discriminate between disease and non-disease states, the response to treatments or sexual dimorphism. However, the data often suffers from low number of samples, high number of variables or unbalanced experimental designs. Moreover, several parameters can be recorded in the same test. Thus, correlations should be assessed, and a more complex statistical framework is necessary for the analysis. Packages already exist that provide analysis tools, but they are not found together, rendering the decision method and implementation difficult for non-statisticians. Result We present Gdaphen, a fast joint-pipeline allowing the identification of most important qualitative and quantitative predictor variables to discriminate between genotypes, treatments, or sex. Gdaphen takes as input behavioral/clinical data and uses a Multiple Factor Analysis (MFA) to deal with groups of variables recorded from the same individuals or anonymize genotype-based recordings. Gdaphen uses as optimized input the non-correlated variables with 30% correlation or higher on the MFA-Principal Component Analysis (PCA), increasing the discriminative power and the classifier’s predictive model efficiency. Gdaphen can determine the strongest variables that predict gene dosage effects thanks to the General Linear Model (GLM)-based classifiers or determine the most discriminative not linear distributed variables thanks to Random Forest (RF) implementation. Moreover, Gdaphen provides the efficacy of each classifier and several visualization options to fully understand and support the results as easily readable plots ready to be included in publications. We demonstrate Gdaphen capabilities on several datasets and provide easily followable vignettes. Conclusions Gdaphen makes the analysis of phenotypic data much easier for medical or preclinical behavioral researchers, providing an integrated framework to perform: (1) pre-processing steps as data imputation or anonymization; (2) a full statistical assessment to identify which variables are the most important discriminators; and (3) state of the art visualizations ready for publication to support the conclusions of the analyses. Gdaphen is open-source and freely available at https://github.com/munizmom/gdaphen , together with vignettes, documentation for the functions and examples to guide you in each own implementation.
first_indexed	2024-04-10T19:39:39Z
format	Article
id	doaj.art-39bbe89150a94398b7a0067274ca96ff
institution	Directory Open Access Journal
issn	1471-2105
language	English
last_indexed	2024-04-10T19:39:39Z
publishDate	2023-01-01
publisher	BMC
record_format	Article
series	BMC Bioinformatics
spelling	doaj.art-39bbe89150a94398b7a0067274ca96ff2023-01-29T12:23:08ZengBMCBMC Bioinformatics1471-21052023-01-0124111810.1186/s12859-022-05111-0Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic dataMaria del Mar Muñiz Moreno0Claire Gavériaux-Ruff1Yann Herault2Université de Strasbourg, CNRS UMR7104, INSERM U1258, Institut de Génétique, Biologie Moléculaire Et Cellulaire (IGBMC)Université de Strasbourg, CNRS UMR7104, INSERM U1258, Institut de Génétique, Biologie Moléculaire Et Cellulaire (IGBMC)Université de Strasbourg, CNRS UMR7104, INSERM U1258, Institut de Génétique, Biologie Moléculaire Et Cellulaire (IGBMC)Abstract Background In individuals or animals suffering from genetic or acquired diseases, it is important to identify which clinical or phenotypic variables can be used to discriminate between disease and non-disease states, the response to treatments or sexual dimorphism. However, the data often suffers from low number of samples, high number of variables or unbalanced experimental designs. Moreover, several parameters can be recorded in the same test. Thus, correlations should be assessed, and a more complex statistical framework is necessary for the analysis. Packages already exist that provide analysis tools, but they are not found together, rendering the decision method and implementation difficult for non-statisticians. Result We present Gdaphen, a fast joint-pipeline allowing the identification of most important qualitative and quantitative predictor variables to discriminate between genotypes, treatments, or sex. Gdaphen takes as input behavioral/clinical data and uses a Multiple Factor Analysis (MFA) to deal with groups of variables recorded from the same individuals or anonymize genotype-based recordings. Gdaphen uses as optimized input the non-correlated variables with 30% correlation or higher on the MFA-Principal Component Analysis (PCA), increasing the discriminative power and the classifier’s predictive model efficiency. Gdaphen can determine the strongest variables that predict gene dosage effects thanks to the General Linear Model (GLM)-based classifiers or determine the most discriminative not linear distributed variables thanks to Random Forest (RF) implementation. Moreover, Gdaphen provides the efficacy of each classifier and several visualization options to fully understand and support the results as easily readable plots ready to be included in publications. We demonstrate Gdaphen capabilities on several datasets and provide easily followable vignettes. Conclusions Gdaphen makes the analysis of phenotypic data much easier for medical or preclinical behavioral researchers, providing an integrated framework to perform: (1) pre-processing steps as data imputation or anonymization; (2) a full statistical assessment to identify which variables are the most important discriminators; and (3) state of the art visualizations ready for publication to support the conclusions of the analyses. Gdaphen is open-source and freely available at https://github.com/munizmom/gdaphen , together with vignettes, documentation for the functions and examples to guide you in each own implementation.https://doi.org/10.1186/s12859-022-05111-0R packagePhenotypic dataClinical dataDiscriminationGeneralized linear modelsRandom forest
spellingShingle	Maria del Mar Muñiz Moreno Claire Gavériaux-Ruff Yann Herault Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data BMC Bioinformatics R package Phenotypic data Clinical data Discrimination Generalized linear models Random forest
title	Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data
title_full	Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data
title_fullStr	Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data
title_full_unstemmed	Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data
title_short	Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data
title_sort	gdaphen r pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data
topic	R package Phenotypic data Clinical data Discrimination Generalized linear models Random forest
url	https://doi.org/10.1186/s12859-022-05111-0
work_keys_str_mv	AT mariadelmarmunizmoreno gdaphenrpipelinetoidentifythemostimportantqualitativeandquantitativepredictorvariablesfromphenotypicdata AT clairegaveriauxruff gdaphenrpipelinetoidentifythemostimportantqualitativeandquantitativepredictorvariablesfromphenotypicdata AT yannherault gdaphenrpipelinetoidentifythemostimportantqualitativeandquantitativepredictorvariablesfromphenotypicdata

Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data

Similar Items