Determining relative importance of variables in developing and validating predictive models

Abstract Background Multiple regression models are used in a wide range of scientific disciplines and automated model selection procedures are frequently used to identify independent predictors. However, determination of relative importance of potential...

Full description

Bibliographic Details
Main Authors:	Atenafu Eshetu G, Beyene Joseph, Hamid Jemila S, To Teresa, Sung Lillian
Format:	Article
Language:	English
Published:	BMC 2009-09-01
Series:	BMC Medical Research Methodology
Online Access:	http://www.biomedcentral.com/1471-2288/9/64

_version_	1811266773640544256
author	Atenafu Eshetu G Beyene Joseph Hamid Jemila S To Teresa Sung Lillian
author_facet	Atenafu Eshetu G Beyene Joseph Hamid Jemila S To Teresa Sung Lillian
author_sort	Atenafu Eshetu G
collection	DOAJ
description	<p>Abstract</p> <p>Background</p> <p>Multiple regression models are used in a wide range of scientific disciplines and automated model selection procedures are frequently used to identify independent predictors. However, determination of relative importance of potential predictors and validating the fitted models for their stability, predictive accuracy and generalizability are often overlooked or not done thoroughly.</p> <p>Methods</p> <p>Using a case study aimed at predicting children with acute lymphoblastic leukemia (ALL) who are at low risk of Tumor Lysis Syndrome (TLS), we propose and compare two strategies, bootstrapping and random split of data, for ordering potential predictors according to their relative importance with respect to model stability and generalizability. We also propose an approach based on relative increase in percentage of explained variation and area under the Receiver Operating Characteristic (ROC) curve for developing models where variables from our ordered list enter the model according to their importance. An additional data set aimed at identifying predictors of prostate cancer penetration is also used for illustrative purposes.</p> <p>Results</p> <p>Age is chosen to be the most important predictor of TLS. It is selected 100% of the time using the bootstrapping approach. Using the random split method, it is selected 99% of the time in the training data and is significant (at 5% level) 98% of the time in the validation data set. This indicates that age is a stable predictor of TLS with good generalizability. The second most important variable is white blood cell count (WBC). Our methods also identified an important predictor of TLS that was otherwise omitted if relying on any of the automated model selection procedures alone. A group at low risk of TLS consists of children younger than 10 years of age, without T-cell immunophenotype, whose baseline WBC is < 20 × 10<sup>9</sup>/L and palpable spleen is < 2 cm. For the prostate cancer data set, the Gleason score and digital rectal exam are identified to be the most important indicators of whether tumor has penetrated the prostate capsule.</p> <p>Conclusion</p> <p>Our model selection procedures based on bootstrap re-sampling and repeated random split techniques can be used to assess the strength of evidence that a variable is truly an independent and reproducible predictor. Our methods, therefore, can be used for developing stable and reproducible models with good performances. Moreover, our methods can serve as a good tool for validating a predictive model. Previous biological and clinical studies support the findings based on our selection and validation strategies. However, extensive simulations may be required to assess the performance of our methods under different scenarios as well as check their sensitivity to a random fluctuation in the data.</p>
first_indexed	2024-04-12T20:49:22Z
format	Article
id	doaj.art-843c1b6a260741d0bd8099a9643cab31
institution	Directory Open Access Journal
issn	1471-2288
language	English
last_indexed	2024-04-12T20:49:22Z
publishDate	2009-09-01
publisher	BMC
record_format	Article
series	BMC Medical Research Methodology
spelling	doaj.art-843c1b6a260741d0bd8099a9643cab312022-12-22T03:17:10ZengBMCBMC Medical Research Methodology1471-22882009-09-01916410.1186/1471-2288-9-64Determining relative importance of variables in developing and validating predictive modelsAtenafu Eshetu GBeyene JosephHamid Jemila STo TeresaSung Lillian<p>Abstract</p> <p>Background</p> <p>Multiple regression models are used in a wide range of scientific disciplines and automated model selection procedures are frequently used to identify independent predictors. However, determination of relative importance of potential predictors and validating the fitted models for their stability, predictive accuracy and generalizability are often overlooked or not done thoroughly.</p> <p>Methods</p> <p>Using a case study aimed at predicting children with acute lymphoblastic leukemia (ALL) who are at low risk of Tumor Lysis Syndrome (TLS), we propose and compare two strategies, bootstrapping and random split of data, for ordering potential predictors according to their relative importance with respect to model stability and generalizability. We also propose an approach based on relative increase in percentage of explained variation and area under the Receiver Operating Characteristic (ROC) curve for developing models where variables from our ordered list enter the model according to their importance. An additional data set aimed at identifying predictors of prostate cancer penetration is also used for illustrative purposes.</p> <p>Results</p> <p>Age is chosen to be the most important predictor of TLS. It is selected 100% of the time using the bootstrapping approach. Using the random split method, it is selected 99% of the time in the training data and is significant (at 5% level) 98% of the time in the validation data set. This indicates that age is a stable predictor of TLS with good generalizability. The second most important variable is white blood cell count (WBC). Our methods also identified an important predictor of TLS that was otherwise omitted if relying on any of the automated model selection procedures alone. A group at low risk of TLS consists of children younger than 10 years of age, without T-cell immunophenotype, whose baseline WBC is < 20 × 10<sup>9</sup>/L and palpable spleen is < 2 cm. For the prostate cancer data set, the Gleason score and digital rectal exam are identified to be the most important indicators of whether tumor has penetrated the prostate capsule.</p> <p>Conclusion</p> <p>Our model selection procedures based on bootstrap re-sampling and repeated random split techniques can be used to assess the strength of evidence that a variable is truly an independent and reproducible predictor. Our methods, therefore, can be used for developing stable and reproducible models with good performances. Moreover, our methods can serve as a good tool for validating a predictive model. Previous biological and clinical studies support the findings based on our selection and validation strategies. However, extensive simulations may be required to assess the performance of our methods under different scenarios as well as check their sensitivity to a random fluctuation in the data.</p>http://www.biomedcentral.com/1471-2288/9/64
spellingShingle	Atenafu Eshetu G Beyene Joseph Hamid Jemila S To Teresa Sung Lillian Determining relative importance of variables in developing and validating predictive models BMC Medical Research Methodology
title	Determining relative importance of variables in developing and validating predictive models
title_full	Determining relative importance of variables in developing and validating predictive models
title_fullStr	Determining relative importance of variables in developing and validating predictive models
title_full_unstemmed	Determining relative importance of variables in developing and validating predictive models
title_short	Determining relative importance of variables in developing and validating predictive models
title_sort	determining relative importance of variables in developing and validating predictive models
url	http://www.biomedcentral.com/1471-2288/9/64
work_keys_str_mv	AT atenafueshetug determiningrelativeimportanceofvariablesindevelopingandvalidatingpredictivemodels AT beyenejoseph determiningrelativeimportanceofvariablesindevelopingandvalidatingpredictivemodels AT hamidjemilas determiningrelativeimportanceofvariablesindevelopingandvalidatingpredictivemodels AT toteresa determiningrelativeimportanceofvariablesindevelopingandvalidatingpredictivemodels AT sunglillian determiningrelativeimportanceofvariablesindevelopingandvalidatingpredictivemodels

Determining relative importance of variables in developing and validating predictive models

Similar Items