Discovery of Depression-Associated Factors From a Nationwide Population-Based Survey: Epidemiological Study Using Machine Learning and Network Analysis
BackgroundIn epidemiological studies, finding the best subset of factors is challenging when the number of explanatory variables is large. ObjectiveOur study had two aims. First, we aimed to identify essential depression-associated factors using the extreme gradient boosting (XGBoost) ma...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
JMIR Publications
2021-06-01
|
Series: | Journal of Medical Internet Research |
Online Access: | https://www.jmir.org/2021/6/e27344 |
_version_ | 1819292416635568128 |
---|---|
author | Sang Min Nam Thomas A Peterson Kyoung Yul Seo Hyun Wook Han Jee In Kang |
author_facet | Sang Min Nam Thomas A Peterson Kyoung Yul Seo Hyun Wook Han Jee In Kang |
author_sort | Sang Min Nam |
collection | DOAJ |
description | BackgroundIn epidemiological studies, finding the best subset of factors is challenging when the number of explanatory variables is large.
ObjectiveOur study had two aims. First, we aimed to identify essential depression-associated factors using the extreme gradient boosting (XGBoost) machine learning algorithm from big survey data (the Korea National Health and Nutrition Examination Survey, 2012-2016). Second, we aimed to achieve a comprehensive understanding of multifactorial features in depression using network analysis.
MethodsAn XGBoost model was trained and tested to classify “current depression” and “no lifetime depression” for a data set of 120 variables for 12,596 cases. The optimal XGBoost hyperparameters were set by an automated machine learning tool (TPOT), and a high-performance sparse model was obtained by feature selection using the feature importance value of XGBoost. We performed statistical tests on the model and nonmodel factors using survey-weighted multiple logistic regression and drew a correlation network among factors. We also adopted statistical tests for the confounder or interaction effect of selected risk factors when it was suspected on the network.
ResultsThe XGBoost-derived depression model consisted of 18 factors with an area under the weighted receiver operating characteristic curve of 0.86. Two nonmodel factors could be found using the model factors, and the factors were classified into direct (P<.05) and indirect (P≥.05), according to the statistical significance of the association with depression. Perceived stress and asthma were the most remarkable risk factors, and urine specific gravity was a novel protective factor. The depression-factor network showed clusters of socioeconomic status and quality of life factors and suggested that educational level and sex might be predisposing factors. Indirect factors (eg, diabetes, hypercholesterolemia, and smoking) were involved in confounding or interaction effects of direct factors. Triglyceride level was a confounder of hypercholesterolemia and diabetes, smoking had a significant risk in females, and weight gain was associated with depression involving diabetes.
ConclusionsXGBoost and network analysis were useful to discover depression-related factors and their relationships and can be applied to epidemiological studies using big survey data. |
first_indexed | 2024-12-24T03:54:11Z |
format | Article |
id | doaj.art-8dd2341f6c444fefb4b67c9333876b28 |
institution | Directory Open Access Journal |
issn | 1438-8871 |
language | English |
last_indexed | 2024-12-24T03:54:11Z |
publishDate | 2021-06-01 |
publisher | JMIR Publications |
record_format | Article |
series | Journal of Medical Internet Research |
spelling | doaj.art-8dd2341f6c444fefb4b67c9333876b282022-12-21T17:16:30ZengJMIR PublicationsJournal of Medical Internet Research1438-88712021-06-01236e2734410.2196/27344Discovery of Depression-Associated Factors From a Nationwide Population-Based Survey: Epidemiological Study Using Machine Learning and Network AnalysisSang Min Namhttps://orcid.org/0000-0001-6903-6333Thomas A Petersonhttps://orcid.org/0000-0002-2562-6574Kyoung Yul Seohttps://orcid.org/0000-0002-9855-1980Hyun Wook Hanhttps://orcid.org/0000-0002-6918-5694Jee In Kanghttps://orcid.org/0000-0002-2818-7183BackgroundIn epidemiological studies, finding the best subset of factors is challenging when the number of explanatory variables is large. ObjectiveOur study had two aims. First, we aimed to identify essential depression-associated factors using the extreme gradient boosting (XGBoost) machine learning algorithm from big survey data (the Korea National Health and Nutrition Examination Survey, 2012-2016). Second, we aimed to achieve a comprehensive understanding of multifactorial features in depression using network analysis. MethodsAn XGBoost model was trained and tested to classify “current depression” and “no lifetime depression” for a data set of 120 variables for 12,596 cases. The optimal XGBoost hyperparameters were set by an automated machine learning tool (TPOT), and a high-performance sparse model was obtained by feature selection using the feature importance value of XGBoost. We performed statistical tests on the model and nonmodel factors using survey-weighted multiple logistic regression and drew a correlation network among factors. We also adopted statistical tests for the confounder or interaction effect of selected risk factors when it was suspected on the network. ResultsThe XGBoost-derived depression model consisted of 18 factors with an area under the weighted receiver operating characteristic curve of 0.86. Two nonmodel factors could be found using the model factors, and the factors were classified into direct (P<.05) and indirect (P≥.05), according to the statistical significance of the association with depression. Perceived stress and asthma were the most remarkable risk factors, and urine specific gravity was a novel protective factor. The depression-factor network showed clusters of socioeconomic status and quality of life factors and suggested that educational level and sex might be predisposing factors. Indirect factors (eg, diabetes, hypercholesterolemia, and smoking) were involved in confounding or interaction effects of direct factors. Triglyceride level was a confounder of hypercholesterolemia and diabetes, smoking had a significant risk in females, and weight gain was associated with depression involving diabetes. ConclusionsXGBoost and network analysis were useful to discover depression-related factors and their relationships and can be applied to epidemiological studies using big survey data.https://www.jmir.org/2021/6/e27344 |
spellingShingle | Sang Min Nam Thomas A Peterson Kyoung Yul Seo Hyun Wook Han Jee In Kang Discovery of Depression-Associated Factors From a Nationwide Population-Based Survey: Epidemiological Study Using Machine Learning and Network Analysis Journal of Medical Internet Research |
title | Discovery of Depression-Associated Factors From a Nationwide Population-Based Survey: Epidemiological Study Using Machine Learning and Network Analysis |
title_full | Discovery of Depression-Associated Factors From a Nationwide Population-Based Survey: Epidemiological Study Using Machine Learning and Network Analysis |
title_fullStr | Discovery of Depression-Associated Factors From a Nationwide Population-Based Survey: Epidemiological Study Using Machine Learning and Network Analysis |
title_full_unstemmed | Discovery of Depression-Associated Factors From a Nationwide Population-Based Survey: Epidemiological Study Using Machine Learning and Network Analysis |
title_short | Discovery of Depression-Associated Factors From a Nationwide Population-Based Survey: Epidemiological Study Using Machine Learning and Network Analysis |
title_sort | discovery of depression associated factors from a nationwide population based survey epidemiological study using machine learning and network analysis |
url | https://www.jmir.org/2021/6/e27344 |
work_keys_str_mv | AT sangminnam discoveryofdepressionassociatedfactorsfromanationwidepopulationbasedsurveyepidemiologicalstudyusingmachinelearningandnetworkanalysis AT thomasapeterson discoveryofdepressionassociatedfactorsfromanationwidepopulationbasedsurveyepidemiologicalstudyusingmachinelearningandnetworkanalysis AT kyoungyulseo discoveryofdepressionassociatedfactorsfromanationwidepopulationbasedsurveyepidemiologicalstudyusingmachinelearningandnetworkanalysis AT hyunwookhan discoveryofdepressionassociatedfactorsfromanationwidepopulationbasedsurveyepidemiologicalstudyusingmachinelearningandnetworkanalysis AT jeeinkang discoveryofdepressionassociatedfactorsfromanationwidepopulationbasedsurveyepidemiologicalstudyusingmachinelearningandnetworkanalysis |