Prediction of lung cancer risk in Chinese population with genetic‐environment factor using extreme gradient boosting

Abstract Background Detecting early‐stage lung cancer is critical to reduce the lung cancer mortality rate; however, existing models based on germline variants perform poorly, and new models are needed. This study aimed to use extreme gradient boosting to develop a predictive model for the early dia...

Full description

Bibliographic Details
Main Authors: Yutao Li, Zixiu Zou, Zhunyi Gao, Yi Wang, Man Xiao, Chang Xu, Gengxi Jiang, Haijian Wang, Li Jin, Jiucun Wang, Huai Zhou Wang, Shicheng Guo, Junjie Wu
Format: Article
Language:English
Published: Wiley 2022-12-01
Series:Cancer Medicine
Subjects:
Online Access:https://doi.org/10.1002/cam4.4800
_version_ 1827198031476293632
author Yutao Li
Zixiu Zou
Zhunyi Gao
Yi Wang
Man Xiao
Chang Xu
Gengxi Jiang
Haijian Wang
Li Jin
Jiucun Wang
Huai Zhou Wang
Shicheng Guo
Junjie Wu
author_facet Yutao Li
Zixiu Zou
Zhunyi Gao
Yi Wang
Man Xiao
Chang Xu
Gengxi Jiang
Haijian Wang
Li Jin
Jiucun Wang
Huai Zhou Wang
Shicheng Guo
Junjie Wu
author_sort Yutao Li
collection DOAJ
description Abstract Background Detecting early‐stage lung cancer is critical to reduce the lung cancer mortality rate; however, existing models based on germline variants perform poorly, and new models are needed. This study aimed to use extreme gradient boosting to develop a predictive model for the early diagnosis of lung cancer in a multicenter case–control study. Materials and Methods A total of 974 cases and 1005 controls in Shanghai and Taizhou were recruited, and 61 single nucleotide polymorphisms (SNPs) were genotyped. Multivariate logistic regression was used to calculate the association between signal SNPs and lung cancer risk. Logistic regression (LR) and extreme gradient boosting (XGBoost) algorithms, a large‐scale machine learning algorithm, were adopted to build the lung cancer risk model. In both models, 10‐fold cross‐validation was performed, and model predictive performance was evaluated by the area under the curve (AUC). Results After FDR adjustment, TYMS rs3819102 and BAG6 rs1077393 were significantly associated with lung cancer risk (p < 0.05). For lung cancer risk prediction, the model predicted only with epidemiology attained an AUC of 0.703 for LR and 0.744 for XGBoost. Compared with the LR model predicted only with epidemiology, further adding SNPs and applying XGBoost increased the AUC to 0.759 (p < 0.001) in the XGBoost model. BAG6 rs1077393 was the most important predictor among all SNPs in the lung cancer prediction XGBoost model, followed by TERT rs2735845 and CAMKK1 rs7214723. Further stratification in lung adenocarcinoma (ADC) showed a significantly elevated performance from 0.639 to 0.699 (p = 0.009) when applying XGBoost and adding SNPs to the model, while the best model for lung squamous cell carcinoma (SCC) prediction was the LR model predicted with epidemiology and SNPs (AUC = 0.833), compared with the XGBoost model (AUC = 0.816). Conclusion Our lung cancer risk prediction models in the Chinese population have a strong predictive ability, especially for SCC. Adding SNPs and applying the XGBoost algorithm to the epidemiologic‐based logistic regression risk prediction model significantly improves model performance.
first_indexed 2024-04-11T06:08:33Z
format Article
id doaj.art-5345a1dea55f48e280c4b3c5e39f5ef9
institution Directory Open Access Journal
issn 2045-7634
language English
last_indexed 2025-03-21T10:10:33Z
publishDate 2022-12-01
publisher Wiley
record_format Article
series Cancer Medicine
spelling doaj.art-5345a1dea55f48e280c4b3c5e39f5ef92024-07-04T06:51:08ZengWileyCancer Medicine2045-76342022-12-0111234469447810.1002/cam4.4800Prediction of lung cancer risk in Chinese population with genetic‐environment factor using extreme gradient boostingYutao Li0Zixiu Zou1Zhunyi Gao2Yi Wang3Man Xiao4Chang Xu5Gengxi Jiang6Haijian Wang7Li Jin8Jiucun Wang9Huai Zhou Wang10Shicheng Guo11Junjie Wu12School of Life Sciences Fudan University Shanghai ChinaSchool of Life Sciences Fudan University Shanghai ChinaCompany 6 of Basic Medical School Navy Military Medical University Shanghai ChinaSchool of Life Sciences Fudan University Shanghai ChinaDepartment of Biochemistry and Molecular Biology Hainan Medical University Haikou ChinaClinical College of Xiangnan University Chenzhou ChinaDepartment of Thoracic Surgery the First Affiliated Hospital of Naval Medical University (Second Military Medical University) Shanghai ChinaSchool of Life Sciences Fudan University Shanghai ChinaSchool of Life Sciences Fudan University Shanghai ChinaSchool of Life Sciences Fudan University Shanghai ChinaDepartment of Laboratory Diagnosis the First Affiliated Hospital of Naval Medical University (Second Military Medical University) Shanghai ChinaSchool of Life Sciences Fudan University Shanghai ChinaSchool of Life Sciences Fudan University Shanghai ChinaAbstract Background Detecting early‐stage lung cancer is critical to reduce the lung cancer mortality rate; however, existing models based on germline variants perform poorly, and new models are needed. This study aimed to use extreme gradient boosting to develop a predictive model for the early diagnosis of lung cancer in a multicenter case–control study. Materials and Methods A total of 974 cases and 1005 controls in Shanghai and Taizhou were recruited, and 61 single nucleotide polymorphisms (SNPs) were genotyped. Multivariate logistic regression was used to calculate the association between signal SNPs and lung cancer risk. Logistic regression (LR) and extreme gradient boosting (XGBoost) algorithms, a large‐scale machine learning algorithm, were adopted to build the lung cancer risk model. In both models, 10‐fold cross‐validation was performed, and model predictive performance was evaluated by the area under the curve (AUC). Results After FDR adjustment, TYMS rs3819102 and BAG6 rs1077393 were significantly associated with lung cancer risk (p < 0.05). For lung cancer risk prediction, the model predicted only with epidemiology attained an AUC of 0.703 for LR and 0.744 for XGBoost. Compared with the LR model predicted only with epidemiology, further adding SNPs and applying XGBoost increased the AUC to 0.759 (p < 0.001) in the XGBoost model. BAG6 rs1077393 was the most important predictor among all SNPs in the lung cancer prediction XGBoost model, followed by TERT rs2735845 and CAMKK1 rs7214723. Further stratification in lung adenocarcinoma (ADC) showed a significantly elevated performance from 0.639 to 0.699 (p = 0.009) when applying XGBoost and adding SNPs to the model, while the best model for lung squamous cell carcinoma (SCC) prediction was the LR model predicted with epidemiology and SNPs (AUC = 0.833), compared with the XGBoost model (AUC = 0.816). Conclusion Our lung cancer risk prediction models in the Chinese population have a strong predictive ability, especially for SCC. Adding SNPs and applying the XGBoost algorithm to the epidemiologic‐based logistic regression risk prediction model significantly improves model performance.https://doi.org/10.1002/cam4.4800Chinese populationextreme gradient boostinglung cancerrisk modelsingle nucleotide polymorphisms
spellingShingle Yutao Li
Zixiu Zou
Zhunyi Gao
Yi Wang
Man Xiao
Chang Xu
Gengxi Jiang
Haijian Wang
Li Jin
Jiucun Wang
Huai Zhou Wang
Shicheng Guo
Junjie Wu
Prediction of lung cancer risk in Chinese population with genetic‐environment factor using extreme gradient boosting
Cancer Medicine
Chinese population
extreme gradient boosting
lung cancer
risk model
single nucleotide polymorphisms
title Prediction of lung cancer risk in Chinese population with genetic‐environment factor using extreme gradient boosting
title_full Prediction of lung cancer risk in Chinese population with genetic‐environment factor using extreme gradient boosting
title_fullStr Prediction of lung cancer risk in Chinese population with genetic‐environment factor using extreme gradient boosting
title_full_unstemmed Prediction of lung cancer risk in Chinese population with genetic‐environment factor using extreme gradient boosting
title_short Prediction of lung cancer risk in Chinese population with genetic‐environment factor using extreme gradient boosting
title_sort prediction of lung cancer risk in chinese population with genetic environment factor using extreme gradient boosting
topic Chinese population
extreme gradient boosting
lung cancer
risk model
single nucleotide polymorphisms
url https://doi.org/10.1002/cam4.4800
work_keys_str_mv AT yutaoli predictionoflungcancerriskinchinesepopulationwithgeneticenvironmentfactorusingextremegradientboosting
AT zixiuzou predictionoflungcancerriskinchinesepopulationwithgeneticenvironmentfactorusingextremegradientboosting
AT zhunyigao predictionoflungcancerriskinchinesepopulationwithgeneticenvironmentfactorusingextremegradientboosting
AT yiwang predictionoflungcancerriskinchinesepopulationwithgeneticenvironmentfactorusingextremegradientboosting
AT manxiao predictionoflungcancerriskinchinesepopulationwithgeneticenvironmentfactorusingextremegradientboosting
AT changxu predictionoflungcancerriskinchinesepopulationwithgeneticenvironmentfactorusingextremegradientboosting
AT gengxijiang predictionoflungcancerriskinchinesepopulationwithgeneticenvironmentfactorusingextremegradientboosting
AT haijianwang predictionoflungcancerriskinchinesepopulationwithgeneticenvironmentfactorusingextremegradientboosting
AT lijin predictionoflungcancerriskinchinesepopulationwithgeneticenvironmentfactorusingextremegradientboosting
AT jiucunwang predictionoflungcancerriskinchinesepopulationwithgeneticenvironmentfactorusingextremegradientboosting
AT huaizhouwang predictionoflungcancerriskinchinesepopulationwithgeneticenvironmentfactorusingextremegradientboosting
AT shichengguo predictionoflungcancerriskinchinesepopulationwithgeneticenvironmentfactorusingextremegradientboosting
AT junjiewu predictionoflungcancerriskinchinesepopulationwithgeneticenvironmentfactorusingextremegradientboosting