A Diabetes Prediction System Based on Incomplete Fused Data Sources

In recent years, the diabetes population has grown younger. Therefore, it has become a key problem to make a timely and effective prediction of diabetes, especially given a single data source. Meanwhile, there are many data sources of diabetes patients collected around the world, and it is extremely...

Full description

Bibliographic Details
Main Authors: Zhaoyi Yuan, Hao Ding, Guoqing Chao, Mingqiang Song, Lei Wang, Weiping Ding, Dianhui Chu
Format: Article
Language:English
Published: MDPI AG 2023-04-01
Series:Machine Learning and Knowledge Extraction
Subjects:
Online Access:https://www.mdpi.com/2504-4990/5/2/23
_version_ 1797593742402650112
author Zhaoyi Yuan
Hao Ding
Guoqing Chao
Mingqiang Song
Lei Wang
Weiping Ding
Dianhui Chu
author_facet Zhaoyi Yuan
Hao Ding
Guoqing Chao
Mingqiang Song
Lei Wang
Weiping Ding
Dianhui Chu
author_sort Zhaoyi Yuan
collection DOAJ
description In recent years, the diabetes population has grown younger. Therefore, it has become a key problem to make a timely and effective prediction of diabetes, especially given a single data source. Meanwhile, there are many data sources of diabetes patients collected around the world, and it is extremely important to integrate these heterogeneous data sources to accurately predict diabetes. For the different data sources used to predict diabetes, the predictors may be different. In other words, some special features exist only in certain data sources, which leads to the problem of missing values. Considering the uncertainty of the missing values within the fused dataset, multiple imputation and a method based on graph representation is used to impute the missing values within the fused dataset. The logistic regression model and stacking strategy are applied for diabetes training and prediction on the fused dataset. It is proved that the idea of combining heterogeneous datasets and imputing the missing values produced in the fusion process can effectively improve the performance of diabetes prediction. In addition, the proposed diabetes prediction method can be further extended to any scenarios where heterogeneous datasets with the same label types and different feature attributes exist.
first_indexed 2024-03-11T02:13:48Z
format Article
id doaj.art-c62664876db94954af88745a8831b4a6
institution Directory Open Access Journal
issn 2504-4990
language English
last_indexed 2024-03-11T02:13:48Z
publishDate 2023-04-01
publisher MDPI AG
record_format Article
series Machine Learning and Knowledge Extraction
spelling doaj.art-c62664876db94954af88745a8831b4a62023-11-18T11:22:04ZengMDPI AGMachine Learning and Knowledge Extraction2504-49902023-04-015238439910.3390/make5020023A Diabetes Prediction System Based on Incomplete Fused Data SourcesZhaoyi Yuan0Hao Ding1Guoqing Chao2Mingqiang Song3Lei Wang4Weiping Ding5Dianhui Chu6School of Computer Sciences and Technology, Harbin Institute of Technology, Weihai 264209, ChinaSchool of Computer Sciences and Technology, Harbin Institute of Technology, Weihai 264209, ChinaSchool of Computer Sciences and Technology, Harbin Institute of Technology, Weihai 264209, ChinaDepartment of Endocrinology and Metabolism, Weihai Municipal Hospital, Affiliated to Shandong University, Weihai 264209, ChinaCAS Key Laboratory of Bio-Medical Diagnostics, Suzhou Institute of Biomedical Engineering and Technology Chinese Academy of Sciences, Suzhou 215163, ChinaSchool of Information Science and Technology, Nantong University, Nantong 226019, ChinaSchool of Computer Sciences and Technology, Harbin Institute of Technology, Weihai 264209, ChinaIn recent years, the diabetes population has grown younger. Therefore, it has become a key problem to make a timely and effective prediction of diabetes, especially given a single data source. Meanwhile, there are many data sources of diabetes patients collected around the world, and it is extremely important to integrate these heterogeneous data sources to accurately predict diabetes. For the different data sources used to predict diabetes, the predictors may be different. In other words, some special features exist only in certain data sources, which leads to the problem of missing values. Considering the uncertainty of the missing values within the fused dataset, multiple imputation and a method based on graph representation is used to impute the missing values within the fused dataset. The logistic regression model and stacking strategy are applied for diabetes training and prediction on the fused dataset. It is proved that the idea of combining heterogeneous datasets and imputing the missing values produced in the fusion process can effectively improve the performance of diabetes prediction. In addition, the proposed diabetes prediction method can be further extended to any scenarios where heterogeneous datasets with the same label types and different feature attributes exist.https://www.mdpi.com/2504-4990/5/2/23diabetes predictiondata sources fusionmissing values imputationgraph representation learningensemble learning
spellingShingle Zhaoyi Yuan
Hao Ding
Guoqing Chao
Mingqiang Song
Lei Wang
Weiping Ding
Dianhui Chu
A Diabetes Prediction System Based on Incomplete Fused Data Sources
Machine Learning and Knowledge Extraction
diabetes prediction
data sources fusion
missing values imputation
graph representation learning
ensemble learning
title A Diabetes Prediction System Based on Incomplete Fused Data Sources
title_full A Diabetes Prediction System Based on Incomplete Fused Data Sources
title_fullStr A Diabetes Prediction System Based on Incomplete Fused Data Sources
title_full_unstemmed A Diabetes Prediction System Based on Incomplete Fused Data Sources
title_short A Diabetes Prediction System Based on Incomplete Fused Data Sources
title_sort diabetes prediction system based on incomplete fused data sources
topic diabetes prediction
data sources fusion
missing values imputation
graph representation learning
ensemble learning
url https://www.mdpi.com/2504-4990/5/2/23
work_keys_str_mv AT zhaoyiyuan adiabetespredictionsystembasedonincompletefuseddatasources
AT haoding adiabetespredictionsystembasedonincompletefuseddatasources
AT guoqingchao adiabetespredictionsystembasedonincompletefuseddatasources
AT mingqiangsong adiabetespredictionsystembasedonincompletefuseddatasources
AT leiwang adiabetespredictionsystembasedonincompletefuseddatasources
AT weipingding adiabetespredictionsystembasedonincompletefuseddatasources
AT dianhuichu adiabetespredictionsystembasedonincompletefuseddatasources
AT zhaoyiyuan diabetespredictionsystembasedonincompletefuseddatasources
AT haoding diabetespredictionsystembasedonincompletefuseddatasources
AT guoqingchao diabetespredictionsystembasedonincompletefuseddatasources
AT mingqiangsong diabetespredictionsystembasedonincompletefuseddatasources
AT leiwang diabetespredictionsystembasedonincompletefuseddatasources
AT weipingding diabetespredictionsystembasedonincompletefuseddatasources
AT dianhuichu diabetespredictionsystembasedonincompletefuseddatasources