Imbalanced target prediction with pattern discovery on clinical data repositories

Abstract Background Clinical data repositories (CDR) have great potential to improve outcome prediction and risk modeling. However, most clinical studies require careful study design, dedicated data collection efforts, and sophisticated modeling techniques before a hypothesis can be tested. We aim t...

Full description

Bibliographic Details
Main Authors:	Tak-Ming Chan, Yuxi Li, Choo-Chiap Chiau, Jane Zhu, Jie Jiang, Yong Huo
Format:	Article
Language:	English
Published:	BMC 2017-04-01
Series:	BMC Medical Informatics and Decision Making
Subjects:	Pattern discovery Data mining Prediction Imbalanced data Clinical data repository
Online Access:	http://link.springer.com/article/10.1186/s12911-017-0443-3

_version_	1818494251008262144
author	Tak-Ming Chan Yuxi Li Choo-Chiap Chiau Jane Zhu Jie Jiang Yong Huo
author_facet	Tak-Ming Chan Yuxi Li Choo-Chiap Chiau Jane Zhu Jie Jiang Yong Huo
author_sort	Tak-Ming Chan
collection	DOAJ
description	Abstract Background Clinical data repositories (CDR) have great potential to improve outcome prediction and risk modeling. However, most clinical studies require careful study design, dedicated data collection efforts, and sophisticated modeling techniques before a hypothesis can be tested. We aim to bridge this gap, so that clinical domain users can perform first-hand prediction on existing repository data without complicated handling, and obtain insightful patterns of imbalanced targets for a formal study before it is conducted. We specifically target for interpretability for domain users where the model can be conveniently explained and applied in clinical practice. Methods We propose an interpretable pattern model which is noise (missing) tolerant for practice data. To address the challenge of imbalanced targets of interest in clinical research, e.g., deaths less than a few percent, the geometric mean of sensitivity and specificity (G-mean) optimization criterion is employed, with which a simple but effective heuristic algorithm is developed. Results We compared pattern discovery to clinically interpretable methods on two retrospective clinical datasets. They contain 14.9% deaths in 1 year in the thoracic dataset and 9.1% deaths in the cardiac dataset, respectively. In spite of the imbalance challenge shown on other methods, pattern discovery consistently shows competitive cross-validated prediction performance. Compared to logistic regression, Naïve Bayes, and decision tree, pattern discovery achieves statistically significant (p-values < 0.01, Wilcoxon signed rank test) favorable averaged testing G-means and F1-scores (harmonic mean of precision and sensitivity). Without requiring sophisticated technical processing of data and tweaking, the prediction performance of pattern discovery is consistently comparable to the best achievable performance. Conclusions Pattern discovery has demonstrated to be robust and valuable for target prediction on existing clinical data repositories with imbalance and noise. The prediction results and interpretable patterns can provide insights in an agile and inexpensive way for the potential formal studies.
first_indexed	2024-12-10T18:04:07Z
format	Article
id	doaj.art-657561dac3804ff9908914edc8c78b40
institution	Directory Open Access Journal
issn	1472-6947
language	English
last_indexed	2024-12-10T18:04:07Z
publishDate	2017-04-01
publisher	BMC
record_format	Article
series	BMC Medical Informatics and Decision Making
spelling	doaj.art-657561dac3804ff9908914edc8c78b402022-12-22T01:38:40ZengBMCBMC Medical Informatics and Decision Making1472-69472017-04-0117111210.1186/s12911-017-0443-3Imbalanced target prediction with pattern discovery on clinical data repositoriesTak-Ming Chan0Yuxi Li1Choo-Chiap Chiau2Jane Zhu3Jie Jiang4Yong Huo5Philips Research China - Health Systems, ChinaPeking University First HospitalPhilips Research China - Health Systems, ChinaPhilips Research China - Health Systems, ChinaPeking University First HospitalPeking University First HospitalAbstract Background Clinical data repositories (CDR) have great potential to improve outcome prediction and risk modeling. However, most clinical studies require careful study design, dedicated data collection efforts, and sophisticated modeling techniques before a hypothesis can be tested. We aim to bridge this gap, so that clinical domain users can perform first-hand prediction on existing repository data without complicated handling, and obtain insightful patterns of imbalanced targets for a formal study before it is conducted. We specifically target for interpretability for domain users where the model can be conveniently explained and applied in clinical practice. Methods We propose an interpretable pattern model which is noise (missing) tolerant for practice data. To address the challenge of imbalanced targets of interest in clinical research, e.g., deaths less than a few percent, the geometric mean of sensitivity and specificity (G-mean) optimization criterion is employed, with which a simple but effective heuristic algorithm is developed. Results We compared pattern discovery to clinically interpretable methods on two retrospective clinical datasets. They contain 14.9% deaths in 1 year in the thoracic dataset and 9.1% deaths in the cardiac dataset, respectively. In spite of the imbalance challenge shown on other methods, pattern discovery consistently shows competitive cross-validated prediction performance. Compared to logistic regression, Naïve Bayes, and decision tree, pattern discovery achieves statistically significant (p-values < 0.01, Wilcoxon signed rank test) favorable averaged testing G-means and F1-scores (harmonic mean of precision and sensitivity). Without requiring sophisticated technical processing of data and tweaking, the prediction performance of pattern discovery is consistently comparable to the best achievable performance. Conclusions Pattern discovery has demonstrated to be robust and valuable for target prediction on existing clinical data repositories with imbalance and noise. The prediction results and interpretable patterns can provide insights in an agile and inexpensive way for the potential formal studies.http://link.springer.com/article/10.1186/s12911-017-0443-3Pattern discoveryData miningPredictionImbalanced dataClinical data repository
spellingShingle	Tak-Ming Chan Yuxi Li Choo-Chiap Chiau Jane Zhu Jie Jiang Yong Huo Imbalanced target prediction with pattern discovery on clinical data repositories BMC Medical Informatics and Decision Making Pattern discovery Data mining Prediction Imbalanced data Clinical data repository
title	Imbalanced target prediction with pattern discovery on clinical data repositories
title_full	Imbalanced target prediction with pattern discovery on clinical data repositories
title_fullStr	Imbalanced target prediction with pattern discovery on clinical data repositories
title_full_unstemmed	Imbalanced target prediction with pattern discovery on clinical data repositories
title_short	Imbalanced target prediction with pattern discovery on clinical data repositories
title_sort	imbalanced target prediction with pattern discovery on clinical data repositories
topic	Pattern discovery Data mining Prediction Imbalanced data Clinical data repository
url	http://link.springer.com/article/10.1186/s12911-017-0443-3
work_keys_str_mv	AT takmingchan imbalancedtargetpredictionwithpatterndiscoveryonclinicaldatarepositories AT yuxili imbalancedtargetpredictionwithpatterndiscoveryonclinicaldatarepositories AT choochiapchiau imbalancedtargetpredictionwithpatterndiscoveryonclinicaldatarepositories AT janezhu imbalancedtargetpredictionwithpatterndiscoveryonclinicaldatarepositories AT jiejiang imbalancedtargetpredictionwithpatterndiscoveryonclinicaldatarepositories AT yonghuo imbalancedtargetpredictionwithpatterndiscoveryonclinicaldatarepositories

Imbalanced target prediction with pattern discovery on clinical data repositories

Similar Items