Descriptive forest: experiments on a novel tree-structure-generalization method for describing cardiovascular diseases
Abstract Background A decision tree is a crucial tool for describing the factors related to cardiovascular disease (CVD) risk and for predicting and explaining it for patients. Notably, the decision tree must be simplified because patients may have different primary topics or factors related to the...
Main Author: | |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2023-07-01
|
Series: | BMC Medical Informatics and Decision Making |
Subjects: | |
Online Access: | https://doi.org/10.1186/s12911-023-02228-x |
_version_ | 1797769469294018560 |
---|---|
author | Peera Liewlom |
author_facet | Peera Liewlom |
author_sort | Peera Liewlom |
collection | DOAJ |
description | Abstract Background A decision tree is a crucial tool for describing the factors related to cardiovascular disease (CVD) risk and for predicting and explaining it for patients. Notably, the decision tree must be simplified because patients may have different primary topics or factors related to the CVD risk. Many decision trees can describe the data collected from multiple environmental heart disease risk datasets or a forest, where each tree describes the CVD risk for each primary topic. Methods We demonstrate the presence of trees, or a forest, using an integrated CVD dataset obtained from multiple datasets. Moreover, we apply a novel method to an association-rule tree to discover each primary topic hidden within a dataset. To generalize the tree structure for descriptive tasks, each primary topic is a boundary node acting as a root node of a C4.5 tree with the least prodigality for the tree structure (PTS). All trees are assigned to a descriptive forest describing the CVD risks in a dataset. A descriptive forest is used to describe each CVD patient’s primary risk topics and related factors. We describe eight primary topics in a descriptive forest acquired from 918 records of a heart failure–prediction dataset with 11 features obtained from five datasets. We apply the proposed method to 253,680 records with 22 features from imbalanced classes of a heart disease health–indicators dataset. Results The usability of the descriptive forest is demonstrated by a comparative study (on qualitative and quantitative tasks of the CVD-risk explanation) with a C4.5 tree generated from the same dataset but with the least PTS. The qualitative descriptive task confirms that compared to a single C4.5 tree, the descriptive forest is more flexible and can better describe the CVD risk, whereas the quantitative descriptive task confirms that it achieved higher coverage (recall) and correctness (accuracy and precision) and provided more detailed explanations. Additionally, for these tasks, the descriptive forest still outperforms the C4.5 tree. To reduce the problem of imbalanced classes, the ratio of classes in each subdataset generating each tree is investigated. Conclusion The results provide confidence for using the descriptive forest. |
first_indexed | 2024-03-12T21:09:28Z |
format | Article |
id | doaj.art-e2bc240a347d41bc8b19237fefba2f6b |
institution | Directory Open Access Journal |
issn | 1472-6947 |
language | English |
last_indexed | 2024-03-12T21:09:28Z |
publishDate | 2023-07-01 |
publisher | BMC |
record_format | Article |
series | BMC Medical Informatics and Decision Making |
spelling | doaj.art-e2bc240a347d41bc8b19237fefba2f6b2023-07-30T11:17:20ZengBMCBMC Medical Informatics and Decision Making1472-69472023-07-0123112510.1186/s12911-023-02228-xDescriptive forest: experiments on a novel tree-structure-generalization method for describing cardiovascular diseasesPeera Liewlom0Department of Computer and Information Science, Faculty of Science and Engineering, Kasetsart UniversityAbstract Background A decision tree is a crucial tool for describing the factors related to cardiovascular disease (CVD) risk and for predicting and explaining it for patients. Notably, the decision tree must be simplified because patients may have different primary topics or factors related to the CVD risk. Many decision trees can describe the data collected from multiple environmental heart disease risk datasets or a forest, where each tree describes the CVD risk for each primary topic. Methods We demonstrate the presence of trees, or a forest, using an integrated CVD dataset obtained from multiple datasets. Moreover, we apply a novel method to an association-rule tree to discover each primary topic hidden within a dataset. To generalize the tree structure for descriptive tasks, each primary topic is a boundary node acting as a root node of a C4.5 tree with the least prodigality for the tree structure (PTS). All trees are assigned to a descriptive forest describing the CVD risks in a dataset. A descriptive forest is used to describe each CVD patient’s primary risk topics and related factors. We describe eight primary topics in a descriptive forest acquired from 918 records of a heart failure–prediction dataset with 11 features obtained from five datasets. We apply the proposed method to 253,680 records with 22 features from imbalanced classes of a heart disease health–indicators dataset. Results The usability of the descriptive forest is demonstrated by a comparative study (on qualitative and quantitative tasks of the CVD-risk explanation) with a C4.5 tree generated from the same dataset but with the least PTS. The qualitative descriptive task confirms that compared to a single C4.5 tree, the descriptive forest is more flexible and can better describe the CVD risk, whereas the quantitative descriptive task confirms that it achieved higher coverage (recall) and correctness (accuracy and precision) and provided more detailed explanations. Additionally, for these tasks, the descriptive forest still outperforms the C4.5 tree. To reduce the problem of imbalanced classes, the ratio of classes in each subdataset generating each tree is investigated. Conclusion The results provide confidence for using the descriptive forest.https://doi.org/10.1186/s12911-023-02228-xInformation ScienceMedical InformaticsData MiningCardiovascular Diseases |
spellingShingle | Peera Liewlom Descriptive forest: experiments on a novel tree-structure-generalization method for describing cardiovascular diseases BMC Medical Informatics and Decision Making Information Science Medical Informatics Data Mining Cardiovascular Diseases |
title | Descriptive forest: experiments on a novel tree-structure-generalization method for describing cardiovascular diseases |
title_full | Descriptive forest: experiments on a novel tree-structure-generalization method for describing cardiovascular diseases |
title_fullStr | Descriptive forest: experiments on a novel tree-structure-generalization method for describing cardiovascular diseases |
title_full_unstemmed | Descriptive forest: experiments on a novel tree-structure-generalization method for describing cardiovascular diseases |
title_short | Descriptive forest: experiments on a novel tree-structure-generalization method for describing cardiovascular diseases |
title_sort | descriptive forest experiments on a novel tree structure generalization method for describing cardiovascular diseases |
topic | Information Science Medical Informatics Data Mining Cardiovascular Diseases |
url | https://doi.org/10.1186/s12911-023-02228-x |
work_keys_str_mv | AT peeraliewlom descriptiveforestexperimentsonanoveltreestructuregeneralizationmethodfordescribingcardiovasculardiseases |