Big data driven co-occurring evidence discovery in chronic obstructive pulmonary disease patients

Abstract Background Chronic Obstructive Pulmonary Disease (COPD) is a chronic lung disease that affects airflow to the lungs. Discovering the co-occurrence of COPD with other diseases, symptoms, and medications is invaluable to medical staff. Building co-occurrence indexes and finding causal relatio...

Full description

Bibliographic Details
Main Authors: Christopher Baechle, Ankur Agarwal, Xingquan Zhu
Format: Article
Language:English
Published: SpringerOpen 2017-04-01
Series:Journal of Big Data
Subjects:
Online Access:http://link.springer.com/article/10.1186/s40537-017-0067-6
_version_ 1818827622534086656
author Christopher Baechle
Ankur Agarwal
Xingquan Zhu
author_facet Christopher Baechle
Ankur Agarwal
Xingquan Zhu
author_sort Christopher Baechle
collection DOAJ
description Abstract Background Chronic Obstructive Pulmonary Disease (COPD) is a chronic lung disease that affects airflow to the lungs. Discovering the co-occurrence of COPD with other diseases, symptoms, and medications is invaluable to medical staff. Building co-occurrence indexes and finding causal relationships with COPD can be difficult because often times disease prevalence within a population influences results. A method which can better separate occurrence within COPD patients from population prevalence would be desirable. Large hospital systems may potentially have tens of millions of patient records spanning decades of collection and a big data approach that is scalable is desirable. The presented method, Co-Occurring Evidence Discovery (COED), presents a methodology and framework to address these issues. Methods Natural Language Processing methods are used to examine 64,371 deidentified clinical notes and discover associations between COPD and medical terms. Apache cTAKES is leveraged to annotate and structure clinical notes. Several extensions to cTAKES have been written to parallelize the annotation of large sets of clinical notes. A co-occurrence score is presented which can penalize scores based on term prevalence, as well as a baseline method traditionally used for finding co-occurrence. These scoring systems are implemented using Apache Spark. Dictionaries of ground truth terms for diseases, medications, and symptoms have been created using clinical domain knowledge. COED and baseline methods are compared using precision, recall, and F1 score. Results The highest scoring diseases using COED are lung and respiratory diseases. In contrast, baseline methods for co-occurrence rank diseases with high population prevalence highest. Medications and symptoms evaluated with COED share similar results. When evaluated against ground truth dictionaries, the maximum improvements in recall for symptoms, diseases, and medications were 0.212, 0.130, and 0.174. The maximum improvements in precision for symptoms, diseases, and medications were 0.303, 0.333, and 0.180. Median increase in F1 score for symptoms, diseases, and medications were 38.1%, 23.0%, and 17.1%. A paired t-test was performed and F1 score increases were found to be statistically significant, where p < 0.01. Conclusion Penalizing terms which are highly frequent in the corpus results in better precision and recall performance. Penalizing frequently occurring terms gives a better picture of the diseases, symptoms, and medications co-occurring with COPD. Using a mathematical and computational approach rather than purely expert driven approach, large dictionaries of COPD related terms can be assembled in a short amount of time.
first_indexed 2024-12-19T00:46:29Z
format Article
id doaj.art-9dc8c9bc64044894a18299eefd4ed511
institution Directory Open Access Journal
issn 2196-1115
language English
last_indexed 2024-12-19T00:46:29Z
publishDate 2017-04-01
publisher SpringerOpen
record_format Article
series Journal of Big Data
spelling doaj.art-9dc8c9bc64044894a18299eefd4ed5112022-12-21T20:44:15ZengSpringerOpenJournal of Big Data2196-11152017-04-014111810.1186/s40537-017-0067-6Big data driven co-occurring evidence discovery in chronic obstructive pulmonary disease patientsChristopher Baechle0Ankur Agarwal1Xingquan Zhu2Department of Computer & Electrical Engineering and Computer Science, College of Engineering, Florida Atlantic UniversityDepartment of Computer & Electrical Engineering and Computer Science, College of Engineering, Florida Atlantic UniversityDepartment of Computer & Electrical Engineering and Computer Science, College of Engineering, Florida Atlantic UniversityAbstract Background Chronic Obstructive Pulmonary Disease (COPD) is a chronic lung disease that affects airflow to the lungs. Discovering the co-occurrence of COPD with other diseases, symptoms, and medications is invaluable to medical staff. Building co-occurrence indexes and finding causal relationships with COPD can be difficult because often times disease prevalence within a population influences results. A method which can better separate occurrence within COPD patients from population prevalence would be desirable. Large hospital systems may potentially have tens of millions of patient records spanning decades of collection and a big data approach that is scalable is desirable. The presented method, Co-Occurring Evidence Discovery (COED), presents a methodology and framework to address these issues. Methods Natural Language Processing methods are used to examine 64,371 deidentified clinical notes and discover associations between COPD and medical terms. Apache cTAKES is leveraged to annotate and structure clinical notes. Several extensions to cTAKES have been written to parallelize the annotation of large sets of clinical notes. A co-occurrence score is presented which can penalize scores based on term prevalence, as well as a baseline method traditionally used for finding co-occurrence. These scoring systems are implemented using Apache Spark. Dictionaries of ground truth terms for diseases, medications, and symptoms have been created using clinical domain knowledge. COED and baseline methods are compared using precision, recall, and F1 score. Results The highest scoring diseases using COED are lung and respiratory diseases. In contrast, baseline methods for co-occurrence rank diseases with high population prevalence highest. Medications and symptoms evaluated with COED share similar results. When evaluated against ground truth dictionaries, the maximum improvements in recall for symptoms, diseases, and medications were 0.212, 0.130, and 0.174. The maximum improvements in precision for symptoms, diseases, and medications were 0.303, 0.333, and 0.180. Median increase in F1 score for symptoms, diseases, and medications were 38.1%, 23.0%, and 17.1%. A paired t-test was performed and F1 score increases were found to be statistically significant, where p < 0.01. Conclusion Penalizing terms which are highly frequent in the corpus results in better precision and recall performance. Penalizing frequently occurring terms gives a better picture of the diseases, symptoms, and medications co-occurring with COPD. Using a mathematical and computational approach rather than purely expert driven approach, large dictionaries of COPD related terms can be assembled in a short amount of time.http://link.springer.com/article/10.1186/s40537-017-0067-6Big dataDecision support systemData miningHealth informatics
spellingShingle Christopher Baechle
Ankur Agarwal
Xingquan Zhu
Big data driven co-occurring evidence discovery in chronic obstructive pulmonary disease patients
Journal of Big Data
Big data
Decision support system
Data mining
Health informatics
title Big data driven co-occurring evidence discovery in chronic obstructive pulmonary disease patients
title_full Big data driven co-occurring evidence discovery in chronic obstructive pulmonary disease patients
title_fullStr Big data driven co-occurring evidence discovery in chronic obstructive pulmonary disease patients
title_full_unstemmed Big data driven co-occurring evidence discovery in chronic obstructive pulmonary disease patients
title_short Big data driven co-occurring evidence discovery in chronic obstructive pulmonary disease patients
title_sort big data driven co occurring evidence discovery in chronic obstructive pulmonary disease patients
topic Big data
Decision support system
Data mining
Health informatics
url http://link.springer.com/article/10.1186/s40537-017-0067-6
work_keys_str_mv AT christopherbaechle bigdatadrivencooccurringevidencediscoveryinchronicobstructivepulmonarydiseasepatients
AT ankuragarwal bigdatadrivencooccurringevidencediscoveryinchronicobstructivepulmonarydiseasepatients
AT xingquanzhu bigdatadrivencooccurringevidencediscoveryinchronicobstructivepulmonarydiseasepatients