Big data driven co-occurring evidence discovery in chronic obstructive pulmonary disease patients
Abstract Background Chronic Obstructive Pulmonary Disease (COPD) is a chronic lung disease that affects airflow to the lungs. Discovering the co-occurrence of COPD with other diseases, symptoms, and medications is invaluable to medical staff. Building co-occurrence indexes and finding causal relatio...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
SpringerOpen
2017-04-01
|
Series: | Journal of Big Data |
Subjects: | |
Online Access: | http://link.springer.com/article/10.1186/s40537-017-0067-6 |
_version_ | 1818827622534086656 |
---|---|
author | Christopher Baechle Ankur Agarwal Xingquan Zhu |
author_facet | Christopher Baechle Ankur Agarwal Xingquan Zhu |
author_sort | Christopher Baechle |
collection | DOAJ |
description | Abstract Background Chronic Obstructive Pulmonary Disease (COPD) is a chronic lung disease that affects airflow to the lungs. Discovering the co-occurrence of COPD with other diseases, symptoms, and medications is invaluable to medical staff. Building co-occurrence indexes and finding causal relationships with COPD can be difficult because often times disease prevalence within a population influences results. A method which can better separate occurrence within COPD patients from population prevalence would be desirable. Large hospital systems may potentially have tens of millions of patient records spanning decades of collection and a big data approach that is scalable is desirable. The presented method, Co-Occurring Evidence Discovery (COED), presents a methodology and framework to address these issues. Methods Natural Language Processing methods are used to examine 64,371 deidentified clinical notes and discover associations between COPD and medical terms. Apache cTAKES is leveraged to annotate and structure clinical notes. Several extensions to cTAKES have been written to parallelize the annotation of large sets of clinical notes. A co-occurrence score is presented which can penalize scores based on term prevalence, as well as a baseline method traditionally used for finding co-occurrence. These scoring systems are implemented using Apache Spark. Dictionaries of ground truth terms for diseases, medications, and symptoms have been created using clinical domain knowledge. COED and baseline methods are compared using precision, recall, and F1 score. Results The highest scoring diseases using COED are lung and respiratory diseases. In contrast, baseline methods for co-occurrence rank diseases with high population prevalence highest. Medications and symptoms evaluated with COED share similar results. When evaluated against ground truth dictionaries, the maximum improvements in recall for symptoms, diseases, and medications were 0.212, 0.130, and 0.174. The maximum improvements in precision for symptoms, diseases, and medications were 0.303, 0.333, and 0.180. Median increase in F1 score for symptoms, diseases, and medications were 38.1%, 23.0%, and 17.1%. A paired t-test was performed and F1 score increases were found to be statistically significant, where p < 0.01. Conclusion Penalizing terms which are highly frequent in the corpus results in better precision and recall performance. Penalizing frequently occurring terms gives a better picture of the diseases, symptoms, and medications co-occurring with COPD. Using a mathematical and computational approach rather than purely expert driven approach, large dictionaries of COPD related terms can be assembled in a short amount of time. |
first_indexed | 2024-12-19T00:46:29Z |
format | Article |
id | doaj.art-9dc8c9bc64044894a18299eefd4ed511 |
institution | Directory Open Access Journal |
issn | 2196-1115 |
language | English |
last_indexed | 2024-12-19T00:46:29Z |
publishDate | 2017-04-01 |
publisher | SpringerOpen |
record_format | Article |
series | Journal of Big Data |
spelling | doaj.art-9dc8c9bc64044894a18299eefd4ed5112022-12-21T20:44:15ZengSpringerOpenJournal of Big Data2196-11152017-04-014111810.1186/s40537-017-0067-6Big data driven co-occurring evidence discovery in chronic obstructive pulmonary disease patientsChristopher Baechle0Ankur Agarwal1Xingquan Zhu2Department of Computer & Electrical Engineering and Computer Science, College of Engineering, Florida Atlantic UniversityDepartment of Computer & Electrical Engineering and Computer Science, College of Engineering, Florida Atlantic UniversityDepartment of Computer & Electrical Engineering and Computer Science, College of Engineering, Florida Atlantic UniversityAbstract Background Chronic Obstructive Pulmonary Disease (COPD) is a chronic lung disease that affects airflow to the lungs. Discovering the co-occurrence of COPD with other diseases, symptoms, and medications is invaluable to medical staff. Building co-occurrence indexes and finding causal relationships with COPD can be difficult because often times disease prevalence within a population influences results. A method which can better separate occurrence within COPD patients from population prevalence would be desirable. Large hospital systems may potentially have tens of millions of patient records spanning decades of collection and a big data approach that is scalable is desirable. The presented method, Co-Occurring Evidence Discovery (COED), presents a methodology and framework to address these issues. Methods Natural Language Processing methods are used to examine 64,371 deidentified clinical notes and discover associations between COPD and medical terms. Apache cTAKES is leveraged to annotate and structure clinical notes. Several extensions to cTAKES have been written to parallelize the annotation of large sets of clinical notes. A co-occurrence score is presented which can penalize scores based on term prevalence, as well as a baseline method traditionally used for finding co-occurrence. These scoring systems are implemented using Apache Spark. Dictionaries of ground truth terms for diseases, medications, and symptoms have been created using clinical domain knowledge. COED and baseline methods are compared using precision, recall, and F1 score. Results The highest scoring diseases using COED are lung and respiratory diseases. In contrast, baseline methods for co-occurrence rank diseases with high population prevalence highest. Medications and symptoms evaluated with COED share similar results. When evaluated against ground truth dictionaries, the maximum improvements in recall for symptoms, diseases, and medications were 0.212, 0.130, and 0.174. The maximum improvements in precision for symptoms, diseases, and medications were 0.303, 0.333, and 0.180. Median increase in F1 score for symptoms, diseases, and medications were 38.1%, 23.0%, and 17.1%. A paired t-test was performed and F1 score increases were found to be statistically significant, where p < 0.01. Conclusion Penalizing terms which are highly frequent in the corpus results in better precision and recall performance. Penalizing frequently occurring terms gives a better picture of the diseases, symptoms, and medications co-occurring with COPD. Using a mathematical and computational approach rather than purely expert driven approach, large dictionaries of COPD related terms can be assembled in a short amount of time.http://link.springer.com/article/10.1186/s40537-017-0067-6Big dataDecision support systemData miningHealth informatics |
spellingShingle | Christopher Baechle Ankur Agarwal Xingquan Zhu Big data driven co-occurring evidence discovery in chronic obstructive pulmonary disease patients Journal of Big Data Big data Decision support system Data mining Health informatics |
title | Big data driven co-occurring evidence discovery in chronic obstructive pulmonary disease patients |
title_full | Big data driven co-occurring evidence discovery in chronic obstructive pulmonary disease patients |
title_fullStr | Big data driven co-occurring evidence discovery in chronic obstructive pulmonary disease patients |
title_full_unstemmed | Big data driven co-occurring evidence discovery in chronic obstructive pulmonary disease patients |
title_short | Big data driven co-occurring evidence discovery in chronic obstructive pulmonary disease patients |
title_sort | big data driven co occurring evidence discovery in chronic obstructive pulmonary disease patients |
topic | Big data Decision support system Data mining Health informatics |
url | http://link.springer.com/article/10.1186/s40537-017-0067-6 |
work_keys_str_mv | AT christopherbaechle bigdatadrivencooccurringevidencediscoveryinchronicobstructivepulmonarydiseasepatients AT ankuragarwal bigdatadrivencooccurringevidencediscoveryinchronicobstructivepulmonarydiseasepatients AT xingquanzhu bigdatadrivencooccurringevidencediscoveryinchronicobstructivepulmonarydiseasepatients |