Defining Disease Phenotypes in Primary Care Electronic Health Records by a Machine Learning Approach: A Case Study in Identifying Rheumatoid Arthritis.

OBJECTIVES:1) To use data-driven method to examine clinical codes (risk factors) of a medical condition in primary care electronic health records (EHRs) that can accurately predict a diagnosis of the condition in secondary care EHRs. 2) To develop and validate a disease phenotyping algorithm for rhe...

Full description

Bibliographic Details
Main Authors: Shang-Ming Zhou, Fabiola Fernandez-Gutierrez, Jonathan Kennedy, Roxanne Cooksey, Mark Atkinson, Spiros Denaxas, Stefan Siebert, William G Dixon, Terence W O'Neill, Ernest Choy, Cathie Sudlow, UK Biobank Follow-up and Outcomes Group, Sinead Brophy
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2016-01-01
Series:PLoS ONE
Online Access:http://europepmc.org/articles/PMC4852928?pdf=render
_version_ 1811216468928364544
author Shang-Ming Zhou
Fabiola Fernandez-Gutierrez
Jonathan Kennedy
Roxanne Cooksey
Mark Atkinson
Spiros Denaxas
Stefan Siebert
William G Dixon
Terence W O'Neill
Ernest Choy
Cathie Sudlow
UK Biobank Follow-up and Outcomes Group
Sinead Brophy
author_facet Shang-Ming Zhou
Fabiola Fernandez-Gutierrez
Jonathan Kennedy
Roxanne Cooksey
Mark Atkinson
Spiros Denaxas
Stefan Siebert
William G Dixon
Terence W O'Neill
Ernest Choy
Cathie Sudlow
UK Biobank Follow-up and Outcomes Group
Sinead Brophy
author_sort Shang-Ming Zhou
collection DOAJ
description OBJECTIVES:1) To use data-driven method to examine clinical codes (risk factors) of a medical condition in primary care electronic health records (EHRs) that can accurately predict a diagnosis of the condition in secondary care EHRs. 2) To develop and validate a disease phenotyping algorithm for rheumatoid arthritis using primary care EHRs. METHODS:This study linked routine primary and secondary care EHRs in Wales, UK. A machine learning based scheme was used to identify patients with rheumatoid arthritis from primary care EHRs via the following steps: i) selection of variables by comparing relative frequencies of Read codes in the primary care dataset associated with disease case compared to non-disease control (disease/non-disease based on the secondary care diagnosis); ii) reduction of predictors/associated variables using a Random Forest method, iii) induction of decision rules from decision tree model. The proposed method was then extensively validated on an independent dataset, and compared for performance with two existing deterministic algorithms for RA which had been developed using expert clinical knowledge. RESULTS:Primary care EHRs were available for 2,238,360 patients over the age of 16 and of these 20,667 were also linked in the secondary care rheumatology clinical system. In the linked dataset, 900 predictors (out of a total of 43,100 variables) in the primary care record were discovered more frequently in those with versus those without RA. These variables were reduced to 37 groups of related clinical codes, which were used to develop a decision tree model. The final algorithm identified 8 predictors related to diagnostic codes for RA, medication codes, such as those for disease modifying anti-rheumatic drugs, and absence of alternative diagnoses such as psoriatic arthritis. The proposed data-driven method performed as well as the expert clinical knowledge based methods. CONCLUSION:Data-driven scheme, such as ensemble machine learning methods, has the potential of identifying the most informative predictors in a cost-effective and rapid way to accurately and reliably classify rheumatoid arthritis or other complex medical conditions in primary care EHRs.
first_indexed 2024-04-12T06:39:35Z
format Article
id doaj.art-d22f25a17f88446294ac4db9e846f671
institution Directory Open Access Journal
issn 1932-6203
language English
last_indexed 2024-04-12T06:39:35Z
publishDate 2016-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj.art-d22f25a17f88446294ac4db9e846f6712022-12-22T03:43:46ZengPublic Library of Science (PLoS)PLoS ONE1932-62032016-01-01115e015451510.1371/journal.pone.0154515Defining Disease Phenotypes in Primary Care Electronic Health Records by a Machine Learning Approach: A Case Study in Identifying Rheumatoid Arthritis.Shang-Ming ZhouFabiola Fernandez-GutierrezJonathan KennedyRoxanne CookseyMark AtkinsonSpiros DenaxasStefan SiebertWilliam G DixonTerence W O'NeillErnest ChoyCathie SudlowUK Biobank Follow-up and Outcomes GroupSinead BrophyOBJECTIVES:1) To use data-driven method to examine clinical codes (risk factors) of a medical condition in primary care electronic health records (EHRs) that can accurately predict a diagnosis of the condition in secondary care EHRs. 2) To develop and validate a disease phenotyping algorithm for rheumatoid arthritis using primary care EHRs. METHODS:This study linked routine primary and secondary care EHRs in Wales, UK. A machine learning based scheme was used to identify patients with rheumatoid arthritis from primary care EHRs via the following steps: i) selection of variables by comparing relative frequencies of Read codes in the primary care dataset associated with disease case compared to non-disease control (disease/non-disease based on the secondary care diagnosis); ii) reduction of predictors/associated variables using a Random Forest method, iii) induction of decision rules from decision tree model. The proposed method was then extensively validated on an independent dataset, and compared for performance with two existing deterministic algorithms for RA which had been developed using expert clinical knowledge. RESULTS:Primary care EHRs were available for 2,238,360 patients over the age of 16 and of these 20,667 were also linked in the secondary care rheumatology clinical system. In the linked dataset, 900 predictors (out of a total of 43,100 variables) in the primary care record were discovered more frequently in those with versus those without RA. These variables were reduced to 37 groups of related clinical codes, which were used to develop a decision tree model. The final algorithm identified 8 predictors related to diagnostic codes for RA, medication codes, such as those for disease modifying anti-rheumatic drugs, and absence of alternative diagnoses such as psoriatic arthritis. The proposed data-driven method performed as well as the expert clinical knowledge based methods. CONCLUSION:Data-driven scheme, such as ensemble machine learning methods, has the potential of identifying the most informative predictors in a cost-effective and rapid way to accurately and reliably classify rheumatoid arthritis or other complex medical conditions in primary care EHRs.http://europepmc.org/articles/PMC4852928?pdf=render
spellingShingle Shang-Ming Zhou
Fabiola Fernandez-Gutierrez
Jonathan Kennedy
Roxanne Cooksey
Mark Atkinson
Spiros Denaxas
Stefan Siebert
William G Dixon
Terence W O'Neill
Ernest Choy
Cathie Sudlow
UK Biobank Follow-up and Outcomes Group
Sinead Brophy
Defining Disease Phenotypes in Primary Care Electronic Health Records by a Machine Learning Approach: A Case Study in Identifying Rheumatoid Arthritis.
PLoS ONE
title Defining Disease Phenotypes in Primary Care Electronic Health Records by a Machine Learning Approach: A Case Study in Identifying Rheumatoid Arthritis.
title_full Defining Disease Phenotypes in Primary Care Electronic Health Records by a Machine Learning Approach: A Case Study in Identifying Rheumatoid Arthritis.
title_fullStr Defining Disease Phenotypes in Primary Care Electronic Health Records by a Machine Learning Approach: A Case Study in Identifying Rheumatoid Arthritis.
title_full_unstemmed Defining Disease Phenotypes in Primary Care Electronic Health Records by a Machine Learning Approach: A Case Study in Identifying Rheumatoid Arthritis.
title_short Defining Disease Phenotypes in Primary Care Electronic Health Records by a Machine Learning Approach: A Case Study in Identifying Rheumatoid Arthritis.
title_sort defining disease phenotypes in primary care electronic health records by a machine learning approach a case study in identifying rheumatoid arthritis
url http://europepmc.org/articles/PMC4852928?pdf=render
work_keys_str_mv AT shangmingzhou definingdiseasephenotypesinprimarycareelectronichealthrecordsbyamachinelearningapproachacasestudyinidentifyingrheumatoidarthritis
AT fabiolafernandezgutierrez definingdiseasephenotypesinprimarycareelectronichealthrecordsbyamachinelearningapproachacasestudyinidentifyingrheumatoidarthritis
AT jonathankennedy definingdiseasephenotypesinprimarycareelectronichealthrecordsbyamachinelearningapproachacasestudyinidentifyingrheumatoidarthritis
AT roxannecooksey definingdiseasephenotypesinprimarycareelectronichealthrecordsbyamachinelearningapproachacasestudyinidentifyingrheumatoidarthritis
AT markatkinson definingdiseasephenotypesinprimarycareelectronichealthrecordsbyamachinelearningapproachacasestudyinidentifyingrheumatoidarthritis
AT spirosdenaxas definingdiseasephenotypesinprimarycareelectronichealthrecordsbyamachinelearningapproachacasestudyinidentifyingrheumatoidarthritis
AT stefansiebert definingdiseasephenotypesinprimarycareelectronichealthrecordsbyamachinelearningapproachacasestudyinidentifyingrheumatoidarthritis
AT williamgdixon definingdiseasephenotypesinprimarycareelectronichealthrecordsbyamachinelearningapproachacasestudyinidentifyingrheumatoidarthritis
AT terencewoneill definingdiseasephenotypesinprimarycareelectronichealthrecordsbyamachinelearningapproachacasestudyinidentifyingrheumatoidarthritis
AT ernestchoy definingdiseasephenotypesinprimarycareelectronichealthrecordsbyamachinelearningapproachacasestudyinidentifyingrheumatoidarthritis
AT cathiesudlow definingdiseasephenotypesinprimarycareelectronichealthrecordsbyamachinelearningapproachacasestudyinidentifyingrheumatoidarthritis
AT ukbiobankfollowupandoutcomesgroup definingdiseasephenotypesinprimarycareelectronichealthrecordsbyamachinelearningapproachacasestudyinidentifyingrheumatoidarthritis
AT sineadbrophy definingdiseasephenotypesinprimarycareelectronichealthrecordsbyamachinelearningapproachacasestudyinidentifyingrheumatoidarthritis