Assessing data quality from the Clinical Practice Research Datalink: a methodological approach applied to the full blood count blood test

Abstract A Full Blood Count (FBC) is a common blood test including 20 parameters, such as haemoglobin and platelets. FBCs from Electronic Health Record (EHR) databases provide a large sample of anonymised individual patient data and are increasingly used in research. We describe the quality of the F...

Full description

Bibliographic Details
Main Authors: Pradeep S. Virdee, Alice Fuller, Michael Jacobs, Tim Holt, Jacqueline Birks
Format: Article
Language:English
Published: SpringerOpen 2020-11-01
Series:Journal of Big Data
Subjects:
Online Access:http://link.springer.com/article/10.1186/s40537-020-00375-w
_version_ 1819206077073326080
author Pradeep S. Virdee
Alice Fuller
Michael Jacobs
Tim Holt
Jacqueline Birks
author_facet Pradeep S. Virdee
Alice Fuller
Michael Jacobs
Tim Holt
Jacqueline Birks
author_sort Pradeep S. Virdee
collection DOAJ
description Abstract A Full Blood Count (FBC) is a common blood test including 20 parameters, such as haemoglobin and platelets. FBCs from Electronic Health Record (EHR) databases provide a large sample of anonymised individual patient data and are increasingly used in research. We describe the quality of the FBC data in one EHR. The Test dataset from the Clinical Research Practice Datalink (CPRD) was accessed, which contains results of tests performed in primary care, such as FBC blood tests. Medical codes and entity codes, two coding systems used within CPRD to identify FBC records, were compared, with levels of mismatched coding, and number that could be rectified reported. The reliability of units of measurement are also described and missing data discussed. There were 14 entity codes and 138 medical codes for the FBC in the data. Medical and entity codes consistently corresponded to the same FBC parameter in 95.2% (n = 217,752,448) of parameters. In the 4.8% (n = 10,955,006) mismatches, the most common parameter rectified was mean platelet volume (n = 2,041,360) and 1,191,540 could not be rectified and were removed. Units of measurement were often either missing, partially entered, or did not appear to correspond to the blood value. The final dataset contained 16,537,017 FBC tests. Applying mathematical equations to derive some missing parameters in these FBCs resulted in 15 of 20 parameters available per FBC on average, with 0.3% of FBCs having all 20 parameters. Performing data quality checks can help to understand the extent of any issues in the dataset. We emphasise balancing large sample sizes with reliability of the data.
first_indexed 2024-12-23T05:01:51Z
format Article
id doaj.art-f1274531db534c5f9f719d6bca792c01
institution Directory Open Access Journal
issn 2196-1115
language English
last_indexed 2024-12-23T05:01:51Z
publishDate 2020-11-01
publisher SpringerOpen
record_format Article
series Journal of Big Data
spelling doaj.art-f1274531db534c5f9f719d6bca792c012022-12-21T17:59:12ZengSpringerOpenJournal of Big Data2196-11152020-11-017111810.1186/s40537-020-00375-wAssessing data quality from the Clinical Practice Research Datalink: a methodological approach applied to the full blood count blood testPradeep S. Virdee0Alice Fuller1Michael Jacobs2Tim Holt3Jacqueline Birks4Centre for Statistics in Medicine, Botnar Research Centre, Nuffield Orthopaedic Centre, NDORMS, University of OxfordNuffield Department of Primary Care Health Sciences, University of OxfordBMS Haematology, John Radcliffe Hospital, Oxford University HospitalsNuffield Department of Primary Care Health Sciences, University of OxfordCentre for Statistics in Medicine, Botnar Research Centre, Nuffield Orthopaedic Centre, NDORMS, University of OxfordAbstract A Full Blood Count (FBC) is a common blood test including 20 parameters, such as haemoglobin and platelets. FBCs from Electronic Health Record (EHR) databases provide a large sample of anonymised individual patient data and are increasingly used in research. We describe the quality of the FBC data in one EHR. The Test dataset from the Clinical Research Practice Datalink (CPRD) was accessed, which contains results of tests performed in primary care, such as FBC blood tests. Medical codes and entity codes, two coding systems used within CPRD to identify FBC records, were compared, with levels of mismatched coding, and number that could be rectified reported. The reliability of units of measurement are also described and missing data discussed. There were 14 entity codes and 138 medical codes for the FBC in the data. Medical and entity codes consistently corresponded to the same FBC parameter in 95.2% (n = 217,752,448) of parameters. In the 4.8% (n = 10,955,006) mismatches, the most common parameter rectified was mean platelet volume (n = 2,041,360) and 1,191,540 could not be rectified and were removed. Units of measurement were often either missing, partially entered, or did not appear to correspond to the blood value. The final dataset contained 16,537,017 FBC tests. Applying mathematical equations to derive some missing parameters in these FBCs resulted in 15 of 20 parameters available per FBC on average, with 0.3% of FBCs having all 20 parameters. Performing data quality checks can help to understand the extent of any issues in the dataset. We emphasise balancing large sample sizes with reliability of the data.http://link.springer.com/article/10.1186/s40537-020-00375-wClinical practice research datalinkFull blood countBlood testData qualityData validation
spellingShingle Pradeep S. Virdee
Alice Fuller
Michael Jacobs
Tim Holt
Jacqueline Birks
Assessing data quality from the Clinical Practice Research Datalink: a methodological approach applied to the full blood count blood test
Journal of Big Data
Clinical practice research datalink
Full blood count
Blood test
Data quality
Data validation
title Assessing data quality from the Clinical Practice Research Datalink: a methodological approach applied to the full blood count blood test
title_full Assessing data quality from the Clinical Practice Research Datalink: a methodological approach applied to the full blood count blood test
title_fullStr Assessing data quality from the Clinical Practice Research Datalink: a methodological approach applied to the full blood count blood test
title_full_unstemmed Assessing data quality from the Clinical Practice Research Datalink: a methodological approach applied to the full blood count blood test
title_short Assessing data quality from the Clinical Practice Research Datalink: a methodological approach applied to the full blood count blood test
title_sort assessing data quality from the clinical practice research datalink a methodological approach applied to the full blood count blood test
topic Clinical practice research datalink
Full blood count
Blood test
Data quality
Data validation
url http://link.springer.com/article/10.1186/s40537-020-00375-w
work_keys_str_mv AT pradeepsvirdee assessingdataqualityfromtheclinicalpracticeresearchdatalinkamethodologicalapproachappliedtothefullbloodcountbloodtest
AT alicefuller assessingdataqualityfromtheclinicalpracticeresearchdatalinkamethodologicalapproachappliedtothefullbloodcountbloodtest
AT michaeljacobs assessingdataqualityfromtheclinicalpracticeresearchdatalinkamethodologicalapproachappliedtothefullbloodcountbloodtest
AT timholt assessingdataqualityfromtheclinicalpracticeresearchdatalinkamethodologicalapproachappliedtothefullbloodcountbloodtest
AT jacquelinebirks assessingdataqualityfromtheclinicalpracticeresearchdatalinkamethodologicalapproachappliedtothefullbloodcountbloodtest