Assessing data quality from the Clinical Practice Research Datalink: a methodological approach applied to the full blood count blood test
Abstract A Full Blood Count (FBC) is a common blood test including 20 parameters, such as haemoglobin and platelets. FBCs from Electronic Health Record (EHR) databases provide a large sample of anonymised individual patient data and are increasingly used in research. We describe the quality of the F...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
SpringerOpen
2020-11-01
|
Series: | Journal of Big Data |
Subjects: | |
Online Access: | http://link.springer.com/article/10.1186/s40537-020-00375-w |
_version_ | 1819206077073326080 |
---|---|
author | Pradeep S. Virdee Alice Fuller Michael Jacobs Tim Holt Jacqueline Birks |
author_facet | Pradeep S. Virdee Alice Fuller Michael Jacobs Tim Holt Jacqueline Birks |
author_sort | Pradeep S. Virdee |
collection | DOAJ |
description | Abstract A Full Blood Count (FBC) is a common blood test including 20 parameters, such as haemoglobin and platelets. FBCs from Electronic Health Record (EHR) databases provide a large sample of anonymised individual patient data and are increasingly used in research. We describe the quality of the FBC data in one EHR. The Test dataset from the Clinical Research Practice Datalink (CPRD) was accessed, which contains results of tests performed in primary care, such as FBC blood tests. Medical codes and entity codes, two coding systems used within CPRD to identify FBC records, were compared, with levels of mismatched coding, and number that could be rectified reported. The reliability of units of measurement are also described and missing data discussed. There were 14 entity codes and 138 medical codes for the FBC in the data. Medical and entity codes consistently corresponded to the same FBC parameter in 95.2% (n = 217,752,448) of parameters. In the 4.8% (n = 10,955,006) mismatches, the most common parameter rectified was mean platelet volume (n = 2,041,360) and 1,191,540 could not be rectified and were removed. Units of measurement were often either missing, partially entered, or did not appear to correspond to the blood value. The final dataset contained 16,537,017 FBC tests. Applying mathematical equations to derive some missing parameters in these FBCs resulted in 15 of 20 parameters available per FBC on average, with 0.3% of FBCs having all 20 parameters. Performing data quality checks can help to understand the extent of any issues in the dataset. We emphasise balancing large sample sizes with reliability of the data. |
first_indexed | 2024-12-23T05:01:51Z |
format | Article |
id | doaj.art-f1274531db534c5f9f719d6bca792c01 |
institution | Directory Open Access Journal |
issn | 2196-1115 |
language | English |
last_indexed | 2024-12-23T05:01:51Z |
publishDate | 2020-11-01 |
publisher | SpringerOpen |
record_format | Article |
series | Journal of Big Data |
spelling | doaj.art-f1274531db534c5f9f719d6bca792c012022-12-21T17:59:12ZengSpringerOpenJournal of Big Data2196-11152020-11-017111810.1186/s40537-020-00375-wAssessing data quality from the Clinical Practice Research Datalink: a methodological approach applied to the full blood count blood testPradeep S. Virdee0Alice Fuller1Michael Jacobs2Tim Holt3Jacqueline Birks4Centre for Statistics in Medicine, Botnar Research Centre, Nuffield Orthopaedic Centre, NDORMS, University of OxfordNuffield Department of Primary Care Health Sciences, University of OxfordBMS Haematology, John Radcliffe Hospital, Oxford University HospitalsNuffield Department of Primary Care Health Sciences, University of OxfordCentre for Statistics in Medicine, Botnar Research Centre, Nuffield Orthopaedic Centre, NDORMS, University of OxfordAbstract A Full Blood Count (FBC) is a common blood test including 20 parameters, such as haemoglobin and platelets. FBCs from Electronic Health Record (EHR) databases provide a large sample of anonymised individual patient data and are increasingly used in research. We describe the quality of the FBC data in one EHR. The Test dataset from the Clinical Research Practice Datalink (CPRD) was accessed, which contains results of tests performed in primary care, such as FBC blood tests. Medical codes and entity codes, two coding systems used within CPRD to identify FBC records, were compared, with levels of mismatched coding, and number that could be rectified reported. The reliability of units of measurement are also described and missing data discussed. There were 14 entity codes and 138 medical codes for the FBC in the data. Medical and entity codes consistently corresponded to the same FBC parameter in 95.2% (n = 217,752,448) of parameters. In the 4.8% (n = 10,955,006) mismatches, the most common parameter rectified was mean platelet volume (n = 2,041,360) and 1,191,540 could not be rectified and were removed. Units of measurement were often either missing, partially entered, or did not appear to correspond to the blood value. The final dataset contained 16,537,017 FBC tests. Applying mathematical equations to derive some missing parameters in these FBCs resulted in 15 of 20 parameters available per FBC on average, with 0.3% of FBCs having all 20 parameters. Performing data quality checks can help to understand the extent of any issues in the dataset. We emphasise balancing large sample sizes with reliability of the data.http://link.springer.com/article/10.1186/s40537-020-00375-wClinical practice research datalinkFull blood countBlood testData qualityData validation |
spellingShingle | Pradeep S. Virdee Alice Fuller Michael Jacobs Tim Holt Jacqueline Birks Assessing data quality from the Clinical Practice Research Datalink: a methodological approach applied to the full blood count blood test Journal of Big Data Clinical practice research datalink Full blood count Blood test Data quality Data validation |
title | Assessing data quality from the Clinical Practice Research Datalink: a methodological approach applied to the full blood count blood test |
title_full | Assessing data quality from the Clinical Practice Research Datalink: a methodological approach applied to the full blood count blood test |
title_fullStr | Assessing data quality from the Clinical Practice Research Datalink: a methodological approach applied to the full blood count blood test |
title_full_unstemmed | Assessing data quality from the Clinical Practice Research Datalink: a methodological approach applied to the full blood count blood test |
title_short | Assessing data quality from the Clinical Practice Research Datalink: a methodological approach applied to the full blood count blood test |
title_sort | assessing data quality from the clinical practice research datalink a methodological approach applied to the full blood count blood test |
topic | Clinical practice research datalink Full blood count Blood test Data quality Data validation |
url | http://link.springer.com/article/10.1186/s40537-020-00375-w |
work_keys_str_mv | AT pradeepsvirdee assessingdataqualityfromtheclinicalpracticeresearchdatalinkamethodologicalapproachappliedtothefullbloodcountbloodtest AT alicefuller assessingdataqualityfromtheclinicalpracticeresearchdatalinkamethodologicalapproachappliedtothefullbloodcountbloodtest AT michaeljacobs assessingdataqualityfromtheclinicalpracticeresearchdatalinkamethodologicalapproachappliedtothefullbloodcountbloodtest AT timholt assessingdataqualityfromtheclinicalpracticeresearchdatalinkamethodologicalapproachappliedtothefullbloodcountbloodtest AT jacquelinebirks assessingdataqualityfromtheclinicalpracticeresearchdatalinkamethodologicalapproachappliedtothefullbloodcountbloodtest |