Synergies between centralized and federated approaches to data quality: a report from the national COVID cohort collaborative

<strong>Objective<br></strong> In response to COVID-19, the informatics community united to aggregate as much clinical data as possible to characterize this new disease and reduce its impact through collaborative analytics. The National COVID Cohort Collaborative (N3C) is now the l...

Full description

Bibliographic Details
Main Authors: Pfaff, ER, Girvin, AT, Gabriel, DL, Kostka, K, Morris, M, Palchuk, MB, Lehmann, HP, Amor, B, Bissell, M, Bradwell, KR, Gold, S, Hong, SS, Loomba, J, Manna, A, McMurry, JA, Niehaus, E, Qureshi, N, Walden, A, Zhang, XT, Zhu, RL, Moffitt, RA, Haendel, MA, Chute, CG, N3C Consortium, Adams, WG, Al-Shukri, S, Anzalone, A, Baghal, A, Bennett, TD, Bernstam, EV, Bissell, MM, Bush, B, Campion, TR, Castro, V, Chang, J, Chaudhari, DD, Chen, W, Chu, S, Cimino, JJ, Crandall, KA, Crooks, M, Davies, SJD, DiPalazzo, J, Dorr, D, Eckrich, D, Eltinge, SE, Fort, DG, Golovko, G, Gupta, S
Format: Journal article
Language:English
Published: Oxford University Press 2021
_version_ 1826307879953498112
author Pfaff, ER
Girvin, AT
Gabriel, DL
Kostka, K
Morris, M
Palchuk, MB
Lehmann, HP
Amor, B
Bissell, M
Bradwell, KR
Gold, S
Hong, SS
Loomba, J
Manna, A
McMurry, JA
Niehaus, E
Qureshi, N
Walden, A
Zhang, XT
Zhu, RL
Moffitt, RA
Haendel, MA
Chute, CG
N3C Consortium
Adams, WG
Al-Shukri, S
Anzalone, A
Baghal, A
Bennett, TD
Bernstam, EV
Bernstam, EV
Bissell, MM
Bush, B
Campion, TR
Castro, V
Chang, J
Chaudhari, DD
Chen, W
Chu, S
Cimino, JJ
Crandall, KA
Crooks, M
Davies, SJD
DiPalazzo, J
Dorr, D
Eckrich, D
Eltinge, SE
Fort, DG
Golovko, G
Gupta, S
author_facet Pfaff, ER
Girvin, AT
Gabriel, DL
Kostka, K
Morris, M
Palchuk, MB
Lehmann, HP
Amor, B
Bissell, M
Bradwell, KR
Gold, S
Hong, SS
Loomba, J
Manna, A
McMurry, JA
Niehaus, E
Qureshi, N
Walden, A
Zhang, XT
Zhu, RL
Moffitt, RA
Haendel, MA
Chute, CG
N3C Consortium
Adams, WG
Al-Shukri, S
Anzalone, A
Baghal, A
Bennett, TD
Bernstam, EV
Bernstam, EV
Bissell, MM
Bush, B
Campion, TR
Castro, V
Chang, J
Chaudhari, DD
Chen, W
Chu, S
Cimino, JJ
Crandall, KA
Crooks, M
Davies, SJD
DiPalazzo, J
Dorr, D
Eckrich, D
Eltinge, SE
Fort, DG
Golovko, G
Gupta, S
author_sort Pfaff, ER
collection OXFORD
description <strong>Objective<br></strong> In response to COVID-19, the informatics community united to aggregate as much clinical data as possible to characterize this new disease and reduce its impact through collaborative analytics. The National COVID Cohort Collaborative (N3C) is now the largest publicly available HIPAA limited dataset in US history with over 6.4 million patients and is a testament to a partnership of over 100 organizations. <br><strong> Materials and Methods<br></strong> We developed a pipeline for ingesting, harmonizing, and centralizing data from 56 contributing data partners using 4 federated Common Data Models. N3C data quality (DQ) review involves both automated and manual procedures. In the process, several DQ heuristics were discovered in our centralized context, both within the pipeline and during downstream project-based analysis. Feedback to the sites led to many local and centralized DQ improvements. <br><strong> Results<br></strong> Beyond well-recognized DQ findings, we discovered 15 heuristics relating to source Common Data Model conformance, demographics, COVID tests, conditions, encounters, measurements, observations, coding completeness, and fitness for use. Of 56 sites, 37 sites (66%) demonstrated issues through these heuristics. These 37 sites demonstrated improvement after receiving feedback. <br><strong> Discussion<br></strong> We encountered site-to-site differences in DQ which would have been challenging to discover using federated checks alone. We have demonstrated that centralized DQ benchmarking reveals unique opportunities for DQ improvement that will support improved research analytics locally and in aggregate. <br><strong> Conclusion<br></strong> By combining rapid, continual assessment of DQ with a large volume of multisite data, it is possible to support more nuanced scientific questions with the scale and rigor that they require.
first_indexed 2024-03-07T07:11:15Z
format Journal article
id oxford-uuid:b846d295-7fb7-4e4f-b878-2edbdca8fe09
institution University of Oxford
language English
last_indexed 2024-03-07T07:11:15Z
publishDate 2021
publisher Oxford University Press
record_format dspace
spelling oxford-uuid:b846d295-7fb7-4e4f-b878-2edbdca8fe092022-06-17T09:01:42ZSynergies between centralized and federated approaches to data quality: a report from the national COVID cohort collaborativeJournal articlehttp://purl.org/coar/resource_type/c_dcae04bcuuid:b846d295-7fb7-4e4f-b878-2edbdca8fe09EnglishSymplectic ElementsOxford University Press2021Pfaff, ERGirvin, ATGabriel, DLKostka, KMorris, MPalchuk, MBLehmann, HPAmor, BBissell, MBradwell, KRGold, SHong, SSLoomba, JManna, AMcMurry, JANiehaus, EQureshi, NWalden, AZhang, XTZhu, RLMoffitt, RAHaendel, MAChute, CGN3C ConsortiumAdams, WGAl-Shukri, SAnzalone, ABaghal, ABennett, TDBernstam, EVBernstam, EVBissell, MMBush, BCampion, TRCastro, VChang, JChaudhari, DDChen, WChu, SCimino, JJCrandall, KACrooks, MDavies, SJDDiPalazzo, JDorr, DEckrich, DEltinge, SEFort, DGGolovko, GGupta, S<strong>Objective<br></strong> In response to COVID-19, the informatics community united to aggregate as much clinical data as possible to characterize this new disease and reduce its impact through collaborative analytics. The National COVID Cohort Collaborative (N3C) is now the largest publicly available HIPAA limited dataset in US history with over 6.4 million patients and is a testament to a partnership of over 100 organizations. <br><strong> Materials and Methods<br></strong> We developed a pipeline for ingesting, harmonizing, and centralizing data from 56 contributing data partners using 4 federated Common Data Models. N3C data quality (DQ) review involves both automated and manual procedures. In the process, several DQ heuristics were discovered in our centralized context, both within the pipeline and during downstream project-based analysis. Feedback to the sites led to many local and centralized DQ improvements. <br><strong> Results<br></strong> Beyond well-recognized DQ findings, we discovered 15 heuristics relating to source Common Data Model conformance, demographics, COVID tests, conditions, encounters, measurements, observations, coding completeness, and fitness for use. Of 56 sites, 37 sites (66%) demonstrated issues through these heuristics. These 37 sites demonstrated improvement after receiving feedback. <br><strong> Discussion<br></strong> We encountered site-to-site differences in DQ which would have been challenging to discover using federated checks alone. We have demonstrated that centralized DQ benchmarking reveals unique opportunities for DQ improvement that will support improved research analytics locally and in aggregate. <br><strong> Conclusion<br></strong> By combining rapid, continual assessment of DQ with a large volume of multisite data, it is possible to support more nuanced scientific questions with the scale and rigor that they require.
spellingShingle Pfaff, ER
Girvin, AT
Gabriel, DL
Kostka, K
Morris, M
Palchuk, MB
Lehmann, HP
Amor, B
Bissell, M
Bradwell, KR
Gold, S
Hong, SS
Loomba, J
Manna, A
McMurry, JA
Niehaus, E
Qureshi, N
Walden, A
Zhang, XT
Zhu, RL
Moffitt, RA
Haendel, MA
Chute, CG
N3C Consortium
Adams, WG
Al-Shukri, S
Anzalone, A
Baghal, A
Bennett, TD
Bernstam, EV
Bernstam, EV
Bissell, MM
Bush, B
Campion, TR
Castro, V
Chang, J
Chaudhari, DD
Chen, W
Chu, S
Cimino, JJ
Crandall, KA
Crooks, M
Davies, SJD
DiPalazzo, J
Dorr, D
Eckrich, D
Eltinge, SE
Fort, DG
Golovko, G
Gupta, S
Synergies between centralized and federated approaches to data quality: a report from the national COVID cohort collaborative
title Synergies between centralized and federated approaches to data quality: a report from the national COVID cohort collaborative
title_full Synergies between centralized and federated approaches to data quality: a report from the national COVID cohort collaborative
title_fullStr Synergies between centralized and federated approaches to data quality: a report from the national COVID cohort collaborative
title_full_unstemmed Synergies between centralized and federated approaches to data quality: a report from the national COVID cohort collaborative
title_short Synergies between centralized and federated approaches to data quality: a report from the national COVID cohort collaborative
title_sort synergies between centralized and federated approaches to data quality a report from the national covid cohort collaborative
work_keys_str_mv AT pfaffer synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT girvinat synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT gabrieldl synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT kostkak synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT morrism synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT palchukmb synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT lehmannhp synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT amorb synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT bissellm synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT bradwellkr synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT golds synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT hongss synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT loombaj synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT mannaa synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT mcmurryja synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT niehause synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT qureshin synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT waldena synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT zhangxt synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT zhurl synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT moffittra synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT haendelma synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT chutecg synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT n3cconsortium synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT adamswg synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT alshukris synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT anzalonea synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT baghala synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT bennetttd synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT bernstamev synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT bernstamev synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT bissellmm synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT bushb synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT campiontr synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT castrov synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT changj synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT chaudharidd synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT chenw synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT chus synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT ciminojj synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT crandallka synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT crooksm synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT daviessjd synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT dipalazzoj synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT dorrd synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT eckrichd synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT eltingese synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT fortdg synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT golovkog synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative
AT guptas synergiesbetweencentralizedandfederatedapproachestodataqualityareportfromthenationalcovidcohortcollaborative