Treating heterogeneity and uncertainty in data integration: study on Brazilian healthcare databases.

ABSTRACT Background Data integration comprises methods and tools to aggregate data from disparate sources to various purposes. Heterogeneity and uncertainty are technical challenges in this field. The first involves different data representation or meaning, while the second refers to incomplete dat...

Full description

Bibliographic Details
Main Authors: Marcos Barreto, Spiros Denaxas
Format: Article
Language:English
Published: Swansea University 2017-04-01
Series:International Journal of Population Data Science
Online Access:https://ijpds.org/article/view/313
_version_ 1797430671022489600
author Marcos Barreto
Spiros Denaxas
author_facet Marcos Barreto
Spiros Denaxas
author_sort Marcos Barreto
collection DOAJ
description ABSTRACT Background Data integration comprises methods and tools to aggregate data from disparate sources to various purposes. Heterogeneity and uncertainty are technical challenges in this field. The first involves different data representation or meaning, while the second refers to incomplete data or the expectancy that a data item exists in a data source. Our goal is to design and validate a data integration model and computing tools able to address both problems. Such model and tools will support the setup of a population-based cohort comprised by 100 million individuals and the generation of data marts (domain-specific data) to be used in epidemiological studies within an ongoing cooperation Brazil-UK. Such studies will assess the impact of a conditional cash transfer programme (PBF – Bolsa Família) on the occurrence, severity and mortality of several diseases and health problems (hospitalization, mortality, child health etc) over this cohort. Approach We propose a three-dimensional data model to aggregate information on the cohort, exposition (payments received during the observed period) and health outcomes. We treat heterogeneity based on our existing probabilistic linkage pipeline that provides data quality assessment, data conditioning (standardization, cleansing, blocking, and anonymization), two methods for probabilistic record matching, and accuracy assessment. Through this pipeline, we are able to probabilistically link records from PBF, CadastroÚnico (CADU - socioeconomic information) and healthcare databases from the Unified Health System (SUS). Uncertainty is modeled through “possible worlds”, which represent a data instance (record) with a corresponding probability. There exist 2^records possible worlds with the probability distribution being the product of record probabilities. We map the most probable relationships between all the databases involved and create some simulation scenarios in order to validate them. We are seeking for a good balance between the set of possible worlds, not overextending the possibilities, and their proximity to a real scenario. Results The current implementation comprises the linkage of the 2011 extraction of CADU and healthcare databases to populate the proposed model. Such linkage provides timely execution (up to 9 hours depending on the databases) with high accurate data marts (over 95% of true positive matched pairs) for samples with increasing size (from 1,447,512 to 12,036,010 records). Conclusion Our model is able to treat heterogeneity aspects present in huge databases. Our tools provide timely execution of probabilistic linkage with high accuracy. We started to model uncertainty in order to perform simulations and decide how to incorporate it in our model.
first_indexed 2024-03-09T09:31:45Z
format Article
id doaj.art-646deec692b34a00ad1971569590db1c
institution Directory Open Access Journal
issn 2399-4908
language English
last_indexed 2024-03-09T09:31:45Z
publishDate 2017-04-01
publisher Swansea University
record_format Article
series International Journal of Population Data Science
spelling doaj.art-646deec692b34a00ad1971569590db1c2023-12-02T03:37:12ZengSwansea UniversityInternational Journal of Population Data Science2399-49082017-04-011110.23889/ijpds.v1i1.313313Treating heterogeneity and uncertainty in data integration: study on Brazilian healthcare databases.Marcos Barreto0Spiros Denaxas1Federal University of Bahia (UFBA)Farr Institute of Health Informatics Research LondonABSTRACT Background Data integration comprises methods and tools to aggregate data from disparate sources to various purposes. Heterogeneity and uncertainty are technical challenges in this field. The first involves different data representation or meaning, while the second refers to incomplete data or the expectancy that a data item exists in a data source. Our goal is to design and validate a data integration model and computing tools able to address both problems. Such model and tools will support the setup of a population-based cohort comprised by 100 million individuals and the generation of data marts (domain-specific data) to be used in epidemiological studies within an ongoing cooperation Brazil-UK. Such studies will assess the impact of a conditional cash transfer programme (PBF – Bolsa Família) on the occurrence, severity and mortality of several diseases and health problems (hospitalization, mortality, child health etc) over this cohort. Approach We propose a three-dimensional data model to aggregate information on the cohort, exposition (payments received during the observed period) and health outcomes. We treat heterogeneity based on our existing probabilistic linkage pipeline that provides data quality assessment, data conditioning (standardization, cleansing, blocking, and anonymization), two methods for probabilistic record matching, and accuracy assessment. Through this pipeline, we are able to probabilistically link records from PBF, CadastroÚnico (CADU - socioeconomic information) and healthcare databases from the Unified Health System (SUS). Uncertainty is modeled through “possible worlds”, which represent a data instance (record) with a corresponding probability. There exist 2^records possible worlds with the probability distribution being the product of record probabilities. We map the most probable relationships between all the databases involved and create some simulation scenarios in order to validate them. We are seeking for a good balance between the set of possible worlds, not overextending the possibilities, and their proximity to a real scenario. Results The current implementation comprises the linkage of the 2011 extraction of CADU and healthcare databases to populate the proposed model. Such linkage provides timely execution (up to 9 hours depending on the databases) with high accurate data marts (over 95% of true positive matched pairs) for samples with increasing size (from 1,447,512 to 12,036,010 records). Conclusion Our model is able to treat heterogeneity aspects present in huge databases. Our tools provide timely execution of probabilistic linkage with high accuracy. We started to model uncertainty in order to perform simulations and decide how to incorporate it in our model.https://ijpds.org/article/view/313
spellingShingle Marcos Barreto
Spiros Denaxas
Treating heterogeneity and uncertainty in data integration: study on Brazilian healthcare databases.
International Journal of Population Data Science
title Treating heterogeneity and uncertainty in data integration: study on Brazilian healthcare databases.
title_full Treating heterogeneity and uncertainty in data integration: study on Brazilian healthcare databases.
title_fullStr Treating heterogeneity and uncertainty in data integration: study on Brazilian healthcare databases.
title_full_unstemmed Treating heterogeneity and uncertainty in data integration: study on Brazilian healthcare databases.
title_short Treating heterogeneity and uncertainty in data integration: study on Brazilian healthcare databases.
title_sort treating heterogeneity and uncertainty in data integration study on brazilian healthcare databases
url https://ijpds.org/article/view/313
work_keys_str_mv AT marcosbarreto treatingheterogeneityanduncertaintyindataintegrationstudyonbrazilianhealthcaredatabases
AT spirosdenaxas treatingheterogeneityanduncertaintyindataintegrationstudyonbrazilianhealthcaredatabases