Treating heterogeneity and uncertainty in data integration: study on Brazilian healthcare databases.

ABSTRACT Background Data integration comprises methods and tools to aggregate data from disparate sources to various purposes. Heterogeneity and uncertainty are technical challenges in this field. The first involves different data representation or meaning, while the second refers to incomplete dat...

Full description

Bibliographic Details
Main Authors:	Marcos Barreto, Spiros Denaxas
Format:	Article
Language:	English
Published:	Swansea University 2017-04-01
Series:	International Journal of Population Data Science
Online Access:	https://ijpds.org/article/view/313

_version_	1797430671022489600
author	Marcos Barreto Spiros Denaxas
author_facet	Marcos Barreto Spiros Denaxas
author_sort	Marcos Barreto
collection	DOAJ
description	ABSTRACT Background Data integration comprises methods and tools to aggregate data from disparate sources to various purposes. Heterogeneity and uncertainty are technical challenges in this field. The first involves different data representation or meaning, while the second refers to incomplete data or the expectancy that a data item exists in a data source. Our goal is to design and validate a data integration model and computing tools able to address both problems. Such model and tools will support the setup of a population-based cohort comprised by 100 million individuals and the generation of data marts (domain-specific data) to be used in epidemiological studies within an ongoing cooperation Brazil-UK. Such studies will assess the impact of a conditional cash transfer programme (PBF – Bolsa Família) on the occurrence, severity and mortality of several diseases and health problems (hospitalization, mortality, child health etc) over this cohort. Approach We propose a three-dimensional data model to aggregate information on the cohort, exposition (payments received during the observed period) and health outcomes. We treat heterogeneity based on our existing probabilistic linkage pipeline that provides data quality assessment, data conditioning (standardization, cleansing, blocking, and anonymization), two methods for probabilistic record matching, and accuracy assessment. Through this pipeline, we are able to probabilistically link records from PBF, CadastroÚnico (CADU - socioeconomic information) and healthcare databases from the Unified Health System (SUS). Uncertainty is modeled through “possible worlds”, which represent a data instance (record) with a corresponding probability. There exist 2^records possible worlds with the probability distribution being the product of record probabilities. We map the most probable relationships between all the databases involved and create some simulation scenarios in order to validate them. We are seeking for a good balance between the set of possible worlds, not overextending the possibilities, and their proximity to a real scenario. Results The current implementation comprises the linkage of the 2011 extraction of CADU and healthcare databases to populate the proposed model. Such linkage provides timely execution (up to 9 hours depending on the databases) with high accurate data marts (over 95% of true positive matched pairs) for samples with increasing size (from 1,447,512 to 12,036,010 records). Conclusion Our model is able to treat heterogeneity aspects present in huge databases. Our tools provide timely execution of probabilistic linkage with high accuracy. We started to model uncertainty in order to perform simulations and decide how to incorporate it in our model.
first_indexed	2024-03-09T09:31:45Z
format	Article
id	doaj.art-646deec692b34a00ad1971569590db1c
institution	Directory Open Access Journal
issn	2399-4908
language	English
last_indexed	2024-03-09T09:31:45Z
publishDate	2017-04-01
publisher	Swansea University
record_format	Article
series	International Journal of Population Data Science
spelling	doaj.art-646deec692b34a00ad1971569590db1c2023-12-02T03:37:12ZengSwansea UniversityInternational Journal of Population Data Science2399-49082017-04-011110.23889/ijpds.v1i1.313313Treating heterogeneity and uncertainty in data integration: study on Brazilian healthcare databases.Marcos Barreto0Spiros Denaxas1Federal University of Bahia (UFBA)Farr Institute of Health Informatics Research LondonABSTRACT Background Data integration comprises methods and tools to aggregate data from disparate sources to various purposes. Heterogeneity and uncertainty are technical challenges in this field. The first involves different data representation or meaning, while the second refers to incomplete data or the expectancy that a data item exists in a data source. Our goal is to design and validate a data integration model and computing tools able to address both problems. Such model and tools will support the setup of a population-based cohort comprised by 100 million individuals and the generation of data marts (domain-specific data) to be used in epidemiological studies within an ongoing cooperation Brazil-UK. Such studies will assess the impact of a conditional cash transfer programme (PBF – Bolsa Família) on the occurrence, severity and mortality of several diseases and health problems (hospitalization, mortality, child health etc) over this cohort. Approach We propose a three-dimensional data model to aggregate information on the cohort, exposition (payments received during the observed period) and health outcomes. We treat heterogeneity based on our existing probabilistic linkage pipeline that provides data quality assessment, data conditioning (standardization, cleansing, blocking, and anonymization), two methods for probabilistic record matching, and accuracy assessment. Through this pipeline, we are able to probabilistically link records from PBF, CadastroÚnico (CADU - socioeconomic information) and healthcare databases from the Unified Health System (SUS). Uncertainty is modeled through “possible worlds”, which represent a data instance (record) with a corresponding probability. There exist 2^records possible worlds with the probability distribution being the product of record probabilities. We map the most probable relationships between all the databases involved and create some simulation scenarios in order to validate them. We are seeking for a good balance between the set of possible worlds, not overextending the possibilities, and their proximity to a real scenario. Results The current implementation comprises the linkage of the 2011 extraction of CADU and healthcare databases to populate the proposed model. Such linkage provides timely execution (up to 9 hours depending on the databases) with high accurate data marts (over 95% of true positive matched pairs) for samples with increasing size (from 1,447,512 to 12,036,010 records). Conclusion Our model is able to treat heterogeneity aspects present in huge databases. Our tools provide timely execution of probabilistic linkage with high accuracy. We started to model uncertainty in order to perform simulations and decide how to incorporate it in our model.https://ijpds.org/article/view/313
spellingShingle	Marcos Barreto Spiros Denaxas Treating heterogeneity and uncertainty in data integration: study on Brazilian healthcare databases. International Journal of Population Data Science
title	Treating heterogeneity and uncertainty in data integration: study on Brazilian healthcare databases.
title_full	Treating heterogeneity and uncertainty in data integration: study on Brazilian healthcare databases.
title_fullStr	Treating heterogeneity and uncertainty in data integration: study on Brazilian healthcare databases.
title_full_unstemmed	Treating heterogeneity and uncertainty in data integration: study on Brazilian healthcare databases.
title_short	Treating heterogeneity and uncertainty in data integration: study on Brazilian healthcare databases.
title_sort	treating heterogeneity and uncertainty in data integration study on brazilian healthcare databases
url	https://ijpds.org/article/view/313
work_keys_str_mv	AT marcosbarreto treatingheterogeneityanduncertaintyindataintegrationstudyonbrazilianhealthcaredatabases AT spirosdenaxas treatingheterogeneityanduncertaintyindataintegrationstudyonbrazilianhealthcaredatabases

Treating heterogeneity and uncertainty in data integration: study on Brazilian healthcare databases.

Similar Items