Enabling Fast and Accurate Record Linkage of Large-Scale Health-Related Administrative Databases Through a DNA-Encoded Approach.

Objective Public health research frequently requires the integration of information from different data sources. However, errors in the records and the high computational costs involved make linking large administrative databases using record linkage (RL) methodologies a major challenge. We present...

Full description

Bibliographic Details
Main Authors: José Araújo, Juan Silva, André Costa-Martins, Vanderson Sampaio, Daniel Castro, Robson Souza, Jeevan Giddaluru, Pablo Ramos, Robespierre Pita, Maurício Barreto, Manoel Netto, Helder Nakaya
Format: Article
Language:English
Published: Swansea University 2022-08-01
Series:International Journal of Population Data Science
Subjects:
Online Access:https://ijpds.org/article/view/1774
_version_ 1797430656798556160
author José Araújo
Juan Silva
André Costa-Martins
Vanderson Sampaio
Daniel Castro
Robson Souza
Jeevan Giddaluru
Pablo Ramos
Robespierre Pita
Maurício Barreto
Manoel Netto
Helder Nakaya
author_facet José Araújo
Juan Silva
André Costa-Martins
Vanderson Sampaio
Daniel Castro
Robson Souza
Jeevan Giddaluru
Pablo Ramos
Robespierre Pita
Maurício Barreto
Manoel Netto
Helder Nakaya
author_sort José Araújo
collection DOAJ
description Objective Public health research frequently requires the integration of information from different data sources. However, errors in the records and the high computational costs involved make linking large administrative databases using record linkage (RL) methodologies a major challenge. We present Tucuxi-BLAST, a versatile tool for probabilistic RL that utilizes a DNA-encoded approach to encrypt, analyze and link massive administrative databases. Materials and Methods Tucuxi-BLAST encodes the identification records into DNA. BLASTn algorithm is then used to align the sequences between databases. We tested and benchmarked on a simulated database containing records for 300 million individuals and also on four large administrative databases containing real data on Brazilian patients. Results Our method was able to overcome misspellings and typographical errors in administrative databases. In processing the RL of the largest simulated dataset (200k records), the state-of-the art method took 5 days and 7 hours to perform the RL, while Tucuxi-BLAST only took 23 hours. When compared with five existing RL tools applied to a gold-standard dataset from real health-related databases, Tucuxi-BLAST had the highest accuracy and speed. Discussion By repurposing genomic tools, researchers are able to perform subject tracing across multiple large epidemiological databases using a regular laptop. Conclusion Tucuxi-BLAST can improve data-driven medical research and provide a fast and accurate way to link individual information across several administrative databases.
first_indexed 2024-03-09T09:30:43Z
format Article
id doaj.art-325e6af5426b4105976e9cc40b416d1a
institution Directory Open Access Journal
issn 2399-4908
language English
last_indexed 2024-03-09T09:30:43Z
publishDate 2022-08-01
publisher Swansea University
record_format Article
series International Journal of Population Data Science
spelling doaj.art-325e6af5426b4105976e9cc40b416d1a2023-12-02T03:51:50ZengSwansea UniversityInternational Journal of Population Data Science2399-49082022-08-017310.23889/ijpds.v7i3.1774Enabling Fast and Accurate Record Linkage of Large-Scale Health-Related Administrative Databases Through a DNA-Encoded Approach.José Araújo0Juan Silva1André Costa-Martins2Vanderson Sampaio3Daniel Castro4Robson Souza5Jeevan Giddaluru6Pablo Ramos7Robespierre Pita8Maurício Barreto9Manoel Netto10Helder Nakaya11Universidade de São PauloUniversidade de São PauloUniversidade de São PauloFundação de Medicina Tropical Dr. Heitor Vieira DouradoFundação de Vigilância em Saúde do AmazonasUniversidade de São PauloUniversidade de São PauloOswaldo Cruz FoundationOswaldo Cruz FoundationOswaldo Cruz FoundationOswaldo Cruz FoundationUniversity of São PauloObjective Public health research frequently requires the integration of information from different data sources. However, errors in the records and the high computational costs involved make linking large administrative databases using record linkage (RL) methodologies a major challenge. We present Tucuxi-BLAST, a versatile tool for probabilistic RL that utilizes a DNA-encoded approach to encrypt, analyze and link massive administrative databases. Materials and Methods Tucuxi-BLAST encodes the identification records into DNA. BLASTn algorithm is then used to align the sequences between databases. We tested and benchmarked on a simulated database containing records for 300 million individuals and also on four large administrative databases containing real data on Brazilian patients. Results Our method was able to overcome misspellings and typographical errors in administrative databases. In processing the RL of the largest simulated dataset (200k records), the state-of-the art method took 5 days and 7 hours to perform the RL, while Tucuxi-BLAST only took 23 hours. When compared with five existing RL tools applied to a gold-standard dataset from real health-related databases, Tucuxi-BLAST had the highest accuracy and speed. Discussion By repurposing genomic tools, researchers are able to perform subject tracing across multiple large epidemiological databases using a regular laptop. Conclusion Tucuxi-BLAST can improve data-driven medical research and provide a fast and accurate way to link individual information across several administrative databases. https://ijpds.org/article/view/1774DNA-encoded methodrecord linkagegenomic toolsepidemiologyBLAST
spellingShingle José Araújo
Juan Silva
André Costa-Martins
Vanderson Sampaio
Daniel Castro
Robson Souza
Jeevan Giddaluru
Pablo Ramos
Robespierre Pita
Maurício Barreto
Manoel Netto
Helder Nakaya
Enabling Fast and Accurate Record Linkage of Large-Scale Health-Related Administrative Databases Through a DNA-Encoded Approach.
International Journal of Population Data Science
DNA-encoded method
record linkage
genomic tools
epidemiology
BLAST
title Enabling Fast and Accurate Record Linkage of Large-Scale Health-Related Administrative Databases Through a DNA-Encoded Approach.
title_full Enabling Fast and Accurate Record Linkage of Large-Scale Health-Related Administrative Databases Through a DNA-Encoded Approach.
title_fullStr Enabling Fast and Accurate Record Linkage of Large-Scale Health-Related Administrative Databases Through a DNA-Encoded Approach.
title_full_unstemmed Enabling Fast and Accurate Record Linkage of Large-Scale Health-Related Administrative Databases Through a DNA-Encoded Approach.
title_short Enabling Fast and Accurate Record Linkage of Large-Scale Health-Related Administrative Databases Through a DNA-Encoded Approach.
title_sort enabling fast and accurate record linkage of large scale health related administrative databases through a dna encoded approach
topic DNA-encoded method
record linkage
genomic tools
epidemiology
BLAST
url https://ijpds.org/article/view/1774
work_keys_str_mv AT josearaujo enablingfastandaccuraterecordlinkageoflargescalehealthrelatedadministrativedatabasesthroughadnaencodedapproach
AT juansilva enablingfastandaccuraterecordlinkageoflargescalehealthrelatedadministrativedatabasesthroughadnaencodedapproach
AT andrecostamartins enablingfastandaccuraterecordlinkageoflargescalehealthrelatedadministrativedatabasesthroughadnaencodedapproach
AT vandersonsampaio enablingfastandaccuraterecordlinkageoflargescalehealthrelatedadministrativedatabasesthroughadnaencodedapproach
AT danielcastro enablingfastandaccuraterecordlinkageoflargescalehealthrelatedadministrativedatabasesthroughadnaencodedapproach
AT robsonsouza enablingfastandaccuraterecordlinkageoflargescalehealthrelatedadministrativedatabasesthroughadnaencodedapproach
AT jeevangiddaluru enablingfastandaccuraterecordlinkageoflargescalehealthrelatedadministrativedatabasesthroughadnaencodedapproach
AT pabloramos enablingfastandaccuraterecordlinkageoflargescalehealthrelatedadministrativedatabasesthroughadnaencodedapproach
AT robespierrepita enablingfastandaccuraterecordlinkageoflargescalehealthrelatedadministrativedatabasesthroughadnaencodedapproach
AT mauriciobarreto enablingfastandaccuraterecordlinkageoflargescalehealthrelatedadministrativedatabasesthroughadnaencodedapproach
AT manoelnetto enablingfastandaccuraterecordlinkageoflargescalehealthrelatedadministrativedatabasesthroughadnaencodedapproach
AT heldernakaya enablingfastandaccuraterecordlinkageoflargescalehealthrelatedadministrativedatabasesthroughadnaencodedapproach