Training data selection for record linkage classification

This paper presents a new two-step approach for record linkage, focusing on the creation of high-quality training data in the first step. The approach employs the unsupervised random forest model as a similarity measure to produce a similarity score vector for record matching. Three constructions we...

Full description

Bibliographic Details
Main Authors: Zaturrawiah Ali Omar, Zamira Hasanah Zamzuri, Noratiqah Mohd Ariff, Mohd Aftar Abu Bakar
Format: Article
Language:English
English
Published: MDPI AG 2023
Subjects:
Online Access:https://eprints.ums.edu.my/id/eprint/42203/1/ABSTRACT.pdf
https://eprints.ums.edu.my/id/eprint/42203/2/FULL%20TEXT.pdf
_version_ 1825715996768337920
author Zaturrawiah Ali Omar
Zamira Hasanah Zamzuri
Noratiqah Mohd Ariff
Mohd Aftar Abu Bakar
author_facet Zaturrawiah Ali Omar
Zamira Hasanah Zamzuri
Noratiqah Mohd Ariff
Mohd Aftar Abu Bakar
author_sort Zaturrawiah Ali Omar
collection UMS
description This paper presents a new two-step approach for record linkage, focusing on the creation of high-quality training data in the first step. The approach employs the unsupervised random forest model as a similarity measure to produce a similarity score vector for record matching. Three constructions were proposed to select non-match pairs for the training data, with both balanced (symmetry) and imbalanced (asymmetry) distributions tested. The top and imbalanced construction was found to be the most effective in producing training data with 100% correct labels. Random forest and support vector machine classification algorithms were compared, and random forest with the top and imbalanced construction produced an F1 -score comparable to probabilistic record linkage using the expectation maximisation algorithm and EpiLink. On average, the proposed approach using random forests and the top and imbalanced construction improved the F1 -score by 1% and recall by 6.45% compared to existing record linkage methods. By emphasising the creation of high-quality training data, this new approach has the potential to improve the accuracy and efficiency of record linkage for a wide range of applications.
first_indexed 2025-03-05T01:34:15Z
format Article
id ums.eprints-42203
institution Universiti Malaysia Sabah
language English
English
last_indexed 2025-03-05T01:34:15Z
publishDate 2023
publisher MDPI AG
record_format dspace
spelling ums.eprints-422032024-12-10T06:57:01Z https://eprints.ums.edu.my/id/eprint/42203/ Training data selection for record linkage classification Zaturrawiah Ali Omar Zamira Hasanah Zamzuri Noratiqah Mohd Ariff Mohd Aftar Abu Bakar QA1-939 Mathematics QA75.5-76.95 Electronic computers. Computer science This paper presents a new two-step approach for record linkage, focusing on the creation of high-quality training data in the first step. The approach employs the unsupervised random forest model as a similarity measure to produce a similarity score vector for record matching. Three constructions were proposed to select non-match pairs for the training data, with both balanced (symmetry) and imbalanced (asymmetry) distributions tested. The top and imbalanced construction was found to be the most effective in producing training data with 100% correct labels. Random forest and support vector machine classification algorithms were compared, and random forest with the top and imbalanced construction produced an F1 -score comparable to probabilistic record linkage using the expectation maximisation algorithm and EpiLink. On average, the proposed approach using random forests and the top and imbalanced construction improved the F1 -score by 1% and recall by 6.45% compared to existing record linkage methods. By emphasising the creation of high-quality training data, this new approach has the potential to improve the accuracy and efficiency of record linkage for a wide range of applications. MDPI AG 2023 Article NonPeerReviewed text en https://eprints.ums.edu.my/id/eprint/42203/1/ABSTRACT.pdf text en https://eprints.ums.edu.my/id/eprint/42203/2/FULL%20TEXT.pdf Zaturrawiah Ali Omar and Zamira Hasanah Zamzuri and Noratiqah Mohd Ariff and Mohd Aftar Abu Bakar (2023) Training data selection for record linkage classification. Symmetry, 15. pp. 1-17. https://doi.org/10.3390/sym15051060
spellingShingle QA1-939 Mathematics
QA75.5-76.95 Electronic computers. Computer science
Zaturrawiah Ali Omar
Zamira Hasanah Zamzuri
Noratiqah Mohd Ariff
Mohd Aftar Abu Bakar
Training data selection for record linkage classification
title Training data selection for record linkage classification
title_full Training data selection for record linkage classification
title_fullStr Training data selection for record linkage classification
title_full_unstemmed Training data selection for record linkage classification
title_short Training data selection for record linkage classification
title_sort training data selection for record linkage classification
topic QA1-939 Mathematics
QA75.5-76.95 Electronic computers. Computer science
url https://eprints.ums.edu.my/id/eprint/42203/1/ABSTRACT.pdf
https://eprints.ums.edu.my/id/eprint/42203/2/FULL%20TEXT.pdf
work_keys_str_mv AT zaturrawiahaliomar trainingdataselectionforrecordlinkageclassification
AT zamirahasanahzamzuri trainingdataselectionforrecordlinkageclassification
AT noratiqahmohdariff trainingdataselectionforrecordlinkageclassification
AT mohdaftarabubakar trainingdataselectionforrecordlinkageclassification