MixER: linear interpolation of latent space for entity resolution

Abstract Entity resolution, accurately identifying various representations of the same real-world entities, is a crucial part of data integration systems. While existing learning-based models can achieve good performance, the models are extremely dependent on the quantity and quality of training dat...

Full description

Bibliographic Details
Main Authors: Huaiguang Wu, Shuaichao Li
Format: Article
Language:English
Published: Springer 2023-03-01
Series:Complex & Intelligent Systems
Subjects:
Online Access:https://doi.org/10.1007/s40747-023-01018-2
_version_ 1797272234532798464
author Huaiguang Wu
Shuaichao Li
author_facet Huaiguang Wu
Shuaichao Li
author_sort Huaiguang Wu
collection DOAJ
description Abstract Entity resolution, accurately identifying various representations of the same real-world entities, is a crucial part of data integration systems. While existing learning-based models can achieve good performance, the models are extremely dependent on the quantity and quality of training data. In this paper, the MixER model is proposed to alleviate these problems. The MixER utilizes our newly designed data augmentation method called EMix. The EMix can map discrete entity records to continuous latent space variables (e.g., probability distributions) and then linearly interpolate entity records in latent space to generate many augmented training samples. The matching model is further optimized based on the augmented data to strengthen its generalization capability. The MixER model achieves significant strengths in the data sensitivity experiments when training data is below 50. In robustness experiments, the MixER model presents an absolute performance advantage when the label noise exceeds 20%. In addition, ablation experiments demonstrate that the newly developed EMix can effectively improve the generalization ability of the matching model. The overall experimental results prove that the MixER model exhibited excellent data sensitivity and robustness over the current state-of-the-art methods.
first_indexed 2024-03-07T14:25:23Z
format Article
id doaj.art-9df37368cee74f039e7f51aef43b720d
institution Directory Open Access Journal
issn 2199-4536
2198-6053
language English
last_indexed 2024-03-07T14:25:23Z
publishDate 2023-03-01
publisher Springer
record_format Article
series Complex & Intelligent Systems
spelling doaj.art-9df37368cee74f039e7f51aef43b720d2024-03-06T08:07:23ZengSpringerComplex & Intelligent Systems2199-45362198-60532023-03-0110132210.1007/s40747-023-01018-2MixER: linear interpolation of latent space for entity resolutionHuaiguang Wu0Shuaichao Li1College of Computer and Communication Engineering, Zhengzhou University of Light IndustryCollege of Computer and Communication Engineering, Zhengzhou University of Light IndustryAbstract Entity resolution, accurately identifying various representations of the same real-world entities, is a crucial part of data integration systems. While existing learning-based models can achieve good performance, the models are extremely dependent on the quantity and quality of training data. In this paper, the MixER model is proposed to alleviate these problems. The MixER utilizes our newly designed data augmentation method called EMix. The EMix can map discrete entity records to continuous latent space variables (e.g., probability distributions) and then linearly interpolate entity records in latent space to generate many augmented training samples. The matching model is further optimized based on the augmented data to strengthen its generalization capability. The MixER model achieves significant strengths in the data sensitivity experiments when training data is below 50. In robustness experiments, the MixER model presents an absolute performance advantage when the label noise exceeds 20%. In addition, ablation experiments demonstrate that the newly developed EMix can effectively improve the generalization ability of the matching model. The overall experimental results prove that the MixER model exhibited excellent data sensitivity and robustness over the current state-of-the-art methods.https://doi.org/10.1007/s40747-023-01018-2Entity resolutionProbability distributionData augmentation
spellingShingle Huaiguang Wu
Shuaichao Li
MixER: linear interpolation of latent space for entity resolution
Complex & Intelligent Systems
Entity resolution
Probability distribution
Data augmentation
title MixER: linear interpolation of latent space for entity resolution
title_full MixER: linear interpolation of latent space for entity resolution
title_fullStr MixER: linear interpolation of latent space for entity resolution
title_full_unstemmed MixER: linear interpolation of latent space for entity resolution
title_short MixER: linear interpolation of latent space for entity resolution
title_sort mixer linear interpolation of latent space for entity resolution
topic Entity resolution
Probability distribution
Data augmentation
url https://doi.org/10.1007/s40747-023-01018-2
work_keys_str_mv AT huaiguangwu mixerlinearinterpolationoflatentspaceforentityresolution
AT shuaichaoli mixerlinearinterpolationoflatentspaceforentityresolution