MixER: linear interpolation of latent space for entity resolution
Abstract Entity resolution, accurately identifying various representations of the same real-world entities, is a crucial part of data integration systems. While existing learning-based models can achieve good performance, the models are extremely dependent on the quantity and quality of training dat...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Springer
2023-03-01
|
Series: | Complex & Intelligent Systems |
Subjects: | |
Online Access: | https://doi.org/10.1007/s40747-023-01018-2 |
_version_ | 1797272234532798464 |
---|---|
author | Huaiguang Wu Shuaichao Li |
author_facet | Huaiguang Wu Shuaichao Li |
author_sort | Huaiguang Wu |
collection | DOAJ |
description | Abstract Entity resolution, accurately identifying various representations of the same real-world entities, is a crucial part of data integration systems. While existing learning-based models can achieve good performance, the models are extremely dependent on the quantity and quality of training data. In this paper, the MixER model is proposed to alleviate these problems. The MixER utilizes our newly designed data augmentation method called EMix. The EMix can map discrete entity records to continuous latent space variables (e.g., probability distributions) and then linearly interpolate entity records in latent space to generate many augmented training samples. The matching model is further optimized based on the augmented data to strengthen its generalization capability. The MixER model achieves significant strengths in the data sensitivity experiments when training data is below 50. In robustness experiments, the MixER model presents an absolute performance advantage when the label noise exceeds 20%. In addition, ablation experiments demonstrate that the newly developed EMix can effectively improve the generalization ability of the matching model. The overall experimental results prove that the MixER model exhibited excellent data sensitivity and robustness over the current state-of-the-art methods. |
first_indexed | 2024-03-07T14:25:23Z |
format | Article |
id | doaj.art-9df37368cee74f039e7f51aef43b720d |
institution | Directory Open Access Journal |
issn | 2199-4536 2198-6053 |
language | English |
last_indexed | 2024-03-07T14:25:23Z |
publishDate | 2023-03-01 |
publisher | Springer |
record_format | Article |
series | Complex & Intelligent Systems |
spelling | doaj.art-9df37368cee74f039e7f51aef43b720d2024-03-06T08:07:23ZengSpringerComplex & Intelligent Systems2199-45362198-60532023-03-0110132210.1007/s40747-023-01018-2MixER: linear interpolation of latent space for entity resolutionHuaiguang Wu0Shuaichao Li1College of Computer and Communication Engineering, Zhengzhou University of Light IndustryCollege of Computer and Communication Engineering, Zhengzhou University of Light IndustryAbstract Entity resolution, accurately identifying various representations of the same real-world entities, is a crucial part of data integration systems. While existing learning-based models can achieve good performance, the models are extremely dependent on the quantity and quality of training data. In this paper, the MixER model is proposed to alleviate these problems. The MixER utilizes our newly designed data augmentation method called EMix. The EMix can map discrete entity records to continuous latent space variables (e.g., probability distributions) and then linearly interpolate entity records in latent space to generate many augmented training samples. The matching model is further optimized based on the augmented data to strengthen its generalization capability. The MixER model achieves significant strengths in the data sensitivity experiments when training data is below 50. In robustness experiments, the MixER model presents an absolute performance advantage when the label noise exceeds 20%. In addition, ablation experiments demonstrate that the newly developed EMix can effectively improve the generalization ability of the matching model. The overall experimental results prove that the MixER model exhibited excellent data sensitivity and robustness over the current state-of-the-art methods.https://doi.org/10.1007/s40747-023-01018-2Entity resolutionProbability distributionData augmentation |
spellingShingle | Huaiguang Wu Shuaichao Li MixER: linear interpolation of latent space for entity resolution Complex & Intelligent Systems Entity resolution Probability distribution Data augmentation |
title | MixER: linear interpolation of latent space for entity resolution |
title_full | MixER: linear interpolation of latent space for entity resolution |
title_fullStr | MixER: linear interpolation of latent space for entity resolution |
title_full_unstemmed | MixER: linear interpolation of latent space for entity resolution |
title_short | MixER: linear interpolation of latent space for entity resolution |
title_sort | mixer linear interpolation of latent space for entity resolution |
topic | Entity resolution Probability distribution Data augmentation |
url | https://doi.org/10.1007/s40747-023-01018-2 |
work_keys_str_mv | AT huaiguangwu mixerlinearinterpolationoflatentspaceforentityresolution AT shuaichaoli mixerlinearinterpolationoflatentspaceforentityresolution |