MixER: linear interpolation of latent space for entity resolution

Abstract Entity resolution, accurately identifying various representations of the same real-world entities, is a crucial part of data integration systems. While existing learning-based models can achieve good performance, the models are extremely dependent on the quantity and quality of training dat...

Full description

Bibliographic Details
Main Authors:	Huaiguang Wu, Shuaichao Li
Format:	Article
Language:	English
Published:	Springer 2023-03-01
Series:	Complex & Intelligent Systems
Subjects:	Entity resolution Probability distribution Data augmentation
Online Access:	https://doi.org/10.1007/s40747-023-01018-2

_version_	1797272234532798464
author	Huaiguang Wu Shuaichao Li
author_facet	Huaiguang Wu Shuaichao Li
author_sort	Huaiguang Wu
collection	DOAJ
description	Abstract Entity resolution, accurately identifying various representations of the same real-world entities, is a crucial part of data integration systems. While existing learning-based models can achieve good performance, the models are extremely dependent on the quantity and quality of training data. In this paper, the MixER model is proposed to alleviate these problems. The MixER utilizes our newly designed data augmentation method called EMix. The EMix can map discrete entity records to continuous latent space variables (e.g., probability distributions) and then linearly interpolate entity records in latent space to generate many augmented training samples. The matching model is further optimized based on the augmented data to strengthen its generalization capability. The MixER model achieves significant strengths in the data sensitivity experiments when training data is below 50. In robustness experiments, the MixER model presents an absolute performance advantage when the label noise exceeds 20%. In addition, ablation experiments demonstrate that the newly developed EMix can effectively improve the generalization ability of the matching model. The overall experimental results prove that the MixER model exhibited excellent data sensitivity and robustness over the current state-of-the-art methods.
first_indexed	2024-03-07T14:25:23Z
format	Article
id	doaj.art-9df37368cee74f039e7f51aef43b720d
institution	Directory Open Access Journal
issn	2199-4536 2198-6053
language	English
last_indexed	2024-03-07T14:25:23Z
publishDate	2023-03-01
publisher	Springer
record_format	Article
series	Complex & Intelligent Systems
spelling	doaj.art-9df37368cee74f039e7f51aef43b720d2024-03-06T08:07:23ZengSpringerComplex & Intelligent Systems2199-45362198-60532023-03-0110132210.1007/s40747-023-01018-2MixER: linear interpolation of latent space for entity resolutionHuaiguang Wu0Shuaichao Li1College of Computer and Communication Engineering, Zhengzhou University of Light IndustryCollege of Computer and Communication Engineering, Zhengzhou University of Light IndustryAbstract Entity resolution, accurately identifying various representations of the same real-world entities, is a crucial part of data integration systems. While existing learning-based models can achieve good performance, the models are extremely dependent on the quantity and quality of training data. In this paper, the MixER model is proposed to alleviate these problems. The MixER utilizes our newly designed data augmentation method called EMix. The EMix can map discrete entity records to continuous latent space variables (e.g., probability distributions) and then linearly interpolate entity records in latent space to generate many augmented training samples. The matching model is further optimized based on the augmented data to strengthen its generalization capability. The MixER model achieves significant strengths in the data sensitivity experiments when training data is below 50. In robustness experiments, the MixER model presents an absolute performance advantage when the label noise exceeds 20%. In addition, ablation experiments demonstrate that the newly developed EMix can effectively improve the generalization ability of the matching model. The overall experimental results prove that the MixER model exhibited excellent data sensitivity and robustness over the current state-of-the-art methods.https://doi.org/10.1007/s40747-023-01018-2Entity resolutionProbability distributionData augmentation
spellingShingle	Huaiguang Wu Shuaichao Li MixER: linear interpolation of latent space for entity resolution Complex & Intelligent Systems Entity resolution Probability distribution Data augmentation
title	MixER: linear interpolation of latent space for entity resolution
title_full	MixER: linear interpolation of latent space for entity resolution
title_fullStr	MixER: linear interpolation of latent space for entity resolution
title_full_unstemmed	MixER: linear interpolation of latent space for entity resolution
title_short	MixER: linear interpolation of latent space for entity resolution
title_sort	mixer linear interpolation of latent space for entity resolution
topic	Entity resolution Probability distribution Data augmentation
url	https://doi.org/10.1007/s40747-023-01018-2
work_keys_str_mv	AT huaiguangwu mixerlinearinterpolationoflatentspaceforentityresolution AT shuaichaoli mixerlinearinterpolationoflatentspaceforentityresolution

MixER: linear interpolation of latent space for entity resolution

Similar Items