Incremental Entity Blocking over Heterogeneous Streaming Data

Web systems have become a valuable source of semi-structured and streaming data. In this sense, Entity Resolution (ER) has become a key solution for integrating multiple data sources or identifying similarities between data items, namely entities. To avoid the quadratic costs of the ER task and impr...

Full description

Bibliographic Details
Main Authors: Tiago Brasileiro Araújo, Kostas Stefanidis, Carlos Eduardo Santos Pires, Jyrki Nummenmaa, Thiago Pereira da Nóbrega
Format: Article
Language:English
Published: MDPI AG 2022-12-01
Series:Information
Subjects:
Online Access:https://www.mdpi.com/2078-2489/13/12/568
_version_ 1797457149913202688
author Tiago Brasileiro Araújo
Kostas Stefanidis
Carlos Eduardo Santos Pires
Jyrki Nummenmaa
Thiago Pereira da Nóbrega
author_facet Tiago Brasileiro Araújo
Kostas Stefanidis
Carlos Eduardo Santos Pires
Jyrki Nummenmaa
Thiago Pereira da Nóbrega
author_sort Tiago Brasileiro Araújo
collection DOAJ
description Web systems have become a valuable source of semi-structured and streaming data. In this sense, Entity Resolution (ER) has become a key solution for integrating multiple data sources or identifying similarities between data items, namely entities. To avoid the quadratic costs of the ER task and improve efficiency, blocking techniques are usually applied. Beyond the traditional challenges faced by ER and, consequently, by the blocking techniques, there are also challenges related to streaming data, incremental processing, and noisy data. To address them, we propose a schema-agnostic blocking technique capable of handling noisy and streaming data incrementally through a distributed computational infrastructure. To the best of our knowledge, there is a lack of blocking techniques that address these challenges simultaneously. This work proposes two strategies (attribute selection and top-<i>n</i> neighborhood entities) to minimize resource consumption and improve blocking efficiency. Moreover, this work presents a noise-tolerant algorithm, which minimizes the impact of noisy data (e.g., typos and misspellings) on blocking effectiveness. In our experimental evaluation, we use real-world pairs of data sources, including a case study that involves data from Twitter and Google News. The proposed technique achieves better results regarding effectiveness and efficiency compared to the state-of-the-art technique (metablocking). More precisely, the application of the two strategies over the proposed technique alone improves efficiency by 56%, on average.
first_indexed 2024-03-09T16:18:01Z
format Article
id doaj.art-b48532fa312b4e64acd2945be14b3dc6
institution Directory Open Access Journal
issn 2078-2489
language English
last_indexed 2024-03-09T16:18:01Z
publishDate 2022-12-01
publisher MDPI AG
record_format Article
series Information
spelling doaj.art-b48532fa312b4e64acd2945be14b3dc62023-11-24T15:37:16ZengMDPI AGInformation2078-24892022-12-01131256810.3390/info13120568Incremental Entity Blocking over Heterogeneous Streaming DataTiago Brasileiro Araújo0Kostas Stefanidis1Carlos Eduardo Santos Pires2Jyrki Nummenmaa3Thiago Pereira da Nóbrega4Academic Unit of Systems and Computing, Federal University of Campina Grande, Campina Grande 58429-900, BrazilFaculty of Information Technology and Communication Sciences, Tampere University, 33100 Tampere, FinlandAcademic Unit of Systems and Computing, Federal University of Campina Grande, Campina Grande 58429-900, BrazilFaculty of Information Technology and Communication Sciences, Tampere University, 33100 Tampere, FinlandAcademic Unit of Systems and Computing, Federal University of Campina Grande, Campina Grande 58429-900, BrazilWeb systems have become a valuable source of semi-structured and streaming data. In this sense, Entity Resolution (ER) has become a key solution for integrating multiple data sources or identifying similarities between data items, namely entities. To avoid the quadratic costs of the ER task and improve efficiency, blocking techniques are usually applied. Beyond the traditional challenges faced by ER and, consequently, by the blocking techniques, there are also challenges related to streaming data, incremental processing, and noisy data. To address them, we propose a schema-agnostic blocking technique capable of handling noisy and streaming data incrementally through a distributed computational infrastructure. To the best of our knowledge, there is a lack of blocking techniques that address these challenges simultaneously. This work proposes two strategies (attribute selection and top-<i>n</i> neighborhood entities) to minimize resource consumption and improve blocking efficiency. Moreover, this work presents a noise-tolerant algorithm, which minimizes the impact of noisy data (e.g., typos and misspellings) on blocking effectiveness. In our experimental evaluation, we use real-world pairs of data sources, including a case study that involves data from Twitter and Google News. The proposed technique achieves better results regarding effectiveness and efficiency compared to the state-of-the-art technique (metablocking). More precisely, the application of the two strategies over the proposed technique alone improves efficiency by 56%, on average.https://www.mdpi.com/2078-2489/13/12/568entity resolutionincremental processingparallel computingschema-agnostic blocking techniquesstreaming data
spellingShingle Tiago Brasileiro Araújo
Kostas Stefanidis
Carlos Eduardo Santos Pires
Jyrki Nummenmaa
Thiago Pereira da Nóbrega
Incremental Entity Blocking over Heterogeneous Streaming Data
Information
entity resolution
incremental processing
parallel computing
schema-agnostic blocking techniques
streaming data
title Incremental Entity Blocking over Heterogeneous Streaming Data
title_full Incremental Entity Blocking over Heterogeneous Streaming Data
title_fullStr Incremental Entity Blocking over Heterogeneous Streaming Data
title_full_unstemmed Incremental Entity Blocking over Heterogeneous Streaming Data
title_short Incremental Entity Blocking over Heterogeneous Streaming Data
title_sort incremental entity blocking over heterogeneous streaming data
topic entity resolution
incremental processing
parallel computing
schema-agnostic blocking techniques
streaming data
url https://www.mdpi.com/2078-2489/13/12/568
work_keys_str_mv AT tiagobrasileiroaraujo incrementalentityblockingoverheterogeneousstreamingdata
AT kostasstefanidis incrementalentityblockingoverheterogeneousstreamingdata
AT carloseduardosantospires incrementalentityblockingoverheterogeneousstreamingdata
AT jyrkinummenmaa incrementalentityblockingoverheterogeneousstreamingdata
AT thiagopereiradanobrega incrementalentityblockingoverheterogeneousstreamingdata