Incremental Entity Blocking over Heterogeneous Streaming Data
Web systems have become a valuable source of semi-structured and streaming data. In this sense, Entity Resolution (ER) has become a key solution for integrating multiple data sources or identifying similarities between data items, namely entities. To avoid the quadratic costs of the ER task and impr...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2022-12-01
|
Series: | Information |
Subjects: | |
Online Access: | https://www.mdpi.com/2078-2489/13/12/568 |
_version_ | 1797457149913202688 |
---|---|
author | Tiago Brasileiro Araújo Kostas Stefanidis Carlos Eduardo Santos Pires Jyrki Nummenmaa Thiago Pereira da Nóbrega |
author_facet | Tiago Brasileiro Araújo Kostas Stefanidis Carlos Eduardo Santos Pires Jyrki Nummenmaa Thiago Pereira da Nóbrega |
author_sort | Tiago Brasileiro Araújo |
collection | DOAJ |
description | Web systems have become a valuable source of semi-structured and streaming data. In this sense, Entity Resolution (ER) has become a key solution for integrating multiple data sources or identifying similarities between data items, namely entities. To avoid the quadratic costs of the ER task and improve efficiency, blocking techniques are usually applied. Beyond the traditional challenges faced by ER and, consequently, by the blocking techniques, there are also challenges related to streaming data, incremental processing, and noisy data. To address them, we propose a schema-agnostic blocking technique capable of handling noisy and streaming data incrementally through a distributed computational infrastructure. To the best of our knowledge, there is a lack of blocking techniques that address these challenges simultaneously. This work proposes two strategies (attribute selection and top-<i>n</i> neighborhood entities) to minimize resource consumption and improve blocking efficiency. Moreover, this work presents a noise-tolerant algorithm, which minimizes the impact of noisy data (e.g., typos and misspellings) on blocking effectiveness. In our experimental evaluation, we use real-world pairs of data sources, including a case study that involves data from Twitter and Google News. The proposed technique achieves better results regarding effectiveness and efficiency compared to the state-of-the-art technique (metablocking). More precisely, the application of the two strategies over the proposed technique alone improves efficiency by 56%, on average. |
first_indexed | 2024-03-09T16:18:01Z |
format | Article |
id | doaj.art-b48532fa312b4e64acd2945be14b3dc6 |
institution | Directory Open Access Journal |
issn | 2078-2489 |
language | English |
last_indexed | 2024-03-09T16:18:01Z |
publishDate | 2022-12-01 |
publisher | MDPI AG |
record_format | Article |
series | Information |
spelling | doaj.art-b48532fa312b4e64acd2945be14b3dc62023-11-24T15:37:16ZengMDPI AGInformation2078-24892022-12-01131256810.3390/info13120568Incremental Entity Blocking over Heterogeneous Streaming DataTiago Brasileiro Araújo0Kostas Stefanidis1Carlos Eduardo Santos Pires2Jyrki Nummenmaa3Thiago Pereira da Nóbrega4Academic Unit of Systems and Computing, Federal University of Campina Grande, Campina Grande 58429-900, BrazilFaculty of Information Technology and Communication Sciences, Tampere University, 33100 Tampere, FinlandAcademic Unit of Systems and Computing, Federal University of Campina Grande, Campina Grande 58429-900, BrazilFaculty of Information Technology and Communication Sciences, Tampere University, 33100 Tampere, FinlandAcademic Unit of Systems and Computing, Federal University of Campina Grande, Campina Grande 58429-900, BrazilWeb systems have become a valuable source of semi-structured and streaming data. In this sense, Entity Resolution (ER) has become a key solution for integrating multiple data sources or identifying similarities between data items, namely entities. To avoid the quadratic costs of the ER task and improve efficiency, blocking techniques are usually applied. Beyond the traditional challenges faced by ER and, consequently, by the blocking techniques, there are also challenges related to streaming data, incremental processing, and noisy data. To address them, we propose a schema-agnostic blocking technique capable of handling noisy and streaming data incrementally through a distributed computational infrastructure. To the best of our knowledge, there is a lack of blocking techniques that address these challenges simultaneously. This work proposes two strategies (attribute selection and top-<i>n</i> neighborhood entities) to minimize resource consumption and improve blocking efficiency. Moreover, this work presents a noise-tolerant algorithm, which minimizes the impact of noisy data (e.g., typos and misspellings) on blocking effectiveness. In our experimental evaluation, we use real-world pairs of data sources, including a case study that involves data from Twitter and Google News. The proposed technique achieves better results regarding effectiveness and efficiency compared to the state-of-the-art technique (metablocking). More precisely, the application of the two strategies over the proposed technique alone improves efficiency by 56%, on average.https://www.mdpi.com/2078-2489/13/12/568entity resolutionincremental processingparallel computingschema-agnostic blocking techniquesstreaming data |
spellingShingle | Tiago Brasileiro Araújo Kostas Stefanidis Carlos Eduardo Santos Pires Jyrki Nummenmaa Thiago Pereira da Nóbrega Incremental Entity Blocking over Heterogeneous Streaming Data Information entity resolution incremental processing parallel computing schema-agnostic blocking techniques streaming data |
title | Incremental Entity Blocking over Heterogeneous Streaming Data |
title_full | Incremental Entity Blocking over Heterogeneous Streaming Data |
title_fullStr | Incremental Entity Blocking over Heterogeneous Streaming Data |
title_full_unstemmed | Incremental Entity Blocking over Heterogeneous Streaming Data |
title_short | Incremental Entity Blocking over Heterogeneous Streaming Data |
title_sort | incremental entity blocking over heterogeneous streaming data |
topic | entity resolution incremental processing parallel computing schema-agnostic blocking techniques streaming data |
url | https://www.mdpi.com/2078-2489/13/12/568 |
work_keys_str_mv | AT tiagobrasileiroaraujo incrementalentityblockingoverheterogeneousstreamingdata AT kostasstefanidis incrementalentityblockingoverheterogeneousstreamingdata AT carloseduardosantospires incrementalentityblockingoverheterogeneousstreamingdata AT jyrkinummenmaa incrementalentityblockingoverheterogeneousstreamingdata AT thiagopereiradanobrega incrementalentityblockingoverheterogeneousstreamingdata |