RED: Redundancy-driven data extraction from result pages?

Data-driven websites are mostly accessed through search interfaces. Such sites follow a common publishing pattern that, surprisingly, has not been fully exploited for unsupervised data extraction yet: the result of a search is presented as a paginated list of result records. Each result record conta...

Full description

Bibliographic Details
Main Authors:	Guo, J, Crescenzi, V, Furche, T, Grasso, G, Gottlob, G
Format:	Conference item
Published:	Association for Computing Machinery 2019

_version_	1826257955522084864
author	Guo, J Crescenzi, V Furche, T Grasso, G Gottlob, G
author_facet	Guo, J Crescenzi, V Furche, T Grasso, G Gottlob, G
author_sort	Guo, J
collection	OXFORD
description	Data-driven websites are mostly accessed through search interfaces. Such sites follow a common publishing pattern that, surprisingly, has not been fully exploited for unsupervised data extraction yet: the result of a search is presented as a paginated list of result records. Each result record contains the main attributes about one single object, and links to a page dedicated to the details of that object. We present red, an automatic approach and a prototype system for extracting data records from sites following this publishing pattern. red leverages the inherent redundancy between result records and corresponding detail pages to design an effective, yet fully-unsupervised and domain-independent method. It is able to extract from result pages all the attributes of the objects that appear both in the result records and in the corresponding detail pages. With respect to previous unsupervised methods, our method does not require any a priori domain-dependent knowledge (e.g, an ontology), can achieve a significantly higher accuracy while automatically selecting only object attributes, a task which is out of the scope of traditional fully unsupervised approaches. With respect to previous supervised or semi-supervised methods, red can reach similar accuracy in many domains (e.g., job postings) without requiring supervision for each domain, let alone each website.
first_indexed	2024-03-06T18:26:23Z
format	Conference item
id	oxford-uuid:081daa4a-01f9-430d-8d04-bf35226d72c2
institution	University of Oxford
last_indexed	2024-03-06T18:26:23Z
publishDate	2019
publisher	Association for Computing Machinery
record_format	dspace
spelling	oxford-uuid:081daa4a-01f9-430d-8d04-bf35226d72c22022-03-26T09:11:15ZRED: Redundancy-driven data extraction from result pages?Conference itemhttp://purl.org/coar/resource_type/c_5794uuid:081daa4a-01f9-430d-8d04-bf35226d72c2Symplectic Elements at OxfordAssociation for Computing Machinery2019Guo, JCrescenzi, VFurche, TGrasso, GGottlob, GData-driven websites are mostly accessed through search interfaces. Such sites follow a common publishing pattern that, surprisingly, has not been fully exploited for unsupervised data extraction yet: the result of a search is presented as a paginated list of result records. Each result record contains the main attributes about one single object, and links to a page dedicated to the details of that object. We present red, an automatic approach and a prototype system for extracting data records from sites following this publishing pattern. red leverages the inherent redundancy between result records and corresponding detail pages to design an effective, yet fully-unsupervised and domain-independent method. It is able to extract from result pages all the attributes of the objects that appear both in the result records and in the corresponding detail pages. With respect to previous unsupervised methods, our method does not require any a priori domain-dependent knowledge (e.g, an ontology), can achieve a significantly higher accuracy while automatically selecting only object attributes, a task which is out of the scope of traditional fully unsupervised approaches. With respect to previous supervised or semi-supervised methods, red can reach similar accuracy in many domains (e.g., job postings) without requiring supervision for each domain, let alone each website.
spellingShingle	Guo, J Crescenzi, V Furche, T Grasso, G Gottlob, G RED: Redundancy-driven data extraction from result pages?
title	RED: Redundancy-driven data extraction from result pages?
title_full	RED: Redundancy-driven data extraction from result pages?
title_fullStr	RED: Redundancy-driven data extraction from result pages?
title_full_unstemmed	RED: Redundancy-driven data extraction from result pages?
title_short	RED: Redundancy-driven data extraction from result pages?
title_sort	red redundancy driven data extraction from result pages
work_keys_str_mv	AT guoj redredundancydrivendataextractionfromresultpages AT crescenziv redredundancydrivendataextractionfromresultpages AT furchet redredundancydrivendataextractionfromresultpages AT grassog redredundancydrivendataextractionfromresultpages AT gottlobg redredundancydrivendataextractionfromresultpages

RED: Redundancy-driven data extraction from result pages?

Similar Items