OXPath: A Language for Scalable, Memory-efficient Data Extraction from Web Applications.

The evolution of the web has outpaced itself: The growing wealth of information and the increasing sophistication of interfaces necessitate automated processing. Web automation and extraction technologies have been overwhelmed by this very growth. To address this trend, we identify four key requirem...

Full description

Bibliographic Details
Main Authors: Furche, T, Gottlob, G, Grasso, G, Schallhart, C, Sellers, A
Format: Journal article
Language:English
Published: 2011
_version_ 1826277012654784512
author Furche, T
Gottlob, G
Grasso, G
Schallhart, C
Sellers, A
author_facet Furche, T
Gottlob, G
Grasso, G
Schallhart, C
Sellers, A
author_sort Furche, T
collection OXFORD
description The evolution of the web has outpaced itself: The growing wealth of information and the increasing sophistication of interfaces necessitate automated processing. Web automation and extraction technologies have been overwhelmed by this very growth. To address this trend, we identify four key requirements of web extraction: (1) Interact with sophisticated web application interfaces, (2) Precisely capture the relevant data for most web extraction tasks, (3) Scale with the number of visited pages, and (4) Readily embed into existing web technologies. We introduce OXPath, an extension of XPath for interacting with web applications and for extracting information thus revealed. It addresses all the above requirements. OXPath's page-at-a-time evaluation guarantees memory use independent of the number of visited pages, yet remains polynomial in time. We validate experimentally the theoretical complexity and demonstrate that its evaluation is dominated by the page rendering of the underlying browser. Our experiments show that OXPath outperforms existing commercial and academic data extraction tools by a wide margin. OXPath is available under an open source license. © 2011 VLDB Endowment.
first_indexed 2024-03-06T23:22:29Z
format Journal article
id oxford-uuid:6933c7fd-5924-4aad-80a9-f2dc0ea1068e
institution University of Oxford
language English
last_indexed 2024-03-06T23:22:29Z
publishDate 2011
record_format dspace
spelling oxford-uuid:6933c7fd-5924-4aad-80a9-f2dc0ea1068e2022-03-26T18:49:50ZOXPath: A Language for Scalable, Memory-efficient Data Extraction from Web Applications.Journal articlehttp://purl.org/coar/resource_type/c_dcae04bcuuid:6933c7fd-5924-4aad-80a9-f2dc0ea1068eEnglishSymplectic Elements at Oxford2011Furche, TGottlob, GGrasso, GSchallhart, CSellers, AThe evolution of the web has outpaced itself: The growing wealth of information and the increasing sophistication of interfaces necessitate automated processing. Web automation and extraction technologies have been overwhelmed by this very growth. To address this trend, we identify four key requirements of web extraction: (1) Interact with sophisticated web application interfaces, (2) Precisely capture the relevant data for most web extraction tasks, (3) Scale with the number of visited pages, and (4) Readily embed into existing web technologies. We introduce OXPath, an extension of XPath for interacting with web applications and for extracting information thus revealed. It addresses all the above requirements. OXPath's page-at-a-time evaluation guarantees memory use independent of the number of visited pages, yet remains polynomial in time. We validate experimentally the theoretical complexity and demonstrate that its evaluation is dominated by the page rendering of the underlying browser. Our experiments show that OXPath outperforms existing commercial and academic data extraction tools by a wide margin. OXPath is available under an open source license. © 2011 VLDB Endowment.
spellingShingle Furche, T
Gottlob, G
Grasso, G
Schallhart, C
Sellers, A
OXPath: A Language for Scalable, Memory-efficient Data Extraction from Web Applications.
title OXPath: A Language for Scalable, Memory-efficient Data Extraction from Web Applications.
title_full OXPath: A Language for Scalable, Memory-efficient Data Extraction from Web Applications.
title_fullStr OXPath: A Language for Scalable, Memory-efficient Data Extraction from Web Applications.
title_full_unstemmed OXPath: A Language for Scalable, Memory-efficient Data Extraction from Web Applications.
title_short OXPath: A Language for Scalable, Memory-efficient Data Extraction from Web Applications.
title_sort oxpath a language for scalable memory efficient data extraction from web applications
work_keys_str_mv AT furchet oxpathalanguageforscalablememoryefficientdataextractionfromwebapplications
AT gottlobg oxpathalanguageforscalablememoryefficientdataextractionfromwebapplications
AT grassog oxpathalanguageforscalablememoryefficientdataextractionfromwebapplications
AT schallhartc oxpathalanguageforscalablememoryefficientdataextractionfromwebapplications
AT sellersa oxpathalanguageforscalablememoryefficientdataextractionfromwebapplications