OXPath: A language for scalable data extraction, automation, and crawling on the deep web.

The evolution of the web has outpaced itself: A growing wealth of information and increasingly sophisticated interfaces necessitate automated processing, yet existing automation and data extraction technologies have been overwhelmed by this very growth. To address this trend, we identify four key re...

Full description

Bibliographic Details
Main Authors:	Furche, T, Gottlob, G, Grasso, G, Schallhart, C, Sellers, A
Format:	Journal article
Language:	English
Published:	2013

_version_	1826298451352092672
author	Furche, T Gottlob, G Grasso, G Schallhart, C Sellers, A
author_facet	Furche, T Gottlob, G Grasso, G Schallhart, C Sellers, A
author_sort	Furche, T
collection	OXFORD
description	The evolution of the web has outpaced itself: A growing wealth of information and increasingly sophisticated interfaces necessitate automated processing, yet existing automation and data extraction technologies have been overwhelmed by this very growth. To address this trend, we identify four key requirements for web data extraction, automation, and (focused) web crawling: (1) interact with sophisticated web application interfaces, (2) precisely capture the relevant data to be extracted, (3) scale with the number of visited pages, and (4) readily embed into existing web technologies. We introduce OXPath as an extension of XPath for interacting with web applications and extracting data thus revealed-matching all the above requirements. OXPath's page-at-a-time evaluation guarantees memory use independent of the number of visited pages, yet remains polynomial in time. We experimentally validate the theoretical complexity and demonstrate that OXPath's resource consumption is dominated by page rendering in the underlying browser. With an extensive study of sublanguages and properties of OXPath, we pinpoint the effect of specific features on evaluation performance. Our experiments show that OXPath outperforms existing commercial and academic data extraction tools by a wide margin. © 2012 Springer-Verlag.
first_indexed	2024-03-07T04:47:03Z
format	Journal article
id	oxford-uuid:d3a36771-e283-450b-9874-f4759ff46698
institution	University of Oxford
language	English
last_indexed	2024-03-07T04:47:03Z
publishDate	2013
record_format	dspace
spelling	oxford-uuid:d3a36771-e283-450b-9874-f4759ff466982022-03-27T08:12:48ZOXPath: A language for scalable data extraction, automation, and crawling on the deep web.Journal articlehttp://purl.org/coar/resource_type/c_dcae04bcuuid:d3a36771-e283-450b-9874-f4759ff46698EnglishSymplectic Elements at Oxford2013Furche, TGottlob, GGrasso, GSchallhart, CSellers, AThe evolution of the web has outpaced itself: A growing wealth of information and increasingly sophisticated interfaces necessitate automated processing, yet existing automation and data extraction technologies have been overwhelmed by this very growth. To address this trend, we identify four key requirements for web data extraction, automation, and (focused) web crawling: (1) interact with sophisticated web application interfaces, (2) precisely capture the relevant data to be extracted, (3) scale with the number of visited pages, and (4) readily embed into existing web technologies. We introduce OXPath as an extension of XPath for interacting with web applications and extracting data thus revealed-matching all the above requirements. OXPath's page-at-a-time evaluation guarantees memory use independent of the number of visited pages, yet remains polynomial in time. We experimentally validate the theoretical complexity and demonstrate that OXPath's resource consumption is dominated by page rendering in the underlying browser. With an extensive study of sublanguages and properties of OXPath, we pinpoint the effect of specific features on evaluation performance. Our experiments show that OXPath outperforms existing commercial and academic data extraction tools by a wide margin. © 2012 Springer-Verlag.
spellingShingle	Furche, T Gottlob, G Grasso, G Schallhart, C Sellers, A OXPath: A language for scalable data extraction, automation, and crawling on the deep web.
title	OXPath: A language for scalable data extraction, automation, and crawling on the deep web.
title_full	OXPath: A language for scalable data extraction, automation, and crawling on the deep web.
title_fullStr	OXPath: A language for scalable data extraction, automation, and crawling on the deep web.
title_full_unstemmed	OXPath: A language for scalable data extraction, automation, and crawling on the deep web.
title_short	OXPath: A language for scalable data extraction, automation, and crawling on the deep web.
title_sort	oxpath a language for scalable data extraction automation and crawling on the deep web
work_keys_str_mv	AT furchet oxpathalanguageforscalabledataextractionautomationandcrawlingonthedeepweb AT gottlobg oxpathalanguageforscalabledataextractionautomationandcrawlingonthedeepweb AT grassog oxpathalanguageforscalabledataextractionautomationandcrawlingonthedeepweb AT schallhartc oxpathalanguageforscalabledataextractionautomationandcrawlingonthedeepweb AT sellersa oxpathalanguageforscalabledataextractionautomationandcrawlingonthedeepweb

OXPath: A language for scalable data extraction, automation, and crawling on the deep web.

Similar Items