Robust and noise resistant wrapper induction

Wrapper induction is the problem of automatically inferring a query from annotated web pages of the same template. This query should not only select the annotated content accurately but also other content following the same template. Beyond accurately matching the template, we consider two additiona...

पूर्ण विवरण

ग्रंथसूची विवरण
मुख्य लेखकों: Furche, T, Guo, J, Maneth, S, Schallhart, C
स्वरूप: Conference item
प्रकाशित: Association for Computing Machinery 2016
_version_ 1826291973006295040
author Furche, T
Guo, J
Maneth, S
Schallhart, C
author_facet Furche, T
Guo, J
Maneth, S
Schallhart, C
author_sort Furche, T
collection OXFORD
description Wrapper induction is the problem of automatically inferring a query from annotated web pages of the same template. This query should not only select the annotated content accurately but also other content following the same template. Beyond accurately matching the template, we consider two additional requirements: (1) wrappers should be robust against a large class of changes to the web pages, and (2) the induction process should be noise resistant, i.e., tolerate slightly erroneous (e.g., machine generated) samples. Key to our approach is a query language that is powerful enough to permit accurate selection, but limited enough to force noisy samples to be generalized into wrappers that select the likely intended items. We introduce such a language as subset of XPATH and show that even for such a restricted language, inducing optimal queries according to a suitable scoring is infeasible. Nevertheless, our wrapper induction framework infers highly robust and noise resistant queries. We evaluate the queries on snapshots from web pages that change over time as provided by the Internet Archive, and show that the induced queries are as robust as the human-made queries. The queries often survive hundreds sometimes thousands of days, with many changes to the relative position of the selected nodes (including changes on template level). This is due to the few and discriminative anchor (intermediately selected) nodes of the generated queries. The queries are highly resistant against positive noise (up to 50%) and negative noise (up to 20%).
first_indexed 2024-03-07T03:07:33Z
format Conference item
id oxford-uuid:b31073dc-d63c-47e7-9fc2-e7b94d3228f1
institution University of Oxford
last_indexed 2024-03-07T03:07:33Z
publishDate 2016
publisher Association for Computing Machinery
record_format dspace
spelling oxford-uuid:b31073dc-d63c-47e7-9fc2-e7b94d3228f12022-03-27T04:16:22ZRobust and noise resistant wrapper inductionConference itemhttp://purl.org/coar/resource_type/c_5794uuid:b31073dc-d63c-47e7-9fc2-e7b94d3228f1Symplectic Elements at OxfordAssociation for Computing Machinery2016Furche, TGuo, JManeth, SSchallhart, CWrapper induction is the problem of automatically inferring a query from annotated web pages of the same template. This query should not only select the annotated content accurately but also other content following the same template. Beyond accurately matching the template, we consider two additional requirements: (1) wrappers should be robust against a large class of changes to the web pages, and (2) the induction process should be noise resistant, i.e., tolerate slightly erroneous (e.g., machine generated) samples. Key to our approach is a query language that is powerful enough to permit accurate selection, but limited enough to force noisy samples to be generalized into wrappers that select the likely intended items. We introduce such a language as subset of XPATH and show that even for such a restricted language, inducing optimal queries according to a suitable scoring is infeasible. Nevertheless, our wrapper induction framework infers highly robust and noise resistant queries. We evaluate the queries on snapshots from web pages that change over time as provided by the Internet Archive, and show that the induced queries are as robust as the human-made queries. The queries often survive hundreds sometimes thousands of days, with many changes to the relative position of the selected nodes (including changes on template level). This is due to the few and discriminative anchor (intermediately selected) nodes of the generated queries. The queries are highly resistant against positive noise (up to 50%) and negative noise (up to 20%).
spellingShingle Furche, T
Guo, J
Maneth, S
Schallhart, C
Robust and noise resistant wrapper induction
title Robust and noise resistant wrapper induction
title_full Robust and noise resistant wrapper induction
title_fullStr Robust and noise resistant wrapper induction
title_full_unstemmed Robust and noise resistant wrapper induction
title_short Robust and noise resistant wrapper induction
title_sort robust and noise resistant wrapper induction
work_keys_str_mv AT furchet robustandnoiseresistantwrapperinduction
AT guoj robustandnoiseresistantwrapperinduction
AT maneths robustandnoiseresistantwrapperinduction
AT schallhartc robustandnoiseresistantwrapperinduction