OPAL: automated form understanding for the deep web.

Forms are our gates to the web. They enable us to access the deep content of web sites. Automatic form understanding unlocks this content for applications ranging from crawlers to meta-search engines and is essential for improving usability and accessibility of the web. Form understanding has receiv...

Full description

Bibliographic Details
Main Authors: Furche, T, Gottlob, G, Grasso, G, Guo, X, Orsi, G, Schallhart, C
Other Authors: Mille, A
Format: Conference item
Published: ACM 2012
_version_ 1797092665408356352
author Furche, T
Gottlob, G
Grasso, G
Guo, X
Orsi, G
Schallhart, C
author2 Mille, A
author_facet Mille, A
Furche, T
Gottlob, G
Grasso, G
Guo, X
Orsi, G
Schallhart, C
author_sort Furche, T
collection OXFORD
description Forms are our gates to the web. They enable us to access the deep content of web sites. Automatic form understanding unlocks this content for applications ranging from crawlers to meta-search engines and is essential for improving usability and accessibility of the web. Form understanding has received surprisingly little attention other than as component in specific applications such as crawlers. No comprehensive approach to form understanding exists and previous works disagree even in the definition of the problem. In this paper, we present OPAL, the first comprehensive approach to form understanding. We identify form labeling and form interpretation as the two main tasks involved in form understanding. On both problems OPAL pushes the state of the art: For form labeling, it combines signals from the text, structure, and visual rendering of a web page, yielding robust characterisations of common design patterns. In extensive experiments on the ICQ and TEL-8 benchmarks and a set of 200 modern web forms OPAL outperforms previous approaches by a significant margin. For form interpretation, we introduce a template language to describe frequent form patterns. These two parts of OPAL combined yield form understanding with near perfect accuracy (> 98%).
first_indexed 2024-03-07T03:49:16Z
format Conference item
id oxford-uuid:c0a6a9a9-7c1f-44dd-9786-dcffc16b8a21
institution University of Oxford
last_indexed 2024-03-07T03:49:16Z
publishDate 2012
publisher ACM
record_format dspace
spelling oxford-uuid:c0a6a9a9-7c1f-44dd-9786-dcffc16b8a212022-03-27T05:55:55ZOPAL: automated form understanding for the deep web.Conference itemhttp://purl.org/coar/resource_type/c_5794uuid:c0a6a9a9-7c1f-44dd-9786-dcffc16b8a21Symplectic Elements at OxfordACM2012Furche, TGottlob, GGrasso, GGuo, XOrsi, GSchallhart, CMille, AGandon, FMisselis, JRabinovich, MStaab, SForms are our gates to the web. They enable us to access the deep content of web sites. Automatic form understanding unlocks this content for applications ranging from crawlers to meta-search engines and is essential for improving usability and accessibility of the web. Form understanding has received surprisingly little attention other than as component in specific applications such as crawlers. No comprehensive approach to form understanding exists and previous works disagree even in the definition of the problem. In this paper, we present OPAL, the first comprehensive approach to form understanding. We identify form labeling and form interpretation as the two main tasks involved in form understanding. On both problems OPAL pushes the state of the art: For form labeling, it combines signals from the text, structure, and visual rendering of a web page, yielding robust characterisations of common design patterns. In extensive experiments on the ICQ and TEL-8 benchmarks and a set of 200 modern web forms OPAL outperforms previous approaches by a significant margin. For form interpretation, we introduce a template language to describe frequent form patterns. These two parts of OPAL combined yield form understanding with near perfect accuracy (> 98%).
spellingShingle Furche, T
Gottlob, G
Grasso, G
Guo, X
Orsi, G
Schallhart, C
OPAL: automated form understanding for the deep web.
title OPAL: automated form understanding for the deep web.
title_full OPAL: automated form understanding for the deep web.
title_fullStr OPAL: automated form understanding for the deep web.
title_full_unstemmed OPAL: automated form understanding for the deep web.
title_short OPAL: automated form understanding for the deep web.
title_sort opal automated form understanding for the deep web
work_keys_str_mv AT furchet opalautomatedformunderstandingforthedeepweb
AT gottlobg opalautomatedformunderstandingforthedeepweb
AT grassog opalautomatedformunderstandingforthedeepweb
AT guox opalautomatedformunderstandingforthedeepweb
AT orsig opalautomatedformunderstandingforthedeepweb
AT schallhartc opalautomatedformunderstandingforthedeepweb