OPAL: automated form understanding for the deep web.
Forms are our gates to the web. They enable us to access the deep content of web sites. Automatic form understanding unlocks this content for applications ranging from crawlers to meta-search engines and is essential for improving usability and accessibility of the web. Form understanding has receiv...
Main Authors: | , , , , , |
---|---|
Other Authors: | |
Format: | Conference item |
Published: |
ACM
2012
|
_version_ | 1797092665408356352 |
---|---|
author | Furche, T Gottlob, G Grasso, G Guo, X Orsi, G Schallhart, C |
author2 | Mille, A |
author_facet | Mille, A Furche, T Gottlob, G Grasso, G Guo, X Orsi, G Schallhart, C |
author_sort | Furche, T |
collection | OXFORD |
description | Forms are our gates to the web. They enable us to access the deep content of web sites. Automatic form understanding unlocks this content for applications ranging from crawlers to meta-search engines and is essential for improving usability and accessibility of the web. Form understanding has received surprisingly little attention other than as component in specific applications such as crawlers. No comprehensive approach to form understanding exists and previous works disagree even in the definition of the problem. In this paper, we present OPAL, the first comprehensive approach to form understanding. We identify form labeling and form interpretation as the two main tasks involved in form understanding. On both problems OPAL pushes the state of the art: For form labeling, it combines signals from the text, structure, and visual rendering of a web page, yielding robust characterisations of common design patterns. In extensive experiments on the ICQ and TEL-8 benchmarks and a set of 200 modern web forms OPAL outperforms previous approaches by a significant margin. For form interpretation, we introduce a template language to describe frequent form patterns. These two parts of OPAL combined yield form understanding with near perfect accuracy (> 98%). |
first_indexed | 2024-03-07T03:49:16Z |
format | Conference item |
id | oxford-uuid:c0a6a9a9-7c1f-44dd-9786-dcffc16b8a21 |
institution | University of Oxford |
last_indexed | 2024-03-07T03:49:16Z |
publishDate | 2012 |
publisher | ACM |
record_format | dspace |
spelling | oxford-uuid:c0a6a9a9-7c1f-44dd-9786-dcffc16b8a212022-03-27T05:55:55ZOPAL: automated form understanding for the deep web.Conference itemhttp://purl.org/coar/resource_type/c_5794uuid:c0a6a9a9-7c1f-44dd-9786-dcffc16b8a21Symplectic Elements at OxfordACM2012Furche, TGottlob, GGrasso, GGuo, XOrsi, GSchallhart, CMille, AGandon, FMisselis, JRabinovich, MStaab, SForms are our gates to the web. They enable us to access the deep content of web sites. Automatic form understanding unlocks this content for applications ranging from crawlers to meta-search engines and is essential for improving usability and accessibility of the web. Form understanding has received surprisingly little attention other than as component in specific applications such as crawlers. No comprehensive approach to form understanding exists and previous works disagree even in the definition of the problem. In this paper, we present OPAL, the first comprehensive approach to form understanding. We identify form labeling and form interpretation as the two main tasks involved in form understanding. On both problems OPAL pushes the state of the art: For form labeling, it combines signals from the text, structure, and visual rendering of a web page, yielding robust characterisations of common design patterns. In extensive experiments on the ICQ and TEL-8 benchmarks and a set of 200 modern web forms OPAL outperforms previous approaches by a significant margin. For form interpretation, we introduce a template language to describe frequent form patterns. These two parts of OPAL combined yield form understanding with near perfect accuracy (> 98%). |
spellingShingle | Furche, T Gottlob, G Grasso, G Guo, X Orsi, G Schallhart, C OPAL: automated form understanding for the deep web. |
title | OPAL: automated form understanding for the deep web. |
title_full | OPAL: automated form understanding for the deep web. |
title_fullStr | OPAL: automated form understanding for the deep web. |
title_full_unstemmed | OPAL: automated form understanding for the deep web. |
title_short | OPAL: automated form understanding for the deep web. |
title_sort | opal automated form understanding for the deep web |
work_keys_str_mv | AT furchet opalautomatedformunderstandingforthedeepweb AT gottlobg opalautomatedformunderstandingforthedeepweb AT grassog opalautomatedformunderstandingforthedeepweb AT guox opalautomatedformunderstandingforthedeepweb AT orsig opalautomatedformunderstandingforthedeepweb AT schallhartc opalautomatedformunderstandingforthedeepweb |