Web page representations and data extraction with BERyL

The web contains a huge amount of data, which can be primarily accessed with the use of web data extraction technology. With increasing complexity of the web development stack and the source code, a web page visual representation rendered by the browser is often the only source reflecting the semant...

Full description

Bibliographic Details
Main Authors: Kravchenko, A, Fayzrakhmanov, R, Sallinger, E
Format: Conference item
Published: Springer Verlag 2018
Description
Summary:The web contains a huge amount of data, which can be primarily accessed with the use of web data extraction technology. With increasing complexity of the web development stack and the source code, a web page visual representation rendered by the browser is often the only source reflecting the semantics, functional role, and logical structure of elements. Thus, modern automatic approaches typically target visual cues and structures (e.g., DOM and CSSOM) constructed by the web browser. In this paper, we briefly analyse different representations of web pages, generic approaches, and introduce Open image in new window, a novel framework and language, which can consolidate two “worlds”, two main approaches: the rule-based approach and machine learning. The rule-based approach is used for feature engineering and pattern recognition, whilst machine learning is used for classification based on the inferred features. This is achieved through three stages including (1) feature computation, pattern construction, and application, (2) machine learning, and (3) refinement.