You Need Only One Clue for Effective Record Segmentation

Record segmentation is a core problem in data extraction. Previous approaches have focused on more and more sophisticated heuristics without knowledge of the concrete domain. In this work, we demonstrate that with only a single clue about mandatory attributes in a given domain, straightforward rules...

Mô tả đầy đủ

Chi tiết về thư mục
Những tác giả chính: Wang, C, Furche, T, Gottlob, G, Grasso, G, Orsi, G, Schallhart, C
Định dạng: Conference item
Được phát hành: 2011
Miêu tả
Tóm tắt:Record segmentation is a core problem in data extraction. Previous approaches have focused on more and more sophisticated heuristics without knowledge of the concrete domain. In this work, we demonstrate that with only a single clue about mandatory attributes in a given domain, straightforward rules for record segmentation suffice to achieve 100% precise record extraction from the vast majority of web sites in that domain. These results are first outcomes of the just launched ERC project DIADEM on domain-specific intelligent automated data extraction.