SInFo – Structure-Driven Incremental Forum Crawler That Optimizes User-Generated Content Retrieval

In this paper we present a Structure-driven Incremental Forum crawler (SInFo) that targets the latest content in crawling cycles. On a Web forum, user generated content is almost never changed or deleted, but it is constantly added. There is a wide spectrum of forum technologies that have different...

Full description

Bibliographic Details
Main Authors: Milos Pavkovic, Jelica Protic
Format: Article
Language:English
Published: IEEE 2019-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8832156/
Description
Summary:In this paper we present a Structure-driven Incremental Forum crawler (SInFo) that targets the latest content in crawling cycles. On a Web forum, user generated content is almost never changed or deleted, but it is constantly added. There is a wide spectrum of forum technologies that have different representations and navigational paths to lead the user to the latest content. Targeting the latest content is not a trivial task, since adding some new content to a forum often results in shifting the old content between pages. Ignoring the way forum content is distributed and sorted can lead to repetitive visits to the pages with the same data from previous crawls while incrementally crawling. The main goal of SInFo is to avoid transfer of duplicate content in forum incremental crawling, using the generic approach regardless of the forum technology. The problem is reduced to discovering and utilizing the following forum technology features: (1) forum index and thread page content and sort representation and, (2) available forum technology navigational options between pages. With the proposed methods and techniques, we show how to locate the target page by observing the URL signature format and minimize the number of required downloads to fetch the page containing the latest content. The experiments were conducted on custom technologies and also on a wide range of pre-built forum packages covering more than 80% of representative widely used software packages. SInFo showed high accuracy and low level of duplicates transmission by reaching the average of 92.6% for the new content in each recrawl cycle.
ISSN:2169-3536