SInFo – Structure-Driven Incremental Forum Crawler That Optimizes User-Generated Content Retrieval

In this paper we present a Structure-driven Incremental Forum crawler (SInFo) that targets the latest content in crawling cycles. On a Web forum, user generated content is almost never changed or deleted, but it is constantly added. There is a wide spectrum of forum technologies that have different...

Full description

Bibliographic Details
Main Authors: Milos Pavkovic, Jelica Protic
Format: Article
Language:English
Published: IEEE 2019-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8832156/
_version_ 1818410947455221760
author Milos Pavkovic
Jelica Protic
author_facet Milos Pavkovic
Jelica Protic
author_sort Milos Pavkovic
collection DOAJ
description In this paper we present a Structure-driven Incremental Forum crawler (SInFo) that targets the latest content in crawling cycles. On a Web forum, user generated content is almost never changed or deleted, but it is constantly added. There is a wide spectrum of forum technologies that have different representations and navigational paths to lead the user to the latest content. Targeting the latest content is not a trivial task, since adding some new content to a forum often results in shifting the old content between pages. Ignoring the way forum content is distributed and sorted can lead to repetitive visits to the pages with the same data from previous crawls while incrementally crawling. The main goal of SInFo is to avoid transfer of duplicate content in forum incremental crawling, using the generic approach regardless of the forum technology. The problem is reduced to discovering and utilizing the following forum technology features: (1) forum index and thread page content and sort representation and, (2) available forum technology navigational options between pages. With the proposed methods and techniques, we show how to locate the target page by observing the URL signature format and minimize the number of required downloads to fetch the page containing the latest content. The experiments were conducted on custom technologies and also on a wide range of pre-built forum packages covering more than 80% of representative widely used software packages. SInFo showed high accuracy and low level of duplicates transmission by reaching the average of 92.6% for the new content in each recrawl cycle.
first_indexed 2024-12-14T10:23:37Z
format Article
id doaj.art-7232bdd21a57482aad23c8c8583f18ba
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-12-14T10:23:37Z
publishDate 2019-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-7232bdd21a57482aad23c8c8583f18ba2022-12-21T23:06:28ZengIEEEIEEE Access2169-35362019-01-01712694112696110.1109/ACCESS.2019.29398728832156SInFo – Structure-Driven Incremental Forum Crawler That Optimizes User-Generated Content RetrievalMilos Pavkovic0https://orcid.org/0000-0001-7776-6045Jelica Protic1School of Electrical Engineering, University of Belgrade, Belgrade, SerbiaSchool of Electrical Engineering, University of Belgrade, Belgrade, SerbiaIn this paper we present a Structure-driven Incremental Forum crawler (SInFo) that targets the latest content in crawling cycles. On a Web forum, user generated content is almost never changed or deleted, but it is constantly added. There is a wide spectrum of forum technologies that have different representations and navigational paths to lead the user to the latest content. Targeting the latest content is not a trivial task, since adding some new content to a forum often results in shifting the old content between pages. Ignoring the way forum content is distributed and sorted can lead to repetitive visits to the pages with the same data from previous crawls while incrementally crawling. The main goal of SInFo is to avoid transfer of duplicate content in forum incremental crawling, using the generic approach regardless of the forum technology. The problem is reduced to discovering and utilizing the following forum technology features: (1) forum index and thread page content and sort representation and, (2) available forum technology navigational options between pages. With the proposed methods and techniques, we show how to locate the target page by observing the URL signature format and minimize the number of required downloads to fetch the page containing the latest content. The experiments were conducted on custom technologies and also on a wide range of pre-built forum packages covering more than 80% of representative widely used software packages. SInFo showed high accuracy and low level of duplicates transmission by reaching the average of 92.6% for the new content in each recrawl cycle.https://ieeexplore.ieee.org/document/8832156/Crawling techniquedata retrievalincremental crawlingoptimizationtraversal strategyWeb forum
spellingShingle Milos Pavkovic
Jelica Protic
SInFo – Structure-Driven Incremental Forum Crawler That Optimizes User-Generated Content Retrieval
IEEE Access
Crawling technique
data retrieval
incremental crawling
optimization
traversal strategy
Web forum
title SInFo – Structure-Driven Incremental Forum Crawler That Optimizes User-Generated Content Retrieval
title_full SInFo – Structure-Driven Incremental Forum Crawler That Optimizes User-Generated Content Retrieval
title_fullStr SInFo – Structure-Driven Incremental Forum Crawler That Optimizes User-Generated Content Retrieval
title_full_unstemmed SInFo – Structure-Driven Incremental Forum Crawler That Optimizes User-Generated Content Retrieval
title_short SInFo – Structure-Driven Incremental Forum Crawler That Optimizes User-Generated Content Retrieval
title_sort sinfo x2013 structure driven incremental forum crawler that optimizes user generated content retrieval
topic Crawling technique
data retrieval
incremental crawling
optimization
traversal strategy
Web forum
url https://ieeexplore.ieee.org/document/8832156/
work_keys_str_mv AT milospavkovic sinfox2013structuredrivenincrementalforumcrawlerthatoptimizesusergeneratedcontentretrieval
AT jelicaprotic sinfox2013structuredrivenincrementalforumcrawlerthatoptimizesusergeneratedcontentretrieval