ARCOMEM Crawling Architecture

The World Wide Web is the largest information repository available today. However, this information is very volatile and Web archiving is essential to preserve it for the future. Existing approaches to Web archiving are based on simple definitions of the scope of Web pages to crawl and are limited t...

Full description

Bibliographic Details
Main Authors:	Vassilis Plachouras, Florent Carpentier, Muhammad Faheem, Julien Masanès, Thomas Risse, Pierre Senellart, Patrick Siehndel, Yannis Stavrakas
Format:	Article
Language:	English
Published:	MDPI AG 2014-08-01
Series:	Future Internet
Subjects:	web archiving crawling architecture content acquisition
Online Access:	http://www.mdpi.com/1999-5903/6/3/518

_version_	1828844093456777216
author	Vassilis Plachouras Florent Carpentier Muhammad Faheem Julien Masanès Thomas Risse Pierre Senellart Patrick Siehndel Yannis Stavrakas
author_facet	Vassilis Plachouras Florent Carpentier Muhammad Faheem Julien Masanès Thomas Risse Pierre Senellart Patrick Siehndel Yannis Stavrakas
author_sort	Vassilis Plachouras
collection	DOAJ
description	The World Wide Web is the largest information repository available today. However, this information is very volatile and Web archiving is essential to preserve it for the future. Existing approaches to Web archiving are based on simple definitions of the scope of Web pages to crawl and are limited to basic interactions with Web servers. The aim of the ARCOMEM project is to overcome these limitations and to provide flexible, adaptive and intelligent content acquisition, relying on social media to create topical Web archives. In this article, we focus on ARCOMEM’s crawling architecture. We introduce the overall architecture and we describe its modules, such as the online analysis module, which computes a priority for the Web pages to be crawled, and the Application-Aware Helper which takes into account the type of Web sites and applications to extract structure from crawled content. We also describe a large-scale distributed crawler that has been developed, as well as the modifications we have implemented to adapt Heritrix, an open source crawler, to the needs of the project. Our experimental results from real crawls show that ARCOMEM’s crawling architecture is effective in acquiring focused information about a topic and leveraging the information from social media.
first_indexed	2024-12-12T20:59:15Z
format	Article
id	doaj.art-ec5ee55717a648f5aa26fb3022e888c3
institution	Directory Open Access Journal
issn	1999-5903
language	English
last_indexed	2024-12-12T20:59:15Z
publishDate	2014-08-01
publisher	MDPI AG
record_format	Article
series	Future Internet
spelling	doaj.art-ec5ee55717a648f5aa26fb3022e888c32022-12-22T00:12:13ZengMDPI AGFuture Internet1999-59032014-08-016351854110.3390/fi6030518fi6030518ARCOMEM Crawling ArchitectureVassilis Plachouras0Florent Carpentier1Muhammad Faheem2Julien Masanès3Thomas Risse4Pierre Senellart5Patrick Siehndel6Yannis Stavrakas7Institute for the Management of Information Systems, Athena Research and Innovation Center, Artemidos 6 & Epidavrou, Maroussi 15125, GreeceInternet Memory Foundation, 45 ter rue de la Révolution, 93100 Montreuil, FranceCNRS LTCI, Institut Mines-Télécom, Télécom ParisTech, 46 rue Barrault, 75634 Paris Cedex 13, FranceInternet Memory Foundation, 45 ter rue de la Révolution, 93100 Montreuil, FranceResearch Center, University of Hannover, Appelstr. 9a, 30167 Hannover, GermanyCNRS LTCI, Institut Mines-Télécom, Télécom ParisTech, 46 rue Barrault, 75634 Paris Cedex 13, FranceResearch Center, University of Hannover, Appelstr. 9a, 30167 Hannover, GermanyInstitute for the Management of Information Systems, Athena Research and Innovation Center, Artemidos 6 & Epidavrou, Maroussi 15125, GreeceThe World Wide Web is the largest information repository available today. However, this information is very volatile and Web archiving is essential to preserve it for the future. Existing approaches to Web archiving are based on simple definitions of the scope of Web pages to crawl and are limited to basic interactions with Web servers. The aim of the ARCOMEM project is to overcome these limitations and to provide flexible, adaptive and intelligent content acquisition, relying on social media to create topical Web archives. In this article, we focus on ARCOMEM’s crawling architecture. We introduce the overall architecture and we describe its modules, such as the online analysis module, which computes a priority for the Web pages to be crawled, and the Application-Aware Helper which takes into account the type of Web sites and applications to extract structure from crawled content. We also describe a large-scale distributed crawler that has been developed, as well as the modifications we have implemented to adapt Heritrix, an open source crawler, to the needs of the project. Our experimental results from real crawls show that ARCOMEM’s crawling architecture is effective in acquiring focused information about a topic and leveraging the information from social media.http://www.mdpi.com/1999-5903/6/3/518web archivingcrawling architecturecontent acquisition
spellingShingle	Vassilis Plachouras Florent Carpentier Muhammad Faheem Julien Masanès Thomas Risse Pierre Senellart Patrick Siehndel Yannis Stavrakas ARCOMEM Crawling Architecture Future Internet web archiving crawling architecture content acquisition
title	ARCOMEM Crawling Architecture
title_full	ARCOMEM Crawling Architecture
title_fullStr	ARCOMEM Crawling Architecture
title_full_unstemmed	ARCOMEM Crawling Architecture
title_short	ARCOMEM Crawling Architecture
title_sort	arcomem crawling architecture
topic	web archiving crawling architecture content acquisition
url	http://www.mdpi.com/1999-5903/6/3/518
work_keys_str_mv	AT vassilisplachouras arcomemcrawlingarchitecture AT florentcarpentier arcomemcrawlingarchitecture AT muhammadfaheem arcomemcrawlingarchitecture AT julienmasanes arcomemcrawlingarchitecture AT thomasrisse arcomemcrawlingarchitecture AT pierresenellart arcomemcrawlingarchitecture AT patricksiehndel arcomemcrawlingarchitecture AT yannisstavrakas arcomemcrawlingarchitecture

ARCOMEM Crawling Architecture

Similar Items