Distributed Web-Scale Infrastructure For Crawling, Indexing And Search With Semantic Support

In this paper, we describe our work in progress in the scope of web-scale informationextraction and information retrieval utilizing distributed computing. Wepresent a distributed architecture built on top of the MapReduce paradigm forinformation retrieval, information processing and intelligent sear...

Full description

Bibliographic Details
Main Authors:	Stefan Dlugolinsky, Martin Seleng, Michal Laclavik, Ladislav Hluchy
Format:	Article
Language:	English
Published:	AGH University of Science and Technology Press 2012-01-01
Series:	Computer Science
Subjects:	istributed web crawling information extraction information retrieval semantic search geocoding spatial search
Online Access:	http://journals.agh.edu.pl/csci/article/download/42/31

_version_	1818241385218703360
author	Stefan Dlugolinsky Martin Seleng Michal Laclavik Ladislav Hluchy
author_facet	Stefan Dlugolinsky Martin Seleng Michal Laclavik Ladislav Hluchy
author_sort	Stefan Dlugolinsky
collection	DOAJ
description	In this paper, we describe our work in progress in the scope of web-scale informationextraction and information retrieval utilizing distributed computing. Wepresent a distributed architecture built on top of the MapReduce paradigm forinformation retrieval, information processing and intelligent search supportedby spatial capabilities. Proposed architecture is focused on crawling documentsin several different formats, information extraction, lightweight semantic annotationof the extracted information, indexing of extracted information andfinally on indexing of documents based on the geo-spatial information foundin a document. We demonstrate the architecture on two use cases, where thefirst is search in job offers retrieved from the LinkedIn portal and the second issearch in BBC news feeds and discuss several problems we had to face duringthe implementation. We also discuss spatial search applications for both casesbecause both LinkedIn job offer pages and BBC news feeds contain a lot of spatialinformation to extract and process.
first_indexed	2024-12-12T13:28:30Z
format	Article
id	doaj.art-4bafd727ffcf448eb88e707d92e20b8c
institution	Directory Open Access Journal
issn	1508-2806
language	English
last_indexed	2024-12-12T13:28:30Z
publishDate	2012-01-01
publisher	AGH University of Science and Technology Press
record_format	Article
series	Computer Science
spelling	doaj.art-4bafd727ffcf448eb88e707d92e20b8c2022-12-22T00:23:07ZengAGH University of Science and Technology PressComputer Science1508-28062012-01-01134510.7494/csci.2012.13.4.5Distributed Web-Scale Infrastructure For Crawling, Indexing And Search With Semantic SupportStefan Dlugolinsky0Martin Seleng1Michal Laclavik2Ladislav Hluchy3Institute of Informatics, Slovak Academy of Sciences, BratislavaInstitute of Informatics, Slovak Academy of Sciences, BratislavaInstitute of Informatics, Slovak Academy of Sciences, BratislavaInstitute of Informatics, Slovak Academy of Sciences, BratislavaIn this paper, we describe our work in progress in the scope of web-scale informationextraction and information retrieval utilizing distributed computing. Wepresent a distributed architecture built on top of the MapReduce paradigm forinformation retrieval, information processing and intelligent search supportedby spatial capabilities. Proposed architecture is focused on crawling documentsin several different formats, information extraction, lightweight semantic annotationof the extracted information, indexing of extracted information andfinally on indexing of documents based on the geo-spatial information foundin a document. We demonstrate the architecture on two use cases, where thefirst is search in job offers retrieved from the LinkedIn portal and the second issearch in BBC news feeds and discuss several problems we had to face duringthe implementation. We also discuss spatial search applications for both casesbecause both LinkedIn job offer pages and BBC news feeds contain a lot of spatialinformation to extract and process.http://journals.agh.edu.pl/csci/article/download/42/31istributed web crawlinginformation extractioninformation retrievalsemantic searchgeocodingspatial search
spellingShingle	Stefan Dlugolinsky Martin Seleng Michal Laclavik Ladislav Hluchy Distributed Web-Scale Infrastructure For Crawling, Indexing And Search With Semantic Support Computer Science istributed web crawling information extraction information retrieval semantic search geocoding spatial search
title	Distributed Web-Scale Infrastructure For Crawling, Indexing And Search With Semantic Support
title_full	Distributed Web-Scale Infrastructure For Crawling, Indexing And Search With Semantic Support
title_fullStr	Distributed Web-Scale Infrastructure For Crawling, Indexing And Search With Semantic Support
title_full_unstemmed	Distributed Web-Scale Infrastructure For Crawling, Indexing And Search With Semantic Support
title_short	Distributed Web-Scale Infrastructure For Crawling, Indexing And Search With Semantic Support
title_sort	distributed web scale infrastructure for crawling indexing and search with semantic support
topic	istributed web crawling information extraction information retrieval semantic search geocoding spatial search
url	http://journals.agh.edu.pl/csci/article/download/42/31
work_keys_str_mv	AT stefandlugolinsky distributedwebscaleinfrastructureforcrawlingindexingandsearchwithsemanticsupport AT martinseleng distributedwebscaleinfrastructureforcrawlingindexingandsearchwithsemanticsupport AT michallaclavik distributedwebscaleinfrastructureforcrawlingindexingandsearchwithsemanticsupport AT ladislavhluchy distributedwebscaleinfrastructureforcrawlingindexingandsearchwithsemanticsupport

Distributed Web-Scale Infrastructure For Crawling, Indexing And Search With Semantic Support

Similar Items