Effective Web Page Crawler

The World Wide Web (WWW) has grown from a few thousand pages in 1993 to more than eight billion pages at present. Due to this explosion in size, web search engines are becoming increasingly important as the primary means of locating relevant information. This research aims to build a crawler that cr...

Full description

Bibliographic Details
Main Authors: Hilal Hadi Saleh, Israa Ali
Format: Article
Language:English
Published: Unviversity of Technology- Iraq 2011-02-01
Series:Engineering and Technology Journal
Subjects:
Online Access:https://etj.uotechnology.edu.iq/article_26186_36b7272baba534e2fd03611087c6e7c5.pdf
Description
Summary:The World Wide Web (WWW) has grown from a few thousand pages in 1993 to more than eight billion pages at present. Due to this explosion in size, web search engines are becoming increasingly important as the primary means of locating relevant information. This research aims to build a crawler that crawls the most important web pages, a crawling system has been built which consists of three main techniques. The first is Best-First Technique which is used to select the most important page. The second is Distributed Crawling Technique which based on UbiCrawler. It is used to distribute the URLs of the selected web pages to several machines. And the third is Duplicated Pages Detecting Technique by using a proposed document fingerprint algorithm.
ISSN:1681-6900
2412-0758