Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting

Duplicate and near-duplicate web pages are the chief concerns for web search engines. In reality, they incur enormous space to store the indexes, ultimately slowing down and increasing the cost of serving results. A variety of techniques have been developed to identify pairs of web pages that are &a...

Full description

Bibliographic Details
Main Authors: J. Prasanna Kumar, P. Govindarajulu
Format: Article
Language:English
Published: Springer 2013-02-01
Series:International Journal of Computational Intelligence Systems
Subjects:
Online Access:https://www.atlantis-press.com/article/25868364.pdf
_version_ 1811302465532854272
author J. Prasanna Kumar
P. Govindarajulu
author_facet J. Prasanna Kumar
P. Govindarajulu
author_sort J. Prasanna Kumar
collection DOAJ
description Duplicate and near-duplicate web pages are the chief concerns for web search engines. In reality, they incur enormous space to store the indexes, ultimately slowing down and increasing the cost of serving results. A variety of techniques have been developed to identify pairs of web pages that are “similar” to each other. The problem of finding near-duplicate web pages has been a subject of research in the database and web-search communities for some years. In order to identify the near duplicate web pages, we make use of sentence level features along with fingerprinting method. When a large number of web documents are in consideration for the detection of web pages, then at first, we use K-mode clustering and subsequently sentence feature and fingerprint comparison is used. Using these steps, we exactly identify the near duplicate web pages in an efficient manner. The experimentation is carried out on the web page collections and the results ensured the efficiency of the proposed approach in detecting the near duplicate web pages.
first_indexed 2024-04-13T07:29:38Z
format Article
id doaj.art-e6022fd7be8444a9be0d1b49e4c8039f
institution Directory Open Access Journal
issn 1875-6883
language English
last_indexed 2024-04-13T07:29:38Z
publishDate 2013-02-01
publisher Springer
record_format Article
series International Journal of Computational Intelligence Systems
spelling doaj.art-e6022fd7be8444a9be0d1b49e4c8039f2022-12-22T02:56:24ZengSpringerInternational Journal of Computational Intelligence Systems1875-68832013-02-016110.1080/18756891.2013.752657Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and FingerprintingJ. Prasanna KumarP. GovindarajuluDuplicate and near-duplicate web pages are the chief concerns for web search engines. In reality, they incur enormous space to store the indexes, ultimately slowing down and increasing the cost of serving results. A variety of techniques have been developed to identify pairs of web pages that are “similar” to each other. The problem of finding near-duplicate web pages has been a subject of research in the database and web-search communities for some years. In order to identify the near duplicate web pages, we make use of sentence level features along with fingerprinting method. When a large number of web documents are in consideration for the detection of web pages, then at first, we use K-mode clustering and subsequently sentence feature and fingerprint comparison is used. Using these steps, we exactly identify the near duplicate web pages in an efficient manner. The experimentation is carried out on the web page collections and the results ensured the efficiency of the proposed approach in detecting the near duplicate web pages.https://www.atlantis-press.com/article/25868364.pdfWeb CrawlingWeb pageDuplicate web pageNear duplicate web pageNear duplicate detectionfingerprinting
spellingShingle J. Prasanna Kumar
P. Govindarajulu
Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting
International Journal of Computational Intelligence Systems
Web Crawling
Web page
Duplicate web page
Near duplicate web page
Near duplicate detection
fingerprinting
title Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting
title_full Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting
title_fullStr Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting
title_full_unstemmed Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting
title_short Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting
title_sort near duplicate web page detection an efficient approach using clustering sentence feature and fingerprinting
topic Web Crawling
Web page
Duplicate web page
Near duplicate web page
Near duplicate detection
fingerprinting
url https://www.atlantis-press.com/article/25868364.pdf
work_keys_str_mv AT jprasannakumar nearduplicatewebpagedetectionanefficientapproachusingclusteringsentencefeatureandfingerprinting
AT pgovindarajulu nearduplicatewebpagedetectionanefficientapproachusingclusteringsentencefeatureandfingerprinting