Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting
Duplicate and near-duplicate web pages are the chief concerns for web search engines. In reality, they incur enormous space to store the indexes, ultimately slowing down and increasing the cost of serving results. A variety of techniques have been developed to identify pairs of web pages that are &a...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Springer
2013-02-01
|
Series: | International Journal of Computational Intelligence Systems |
Subjects: | |
Online Access: | https://www.atlantis-press.com/article/25868364.pdf |
_version_ | 1811302465532854272 |
---|---|
author | J. Prasanna Kumar P. Govindarajulu |
author_facet | J. Prasanna Kumar P. Govindarajulu |
author_sort | J. Prasanna Kumar |
collection | DOAJ |
description | Duplicate and near-duplicate web pages are the chief concerns for web search engines. In reality, they incur enormous space to store the indexes, ultimately slowing down and increasing the cost of serving results. A variety of techniques have been developed to identify pairs of web pages that are “similar” to each other. The problem of finding near-duplicate web pages has been a subject of research in the database and web-search communities for some years. In order to identify the near duplicate web pages, we make use of sentence level features along with fingerprinting method. When a large number of web documents are in consideration for the detection of web pages, then at first, we use K-mode clustering and subsequently sentence feature and fingerprint comparison is used. Using these steps, we exactly identify the near duplicate web pages in an efficient manner. The experimentation is carried out on the web page collections and the results ensured the efficiency of the proposed approach in detecting the near duplicate web pages. |
first_indexed | 2024-04-13T07:29:38Z |
format | Article |
id | doaj.art-e6022fd7be8444a9be0d1b49e4c8039f |
institution | Directory Open Access Journal |
issn | 1875-6883 |
language | English |
last_indexed | 2024-04-13T07:29:38Z |
publishDate | 2013-02-01 |
publisher | Springer |
record_format | Article |
series | International Journal of Computational Intelligence Systems |
spelling | doaj.art-e6022fd7be8444a9be0d1b49e4c8039f2022-12-22T02:56:24ZengSpringerInternational Journal of Computational Intelligence Systems1875-68832013-02-016110.1080/18756891.2013.752657Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and FingerprintingJ. Prasanna KumarP. GovindarajuluDuplicate and near-duplicate web pages are the chief concerns for web search engines. In reality, they incur enormous space to store the indexes, ultimately slowing down and increasing the cost of serving results. A variety of techniques have been developed to identify pairs of web pages that are “similar” to each other. The problem of finding near-duplicate web pages has been a subject of research in the database and web-search communities for some years. In order to identify the near duplicate web pages, we make use of sentence level features along with fingerprinting method. When a large number of web documents are in consideration for the detection of web pages, then at first, we use K-mode clustering and subsequently sentence feature and fingerprint comparison is used. Using these steps, we exactly identify the near duplicate web pages in an efficient manner. The experimentation is carried out on the web page collections and the results ensured the efficiency of the proposed approach in detecting the near duplicate web pages.https://www.atlantis-press.com/article/25868364.pdfWeb CrawlingWeb pageDuplicate web pageNear duplicate web pageNear duplicate detectionfingerprinting |
spellingShingle | J. Prasanna Kumar P. Govindarajulu Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting International Journal of Computational Intelligence Systems Web Crawling Web page Duplicate web page Near duplicate web page Near duplicate detection fingerprinting |
title | Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting |
title_full | Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting |
title_fullStr | Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting |
title_full_unstemmed | Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting |
title_short | Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting |
title_sort | near duplicate web page detection an efficient approach using clustering sentence feature and fingerprinting |
topic | Web Crawling Web page Duplicate web page Near duplicate web page Near duplicate detection fingerprinting |
url | https://www.atlantis-press.com/article/25868364.pdf |
work_keys_str_mv | AT jprasannakumar nearduplicatewebpagedetectionanefficientapproachusingclusteringsentencefeatureandfingerprinting AT pgovindarajulu nearduplicatewebpagedetectionanefficientapproachusingclusteringsentencefeatureandfingerprinting |