An Enhanced Semantic Focused Web Crawler Based on Hybrid String Matching Algorithm
Topic precise crawler is a special purpose web crawler, which downloads appropriate web pages analogous to a particular topic by measuring cosine similarity or semantic similarity score. The cosine based similarity measure displays inaccurate relevance score, if topic term does not directly occur in...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Sciendo
2021-06-01
|
Series: | Cybernetics and Information Technologies |
Subjects: | |
Online Access: | https://doi.org/10.2478/cait-2021-0022 |
_version_ | 1818745140602208256 |
---|---|
author | Sakunthala Prabha K. S. Mahesh C. Raja S. P. |
author_facet | Sakunthala Prabha K. S. Mahesh C. Raja S. P. |
author_sort | Sakunthala Prabha K. S. |
collection | DOAJ |
description | Topic precise crawler is a special purpose web crawler, which downloads appropriate web pages analogous to a particular topic by measuring cosine similarity or semantic similarity score. The cosine based similarity measure displays inaccurate relevance score, if topic term does not directly occur in the web page. The semantic-based similarity measure provides the precise relevance score, even if the synonyms of the given topic occur in the web page. The unavailability of the topic in the ontology produces inaccurate relevance score by the semantic focused crawlers. This paper overcomes these glitches with a hybrid string-matching algorithm by combining the semantic similarity-based measure with the probabilistic similarity-based measure. The experimental results revealed that this algorithm increased the efficiency of the focused web crawlers and achieved better Harvest Rate (HR), Precision (P) and Irrelevance Ratio (IR) than the existing web focused crawlers achieve. |
first_indexed | 2024-12-18T02:55:28Z |
format | Article |
id | doaj.art-812fa957c2ab40c385e351924731f2be |
institution | Directory Open Access Journal |
issn | 1314-4081 |
language | English |
last_indexed | 2024-12-18T02:55:28Z |
publishDate | 2021-06-01 |
publisher | Sciendo |
record_format | Article |
series | Cybernetics and Information Technologies |
spelling | doaj.art-812fa957c2ab40c385e351924731f2be2022-12-21T21:23:22ZengSciendoCybernetics and Information Technologies1314-40812021-06-0121210512010.2478/cait-2021-0022An Enhanced Semantic Focused Web Crawler Based on Hybrid String Matching AlgorithmSakunthala Prabha K. S.0Mahesh C.1Raja S. P.2Department of Information Technology, Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology, Avadi, Chennai, Tamil Nadu, IndiaDepartment of Information Technology, Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology, Avadi, Chennai, Tamil Nadu, IndiaSchool of Computer Science and Engineering, Vellore Institute of Technology, Vellore, Tamil Nadu, IndiaTopic precise crawler is a special purpose web crawler, which downloads appropriate web pages analogous to a particular topic by measuring cosine similarity or semantic similarity score. The cosine based similarity measure displays inaccurate relevance score, if topic term does not directly occur in the web page. The semantic-based similarity measure provides the precise relevance score, even if the synonyms of the given topic occur in the web page. The unavailability of the topic in the ontology produces inaccurate relevance score by the semantic focused crawlers. This paper overcomes these glitches with a hybrid string-matching algorithm by combining the semantic similarity-based measure with the probabilistic similarity-based measure. The experimental results revealed that this algorithm increased the efficiency of the focused web crawlers and achieved better Harvest Rate (HR), Precision (P) and Irrelevance Ratio (IR) than the existing web focused crawlers achieve.https://doi.org/10.2478/cait-2021-0022probabilistic modelhybrid semantic similarityweb focused crawlerstring matching |
spellingShingle | Sakunthala Prabha K. S. Mahesh C. Raja S. P. An Enhanced Semantic Focused Web Crawler Based on Hybrid String Matching Algorithm Cybernetics and Information Technologies probabilistic model hybrid semantic similarity web focused crawler string matching |
title | An Enhanced Semantic Focused Web Crawler Based on Hybrid String Matching Algorithm |
title_full | An Enhanced Semantic Focused Web Crawler Based on Hybrid String Matching Algorithm |
title_fullStr | An Enhanced Semantic Focused Web Crawler Based on Hybrid String Matching Algorithm |
title_full_unstemmed | An Enhanced Semantic Focused Web Crawler Based on Hybrid String Matching Algorithm |
title_short | An Enhanced Semantic Focused Web Crawler Based on Hybrid String Matching Algorithm |
title_sort | enhanced semantic focused web crawler based on hybrid string matching algorithm |
topic | probabilistic model hybrid semantic similarity web focused crawler string matching |
url | https://doi.org/10.2478/cait-2021-0022 |
work_keys_str_mv | AT sakunthalaprabhaks anenhancedsemanticfocusedwebcrawlerbasedonhybridstringmatchingalgorithm AT maheshc anenhancedsemanticfocusedwebcrawlerbasedonhybridstringmatchingalgorithm AT rajasp anenhancedsemanticfocusedwebcrawlerbasedonhybridstringmatchingalgorithm AT sakunthalaprabhaks enhancedsemanticfocusedwebcrawlerbasedonhybridstringmatchingalgorithm AT maheshc enhancedsemanticfocusedwebcrawlerbasedonhybridstringmatchingalgorithm AT rajasp enhancedsemanticfocusedwebcrawlerbasedonhybridstringmatchingalgorithm |