An Enhanced Semantic Focused Web Crawler Based on Hybrid String Matching Algorithm

Topic precise crawler is a special purpose web crawler, which downloads appropriate web pages analogous to a particular topic by measuring cosine similarity or semantic similarity score. The cosine based similarity measure displays inaccurate relevance score, if topic term does not directly occur in...

Full description

Bibliographic Details
Main Authors: Sakunthala Prabha K. S., Mahesh C., Raja S. P.
Format: Article
Language:English
Published: Sciendo 2021-06-01
Series:Cybernetics and Information Technologies
Subjects:
Online Access:https://doi.org/10.2478/cait-2021-0022
_version_ 1818745140602208256
author Sakunthala Prabha K. S.
Mahesh C.
Raja S. P.
author_facet Sakunthala Prabha K. S.
Mahesh C.
Raja S. P.
author_sort Sakunthala Prabha K. S.
collection DOAJ
description Topic precise crawler is a special purpose web crawler, which downloads appropriate web pages analogous to a particular topic by measuring cosine similarity or semantic similarity score. The cosine based similarity measure displays inaccurate relevance score, if topic term does not directly occur in the web page. The semantic-based similarity measure provides the precise relevance score, even if the synonyms of the given topic occur in the web page. The unavailability of the topic in the ontology produces inaccurate relevance score by the semantic focused crawlers. This paper overcomes these glitches with a hybrid string-matching algorithm by combining the semantic similarity-based measure with the probabilistic similarity-based measure. The experimental results revealed that this algorithm increased the efficiency of the focused web crawlers and achieved better Harvest Rate (HR), Precision (P) and Irrelevance Ratio (IR) than the existing web focused crawlers achieve.
first_indexed 2024-12-18T02:55:28Z
format Article
id doaj.art-812fa957c2ab40c385e351924731f2be
institution Directory Open Access Journal
issn 1314-4081
language English
last_indexed 2024-12-18T02:55:28Z
publishDate 2021-06-01
publisher Sciendo
record_format Article
series Cybernetics and Information Technologies
spelling doaj.art-812fa957c2ab40c385e351924731f2be2022-12-21T21:23:22ZengSciendoCybernetics and Information Technologies1314-40812021-06-0121210512010.2478/cait-2021-0022An Enhanced Semantic Focused Web Crawler Based on Hybrid String Matching AlgorithmSakunthala Prabha K. S.0Mahesh C.1Raja S. P.2Department of Information Technology, Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology, Avadi, Chennai, Tamil Nadu, IndiaDepartment of Information Technology, Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology, Avadi, Chennai, Tamil Nadu, IndiaSchool of Computer Science and Engineering, Vellore Institute of Technology, Vellore, Tamil Nadu, IndiaTopic precise crawler is a special purpose web crawler, which downloads appropriate web pages analogous to a particular topic by measuring cosine similarity or semantic similarity score. The cosine based similarity measure displays inaccurate relevance score, if topic term does not directly occur in the web page. The semantic-based similarity measure provides the precise relevance score, even if the synonyms of the given topic occur in the web page. The unavailability of the topic in the ontology produces inaccurate relevance score by the semantic focused crawlers. This paper overcomes these glitches with a hybrid string-matching algorithm by combining the semantic similarity-based measure with the probabilistic similarity-based measure. The experimental results revealed that this algorithm increased the efficiency of the focused web crawlers and achieved better Harvest Rate (HR), Precision (P) and Irrelevance Ratio (IR) than the existing web focused crawlers achieve.https://doi.org/10.2478/cait-2021-0022probabilistic modelhybrid semantic similarityweb focused crawlerstring matching
spellingShingle Sakunthala Prabha K. S.
Mahesh C.
Raja S. P.
An Enhanced Semantic Focused Web Crawler Based on Hybrid String Matching Algorithm
Cybernetics and Information Technologies
probabilistic model
hybrid semantic similarity
web focused crawler
string matching
title An Enhanced Semantic Focused Web Crawler Based on Hybrid String Matching Algorithm
title_full An Enhanced Semantic Focused Web Crawler Based on Hybrid String Matching Algorithm
title_fullStr An Enhanced Semantic Focused Web Crawler Based on Hybrid String Matching Algorithm
title_full_unstemmed An Enhanced Semantic Focused Web Crawler Based on Hybrid String Matching Algorithm
title_short An Enhanced Semantic Focused Web Crawler Based on Hybrid String Matching Algorithm
title_sort enhanced semantic focused web crawler based on hybrid string matching algorithm
topic probabilistic model
hybrid semantic similarity
web focused crawler
string matching
url https://doi.org/10.2478/cait-2021-0022
work_keys_str_mv AT sakunthalaprabhaks anenhancedsemanticfocusedwebcrawlerbasedonhybridstringmatchingalgorithm
AT maheshc anenhancedsemanticfocusedwebcrawlerbasedonhybridstringmatchingalgorithm
AT rajasp anenhancedsemanticfocusedwebcrawlerbasedonhybridstringmatchingalgorithm
AT sakunthalaprabhaks enhancedsemanticfocusedwebcrawlerbasedonhybridstringmatchingalgorithm
AT maheshc enhancedsemanticfocusedwebcrawlerbasedonhybridstringmatchingalgorithm
AT rajasp enhancedsemanticfocusedwebcrawlerbasedonhybridstringmatchingalgorithm