An improved framework for content and link-based web spam detection: a combined approach

In the modern digital era, the Web has been utilized for searching information by using different search engines (SE) as a tool. However, web spammers misuse the web for financial benefits by ranking the irrelevant and spam web pages higher than relevant pages in the search engine's results pag...

Full description

Bibliographic Details
Main Author: Shahzad, Asim
Format: Thesis
Language:English
English
English
Published: 2021
Subjects:
Online Access:http://eprints.uthm.edu.my/1777/2/ASIM%20SHAHZAD%20-%20declaration.pdf
http://eprints.uthm.edu.my/1777/1/ASIM%20SHAHZAD%20-%2024p.pdf
http://eprints.uthm.edu.my/1777/3/ASIM%20SHAHZAD%20-%20fulltext.pdf
_version_ 1796868525538672640
author Shahzad, Asim
author_facet Shahzad, Asim
author_sort Shahzad, Asim
collection UTHM
description In the modern digital era, the Web has been utilized for searching information by using different search engines (SE) as a tool. However, web spammers misuse the web for financial benefits by ranking the irrelevant and spam web pages higher than relevant pages in the search engine's results pages (SERPs) by using web spamming techniques. Furthermore, those top-ranked unrelated web pages contain insufficient or inappropriate information for the user. In addition, web spamming techniques dramatically affect the quality of the search engine. Researchers introduced several web spam detection techniques such as content-based features, link-based features, label propagation, label refinement, click-based web spamming detection, and real-time web spam detection. However, identifying all spam pages on the Web with high accuracy is still remains unsolved. This work proposes a content-based web spam detection framework, link-based web spam detection framework, and a combined approach to identify both types of web spams with high accuracy that can detect the newly evolved link pyramid. The content-based web spam detection framework uses three proposed and two improved content-based algorithms for web spam detection. The link-based web spam detection framework initially exposed the relationship network behind the link spamming and then used the paid-links database algorithm, spam signals algorithm, and improved link farms algorithm for link-based web spam identification. Finally, the combination of both content and link-based frameworks enhance the accuracy of web spam detection. The proposed combined approach's performance has been evaluated and compared with the J48 classifier, C4.5 decision tree classifier, SVM classifier, and heuristic combined approach. Some experiments were conducted to obtain the threshold values using the proposed collection architecture on well-known datasets WEB SPAM-UK2006 and WEB SPAM-UK2007. The results show that the proposed methods outperform other methods with 82.1% precision and an F-measure of 80.6% to illustrate the proposed framework's effectiveness and applicability.
first_indexed 2024-03-05T21:41:04Z
format Thesis
id uthm.eprints-1777
institution Universiti Tun Hussein Onn Malaysia
language English
English
English
last_indexed 2024-03-05T21:41:04Z
publishDate 2021
record_format dspace
spelling uthm.eprints-17772021-10-11T07:58:48Z http://eprints.uthm.edu.my/1777/ An improved framework for content and link-based web spam detection: a combined approach Shahzad, Asim QA76.75-76.765 Computer software In the modern digital era, the Web has been utilized for searching information by using different search engines (SE) as a tool. However, web spammers misuse the web for financial benefits by ranking the irrelevant and spam web pages higher than relevant pages in the search engine's results pages (SERPs) by using web spamming techniques. Furthermore, those top-ranked unrelated web pages contain insufficient or inappropriate information for the user. In addition, web spamming techniques dramatically affect the quality of the search engine. Researchers introduced several web spam detection techniques such as content-based features, link-based features, label propagation, label refinement, click-based web spamming detection, and real-time web spam detection. However, identifying all spam pages on the Web with high accuracy is still remains unsolved. This work proposes a content-based web spam detection framework, link-based web spam detection framework, and a combined approach to identify both types of web spams with high accuracy that can detect the newly evolved link pyramid. The content-based web spam detection framework uses three proposed and two improved content-based algorithms for web spam detection. The link-based web spam detection framework initially exposed the relationship network behind the link spamming and then used the paid-links database algorithm, spam signals algorithm, and improved link farms algorithm for link-based web spam identification. Finally, the combination of both content and link-based frameworks enhance the accuracy of web spam detection. The proposed combined approach's performance has been evaluated and compared with the J48 classifier, C4.5 decision tree classifier, SVM classifier, and heuristic combined approach. Some experiments were conducted to obtain the threshold values using the proposed collection architecture on well-known datasets WEB SPAM-UK2006 and WEB SPAM-UK2007. The results show that the proposed methods outperform other methods with 82.1% precision and an F-measure of 80.6% to illustrate the proposed framework's effectiveness and applicability. 2021-05 Thesis NonPeerReviewed text en http://eprints.uthm.edu.my/1777/2/ASIM%20SHAHZAD%20-%20declaration.pdf text en http://eprints.uthm.edu.my/1777/1/ASIM%20SHAHZAD%20-%2024p.pdf text en http://eprints.uthm.edu.my/1777/3/ASIM%20SHAHZAD%20-%20fulltext.pdf Shahzad, Asim (2021) An improved framework for content and link-based web spam detection: a combined approach. Doctoral thesis, Universiti Tun Hussein Onn Malaysia.
spellingShingle QA76.75-76.765 Computer software
Shahzad, Asim
An improved framework for content and link-based web spam detection: a combined approach
title An improved framework for content and link-based web spam detection: a combined approach
title_full An improved framework for content and link-based web spam detection: a combined approach
title_fullStr An improved framework for content and link-based web spam detection: a combined approach
title_full_unstemmed An improved framework for content and link-based web spam detection: a combined approach
title_short An improved framework for content and link-based web spam detection: a combined approach
title_sort improved framework for content and link based web spam detection a combined approach
topic QA76.75-76.765 Computer software
url http://eprints.uthm.edu.my/1777/2/ASIM%20SHAHZAD%20-%20declaration.pdf
http://eprints.uthm.edu.my/1777/1/ASIM%20SHAHZAD%20-%2024p.pdf
http://eprints.uthm.edu.my/1777/3/ASIM%20SHAHZAD%20-%20fulltext.pdf
work_keys_str_mv AT shahzadasim animprovedframeworkforcontentandlinkbasedwebspamdetectionacombinedapproach
AT shahzadasim improvedframeworkforcontentandlinkbasedwebspamdetectionacombinedapproach