Machine learning based web page classifier

In recent years, the usage of the Internet has increased tremendously, and the total number of web pages has become enormous. The Internet is accessed by everyone for various purposes, and is growing very rapidly. In 2015, worldwidewebsize estimated there are around 50 billion webpages in the Intern...

Full description

Bibliographic Details
Main Author: Setiawan, Andri
Other Authors: Chang Chip Hong
Format: Final Year Project (FYP)
Language:English
Published: 2016
Subjects:
Online Access:http://hdl.handle.net/10356/68086
Description
Summary:In recent years, the usage of the Internet has increased tremendously, and the total number of web pages has become enormous. The Internet is accessed by everyone for various purposes, and is growing very rapidly. In 2015, worldwidewebsize estimated there are around 50 billion webpages in the Internet [1]. Web Directories, such as DMOZ (directory.mozilla.org) and Hotfrog, has classified the web pages into a set of categories. This is done to assist internet users and search engine such as Google. Search engine has been known to use the web directory to find and rank the web pages for certain keywords. The largest web directory, DMOZ, is a human-edited directory and has listed around 4 million web pages [2]. Most web directory hires web experts to classify the web pages into different categories, and this approach is not effective because of the rate the internet is growing. Hence, to improve the effectiveness and automate web categorization, some methods related to machine learning and data mining have been researched to categorize the web pages automatically. In this project, the features that was used for the classifier is all related to the HTML structure of the web pages. Most common HTML tags, metadata, and images are extracted based on the HTML document. The classifiers that will be used are Neural Network for Pattern Recognition, and Support Vector Machine. Four classes of web pages are chosen for this project, and those are: Online Store, Internet Forum, News Article, and Blog Article. The web pages are collected manually through Google Search Engine. Furthermore, the final application for this project is to be able to classify a web page by using its URL as an input.