Classic term weighting technique for mining web content outliers

Outlier analysis has become a popular topic in the field of data mining but there have been less work on how to detect outliers in web content. Mining Web Content Outliers is used to detect irrelevant web content within a web portal. Term Frequency (TF) techniques from Information Retrieval (IR) hav...

Full description

Bibliographic Details
Main Authors:	Wan Zulkifeli, Wan Rusila, Mustapha, Norwati, Mustapha, Aida
Format:	Conference or Workshop Item
Language:	English
Published:	Planetary Scientific Research Center 2012
Online Access:	http://psasir.upm.edu.my/id/eprint/49837/1/Classic%20term%20weighting%20technique%20for%20mining%20web%20content%20outliers.pdf

_version_	1825930212645273600
author	Wan Zulkifeli, Wan Rusila Mustapha, Norwati Mustapha, Aida
author_facet	Wan Zulkifeli, Wan Rusila Mustapha, Norwati Mustapha, Aida
author_sort	Wan Zulkifeli, Wan Rusila
collection	UPM
description	Outlier analysis has become a popular topic in the field of data mining but there have been less work on how to detect outliers in web content. Mining Web Content Outliers is used to detect irrelevant web content within a web portal. Term Frequency (TF) techniques from Information Retrieval (IR) have been used to detect the relevancy of a term in a web document. However, when document length varies, relative frequency is preferred. This study used maximum frequency normalization and applied Inverse Document Frequency (IDF) weighting technique which is a traditional term weighting method in IR to use the value of less frequent terms among documents which are considered as more discriminative than frequent terms. The dataset is from The 20 Newsgroups Dataset. TF.IDF is used in dissimilarity measure and the result achieves up to 91.10% of accuracy, which is about 17.77% higher than the previous technique.
first_indexed	2024-03-06T09:08:11Z
format	Conference or Workshop Item
id	upm.eprints-49837
institution	Universiti Putra Malaysia
language	English
last_indexed	2024-03-06T09:08:11Z
publishDate	2012
publisher	Planetary Scientific Research Center
record_format	dspace
spelling	upm.eprints-498372016-12-30T05:31:36Z http://psasir.upm.edu.my/id/eprint/49837/ Classic term weighting technique for mining web content outliers Wan Zulkifeli, Wan Rusila Mustapha, Norwati Mustapha, Aida Outlier analysis has become a popular topic in the field of data mining but there have been less work on how to detect outliers in web content. Mining Web Content Outliers is used to detect irrelevant web content within a web portal. Term Frequency (TF) techniques from Information Retrieval (IR) have been used to detect the relevancy of a term in a web document. However, when document length varies, relative frequency is preferred. This study used maximum frequency normalization and applied Inverse Document Frequency (IDF) weighting technique which is a traditional term weighting method in IR to use the value of less frequent terms among documents which are considered as more discriminative than frequent terms. The dataset is from The 20 Newsgroups Dataset. TF.IDF is used in dissimilarity measure and the result achieves up to 91.10% of accuracy, which is about 17.77% higher than the previous technique. Planetary Scientific Research Center 2012 Conference or Workshop Item PeerReviewed application/pdf en http://psasir.upm.edu.my/id/eprint/49837/1/Classic%20term%20weighting%20technique%20for%20mining%20web%20content%20outliers.pdf Wan Zulkifeli, Wan Rusila and Mustapha, Norwati and Mustapha, Aida (2012) Classic term weighting technique for mining web content outliers. In: International Conference on Computational Techniques and Artificial Intelligence (ICCTAI'2012), 11-12 Feb. 2012, Penang, Malaysia. (pp. 271-275). http://psrcentre.org/proceeding.php?page=2&mode=detail&catid=128&type=1
spellingShingle	Wan Zulkifeli, Wan Rusila Mustapha, Norwati Mustapha, Aida Classic term weighting technique for mining web content outliers
title	Classic term weighting technique for mining web content outliers
title_full	Classic term weighting technique for mining web content outliers
title_fullStr	Classic term weighting technique for mining web content outliers
title_full_unstemmed	Classic term weighting technique for mining web content outliers
title_short	Classic term weighting technique for mining web content outliers
title_sort	classic term weighting technique for mining web content outliers
url	http://psasir.upm.edu.my/id/eprint/49837/1/Classic%20term%20weighting%20technique%20for%20mining%20web%20content%20outliers.pdf
work_keys_str_mv	AT wanzulkifeliwanrusila classictermweightingtechniqueforminingwebcontentoutliers AT mustaphanorwati classictermweightingtechniqueforminingwebcontentoutliers AT mustaphaaida classictermweightingtechniqueforminingwebcontentoutliers

Classic term weighting technique for mining web content outliers

Similar Items