Arabic script web page language identification using hybrid-KNN method

In this paper, we proposed hybrid-KNN methods on the Arabic script web page language identification. One of the crucial tasks in the text-based language identification that utilizes the same script is how to produce reliable features and how to deal with the huge number of languages in the world. Sp...

Full description

Bibliographic Details
Main Authors: Selamat, Ali, Subroto, I. M. I., Ng, Choon Ching
Format: Article
Published: Imperial College Press 2009
Subjects:
_version_ 1796855205462016000
author Selamat, Ali
Subroto, I. M. I.
Ng, Choon Ching
author_facet Selamat, Ali
Subroto, I. M. I.
Ng, Choon Ching
author_sort Selamat, Ali
collection ePrints
description In this paper, we proposed hybrid-KNN methods on the Arabic script web page language identification. One of the crucial tasks in the text-based language identification that utilizes the same script is how to produce reliable features and how to deal with the huge number of languages in the world. Specifically, it has involved the issue of feature representation, feature selection, identification performance, retrieval performance, and noise tolerance performance. Therefore, there are a number of methods that have been evaluated in this work; k -nearest neighbor (KNN), support vector machine (SVM), backpropagation neural networks (BPNN), hybrid KNN-SVM, and KNN-BPNN, in order to justify the capability of the state-of-the-art methods. KNN is prominent in data clustering or data filtering, SVM and BPNN are well known in supervised classification, and we have proposed hybrid-KNN for noise removal on web page language identification. We have used the standard measurements which are accuracy, precision, recall and F 1 measurements to evaluate the effectiveness of the proposed hybrid-KNN. From the experiment, we have observed that BPNN is able to produce precise identification if the data set given is clean. However, when increasing the level of noise in the training data, KNN-SVM performs better than KNN-BPNN against the misclassification data, even on the level of 50% noise. Therefore, it is proven that KNN-SVM produce promising identification performance, in which KNN is able to reduce the noise in the data set and SVM is reliable in the language identification.
first_indexed 2024-03-05T18:25:14Z
format Article
id utm.eprints-13184
institution Universiti Teknologi Malaysia - ePrints
last_indexed 2024-03-05T18:25:14Z
publishDate 2009
publisher Imperial College Press
record_format dspace
spelling utm.eprints-131842011-07-22T02:09:00Z http://eprints.utm.my/13184/ Arabic script web page language identification using hybrid-KNN method Selamat, Ali Subroto, I. M. I. Ng, Choon Ching QA76 Computer software In this paper, we proposed hybrid-KNN methods on the Arabic script web page language identification. One of the crucial tasks in the text-based language identification that utilizes the same script is how to produce reliable features and how to deal with the huge number of languages in the world. Specifically, it has involved the issue of feature representation, feature selection, identification performance, retrieval performance, and noise tolerance performance. Therefore, there are a number of methods that have been evaluated in this work; k -nearest neighbor (KNN), support vector machine (SVM), backpropagation neural networks (BPNN), hybrid KNN-SVM, and KNN-BPNN, in order to justify the capability of the state-of-the-art methods. KNN is prominent in data clustering or data filtering, SVM and BPNN are well known in supervised classification, and we have proposed hybrid-KNN for noise removal on web page language identification. We have used the standard measurements which are accuracy, precision, recall and F 1 measurements to evaluate the effectiveness of the proposed hybrid-KNN. From the experiment, we have observed that BPNN is able to produce precise identification if the data set given is clean. However, when increasing the level of noise in the training data, KNN-SVM performs better than KNN-BPNN against the misclassification data, even on the level of 50% noise. Therefore, it is proven that KNN-SVM produce promising identification performance, in which KNN is able to reduce the noise in the data set and SVM is reliable in the language identification. Imperial College Press 2009 Article PeerReviewed Selamat, Ali and Subroto, I. M. I. and Ng, Choon Ching (2009) Arabic script web page language identification using hybrid-KNN method. International Journal of Computational Intelligence and Applications, 8 (3). pp. 315-343. ISSN 14690268 http://dx.doi.org/10.1142/S146902680900262X DOI: 10.1142/S146902680900262X
spellingShingle QA76 Computer software
Selamat, Ali
Subroto, I. M. I.
Ng, Choon Ching
Arabic script web page language identification using hybrid-KNN method
title Arabic script web page language identification using hybrid-KNN method
title_full Arabic script web page language identification using hybrid-KNN method
title_fullStr Arabic script web page language identification using hybrid-KNN method
title_full_unstemmed Arabic script web page language identification using hybrid-KNN method
title_short Arabic script web page language identification using hybrid-KNN method
title_sort arabic script web page language identification using hybrid knn method
topic QA76 Computer software
work_keys_str_mv AT selamatali arabicscriptwebpagelanguageidentificationusinghybridknnmethod
AT subrotoimi arabicscriptwebpagelanguageidentificationusinghybridknnmethod
AT ngchoonching arabicscriptwebpagelanguageidentificationusinghybridknnmethod