Improving language identification of web page using optimum profile

Language is an indispensable tool for human communication, and presently, the language that dominates the Internet is English. Language identification is the process of determining a predetermined language automatically from a given content (e.g., English, Malay, Danish, Estonian, Czech, Slovak, etc...

Full description

Bibliographic Details
Main Authors: Ng, C. -C., Selamat, Ali
Format: Book Section
Published: Springer Berlin Heidelberg 2011
Subjects:
_version_ 1796856507252342784
author Ng, C. -C.
Selamat, Ali
author_facet Ng, C. -C.
Selamat, Ali
author_sort Ng, C. -C.
collection ePrints
description Language is an indispensable tool for human communication, and presently, the language that dominates the Internet is English. Language identification is the process of determining a predetermined language automatically from a given content (e.g., English, Malay, Danish, Estonian, Czech, Slovak, etc.). The ability to identify other languages in relation to English is highly desirable. It is the goal of this research to improve the method used to achieve this end. Three methods have been studied in this research are distance measurement, Boolean method, and the proposed method, namely, optimum profile. From the initial experiments, we have found that, distance measurement and Boolean method is not reliable in the European web page identification. Therefore, we propose optimum profile which is using N-grams frequency and N-grams position to do web page language identification. The result show that the proposed method gives the highest performance with accuracy 91.52%.
first_indexed 2024-03-05T18:43:59Z
format Book Section
id utm.eprints-29186
institution Universiti Teknologi Malaysia - ePrints
last_indexed 2024-03-05T18:43:59Z
publishDate 2011
publisher Springer Berlin Heidelberg
record_format dspace
spelling utm.eprints-291862017-02-04T08:39:25Z http://eprints.utm.my/29186/ Improving language identification of web page using optimum profile Ng, C. -C. Selamat, Ali QA Mathematics Language is an indispensable tool for human communication, and presently, the language that dominates the Internet is English. Language identification is the process of determining a predetermined language automatically from a given content (e.g., English, Malay, Danish, Estonian, Czech, Slovak, etc.). The ability to identify other languages in relation to English is highly desirable. It is the goal of this research to improve the method used to achieve this end. Three methods have been studied in this research are distance measurement, Boolean method, and the proposed method, namely, optimum profile. From the initial experiments, we have found that, distance measurement and Boolean method is not reliable in the European web page identification. Therefore, we propose optimum profile which is using N-grams frequency and N-grams position to do web page language identification. The result show that the proposed method gives the highest performance with accuracy 91.52%. Springer Berlin Heidelberg 2011 Book Section PeerReviewed Ng, C. -C. and Selamat, Ali (2011) Improving language identification of web page using optimum profile. In: Software Engineering and Computer Systems: Second International Conference, ICSECS 2011, Kuantan, Pahang, Malaysia, June 27-29, 2011, Proceedings, Part II. Springer Berlin Heidelberg, Dordrecht, South Holland, pp. 157-166. ISBN 978-364222190-3 http://dx.doi.org/10.1007/978-3-642-22191-0_14 10.1007/978-3-642-22191-0_14
spellingShingle QA Mathematics
Ng, C. -C.
Selamat, Ali
Improving language identification of web page using optimum profile
title Improving language identification of web page using optimum profile
title_full Improving language identification of web page using optimum profile
title_fullStr Improving language identification of web page using optimum profile
title_full_unstemmed Improving language identification of web page using optimum profile
title_short Improving language identification of web page using optimum profile
title_sort improving language identification of web page using optimum profile
topic QA Mathematics
work_keys_str_mv AT ngcc improvinglanguageidentificationofwebpageusingoptimumprofile
AT selamatali improvinglanguageidentificationofwebpageusingoptimumprofile