Language identifications of Arabic script web documents using independent component analysis

We analyze the language identification algorithms used to identify the Arabic script web documents such as Arabic, Jawi, Persian and Urdu using independent component analysis (ICA). We have used a combination of Entropy term weighting scheme and class based feature (CPBF) vectors as feature selectio...

Full description

Bibliographic Details
Main Authors: Selamat, Ali, Lee, Zhi-Sam
Format: Book Section
Published: Institute of Electrical and Electronics Engineers 2008
Subjects:
_version_ 1796855108600856576
author Selamat, Ali
Lee, Zhi-Sam
author_facet Selamat, Ali
Lee, Zhi-Sam
author_sort Selamat, Ali
collection ePrints
description We analyze the language identification algorithms used to identify the Arabic script web documents such as Arabic, Jawi, Persian and Urdu using independent component analysis (ICA). We have used a combination of Entropy term weighting scheme and class based feature (CPBF) vectors as feature selection methods for selecting the best features of Arabic script web documents for web page language identifications. Then we input the selected features based on the identification of latent semantics of user profiles using singular value decomposition (SVD). The SVD has been used to remove the noises on the documents retrieved before applying the ICA for topic extraction. We assume that the topic on each document is independent from each other. We have used the information retrieval measures that are precision, recall and F1 in order to evaluate the effectiveness of the proposed algorithm. From the experiments, we have found that the proposed method could leads to good Arabic script language identification results with good separations of Arabic, Persian, and Urdu languages using the ICA.
first_indexed 2024-03-05T18:23:48Z
format Book Section
id utm.eprints-12612
institution Universiti Teknologi Malaysia - ePrints
last_indexed 2024-03-05T18:23:48Z
publishDate 2008
publisher Institute of Electrical and Electronics Engineers
record_format dspace
spelling utm.eprints-126122011-06-14T05:11:15Z http://eprints.utm.my/12612/ Language identifications of Arabic script web documents using independent component analysis Selamat, Ali Lee, Zhi-Sam QA75 Electronic computers. Computer science We analyze the language identification algorithms used to identify the Arabic script web documents such as Arabic, Jawi, Persian and Urdu using independent component analysis (ICA). We have used a combination of Entropy term weighting scheme and class based feature (CPBF) vectors as feature selection methods for selecting the best features of Arabic script web documents for web page language identifications. Then we input the selected features based on the identification of latent semantics of user profiles using singular value decomposition (SVD). The SVD has been used to remove the noises on the documents retrieved before applying the ICA for topic extraction. We assume that the topic on each document is independent from each other. We have used the information retrieval measures that are precision, recall and F1 in order to evaluate the effectiveness of the proposed algorithm. From the experiments, we have found that the proposed method could leads to good Arabic script language identification results with good separations of Arabic, Persian, and Urdu languages using the ICA. Institute of Electrical and Electronics Engineers 2008 Book Section PeerReviewed Selamat, Ali and Lee, Zhi-Sam (2008) Language identifications of Arabic script web documents using independent component analysis. In: Proceedings - 2nd Asia International Conference on Modelling and Simulation, AMS 2008. Institute of Electrical and Electronics Engineers, New York, 427 -432. ISBN 978-076953136-6 http://dx.doi.org/10.1109/AMS.2008.46 doi:10.1109/AMS.2008.46
spellingShingle QA75 Electronic computers. Computer science
Selamat, Ali
Lee, Zhi-Sam
Language identifications of Arabic script web documents using independent component analysis
title Language identifications of Arabic script web documents using independent component analysis
title_full Language identifications of Arabic script web documents using independent component analysis
title_fullStr Language identifications of Arabic script web documents using independent component analysis
title_full_unstemmed Language identifications of Arabic script web documents using independent component analysis
title_short Language identifications of Arabic script web documents using independent component analysis
title_sort language identifications of arabic script web documents using independent component analysis
topic QA75 Electronic computers. Computer science
work_keys_str_mv AT selamatali languageidentificationsofarabicscriptwebdocumentsusingindependentcomponentanalysis
AT leezhisam languageidentificationsofarabicscriptwebdocumentsusingindependentcomponentanalysis