Toward the Development of Large-Scale Word Embedding for Low-Resourced Language

Word embedding is possessed by Natural language processing as a key procedure for semantically and syntactically manipulating the unlabeled text corpus. While this process represents the extracted features of corpus on vector space that enables to perform the NLP tasks such as summary generation, te...

Full description

Bibliographic Details
Main Authors:	Shahzad Nazir, Muhammad Asif, Shahbaz Ahmad Sahi, Shahbaz Ahmad, Yazeed Yasin Ghadi, Muhammad Haris Aziz
Format:	Article
Language:	English
Published:	IEEE 2022-01-01
Series:	IEEE Access
Subjects:	Word embedding Urdu language word vectors word2vec large-scale
Online Access:	https://ieeexplore.ieee.org/document/9770772/

_version_	1818552218337411072
author	Shahzad Nazir Muhammad Asif Shahbaz Ahmad Sahi Shahbaz Ahmad Yazeed Yasin Ghadi Muhammad Haris Aziz
author_facet	Shahzad Nazir Muhammad Asif Shahbaz Ahmad Sahi Shahbaz Ahmad Yazeed Yasin Ghadi Muhammad Haris Aziz
author_sort	Shahzad Nazir
collection	DOAJ
description	Word embedding is possessed by Natural language processing as a key procedure for semantically and syntactically manipulating the unlabeled text corpus. While this process represents the extracted features of corpus on vector space that enables to perform the NLP tasks such as summary generation, text simplification, next sentence prediction, etc. There exist some approaches for word embedding that consider co-occurrence and word frequency, such as Matrix Factorization, skip-gram, hierarchical-structure regularizer, and noise contrastive estimation. These approaches have created mature word vectors for most spoken languages in the world, on the other hand, the research community turned their minor attention towards the Urdu language having 231.3 million speakers. This paper focuses on creating Urdu word embedding. To perform this task, we used a dataset covering different categories of News such as Business, Sports, Health, Politics, Entertainment, Science, world, and others. This dataset was tokenized while creating 288 million tokens. Further, for word vector formation we utilized skip-gram also known as the word2vec model. The embedding was performed while limiting the vector dimensions to 100, 200, 300, 400, 500, 128, 256, and 512. For evaluation Wordsim-353 and Lexsim-999 annotated datasets were utilized. The proposed work achieved a 0.66 Spearman correlation coefficient value for wordsim-353 and 0.439 for Lexsim-999. The results were compared with state-of-the-art and were observed better.
first_indexed	2024-12-12T09:10:19Z
format	Article
id	doaj.art-02e181df6a0946b985f6ca7e1e6ec432
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-12-12T09:10:19Z
publishDate	2022-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-02e181df6a0946b985f6ca7e1e6ec4322022-12-22T00:29:32ZengIEEEIEEE Access2169-35362022-01-0110540915409710.1109/ACCESS.2022.31732599770772Toward the Development of Large-Scale Word Embedding for Low-Resourced LanguageShahzad Nazir0Muhammad Asif1https://orcid.org/0000-0003-1839-2527Shahbaz Ahmad Sahi2https://orcid.org/0000-0003-0148-4521Shahbaz Ahmad3Yazeed Yasin Ghadi4https://orcid.org/0000-0002-7121-495XMuhammad Haris Aziz5https://orcid.org/0000-0001-9584-0093Department of Computer Science, National Textile University, Faisalabad, PakistanDepartment of Computer Science, National Textile University, Faisalabad, PakistanDepartment of Computer Science, National Textile University, Faisalabad, PakistanDepartment of Computer Science, National Textile University, Faisalabad, PakistanDepartment of Computer Science/Software Engineering, Al Ain University, Al Ain, United Arab EmiratesMechanical Engineering Department, University of Sargodha, Sargodha, PakistanWord embedding is possessed by Natural language processing as a key procedure for semantically and syntactically manipulating the unlabeled text corpus. While this process represents the extracted features of corpus on vector space that enables to perform the NLP tasks such as summary generation, text simplification, next sentence prediction, etc. There exist some approaches for word embedding that consider co-occurrence and word frequency, such as Matrix Factorization, skip-gram, hierarchical-structure regularizer, and noise contrastive estimation. These approaches have created mature word vectors for most spoken languages in the world, on the other hand, the research community turned their minor attention towards the Urdu language having 231.3 million speakers. This paper focuses on creating Urdu word embedding. To perform this task, we used a dataset covering different categories of News such as Business, Sports, Health, Politics, Entertainment, Science, world, and others. This dataset was tokenized while creating 288 million tokens. Further, for word vector formation we utilized skip-gram also known as the word2vec model. The embedding was performed while limiting the vector dimensions to 100, 200, 300, 400, 500, 128, 256, and 512. For evaluation Wordsim-353 and Lexsim-999 annotated datasets were utilized. The proposed work achieved a 0.66 Spearman correlation coefficient value for wordsim-353 and 0.439 for Lexsim-999. The results were compared with state-of-the-art and were observed better.https://ieeexplore.ieee.org/document/9770772/Word embeddingUrdu languageword vectorsword2veclarge-scale
spellingShingle	Shahzad Nazir Muhammad Asif Shahbaz Ahmad Sahi Shahbaz Ahmad Yazeed Yasin Ghadi Muhammad Haris Aziz Toward the Development of Large-Scale Word Embedding for Low-Resourced Language IEEE Access Word embedding Urdu language word vectors word2vec large-scale
title	Toward the Development of Large-Scale Word Embedding for Low-Resourced Language
title_full	Toward the Development of Large-Scale Word Embedding for Low-Resourced Language
title_fullStr	Toward the Development of Large-Scale Word Embedding for Low-Resourced Language
title_full_unstemmed	Toward the Development of Large-Scale Word Embedding for Low-Resourced Language
title_short	Toward the Development of Large-Scale Word Embedding for Low-Resourced Language
title_sort	toward the development of large scale word embedding for low resourced language
topic	Word embedding Urdu language word vectors word2vec large-scale
url	https://ieeexplore.ieee.org/document/9770772/
work_keys_str_mv	AT shahzadnazir towardthedevelopmentoflargescalewordembeddingforlowresourcedlanguage AT muhammadasif towardthedevelopmentoflargescalewordembeddingforlowresourcedlanguage AT shahbazahmadsahi towardthedevelopmentoflargescalewordembeddingforlowresourcedlanguage AT shahbazahmad towardthedevelopmentoflargescalewordembeddingforlowresourcedlanguage AT yazeedyasinghadi towardthedevelopmentoflargescalewordembeddingforlowresourcedlanguage AT muhammadharisaziz towardthedevelopmentoflargescalewordembeddingforlowresourcedlanguage

Toward the Development of Large-Scale Word Embedding for Low-Resourced Language

Similar Items