Learned Text Representation for Amharic Information Retrieval and Natural Language Processing

Over the past few years, word embeddings and bidirectional encoder representations from transformers (BERT) models have brought better solutions to learning text representations for natural language processing (NLP) and other tasks. Many NLP applications rely on pre-trained text representations, lea...

Full description

Bibliographic Details
Main Authors:	Tilahun Yeshambel, Josiane Mothe, Yaregal Assabie
Format:	Article
Language:	English
Published:	MDPI AG 2023-03-01
Series:	Information
Subjects:	word embeddings BERT pre-trained Amharic BERT model query expansion learning text representation text classification
Online Access:	https://www.mdpi.com/2078-2489/14/3/195

_version_	1797611139623813120
author	Tilahun Yeshambel Josiane Mothe Yaregal Assabie
author_facet	Tilahun Yeshambel Josiane Mothe Yaregal Assabie
author_sort	Tilahun Yeshambel
collection	DOAJ
description	Over the past few years, word embeddings and bidirectional encoder representations from transformers (BERT) models have brought better solutions to learning text representations for natural language processing (NLP) and other tasks. Many NLP applications rely on pre-trained text representations, leading to the development of a number of neural network language models for various languages. However, this is not the case for Amharic, which is known to be a morphologically complex and under-resourced language. Usable pre-trained models for automatic Amharic text processing are not available. This paper presents an investigation on the essence of learned text representation for information retrieval and NLP tasks using word embeddings and BERT language models. We explored the most commonly used methods for word embeddings, including word2vec, GloVe, and fastText, as well as the BERT model. We investigated the performance of query expansion using word embeddings. We also analyzed the use of a pre-trained Amharic BERT model for masked language modeling, next sentence prediction, and text classification tasks. Amharic ad hoc information retrieval test collections that contain word-based, stem-based, and root-based text representations were used for evaluation. We conducted a detailed empirical analysis on the usability of word embeddings and BERT models on word-based, stem-based, and root-based corpora. Experimental results show that word-based query expansion and language modeling perform better than stem-based and root-based text representations, and fastText outperforms other word embeddings on word-based corpus.
first_indexed	2024-03-11T06:24:35Z
format	Article
id	doaj.art-ef4bf60cec9a4545aa757c1d8cb52bca
institution	Directory Open Access Journal
issn	2078-2489
language	English
last_indexed	2024-03-11T06:24:35Z
publishDate	2023-03-01
publisher	MDPI AG
record_format	Article
series	Information
spelling	doaj.art-ef4bf60cec9a4545aa757c1d8cb52bca2023-11-17T11:44:24ZengMDPI AGInformation2078-24892023-03-0114319510.3390/info14030195Learned Text Representation for Amharic Information Retrieval and Natural Language ProcessingTilahun Yeshambel0Josiane Mothe1Yaregal Assabie2IT Doctorial Program, Addis Ababa University, Addis Ababa P.O. Box 1176, EthiopiaComponsante INSPE, IRIT, UMR5505 CNRS, Université de Toulouse Jean-Jaurès, 118 Rte de Narbonne, F31400 Toulouse, FranceDepartment of Computer Science, Addis Ababa University, Addis Ababa P.O. Box 1176, EthiopiaOver the past few years, word embeddings and bidirectional encoder representations from transformers (BERT) models have brought better solutions to learning text representations for natural language processing (NLP) and other tasks. Many NLP applications rely on pre-trained text representations, leading to the development of a number of neural network language models for various languages. However, this is not the case for Amharic, which is known to be a morphologically complex and under-resourced language. Usable pre-trained models for automatic Amharic text processing are not available. This paper presents an investigation on the essence of learned text representation for information retrieval and NLP tasks using word embeddings and BERT language models. We explored the most commonly used methods for word embeddings, including word2vec, GloVe, and fastText, as well as the BERT model. We investigated the performance of query expansion using word embeddings. We also analyzed the use of a pre-trained Amharic BERT model for masked language modeling, next sentence prediction, and text classification tasks. Amharic ad hoc information retrieval test collections that contain word-based, stem-based, and root-based text representations were used for evaluation. We conducted a detailed empirical analysis on the usability of word embeddings and BERT models on word-based, stem-based, and root-based corpora. Experimental results show that word-based query expansion and language modeling perform better than stem-based and root-based text representations, and fastText outperforms other word embeddings on word-based corpus.https://www.mdpi.com/2078-2489/14/3/195word embeddingsBERTpre-trained Amharic BERT modelquery expansionlearning text representationtext classification
spellingShingle	Tilahun Yeshambel Josiane Mothe Yaregal Assabie Learned Text Representation for Amharic Information Retrieval and Natural Language Processing Information word embeddings BERT pre-trained Amharic BERT model query expansion learning text representation text classification
title	Learned Text Representation for Amharic Information Retrieval and Natural Language Processing
title_full	Learned Text Representation for Amharic Information Retrieval and Natural Language Processing
title_fullStr	Learned Text Representation for Amharic Information Retrieval and Natural Language Processing
title_full_unstemmed	Learned Text Representation for Amharic Information Retrieval and Natural Language Processing
title_short	Learned Text Representation for Amharic Information Retrieval and Natural Language Processing
title_sort	learned text representation for amharic information retrieval and natural language processing
topic	word embeddings BERT pre-trained Amharic BERT model query expansion learning text representation text classification
url	https://www.mdpi.com/2078-2489/14/3/195
work_keys_str_mv	AT tilahunyeshambel learnedtextrepresentationforamharicinformationretrievalandnaturallanguageprocessing AT josianemothe learnedtextrepresentationforamharicinformationretrievalandnaturallanguageprocessing AT yaregalassabie learnedtextrepresentationforamharicinformationretrievalandnaturallanguageprocessing

Learned Text Representation for Amharic Information Retrieval and Natural Language Processing

Similar Items