Learned Text Representation for Amharic Information Retrieval and Natural Language Processing

Over the past few years, word embeddings and bidirectional encoder representations from transformers (BERT) models have brought better solutions to learning text representations for natural language processing (NLP) and other tasks. Many NLP applications rely on pre-trained text representations, lea...

Full description

Bibliographic Details
Main Authors: Tilahun Yeshambel, Josiane Mothe, Yaregal Assabie
Format: Article
Language:English
Published: MDPI AG 2023-03-01
Series:Information
Subjects:
Online Access:https://www.mdpi.com/2078-2489/14/3/195
_version_ 1797611139623813120
author Tilahun Yeshambel
Josiane Mothe
Yaregal Assabie
author_facet Tilahun Yeshambel
Josiane Mothe
Yaregal Assabie
author_sort Tilahun Yeshambel
collection DOAJ
description Over the past few years, word embeddings and bidirectional encoder representations from transformers (BERT) models have brought better solutions to learning text representations for natural language processing (NLP) and other tasks. Many NLP applications rely on pre-trained text representations, leading to the development of a number of neural network language models for various languages. However, this is not the case for Amharic, which is known to be a morphologically complex and under-resourced language. Usable pre-trained models for automatic Amharic text processing are not available. This paper presents an investigation on the essence of learned text representation for information retrieval and NLP tasks using word embeddings and BERT language models. We explored the most commonly used methods for word embeddings, including word2vec, GloVe, and fastText, as well as the BERT model. We investigated the performance of query expansion using word embeddings. We also analyzed the use of a pre-trained Amharic BERT model for masked language modeling, next sentence prediction, and text classification tasks. Amharic ad hoc information retrieval test collections that contain word-based, stem-based, and root-based text representations were used for evaluation. We conducted a detailed empirical analysis on the usability of word embeddings and BERT models on word-based, stem-based, and root-based corpora. Experimental results show that word-based query expansion and language modeling perform better than stem-based and root-based text representations, and fastText outperforms other word embeddings on word-based corpus.
first_indexed 2024-03-11T06:24:35Z
format Article
id doaj.art-ef4bf60cec9a4545aa757c1d8cb52bca
institution Directory Open Access Journal
issn 2078-2489
language English
last_indexed 2024-03-11T06:24:35Z
publishDate 2023-03-01
publisher MDPI AG
record_format Article
series Information
spelling doaj.art-ef4bf60cec9a4545aa757c1d8cb52bca2023-11-17T11:44:24ZengMDPI AGInformation2078-24892023-03-0114319510.3390/info14030195Learned Text Representation for Amharic Information Retrieval and Natural Language ProcessingTilahun Yeshambel0Josiane Mothe1Yaregal Assabie2IT Doctorial Program, Addis Ababa University, Addis Ababa P.O. Box 1176, EthiopiaComponsante INSPE, IRIT, UMR5505 CNRS, Université de Toulouse Jean-Jaurès, 118 Rte de Narbonne, F31400 Toulouse, FranceDepartment of Computer Science, Addis Ababa University, Addis Ababa P.O. Box 1176, EthiopiaOver the past few years, word embeddings and bidirectional encoder representations from transformers (BERT) models have brought better solutions to learning text representations for natural language processing (NLP) and other tasks. Many NLP applications rely on pre-trained text representations, leading to the development of a number of neural network language models for various languages. However, this is not the case for Amharic, which is known to be a morphologically complex and under-resourced language. Usable pre-trained models for automatic Amharic text processing are not available. This paper presents an investigation on the essence of learned text representation for information retrieval and NLP tasks using word embeddings and BERT language models. We explored the most commonly used methods for word embeddings, including word2vec, GloVe, and fastText, as well as the BERT model. We investigated the performance of query expansion using word embeddings. We also analyzed the use of a pre-trained Amharic BERT model for masked language modeling, next sentence prediction, and text classification tasks. Amharic ad hoc information retrieval test collections that contain word-based, stem-based, and root-based text representations were used for evaluation. We conducted a detailed empirical analysis on the usability of word embeddings and BERT models on word-based, stem-based, and root-based corpora. Experimental results show that word-based query expansion and language modeling perform better than stem-based and root-based text representations, and fastText outperforms other word embeddings on word-based corpus.https://www.mdpi.com/2078-2489/14/3/195word embeddingsBERTpre-trained Amharic BERT modelquery expansionlearning text representationtext classification
spellingShingle Tilahun Yeshambel
Josiane Mothe
Yaregal Assabie
Learned Text Representation for Amharic Information Retrieval and Natural Language Processing
Information
word embeddings
BERT
pre-trained Amharic BERT model
query expansion
learning text representation
text classification
title Learned Text Representation for Amharic Information Retrieval and Natural Language Processing
title_full Learned Text Representation for Amharic Information Retrieval and Natural Language Processing
title_fullStr Learned Text Representation for Amharic Information Retrieval and Natural Language Processing
title_full_unstemmed Learned Text Representation for Amharic Information Retrieval and Natural Language Processing
title_short Learned Text Representation for Amharic Information Retrieval and Natural Language Processing
title_sort learned text representation for amharic information retrieval and natural language processing
topic word embeddings
BERT
pre-trained Amharic BERT model
query expansion
learning text representation
text classification
url https://www.mdpi.com/2078-2489/14/3/195
work_keys_str_mv AT tilahunyeshambel learnedtextrepresentationforamharicinformationretrievalandnaturallanguageprocessing
AT josianemothe learnedtextrepresentationforamharicinformationretrievalandnaturallanguageprocessing
AT yaregalassabie learnedtextrepresentationforamharicinformationretrievalandnaturallanguageprocessing