Learned Text Representation for Amharic Information Retrieval and Natural Language Processing
Over the past few years, word embeddings and bidirectional encoder representations from transformers (BERT) models have brought better solutions to learning text representations for natural language processing (NLP) and other tasks. Many NLP applications rely on pre-trained text representations, lea...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2023-03-01
|
Series: | Information |
Subjects: | |
Online Access: | https://www.mdpi.com/2078-2489/14/3/195 |
_version_ | 1797611139623813120 |
---|---|
author | Tilahun Yeshambel Josiane Mothe Yaregal Assabie |
author_facet | Tilahun Yeshambel Josiane Mothe Yaregal Assabie |
author_sort | Tilahun Yeshambel |
collection | DOAJ |
description | Over the past few years, word embeddings and bidirectional encoder representations from transformers (BERT) models have brought better solutions to learning text representations for natural language processing (NLP) and other tasks. Many NLP applications rely on pre-trained text representations, leading to the development of a number of neural network language models for various languages. However, this is not the case for Amharic, which is known to be a morphologically complex and under-resourced language. Usable pre-trained models for automatic Amharic text processing are not available. This paper presents an investigation on the essence of learned text representation for information retrieval and NLP tasks using word embeddings and BERT language models. We explored the most commonly used methods for word embeddings, including word2vec, GloVe, and fastText, as well as the BERT model. We investigated the performance of query expansion using word embeddings. We also analyzed the use of a pre-trained Amharic BERT model for masked language modeling, next sentence prediction, and text classification tasks. Amharic ad hoc information retrieval test collections that contain word-based, stem-based, and root-based text representations were used for evaluation. We conducted a detailed empirical analysis on the usability of word embeddings and BERT models on word-based, stem-based, and root-based corpora. Experimental results show that word-based query expansion and language modeling perform better than stem-based and root-based text representations, and fastText outperforms other word embeddings on word-based corpus. |
first_indexed | 2024-03-11T06:24:35Z |
format | Article |
id | doaj.art-ef4bf60cec9a4545aa757c1d8cb52bca |
institution | Directory Open Access Journal |
issn | 2078-2489 |
language | English |
last_indexed | 2024-03-11T06:24:35Z |
publishDate | 2023-03-01 |
publisher | MDPI AG |
record_format | Article |
series | Information |
spelling | doaj.art-ef4bf60cec9a4545aa757c1d8cb52bca2023-11-17T11:44:24ZengMDPI AGInformation2078-24892023-03-0114319510.3390/info14030195Learned Text Representation for Amharic Information Retrieval and Natural Language ProcessingTilahun Yeshambel0Josiane Mothe1Yaregal Assabie2IT Doctorial Program, Addis Ababa University, Addis Ababa P.O. Box 1176, EthiopiaComponsante INSPE, IRIT, UMR5505 CNRS, Université de Toulouse Jean-Jaurès, 118 Rte de Narbonne, F31400 Toulouse, FranceDepartment of Computer Science, Addis Ababa University, Addis Ababa P.O. Box 1176, EthiopiaOver the past few years, word embeddings and bidirectional encoder representations from transformers (BERT) models have brought better solutions to learning text representations for natural language processing (NLP) and other tasks. Many NLP applications rely on pre-trained text representations, leading to the development of a number of neural network language models for various languages. However, this is not the case for Amharic, which is known to be a morphologically complex and under-resourced language. Usable pre-trained models for automatic Amharic text processing are not available. This paper presents an investigation on the essence of learned text representation for information retrieval and NLP tasks using word embeddings and BERT language models. We explored the most commonly used methods for word embeddings, including word2vec, GloVe, and fastText, as well as the BERT model. We investigated the performance of query expansion using word embeddings. We also analyzed the use of a pre-trained Amharic BERT model for masked language modeling, next sentence prediction, and text classification tasks. Amharic ad hoc information retrieval test collections that contain word-based, stem-based, and root-based text representations were used for evaluation. We conducted a detailed empirical analysis on the usability of word embeddings and BERT models on word-based, stem-based, and root-based corpora. Experimental results show that word-based query expansion and language modeling perform better than stem-based and root-based text representations, and fastText outperforms other word embeddings on word-based corpus.https://www.mdpi.com/2078-2489/14/3/195word embeddingsBERTpre-trained Amharic BERT modelquery expansionlearning text representationtext classification |
spellingShingle | Tilahun Yeshambel Josiane Mothe Yaregal Assabie Learned Text Representation for Amharic Information Retrieval and Natural Language Processing Information word embeddings BERT pre-trained Amharic BERT model query expansion learning text representation text classification |
title | Learned Text Representation for Amharic Information Retrieval and Natural Language Processing |
title_full | Learned Text Representation for Amharic Information Retrieval and Natural Language Processing |
title_fullStr | Learned Text Representation for Amharic Information Retrieval and Natural Language Processing |
title_full_unstemmed | Learned Text Representation for Amharic Information Retrieval and Natural Language Processing |
title_short | Learned Text Representation for Amharic Information Retrieval and Natural Language Processing |
title_sort | learned text representation for amharic information retrieval and natural language processing |
topic | word embeddings BERT pre-trained Amharic BERT model query expansion learning text representation text classification |
url | https://www.mdpi.com/2078-2489/14/3/195 |
work_keys_str_mv | AT tilahunyeshambel learnedtextrepresentationforamharicinformationretrievalandnaturallanguageprocessing AT josianemothe learnedtextrepresentationforamharicinformationretrievalandnaturallanguageprocessing AT yaregalassabie learnedtextrepresentationforamharicinformationretrievalandnaturallanguageprocessing |