Amharic <i>Adhoc</i> Information Retrieval System Based on Morphological Features

Information retrieval (IR) is one of the most important research and development areas due to the explosion of digital data and the need of accessing relevant information from huge corpora. Although IR systems function well for technologically advanced languages such as English, this is not the case...

Full description

Bibliographic Details
Main Authors: Tilahun Yeshambel, Josiane Mothe, Yaregal Assabie
Format: Article
Language:English
Published: MDPI AG 2022-01-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/12/3/1294
_version_ 1827661648270196736
author Tilahun Yeshambel
Josiane Mothe
Yaregal Assabie
author_facet Tilahun Yeshambel
Josiane Mothe
Yaregal Assabie
author_sort Tilahun Yeshambel
collection DOAJ
description Information retrieval (IR) is one of the most important research and development areas due to the explosion of digital data and the need of accessing relevant information from huge corpora. Although IR systems function well for technologically advanced languages such as English, this is not the case for morphologically complex, under-resourced and less-studied languages such as Amharic. Amharic is a Semitic language characterized by a complex morphology where thousands of words are generated from a single root form through inflection and derivation. This has made the development of Amharic natural language processing (NLP) tools a challenging task. Amharic <i>adhoc</i> retrieval also faces challenges due to scarcity of linguistic resources, tools and standard evaluation corpora. In this research work, we investigate the impact of morphological features on the representation of Amharic documents and queries for <i>adhoc</i> retrieval. We also analyze the effects of stem-based and root-based text representation, and proposed new Amharic IR system architecture. Moreover, we present the resources and corpora we constructed for evaluation of Amharic IR systems and other NLP tools. We conduct various experiments with a TREC-like approach for Amharic IR test collection using a standard evaluation framework and measures. Our findings show that root-based text representation outperforms the conventional stem-based representation on Amharic IR.
first_indexed 2024-03-10T00:14:09Z
format Article
id doaj.art-8b02074b2fd44cf29adb06e481c768a1
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-10T00:14:09Z
publishDate 2022-01-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-8b02074b2fd44cf29adb06e481c768a12023-11-23T15:55:17ZengMDPI AGApplied Sciences2076-34172022-01-01123129410.3390/app12031294Amharic <i>Adhoc</i> Information Retrieval System Based on Morphological FeaturesTilahun Yeshambel0Josiane Mothe1Yaregal Assabie2ITPhD Program, Addis Ababa University, Addis Ababa P.O. Box 1176, EthiopiaUniversité Jean-Jaurès, Université de Toulouse, Componsante INSPE, IRIT, UMR5505 CNRS, 118 Rte de Narbonne, F31400 Toulouse, FranceDepartment of Computer Science, Addis Ababa University, Addis Ababa P.O. Box 1176, EthiopiaInformation retrieval (IR) is one of the most important research and development areas due to the explosion of digital data and the need of accessing relevant information from huge corpora. Although IR systems function well for technologically advanced languages such as English, this is not the case for morphologically complex, under-resourced and less-studied languages such as Amharic. Amharic is a Semitic language characterized by a complex morphology where thousands of words are generated from a single root form through inflection and derivation. This has made the development of Amharic natural language processing (NLP) tools a challenging task. Amharic <i>adhoc</i> retrieval also faces challenges due to scarcity of linguistic resources, tools and standard evaluation corpora. In this research work, we investigate the impact of morphological features on the representation of Amharic documents and queries for <i>adhoc</i> retrieval. We also analyze the effects of stem-based and root-based text representation, and proposed new Amharic IR system architecture. Moreover, we present the resources and corpora we constructed for evaluation of Amharic IR systems and other NLP tools. We conduct various experiments with a TREC-like approach for Amharic IR test collection using a standard evaluation framework and measures. Our findings show that root-based text representation outperforms the conventional stem-based representation on Amharic IR.https://www.mdpi.com/2076-3417/12/3/1294information retrieval<i>adhoc</i> retrievalAmhariccomplex morphologycorpusresources
spellingShingle Tilahun Yeshambel
Josiane Mothe
Yaregal Assabie
Amharic <i>Adhoc</i> Information Retrieval System Based on Morphological Features
Applied Sciences
information retrieval
<i>adhoc</i> retrieval
Amharic
complex morphology
corpus
resources
title Amharic <i>Adhoc</i> Information Retrieval System Based on Morphological Features
title_full Amharic <i>Adhoc</i> Information Retrieval System Based on Morphological Features
title_fullStr Amharic <i>Adhoc</i> Information Retrieval System Based on Morphological Features
title_full_unstemmed Amharic <i>Adhoc</i> Information Retrieval System Based on Morphological Features
title_short Amharic <i>Adhoc</i> Information Retrieval System Based on Morphological Features
title_sort amharic i adhoc i information retrieval system based on morphological features
topic information retrieval
<i>adhoc</i> retrieval
Amharic
complex morphology
corpus
resources
url https://www.mdpi.com/2076-3417/12/3/1294
work_keys_str_mv AT tilahunyeshambel amhariciadhociinformationretrievalsystembasedonmorphologicalfeatures
AT josianemothe amhariciadhociinformationretrievalsystembasedonmorphologicalfeatures
AT yaregalassabie amhariciadhociinformationretrievalsystembasedonmorphologicalfeatures