Grammar-aware phrase dataset generated using a novel python package

The past technique of manual dataset preparation was time-consuming and needed much effort. Another attempt to the data acquisition method was using web scraping. Such web scraping tools also produce a bunch of data errors. For this reason, we developed “Oromo-grammar” a novel Python package that ac...

Full description

Bibliographic Details
Main Authors: Ebisa A. Gemechu, G.R. Kanagachidambaresan
Format: Article
Language:English
Published: Elsevier 2023-06-01
Series:Data in Brief
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2352340923003566
_version_ 1797798034964217856
author Ebisa A. Gemechu
G.R. Kanagachidambaresan
author_facet Ebisa A. Gemechu
G.R. Kanagachidambaresan
author_sort Ebisa A. Gemechu
collection DOAJ
description The past technique of manual dataset preparation was time-consuming and needed much effort. Another attempt to the data acquisition method was using web scraping. Such web scraping tools also produce a bunch of data errors. For this reason, we developed “Oromo-grammar” a novel Python package that accepts a raw text file from the user, extracts every possible root verb from the text, and stores the verbs into a Python list. Our algorithm then iterates over list of root verbs to form their corresponding list of stems. Finally, our algorithm synthesizes grammatical phrases using the appropriate affixations and personal pronouns. The generated phrase dataset can indicate grammatical elements like numbers, gender, and cases. The output is a grammar-rich dataset, which is applicable to modern NLP applications like machine translation, sentence completion, and grammar and spell checker. The dataset also helps linguists and academia in teaching language grammar structures. The method can easily be reproducible to any other language with a systematic analysis and slight modifications to its affix structures in the algorithm.
first_indexed 2024-03-13T03:58:29Z
format Article
id doaj.art-d7ce44302f3745f6b050eb819173c5f7
institution Directory Open Access Journal
issn 2352-3409
language English
last_indexed 2024-03-13T03:58:29Z
publishDate 2023-06-01
publisher Elsevier
record_format Article
series Data in Brief
spelling doaj.art-d7ce44302f3745f6b050eb819173c5f72023-06-22T05:04:05ZengElsevierData in Brief2352-34092023-06-0148109237Grammar-aware phrase dataset generated using a novel python packageEbisa A. Gemechu0G.R. Kanagachidambaresan1Corresponding author.; Department of Computer Science and Engineering, Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology, Chennai 600062, Tamil Nadu, IndiaDepartment of Computer Science and Engineering, Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology, Chennai 600062, Tamil Nadu, IndiaThe past technique of manual dataset preparation was time-consuming and needed much effort. Another attempt to the data acquisition method was using web scraping. Such web scraping tools also produce a bunch of data errors. For this reason, we developed “Oromo-grammar” a novel Python package that accepts a raw text file from the user, extracts every possible root verb from the text, and stores the verbs into a Python list. Our algorithm then iterates over list of root verbs to form their corresponding list of stems. Finally, our algorithm synthesizes grammatical phrases using the appropriate affixations and personal pronouns. The generated phrase dataset can indicate grammatical elements like numbers, gender, and cases. The output is a grammar-rich dataset, which is applicable to modern NLP applications like machine translation, sentence completion, and grammar and spell checker. The dataset also helps linguists and academia in teaching language grammar structures. The method can easily be reproducible to any other language with a systematic analysis and slight modifications to its affix structures in the algorithm.http://www.sciencedirect.com/science/article/pii/S2352340923003566Oromo-grammarVerb extractionMachine translationGrammar-awareOromo verb
spellingShingle Ebisa A. Gemechu
G.R. Kanagachidambaresan
Grammar-aware phrase dataset generated using a novel python package
Data in Brief
Oromo-grammar
Verb extraction
Machine translation
Grammar-aware
Oromo verb
title Grammar-aware phrase dataset generated using a novel python package
title_full Grammar-aware phrase dataset generated using a novel python package
title_fullStr Grammar-aware phrase dataset generated using a novel python package
title_full_unstemmed Grammar-aware phrase dataset generated using a novel python package
title_short Grammar-aware phrase dataset generated using a novel python package
title_sort grammar aware phrase dataset generated using a novel python package
topic Oromo-grammar
Verb extraction
Machine translation
Grammar-aware
Oromo verb
url http://www.sciencedirect.com/science/article/pii/S2352340923003566
work_keys_str_mv AT ebisaagemechu grammarawarephrasedatasetgeneratedusinganovelpythonpackage
AT grkanagachidambaresan grammarawarephrasedatasetgeneratedusinganovelpythonpackage