Grammar-aware phrase dataset generated using a novel python package
The past technique of manual dataset preparation was time-consuming and needed much effort. Another attempt to the data acquisition method was using web scraping. Such web scraping tools also produce a bunch of data errors. For this reason, we developed “Oromo-grammar” a novel Python package that ac...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2023-06-01
|
Series: | Data in Brief |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2352340923003566 |
_version_ | 1797798034964217856 |
---|---|
author | Ebisa A. Gemechu G.R. Kanagachidambaresan |
author_facet | Ebisa A. Gemechu G.R. Kanagachidambaresan |
author_sort | Ebisa A. Gemechu |
collection | DOAJ |
description | The past technique of manual dataset preparation was time-consuming and needed much effort. Another attempt to the data acquisition method was using web scraping. Such web scraping tools also produce a bunch of data errors. For this reason, we developed “Oromo-grammar” a novel Python package that accepts a raw text file from the user, extracts every possible root verb from the text, and stores the verbs into a Python list. Our algorithm then iterates over list of root verbs to form their corresponding list of stems. Finally, our algorithm synthesizes grammatical phrases using the appropriate affixations and personal pronouns. The generated phrase dataset can indicate grammatical elements like numbers, gender, and cases. The output is a grammar-rich dataset, which is applicable to modern NLP applications like machine translation, sentence completion, and grammar and spell checker. The dataset also helps linguists and academia in teaching language grammar structures. The method can easily be reproducible to any other language with a systematic analysis and slight modifications to its affix structures in the algorithm. |
first_indexed | 2024-03-13T03:58:29Z |
format | Article |
id | doaj.art-d7ce44302f3745f6b050eb819173c5f7 |
institution | Directory Open Access Journal |
issn | 2352-3409 |
language | English |
last_indexed | 2024-03-13T03:58:29Z |
publishDate | 2023-06-01 |
publisher | Elsevier |
record_format | Article |
series | Data in Brief |
spelling | doaj.art-d7ce44302f3745f6b050eb819173c5f72023-06-22T05:04:05ZengElsevierData in Brief2352-34092023-06-0148109237Grammar-aware phrase dataset generated using a novel python packageEbisa A. Gemechu0G.R. Kanagachidambaresan1Corresponding author.; Department of Computer Science and Engineering, Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology, Chennai 600062, Tamil Nadu, IndiaDepartment of Computer Science and Engineering, Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology, Chennai 600062, Tamil Nadu, IndiaThe past technique of manual dataset preparation was time-consuming and needed much effort. Another attempt to the data acquisition method was using web scraping. Such web scraping tools also produce a bunch of data errors. For this reason, we developed “Oromo-grammar” a novel Python package that accepts a raw text file from the user, extracts every possible root verb from the text, and stores the verbs into a Python list. Our algorithm then iterates over list of root verbs to form their corresponding list of stems. Finally, our algorithm synthesizes grammatical phrases using the appropriate affixations and personal pronouns. The generated phrase dataset can indicate grammatical elements like numbers, gender, and cases. The output is a grammar-rich dataset, which is applicable to modern NLP applications like machine translation, sentence completion, and grammar and spell checker. The dataset also helps linguists and academia in teaching language grammar structures. The method can easily be reproducible to any other language with a systematic analysis and slight modifications to its affix structures in the algorithm.http://www.sciencedirect.com/science/article/pii/S2352340923003566Oromo-grammarVerb extractionMachine translationGrammar-awareOromo verb |
spellingShingle | Ebisa A. Gemechu G.R. Kanagachidambaresan Grammar-aware phrase dataset generated using a novel python package Data in Brief Oromo-grammar Verb extraction Machine translation Grammar-aware Oromo verb |
title | Grammar-aware phrase dataset generated using a novel python package |
title_full | Grammar-aware phrase dataset generated using a novel python package |
title_fullStr | Grammar-aware phrase dataset generated using a novel python package |
title_full_unstemmed | Grammar-aware phrase dataset generated using a novel python package |
title_short | Grammar-aware phrase dataset generated using a novel python package |
title_sort | grammar aware phrase dataset generated using a novel python package |
topic | Oromo-grammar Verb extraction Machine translation Grammar-aware Oromo verb |
url | http://www.sciencedirect.com/science/article/pii/S2352340923003566 |
work_keys_str_mv | AT ebisaagemechu grammarawarephrasedatasetgeneratedusinganovelpythonpackage AT grkanagachidambaresan grammarawarephrasedatasetgeneratedusinganovelpythonpackage |