Grammar-aware phrase dataset generated using a novel python package

The past technique of manual dataset preparation was time-consuming and needed much effort. Another attempt to the data acquisition method was using web scraping. Such web scraping tools also produce a bunch of data errors. For this reason, we developed “Oromo-grammar” a novel Python package that ac...

Full description

Bibliographic Details
Main Authors:	Ebisa A. Gemechu, G.R. Kanagachidambaresan
Format:	Article
Language:	English
Published:	Elsevier 2023-06-01
Series:	Data in Brief
Subjects:	Oromo-grammar Verb extraction Machine translation Grammar-aware Oromo verb
Online Access:	http://www.sciencedirect.com/science/article/pii/S2352340923003566

_version_	1797798034964217856
author	Ebisa A. Gemechu G.R. Kanagachidambaresan
author_facet	Ebisa A. Gemechu G.R. Kanagachidambaresan
author_sort	Ebisa A. Gemechu
collection	DOAJ
description	The past technique of manual dataset preparation was time-consuming and needed much effort. Another attempt to the data acquisition method was using web scraping. Such web scraping tools also produce a bunch of data errors. For this reason, we developed “Oromo-grammar” a novel Python package that accepts a raw text file from the user, extracts every possible root verb from the text, and stores the verbs into a Python list. Our algorithm then iterates over list of root verbs to form their corresponding list of stems. Finally, our algorithm synthesizes grammatical phrases using the appropriate affixations and personal pronouns. The generated phrase dataset can indicate grammatical elements like numbers, gender, and cases. The output is a grammar-rich dataset, which is applicable to modern NLP applications like machine translation, sentence completion, and grammar and spell checker. The dataset also helps linguists and academia in teaching language grammar structures. The method can easily be reproducible to any other language with a systematic analysis and slight modifications to its affix structures in the algorithm.
first_indexed	2024-03-13T03:58:29Z
format	Article
id	doaj.art-d7ce44302f3745f6b050eb819173c5f7
institution	Directory Open Access Journal
issn	2352-3409
language	English
last_indexed	2024-03-13T03:58:29Z
publishDate	2023-06-01
publisher	Elsevier
record_format	Article
series	Data in Brief
spelling	doaj.art-d7ce44302f3745f6b050eb819173c5f72023-06-22T05:04:05ZengElsevierData in Brief2352-34092023-06-0148109237Grammar-aware phrase dataset generated using a novel python packageEbisa A. Gemechu0G.R. Kanagachidambaresan1Corresponding author.; Department of Computer Science and Engineering, Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology, Chennai 600062, Tamil Nadu, IndiaDepartment of Computer Science and Engineering, Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology, Chennai 600062, Tamil Nadu, IndiaThe past technique of manual dataset preparation was time-consuming and needed much effort. Another attempt to the data acquisition method was using web scraping. Such web scraping tools also produce a bunch of data errors. For this reason, we developed “Oromo-grammar” a novel Python package that accepts a raw text file from the user, extracts every possible root verb from the text, and stores the verbs into a Python list. Our algorithm then iterates over list of root verbs to form their corresponding list of stems. Finally, our algorithm synthesizes grammatical phrases using the appropriate affixations and personal pronouns. The generated phrase dataset can indicate grammatical elements like numbers, gender, and cases. The output is a grammar-rich dataset, which is applicable to modern NLP applications like machine translation, sentence completion, and grammar and spell checker. The dataset also helps linguists and academia in teaching language grammar structures. The method can easily be reproducible to any other language with a systematic analysis and slight modifications to its affix structures in the algorithm.http://www.sciencedirect.com/science/article/pii/S2352340923003566Oromo-grammarVerb extractionMachine translationGrammar-awareOromo verb
spellingShingle	Ebisa A. Gemechu G.R. Kanagachidambaresan Grammar-aware phrase dataset generated using a novel python package Data in Brief Oromo-grammar Verb extraction Machine translation Grammar-aware Oromo verb
title	Grammar-aware phrase dataset generated using a novel python package
title_full	Grammar-aware phrase dataset generated using a novel python package
title_fullStr	Grammar-aware phrase dataset generated using a novel python package
title_full_unstemmed	Grammar-aware phrase dataset generated using a novel python package
title_short	Grammar-aware phrase dataset generated using a novel python package
title_sort	grammar aware phrase dataset generated using a novel python package
topic	Oromo-grammar Verb extraction Machine translation Grammar-aware Oromo verb
url	http://www.sciencedirect.com/science/article/pii/S2352340923003566
work_keys_str_mv	AT ebisaagemechu grammarawarephrasedatasetgeneratedusinganovelpythonpackage AT grkanagachidambaresan grammarawarephrasedatasetgeneratedusinganovelpythonpackage

Grammar-aware phrase dataset generated using a novel python package

Similar Items