An Incremental Approach to Corpus Design and Construction: Application to a Large Contemporary Saudi Corpus

Due to the rapid developments in technology and the sudden expansion of social media use, Dialect Arabic has become an important source of data that needs to be addressed when building Arabic corpora. In this paper, thirty-three Arabic corpora are surveyed to show that despite all of the development...

Full description

Bibliographic Details
Main Authors: Hebah Elgibreen, Mohammed Faisal, Mansour Al Sulaiman, Sherif Abdou, Mohamed Amine Mekhtiche, Abdullah M. Moussa, Yousef A. Alohali, Wadood Abdul, Ghulam Muhammad, Mohsen Rashwan, Mohammed Algabri
Format: Article
Language:English
Published: IEEE 2021-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9458313/
_version_ 1818657770062217216
author Hebah Elgibreen
Mohammed Faisal
Mansour Al Sulaiman
Sherif Abdou
Mohamed Amine Mekhtiche
Abdullah M. Moussa
Yousef A. Alohali
Wadood Abdul
Ghulam Muhammad
Mohsen Rashwan
Mohammed Algabri
author_facet Hebah Elgibreen
Mohammed Faisal
Mansour Al Sulaiman
Sherif Abdou
Mohamed Amine Mekhtiche
Abdullah M. Moussa
Yousef A. Alohali
Wadood Abdul
Ghulam Muhammad
Mohsen Rashwan
Mohammed Algabri
author_sort Hebah Elgibreen
collection DOAJ
description Due to the rapid developments in technology and the sudden expansion of social media use, Dialect Arabic has become an important source of data that needs to be addressed when building Arabic corpora. In this paper, thirty-three Arabic corpora are surveyed to show that despite all of the developments in the literature, Saudi dialect (SD) corpora still need further expansion. This paper contributes to the literature on SD corpora by creating the largest Saudi corpus – the King Saud University Saudi Corpus (KSUSC) – with +1B total words, including +119M SD words. The KSUSC not only is the newest and largest SD corpus but is also diverse, covering 26 domains in text collected from five different sources. This paper also contributes to the literature by developing a new incremental preprocessing system that is used to create relevant lexicons that are then used to clean and normalize the collected data. This incremental system is scalable and can be adapted for different resources and dialects. Moreover, the collection process for building the KSUSC is discussed in detail, and the challenges in collecting SD text with respect to each platform are highlighted. By the end of this paper, different design criteria are proposed and used with the KSUSC to conclude that the resulting corpus can be of great benefit to researchers who are interested in integrating the corpus with their own work or using its resulting lexicons with Saudi-based NLP tasks.
first_indexed 2024-12-17T03:46:45Z
format Article
id doaj.art-82c7253562704e71bcb170607187366d
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-12-17T03:46:45Z
publishDate 2021-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-82c7253562704e71bcb170607187366d2022-12-21T22:04:52ZengIEEEIEEE Access2169-35362021-01-019884058842810.1109/ACCESS.2021.30899249458313An Incremental Approach to Corpus Design and Construction: Application to a Large Contemporary Saudi CorpusHebah Elgibreen0https://orcid.org/0000-0002-3764-6169Mohammed Faisal1https://orcid.org/0000-0001-7720-0076Mansour Al Sulaiman2https://orcid.org/0000-0003-2866-184XSherif Abdou3Mohamed Amine Mekhtiche4https://orcid.org/0000-0001-9478-9206Abdullah M. Moussa5https://orcid.org/0000-0001-7556-9267Yousef A. Alohali6Wadood Abdul7Ghulam Muhammad8https://orcid.org/0000-0002-9781-3969Mohsen Rashwan9Mohammed Algabri10https://orcid.org/0000-0001-7962-8121Center of Smart Robotics Research, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi ArabiaCenter of Smart Robotics Research, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi ArabiaCenter of Smart Robotics Research, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi ArabiaDepartment of Information Technology, Faculty of Computers and Artificial Intelligence, Cairo University, Giza, EgyptCenter of Smart Robotics Research, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi ArabiaDepartment of Information Technology, Faculty of Computers and Artificial Intelligence, Cairo University, Giza, EgyptCenter of Smart Robotics Research, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi ArabiaCenter of Smart Robotics Research, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi ArabiaCenter of Smart Robotics Research, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi ArabiaDepartment of Electronics and Electrical Communications, Faculty of Engineering, Cairo University, Giza, EgyptCenter of Smart Robotics Research, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi ArabiaDue to the rapid developments in technology and the sudden expansion of social media use, Dialect Arabic has become an important source of data that needs to be addressed when building Arabic corpora. In this paper, thirty-three Arabic corpora are surveyed to show that despite all of the developments in the literature, Saudi dialect (SD) corpora still need further expansion. This paper contributes to the literature on SD corpora by creating the largest Saudi corpus – the King Saud University Saudi Corpus (KSUSC) – with +1B total words, including +119M SD words. The KSUSC not only is the newest and largest SD corpus but is also diverse, covering 26 domains in text collected from five different sources. This paper also contributes to the literature by developing a new incremental preprocessing system that is used to create relevant lexicons that are then used to clean and normalize the collected data. This incremental system is scalable and can be adapted for different resources and dialects. Moreover, the collection process for building the KSUSC is discussed in detail, and the challenges in collecting SD text with respect to each platform are highlighted. By the end of this paper, different design criteria are proposed and used with the KSUSC to conclude that the resulting corpus can be of great benefit to researchers who are interested in integrating the corpus with their own work or using its resulting lexicons with Saudi-based NLP tasks.https://ieeexplore.ieee.org/document/9458313/Saudi dialectcorpusnatural language processingdata preprocessing
spellingShingle Hebah Elgibreen
Mohammed Faisal
Mansour Al Sulaiman
Sherif Abdou
Mohamed Amine Mekhtiche
Abdullah M. Moussa
Yousef A. Alohali
Wadood Abdul
Ghulam Muhammad
Mohsen Rashwan
Mohammed Algabri
An Incremental Approach to Corpus Design and Construction: Application to a Large Contemporary Saudi Corpus
IEEE Access
Saudi dialect
corpus
natural language processing
data preprocessing
title An Incremental Approach to Corpus Design and Construction: Application to a Large Contemporary Saudi Corpus
title_full An Incremental Approach to Corpus Design and Construction: Application to a Large Contemporary Saudi Corpus
title_fullStr An Incremental Approach to Corpus Design and Construction: Application to a Large Contemporary Saudi Corpus
title_full_unstemmed An Incremental Approach to Corpus Design and Construction: Application to a Large Contemporary Saudi Corpus
title_short An Incremental Approach to Corpus Design and Construction: Application to a Large Contemporary Saudi Corpus
title_sort incremental approach to corpus design and construction application to a large contemporary saudi corpus
topic Saudi dialect
corpus
natural language processing
data preprocessing
url https://ieeexplore.ieee.org/document/9458313/
work_keys_str_mv AT hebahelgibreen anincrementalapproachtocorpusdesignandconstructionapplicationtoalargecontemporarysaudicorpus
AT mohammedfaisal anincrementalapproachtocorpusdesignandconstructionapplicationtoalargecontemporarysaudicorpus
AT mansouralsulaiman anincrementalapproachtocorpusdesignandconstructionapplicationtoalargecontemporarysaudicorpus
AT sherifabdou anincrementalapproachtocorpusdesignandconstructionapplicationtoalargecontemporarysaudicorpus
AT mohamedaminemekhtiche anincrementalapproachtocorpusdesignandconstructionapplicationtoalargecontemporarysaudicorpus
AT abdullahmmoussa anincrementalapproachtocorpusdesignandconstructionapplicationtoalargecontemporarysaudicorpus
AT yousefaalohali anincrementalapproachtocorpusdesignandconstructionapplicationtoalargecontemporarysaudicorpus
AT wadoodabdul anincrementalapproachtocorpusdesignandconstructionapplicationtoalargecontemporarysaudicorpus
AT ghulammuhammad anincrementalapproachtocorpusdesignandconstructionapplicationtoalargecontemporarysaudicorpus
AT mohsenrashwan anincrementalapproachtocorpusdesignandconstructionapplicationtoalargecontemporarysaudicorpus
AT mohammedalgabri anincrementalapproachtocorpusdesignandconstructionapplicationtoalargecontemporarysaudicorpus
AT hebahelgibreen incrementalapproachtocorpusdesignandconstructionapplicationtoalargecontemporarysaudicorpus
AT mohammedfaisal incrementalapproachtocorpusdesignandconstructionapplicationtoalargecontemporarysaudicorpus
AT mansouralsulaiman incrementalapproachtocorpusdesignandconstructionapplicationtoalargecontemporarysaudicorpus
AT sherifabdou incrementalapproachtocorpusdesignandconstructionapplicationtoalargecontemporarysaudicorpus
AT mohamedaminemekhtiche incrementalapproachtocorpusdesignandconstructionapplicationtoalargecontemporarysaudicorpus
AT abdullahmmoussa incrementalapproachtocorpusdesignandconstructionapplicationtoalargecontemporarysaudicorpus
AT yousefaalohali incrementalapproachtocorpusdesignandconstructionapplicationtoalargecontemporarysaudicorpus
AT wadoodabdul incrementalapproachtocorpusdesignandconstructionapplicationtoalargecontemporarysaudicorpus
AT ghulammuhammad incrementalapproachtocorpusdesignandconstructionapplicationtoalargecontemporarysaudicorpus
AT mohsenrashwan incrementalapproachtocorpusdesignandconstructionapplicationtoalargecontemporarysaudicorpus
AT mohammedalgabri incrementalapproachtocorpusdesignandconstructionapplicationtoalargecontemporarysaudicorpus