Transfer learning and multi-phase training for accurate diacritization of Arabic poetry

Most Arabic poetry is undiacritized or partially diacritized (written without short vowels). For people of various ages and language mastery levels, diacritizing Arabic poetry would allow them to enjoy reading and chanting it easily and properly. Moreover, diacritizing a poetry verse is an essential...

Full description

Bibliographic Details
Main Authors: Gheith A. Abandah, Ashraf E. Suyyagh, Mohammad R. Abdel-Majeed
Format: Article
Language:English
Published: Elsevier 2022-06-01
Series:Journal of King Saud University: Computer and Information Sciences
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S1319157822001227
Description
Summary:Most Arabic poetry is undiacritized or partially diacritized (written without short vowels). For people of various ages and language mastery levels, diacritizing Arabic poetry would allow them to enjoy reading and chanting it easily and properly. Moreover, diacritizing a poetry verse is an essential step to analyze it for classification and evaluation. Unfortunately, the available automatic poetry diacritization solutions are inaccurate. Diacritizing Arabic poetry is a difficult task for people and machines alike because Arabic has numerous complex diacritization rules and Arabic poetry has additional special cases and rich and vibrant compositions. Deep machine learning could provide the desired diacritization solution provided that adequate training datasets are available. Unfortunately, the available datasets are insufficient and expensive to develop. In this paper, we propose solutions to improve the automatic diacritization of Arabic poetry using deep machine learning. We mitigate the difficulty of diacritizing Arabic poetry verses by employing transfer learning to leverage pattern features from a pretrained classification model. We also overcome the training dataset deficiency by training the composite diacritization model in multiple phases on carefully selected sub-datasets. Compared with best known previous results, the proposed solutions improve the diacritization error rate from 6.08% to 3.54% (42% improvement).
ISSN:1319-1578