Arabic Diacritization Using Bidirectional Long Short-Term Memory Neural Networks With Conditional Random Fields

Arabic diacritics play a significant role in distinguishing words with the same orthography but different meanings, pronunciations, and syntactic functions. The presence of Arabic diacritics can be useful in many natural language processing applications, such as text-to-speech tasks, machine transla...

Full description

Bibliographic Details
Main Authors:	Abdulmohsen Al-Thubaity, Atheer Alkhalifa, Abdulrahman Almuhareb, Waleed Alsanie
Format:	Article
Language:	English
Published:	IEEE 2020-01-01
Series:	IEEE Access
Subjects:	Arabic diacritic restoration bi-directional long short-term memory computational linguistics conditional random fields deep learning neural network
Online Access:	https://ieeexplore.ieee.org/document/9174712/

_version_	1829494894065876992
author	Abdulmohsen Al-Thubaity Atheer Alkhalifa Abdulrahman Almuhareb Waleed Alsanie
author_facet	Abdulmohsen Al-Thubaity Atheer Alkhalifa Abdulrahman Almuhareb Waleed Alsanie
author_sort	Abdulmohsen Al-Thubaity
collection	DOAJ
description	Arabic diacritics play a significant role in distinguishing words with the same orthography but different meanings, pronunciations, and syntactic functions. The presence of Arabic diacritics can be useful in many natural language processing applications, such as text-to-speech tasks, machine translation, and part-of-speech tagging. This article discusses the use of bidirectional long short-term memory neural networks with conditional random fields for Arabic diacritization. This approach requires no morphological analyzers, dictionary, or feature engineering, but rather uses a sequence-to-sequence schema. The input is a sequence of characters that constitute the sentence, and the output consists of the corresponding diacritic(s) for each character in that sentence. The performance of the proposed approach was examined using four datasets with different sizes and genres, namely, the King Abdulaziz City for Science and Technology text-to-speech (KACST TTS) dataset, the Holy Quran, Sahih Al-Bukhary, and the Penn Arabic Treebank (ATB). For training, 60% of the sentences were randomly selected from each dataset, 20% were selected for validation, and 20% were selected for testing. The trained models achieved diacritic error rates of 3.41%, 1.34%, 1.57%, and 2.13% and word error rates of 14.46%, 4.92%, 5.65%, and 8.43% on the KACST TTS, Holy Quran, Sahih Al-Bukhary, and ATB datasets, respectively. Comparison of the proposed method with those used in other studies and existing systems revealed that its results are comparable to or better than those of the state-of-the-art methods.
first_indexed	2024-12-16T07:00:15Z
format	Article
id	doaj.art-45a642c1dfb3491e94bf8a6701c801e2
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-12-16T07:00:15Z
publishDate	2020-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-45a642c1dfb3491e94bf8a6701c801e22022-12-21T22:40:10ZengIEEEIEEE Access2169-35362020-01-01815498415499610.1109/ACCESS.2020.30188859174712Arabic Diacritization Using Bidirectional Long Short-Term Memory Neural Networks With Conditional Random FieldsAbdulmohsen Al-Thubaity0https://orcid.org/0000-0003-2376-0849Atheer Alkhalifa1https://orcid.org/0000-0002-6576-0735Abdulrahman Almuhareb2https://orcid.org/0000-0001-5053-6530Waleed Alsanie3https://orcid.org/0000-0002-8525-4645National Center for Data Analytics and Artificial Intelligence, KACST, Riyadh, Saudi ArabiaNational Center for Data Analytics and Artificial Intelligence, KACST, Riyadh, Saudi ArabiaNational Center for Data Analytics and Artificial Intelligence, KACST, Riyadh, Saudi ArabiaNational Center for Data Analytics and Artificial Intelligence, KACST, Riyadh, Saudi ArabiaArabic diacritics play a significant role in distinguishing words with the same orthography but different meanings, pronunciations, and syntactic functions. The presence of Arabic diacritics can be useful in many natural language processing applications, such as text-to-speech tasks, machine translation, and part-of-speech tagging. This article discusses the use of bidirectional long short-term memory neural networks with conditional random fields for Arabic diacritization. This approach requires no morphological analyzers, dictionary, or feature engineering, but rather uses a sequence-to-sequence schema. The input is a sequence of characters that constitute the sentence, and the output consists of the corresponding diacritic(s) for each character in that sentence. The performance of the proposed approach was examined using four datasets with different sizes and genres, namely, the King Abdulaziz City for Science and Technology text-to-speech (KACST TTS) dataset, the Holy Quran, Sahih Al-Bukhary, and the Penn Arabic Treebank (ATB). For training, 60% of the sentences were randomly selected from each dataset, 20% were selected for validation, and 20% were selected for testing. The trained models achieved diacritic error rates of 3.41%, 1.34%, 1.57%, and 2.13% and word error rates of 14.46%, 4.92%, 5.65%, and 8.43% on the KACST TTS, Holy Quran, Sahih Al-Bukhary, and ATB datasets, respectively. Comparison of the proposed method with those used in other studies and existing systems revealed that its results are comparable to or better than those of the state-of-the-art methods.https://ieeexplore.ieee.org/document/9174712/Arabic diacritic restorationbi-directional long short-term memorycomputational linguisticsconditional random fieldsdeep learningneural network
spellingShingle	Abdulmohsen Al-Thubaity Atheer Alkhalifa Abdulrahman Almuhareb Waleed Alsanie Arabic Diacritization Using Bidirectional Long Short-Term Memory Neural Networks With Conditional Random Fields IEEE Access Arabic diacritic restoration bi-directional long short-term memory computational linguistics conditional random fields deep learning neural network
title	Arabic Diacritization Using Bidirectional Long Short-Term Memory Neural Networks With Conditional Random Fields
title_full	Arabic Diacritization Using Bidirectional Long Short-Term Memory Neural Networks With Conditional Random Fields
title_fullStr	Arabic Diacritization Using Bidirectional Long Short-Term Memory Neural Networks With Conditional Random Fields
title_full_unstemmed	Arabic Diacritization Using Bidirectional Long Short-Term Memory Neural Networks With Conditional Random Fields
title_short	Arabic Diacritization Using Bidirectional Long Short-Term Memory Neural Networks With Conditional Random Fields
title_sort	arabic diacritization using bidirectional long short term memory neural networks with conditional random fields
topic	Arabic diacritic restoration bi-directional long short-term memory computational linguistics conditional random fields deep learning neural network
url	https://ieeexplore.ieee.org/document/9174712/
work_keys_str_mv	AT abdulmohsenalthubaity arabicdiacritizationusingbidirectionallongshorttermmemoryneuralnetworkswithconditionalrandomfields AT atheeralkhalifa arabicdiacritizationusingbidirectionallongshorttermmemoryneuralnetworkswithconditionalrandomfields AT abdulrahmanalmuhareb arabicdiacritizationusingbidirectionallongshorttermmemoryneuralnetworkswithconditionalrandomfields AT waleedalsanie arabicdiacritizationusingbidirectionallongshorttermmemoryneuralnetworkswithconditionalrandomfields

Arabic Diacritization Using Bidirectional Long Short-Term Memory Neural Networks With Conditional Random Fields

Similar Items