Simplification of Arabic text: A hybrid approach integrating machine translation and transformer-based lexical model

The process of text simplification (TS) is crucial for enhancing the comprehension of written material, especially for people with low literacy levels and those who struggle to understand written content. In this study, we introduce the first automated approach to TS that combines word-level and sen...

Full description

Bibliographic Details
Main Authors:	Suha S. Al-Thanyyan, Aqil M. Azmi
Format:	Article
Language:	English
Published:	Elsevier 2023-09-01
Series:	Journal of King Saud University: Computer and Information Sciences
Subjects:	Text simplification Arabic text simplification Lexical simplification Neural machine translation Transformers Arabic corpora
Online Access:	http://www.sciencedirect.com/science/article/pii/S1319157823002161

_version_	1797663912914583552
author	Suha S. Al-Thanyyan Aqil M. Azmi
author_facet	Suha S. Al-Thanyyan Aqil M. Azmi
author_sort	Suha S. Al-Thanyyan
collection	DOAJ
description	The process of text simplification (TS) is crucial for enhancing the comprehension of written material, especially for people with low literacy levels and those who struggle to understand written content. In this study, we introduce the first automated approach to TS that combines word-level and sentence-level simplification techniques for Arabic text. We employ three models: a neural machine translation model, an Arabic-BERT-based lexical model, and a hybrid model that combines both methods to simplify the text. To evaluate the models, we created and utilized two Arabic datasets, namely EW-SEW and WikiLarge, comprising 82,585 and 249 sentence pairs, respectively. As resources were scarce, we made these datasets available to other researchers. The EW-SEW dataset is a commonly used English TS corpus that aligns each sentence in the original English Wikipedia (EW) with a simpler reference sentence from Simple English Wikipedia (SEW). In contrast, the WikiLarge dataset has eight simplified reference sentences for each of the 249 test sentences. The hybrid model outperformed the other models, achieving a BLEU score of 55.68, a SARI score of 37.15, and an FBERT score of 86.7% on the WikiLarge dataset, demonstrating the effectiveness of the combined approach.
first_indexed	2024-03-11T19:21:40Z
format	Article
id	doaj.art-fc8247cd3d9e4be1bc5e6ad24ca39bc3
institution	Directory Open Access Journal
issn	1319-1578
language	English
last_indexed	2024-03-11T19:21:40Z
publishDate	2023-09-01
publisher	Elsevier
record_format	Article
series	Journal of King Saud University: Computer and Information Sciences
spelling	doaj.art-fc8247cd3d9e4be1bc5e6ad24ca39bc32023-10-07T04:34:00ZengElsevierJournal of King Saud University: Computer and Information Sciences1319-15782023-09-01358101662Simplification of Arabic text: A hybrid approach integrating machine translation and transformer-based lexical modelSuha S. Al-Thanyyan0Aqil M. Azmi1Department of Computer Science, College of Computer & Information Sciences, King Saud University, Riyadh 11543, Saudi ArabiaCorresponding author.; Department of Computer Science, College of Computer & Information Sciences, King Saud University, Riyadh 11543, Saudi ArabiaThe process of text simplification (TS) is crucial for enhancing the comprehension of written material, especially for people with low literacy levels and those who struggle to understand written content. In this study, we introduce the first automated approach to TS that combines word-level and sentence-level simplification techniques for Arabic text. We employ three models: a neural machine translation model, an Arabic-BERT-based lexical model, and a hybrid model that combines both methods to simplify the text. To evaluate the models, we created and utilized two Arabic datasets, namely EW-SEW and WikiLarge, comprising 82,585 and 249 sentence pairs, respectively. As resources were scarce, we made these datasets available to other researchers. The EW-SEW dataset is a commonly used English TS corpus that aligns each sentence in the original English Wikipedia (EW) with a simpler reference sentence from Simple English Wikipedia (SEW). In contrast, the WikiLarge dataset has eight simplified reference sentences for each of the 249 test sentences. The hybrid model outperformed the other models, achieving a BLEU score of 55.68, a SARI score of 37.15, and an FBERT score of 86.7% on the WikiLarge dataset, demonstrating the effectiveness of the combined approach.http://www.sciencedirect.com/science/article/pii/S1319157823002161Text simplificationArabic text simplificationLexical simplificationNeural machine translationTransformersArabic corpora
spellingShingle	Suha S. Al-Thanyyan Aqil M. Azmi Simplification of Arabic text: A hybrid approach integrating machine translation and transformer-based lexical model Journal of King Saud University: Computer and Information Sciences Text simplification Arabic text simplification Lexical simplification Neural machine translation Transformers Arabic corpora
title	Simplification of Arabic text: A hybrid approach integrating machine translation and transformer-based lexical model
title_full	Simplification of Arabic text: A hybrid approach integrating machine translation and transformer-based lexical model
title_fullStr	Simplification of Arabic text: A hybrid approach integrating machine translation and transformer-based lexical model
title_full_unstemmed	Simplification of Arabic text: A hybrid approach integrating machine translation and transformer-based lexical model
title_short	Simplification of Arabic text: A hybrid approach integrating machine translation and transformer-based lexical model
title_sort	simplification of arabic text a hybrid approach integrating machine translation and transformer based lexical model
topic	Text simplification Arabic text simplification Lexical simplification Neural machine translation Transformers Arabic corpora
url	http://www.sciencedirect.com/science/article/pii/S1319157823002161
work_keys_str_mv	AT suhasalthanyyan simplificationofarabictextahybridapproachintegratingmachinetranslationandtransformerbasedlexicalmodel AT aqilmazmi simplificationofarabictextahybridapproachintegratingmachinetranslationandtransformerbasedlexicalmodel

Simplification of Arabic text: A hybrid approach integrating machine translation and transformer-based lexical model

Similar Items