Machine Translation Utilizing the Frequent-Item Set Concept

In this paper, we introduce new concepts in the machine translation paradigm. We treat the corpus as a database of frequent word sets. A translation request triggers association rules joining phrases present in the source language, and phrases present in the target language. It has to be noted that...

Full description

Bibliographic Details
Main Authors:	Hanan A. Hosni Mahmoud, Hanan Abdullah Mengash
Format:	Article
Language:	English
Published:	MDPI AG 2021-02-01
Series:	Sensors
Subjects:	machine translation frequent-item set bilingual corpus BLEU score
Online Access:	https://www.mdpi.com/1424-8220/21/4/1493

_version_	1797395686413565952
author	Hanan A. Hosni Mahmoud Hanan Abdullah Mengash
author_facet	Hanan A. Hosni Mahmoud Hanan Abdullah Mengash
author_sort	Hanan A. Hosni Mahmoud
collection	DOAJ
description	In this paper, we introduce new concepts in the machine translation paradigm. We treat the corpus as a database of frequent word sets. A translation request triggers association rules joining phrases present in the source language, and phrases present in the target language. It has to be noted that a sequential scan of the corpus for such phrases will increase the response time in an unexpected manner. We introduce the pre-processing of the bilingual corpus through proposing a data structure called Corpus-Trie (CT) that renders a bilingual parallel corpus in a compact data structure representing frequent data items sets. We also present algorithms which utilize the CT to respond to translation requests and explore novel techniques in exhaustive experiments. Experiments were performed on specific language pairs, although the proposed method is not restricted to any specific language. Moreover, the proposed Corpus-Trie can be extended from bilingual corpora to accommodate multi-language corpora. Experiments indicated that the response time of a translation request is logarithmic to the count of unrepeated phrases in the original bilingual corpus (and thus, the Corpus-Trie size). In practical situations, 5–20% of the log of the number of the nodes have to be visited. The experimental results indicate that the BLEU score for the proposed CT system increases with the size of the number of phrases in the CT, for both English-Arabic and English-French translations. The proposed CT system was demonstrated to be better than both Omega-T and Apertium in quality of translation from a corpus size exceeding 1,600,000 phrases for English-Arabic translation, and 300,000 phrases for English-French translation.
first_indexed	2024-03-09T00:39:06Z
format	Article
id	doaj.art-12507ceb5a3648e99645b5be21b84637
institution	Directory Open Access Journal
issn	1424-8220
language	English
last_indexed	2024-03-09T00:39:06Z
publishDate	2021-02-01
publisher	MDPI AG
record_format	Article
series	Sensors
spelling	doaj.art-12507ceb5a3648e99645b5be21b846372023-12-11T17:54:48ZengMDPI AGSensors1424-82202021-02-01214149310.3390/s21041493Machine Translation Utilizing the Frequent-Item Set ConceptHanan A. Hosni Mahmoud0Hanan Abdullah Mengash1Department of Computer Sciences, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, Riyadh P.O. Box 11671, Saudi ArabiaDepartment of Information Systems, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, Riyadh P.O. Box 11671, Saudi ArabiaIn this paper, we introduce new concepts in the machine translation paradigm. We treat the corpus as a database of frequent word sets. A translation request triggers association rules joining phrases present in the source language, and phrases present in the target language. It has to be noted that a sequential scan of the corpus for such phrases will increase the response time in an unexpected manner. We introduce the pre-processing of the bilingual corpus through proposing a data structure called Corpus-Trie (CT) that renders a bilingual parallel corpus in a compact data structure representing frequent data items sets. We also present algorithms which utilize the CT to respond to translation requests and explore novel techniques in exhaustive experiments. Experiments were performed on specific language pairs, although the proposed method is not restricted to any specific language. Moreover, the proposed Corpus-Trie can be extended from bilingual corpora to accommodate multi-language corpora. Experiments indicated that the response time of a translation request is logarithmic to the count of unrepeated phrases in the original bilingual corpus (and thus, the Corpus-Trie size). In practical situations, 5–20% of the log of the number of the nodes have to be visited. The experimental results indicate that the BLEU score for the proposed CT system increases with the size of the number of phrases in the CT, for both English-Arabic and English-French translations. The proposed CT system was demonstrated to be better than both Omega-T and Apertium in quality of translation from a corpus size exceeding 1,600,000 phrases for English-Arabic translation, and 300,000 phrases for English-French translation.https://www.mdpi.com/1424-8220/21/4/1493machine translationfrequent-item setbilingual corpusBLEU score
spellingShingle	Hanan A. Hosni Mahmoud Hanan Abdullah Mengash Machine Translation Utilizing the Frequent-Item Set Concept Sensors machine translation frequent-item set bilingual corpus BLEU score
title	Machine Translation Utilizing the Frequent-Item Set Concept
title_full	Machine Translation Utilizing the Frequent-Item Set Concept
title_fullStr	Machine Translation Utilizing the Frequent-Item Set Concept
title_full_unstemmed	Machine Translation Utilizing the Frequent-Item Set Concept
title_short	Machine Translation Utilizing the Frequent-Item Set Concept
title_sort	machine translation utilizing the frequent item set concept
topic	machine translation frequent-item set bilingual corpus BLEU score
url	https://www.mdpi.com/1424-8220/21/4/1493
work_keys_str_mv	AT hananahosnimahmoud machinetranslationutilizingthefrequentitemsetconcept AT hananabdullahmengash machinetranslationutilizingthefrequentitemsetconcept

Machine Translation Utilizing the Frequent-Item Set Concept

Similar Items