Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition

Many existing text recognition methods rely on the structure of Latin characters and words. Such methods may not be able to deal with non-Latin scripts that have highly complex features, such as character stacking, diacritics, ligatures, non-uniform character widths, and writing without explicit wor...

Full description

Bibliographic Details
Main Authors:	Rina Buoy, Masakazu Iwamura, Sovila Srun, Koichi Kise
Format:	Article
Language:	English
Published:	IEEE 2023-01-01
Series:	IEEE Access
Subjects:	Khmer script non-Latin scripts character stacking no explicit word boundaries text recognition image chunking
Online Access:	https://ieeexplore.ieee.org/document/10316307/

_version_	1827700511523995648
author	Rina Buoy Masakazu Iwamura Sovila Srun Koichi Kise
author_facet	Rina Buoy Masakazu Iwamura Sovila Srun Koichi Kise
author_sort	Rina Buoy
collection	DOAJ
description	Many existing text recognition methods rely on the structure of Latin characters and words. Such methods may not be able to deal with non-Latin scripts that have highly complex features, such as character stacking, diacritics, ligatures, non-uniform character widths, and writing without explicit word boundaries. In addition, from a natural language processing (NLP) perspective, most non-Latin languages are considered low-resource due to the scarcity of large-scale data. This paper presents a convolutional Transformer-based text recognition method for low-resource non-Latin scripts, which uses local two-dimensional (2D) feature maps. The proposed method can handle images of arbitrarily long textlines, which may occur with non-Latin writing without explicit word boundaries, without resizing them to a fixed size by using an improved image chunking and merging strategy. It has a low time complexity in self-attention layers and allows efficient training. The Khmer script is used as the representative of non-Latin scripts because it shares many features with other non-Latin scripts, which makes the construction of an optical character recognition (OCR) method for Khmer as hard as that for other non-Latin scripts. Thus, by analogy with the AI-complete concept, a Khmer OCR method can be considered as one of the non-Latin-complete methods and can be used as a low-resource non-Latin baseline method. The proposed 2D method was trained on synthetic datasets and outperformed the baseline models on both synthetic and real datasets. Fine-tuning experiments using Khmer handwritten palm leaf manuscripts and other non-Latin scripts demonstrated the feasibility of transfer learning from the Khmer OCR method. To contribute to the low-resource language community, the training and evaluation datasets will be made publicly available.
first_indexed	2024-03-10T14:13:11Z
format	Article
id	doaj.art-ed9b51c91500439c9b57783f8d92cef4
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-03-10T14:13:11Z
publishDate	2023-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-ed9b51c91500439c9b57783f8d92cef42023-11-21T00:01:28ZengIEEEIEEE Access2169-35362023-01-011112804412806010.1109/ACCESS.2023.333236110316307Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character RecognitionRina Buoy0https://orcid.org/0000-0002-6960-4262Masakazu Iwamura1https://orcid.org/0000-0003-2508-2869Sovila Srun2Koichi Kise3Department of Core Informatics, Graduate School of Informatics, Osaka Metropolitan University, Sakai, Osaka, JapanDepartment of Core Informatics, Graduate School of Informatics, Osaka Metropolitan University, Sakai, Osaka, JapanDepartment of Information Technology Engineering, Faculty of Engineering, Royal University of Phnom Penh, Phnom Penh, CambodiaDepartment of Core Informatics, Graduate School of Informatics, Osaka Metropolitan University, Sakai, Osaka, JapanMany existing text recognition methods rely on the structure of Latin characters and words. Such methods may not be able to deal with non-Latin scripts that have highly complex features, such as character stacking, diacritics, ligatures, non-uniform character widths, and writing without explicit word boundaries. In addition, from a natural language processing (NLP) perspective, most non-Latin languages are considered low-resource due to the scarcity of large-scale data. This paper presents a convolutional Transformer-based text recognition method for low-resource non-Latin scripts, which uses local two-dimensional (2D) feature maps. The proposed method can handle images of arbitrarily long textlines, which may occur with non-Latin writing without explicit word boundaries, without resizing them to a fixed size by using an improved image chunking and merging strategy. It has a low time complexity in self-attention layers and allows efficient training. The Khmer script is used as the representative of non-Latin scripts because it shares many features with other non-Latin scripts, which makes the construction of an optical character recognition (OCR) method for Khmer as hard as that for other non-Latin scripts. Thus, by analogy with the AI-complete concept, a Khmer OCR method can be considered as one of the non-Latin-complete methods and can be used as a low-resource non-Latin baseline method. The proposed 2D method was trained on synthetic datasets and outperformed the baseline models on both synthetic and real datasets. Fine-tuning experiments using Khmer handwritten palm leaf manuscripts and other non-Latin scripts demonstrated the feasibility of transfer learning from the Khmer OCR method. To contribute to the low-resource language community, the training and evaluation datasets will be made publicly available.https://ieeexplore.ieee.org/document/10316307/Khmer scriptnon-Latin scriptscharacter stackingno explicit word boundariestext recognitionimage chunking
spellingShingle	Rina Buoy Masakazu Iwamura Sovila Srun Koichi Kise Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition IEEE Access Khmer script non-Latin scripts character stacking no explicit word boundaries text recognition image chunking
title	Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition
title_full	Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition
title_fullStr	Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition
title_full_unstemmed	Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition
title_short	Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition
title_sort	toward a low resource non latin complete baseline an exploration of khmer optical character recognition
topic	Khmer script non-Latin scripts character stacking no explicit word boundaries text recognition image chunking
url	https://ieeexplore.ieee.org/document/10316307/
work_keys_str_mv	AT rinabuoy towardalowresourcenonlatincompletebaselineanexplorationofkhmeropticalcharacterrecognition AT masakazuiwamura towardalowresourcenonlatincompletebaselineanexplorationofkhmeropticalcharacterrecognition AT sovilasrun towardalowresourcenonlatincompletebaselineanexplorationofkhmeropticalcharacterrecognition AT koichikise towardalowresourcenonlatincompletebaselineanexplorationofkhmeropticalcharacterrecognition

Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition

Similar Items