Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition

Many existing text recognition methods rely on the structure of Latin characters and words. Such methods may not be able to deal with non-Latin scripts that have highly complex features, such as character stacking, diacritics, ligatures, non-uniform character widths, and writing without explicit wor...

Full description

Bibliographic Details
Main Authors: Rina Buoy, Masakazu Iwamura, Sovila Srun, Koichi Kise
Format: Article
Language:English
Published: IEEE 2023-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10316307/
_version_ 1827700511523995648
author Rina Buoy
Masakazu Iwamura
Sovila Srun
Koichi Kise
author_facet Rina Buoy
Masakazu Iwamura
Sovila Srun
Koichi Kise
author_sort Rina Buoy
collection DOAJ
description Many existing text recognition methods rely on the structure of Latin characters and words. Such methods may not be able to deal with non-Latin scripts that have highly complex features, such as character stacking, diacritics, ligatures, non-uniform character widths, and writing without explicit word boundaries. In addition, from a natural language processing (NLP) perspective, most non-Latin languages are considered low-resource due to the scarcity of large-scale data. This paper presents a convolutional Transformer-based text recognition method for low-resource non-Latin scripts, which uses local two-dimensional (2D) feature maps. The proposed method can handle images of arbitrarily long textlines, which may occur with non-Latin writing without explicit word boundaries, without resizing them to a fixed size by using an improved image chunking and merging strategy. It has a low time complexity in self-attention layers and allows efficient training. The Khmer script is used as the representative of non-Latin scripts because it shares many features with other non-Latin scripts, which makes the construction of an optical character recognition (OCR) method for Khmer as hard as that for other non-Latin scripts. Thus, by analogy with the AI-complete concept, a Khmer OCR method can be considered as one of the non-Latin-complete methods and can be used as a low-resource non-Latin baseline method. The proposed 2D method was trained on synthetic datasets and outperformed the baseline models on both synthetic and real datasets. Fine-tuning experiments using Khmer handwritten palm leaf manuscripts and other non-Latin scripts demonstrated the feasibility of transfer learning from the Khmer OCR method. To contribute to the low-resource language community, the training and evaluation datasets will be made publicly available.
first_indexed 2024-03-10T14:13:11Z
format Article
id doaj.art-ed9b51c91500439c9b57783f8d92cef4
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-03-10T14:13:11Z
publishDate 2023-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-ed9b51c91500439c9b57783f8d92cef42023-11-21T00:01:28ZengIEEEIEEE Access2169-35362023-01-011112804412806010.1109/ACCESS.2023.333236110316307Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character RecognitionRina Buoy0https://orcid.org/0000-0002-6960-4262Masakazu Iwamura1https://orcid.org/0000-0003-2508-2869Sovila Srun2Koichi Kise3Department of Core Informatics, Graduate School of Informatics, Osaka Metropolitan University, Sakai, Osaka, JapanDepartment of Core Informatics, Graduate School of Informatics, Osaka Metropolitan University, Sakai, Osaka, JapanDepartment of Information Technology Engineering, Faculty of Engineering, Royal University of Phnom Penh, Phnom Penh, CambodiaDepartment of Core Informatics, Graduate School of Informatics, Osaka Metropolitan University, Sakai, Osaka, JapanMany existing text recognition methods rely on the structure of Latin characters and words. Such methods may not be able to deal with non-Latin scripts that have highly complex features, such as character stacking, diacritics, ligatures, non-uniform character widths, and writing without explicit word boundaries. In addition, from a natural language processing (NLP) perspective, most non-Latin languages are considered low-resource due to the scarcity of large-scale data. This paper presents a convolutional Transformer-based text recognition method for low-resource non-Latin scripts, which uses local two-dimensional (2D) feature maps. The proposed method can handle images of arbitrarily long textlines, which may occur with non-Latin writing without explicit word boundaries, without resizing them to a fixed size by using an improved image chunking and merging strategy. It has a low time complexity in self-attention layers and allows efficient training. The Khmer script is used as the representative of non-Latin scripts because it shares many features with other non-Latin scripts, which makes the construction of an optical character recognition (OCR) method for Khmer as hard as that for other non-Latin scripts. Thus, by analogy with the AI-complete concept, a Khmer OCR method can be considered as one of the non-Latin-complete methods and can be used as a low-resource non-Latin baseline method. The proposed 2D method was trained on synthetic datasets and outperformed the baseline models on both synthetic and real datasets. Fine-tuning experiments using Khmer handwritten palm leaf manuscripts and other non-Latin scripts demonstrated the feasibility of transfer learning from the Khmer OCR method. To contribute to the low-resource language community, the training and evaluation datasets will be made publicly available.https://ieeexplore.ieee.org/document/10316307/Khmer scriptnon-Latin scriptscharacter stackingno explicit word boundariestext recognitionimage chunking
spellingShingle Rina Buoy
Masakazu Iwamura
Sovila Srun
Koichi Kise
Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition
IEEE Access
Khmer script
non-Latin scripts
character stacking
no explicit word boundaries
text recognition
image chunking
title Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition
title_full Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition
title_fullStr Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition
title_full_unstemmed Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition
title_short Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition
title_sort toward a low resource non latin complete baseline an exploration of khmer optical character recognition
topic Khmer script
non-Latin scripts
character stacking
no explicit word boundaries
text recognition
image chunking
url https://ieeexplore.ieee.org/document/10316307/
work_keys_str_mv AT rinabuoy towardalowresourcenonlatincompletebaselineanexplorationofkhmeropticalcharacterrecognition
AT masakazuiwamura towardalowresourcenonlatincompletebaselineanexplorationofkhmeropticalcharacterrecognition
AT sovilasrun towardalowresourcenonlatincompletebaselineanexplorationofkhmeropticalcharacterrecognition
AT koichikise towardalowresourcenonlatincompletebaselineanexplorationofkhmeropticalcharacterrecognition