Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition
Many existing text recognition methods rely on the structure of Latin characters and words. Such methods may not be able to deal with non-Latin scripts that have highly complex features, such as character stacking, diacritics, ligatures, non-uniform character widths, and writing without explicit wor...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2023-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10316307/ |
_version_ | 1827700511523995648 |
---|---|
author | Rina Buoy Masakazu Iwamura Sovila Srun Koichi Kise |
author_facet | Rina Buoy Masakazu Iwamura Sovila Srun Koichi Kise |
author_sort | Rina Buoy |
collection | DOAJ |
description | Many existing text recognition methods rely on the structure of Latin characters and words. Such methods may not be able to deal with non-Latin scripts that have highly complex features, such as character stacking, diacritics, ligatures, non-uniform character widths, and writing without explicit word boundaries. In addition, from a natural language processing (NLP) perspective, most non-Latin languages are considered low-resource due to the scarcity of large-scale data. This paper presents a convolutional Transformer-based text recognition method for low-resource non-Latin scripts, which uses local two-dimensional (2D) feature maps. The proposed method can handle images of arbitrarily long textlines, which may occur with non-Latin writing without explicit word boundaries, without resizing them to a fixed size by using an improved image chunking and merging strategy. It has a low time complexity in self-attention layers and allows efficient training. The Khmer script is used as the representative of non-Latin scripts because it shares many features with other non-Latin scripts, which makes the construction of an optical character recognition (OCR) method for Khmer as hard as that for other non-Latin scripts. Thus, by analogy with the AI-complete concept, a Khmer OCR method can be considered as one of the non-Latin-complete methods and can be used as a low-resource non-Latin baseline method. The proposed 2D method was trained on synthetic datasets and outperformed the baseline models on both synthetic and real datasets. Fine-tuning experiments using Khmer handwritten palm leaf manuscripts and other non-Latin scripts demonstrated the feasibility of transfer learning from the Khmer OCR method. To contribute to the low-resource language community, the training and evaluation datasets will be made publicly available. |
first_indexed | 2024-03-10T14:13:11Z |
format | Article |
id | doaj.art-ed9b51c91500439c9b57783f8d92cef4 |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-03-10T14:13:11Z |
publishDate | 2023-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-ed9b51c91500439c9b57783f8d92cef42023-11-21T00:01:28ZengIEEEIEEE Access2169-35362023-01-011112804412806010.1109/ACCESS.2023.333236110316307Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character RecognitionRina Buoy0https://orcid.org/0000-0002-6960-4262Masakazu Iwamura1https://orcid.org/0000-0003-2508-2869Sovila Srun2Koichi Kise3Department of Core Informatics, Graduate School of Informatics, Osaka Metropolitan University, Sakai, Osaka, JapanDepartment of Core Informatics, Graduate School of Informatics, Osaka Metropolitan University, Sakai, Osaka, JapanDepartment of Information Technology Engineering, Faculty of Engineering, Royal University of Phnom Penh, Phnom Penh, CambodiaDepartment of Core Informatics, Graduate School of Informatics, Osaka Metropolitan University, Sakai, Osaka, JapanMany existing text recognition methods rely on the structure of Latin characters and words. Such methods may not be able to deal with non-Latin scripts that have highly complex features, such as character stacking, diacritics, ligatures, non-uniform character widths, and writing without explicit word boundaries. In addition, from a natural language processing (NLP) perspective, most non-Latin languages are considered low-resource due to the scarcity of large-scale data. This paper presents a convolutional Transformer-based text recognition method for low-resource non-Latin scripts, which uses local two-dimensional (2D) feature maps. The proposed method can handle images of arbitrarily long textlines, which may occur with non-Latin writing without explicit word boundaries, without resizing them to a fixed size by using an improved image chunking and merging strategy. It has a low time complexity in self-attention layers and allows efficient training. The Khmer script is used as the representative of non-Latin scripts because it shares many features with other non-Latin scripts, which makes the construction of an optical character recognition (OCR) method for Khmer as hard as that for other non-Latin scripts. Thus, by analogy with the AI-complete concept, a Khmer OCR method can be considered as one of the non-Latin-complete methods and can be used as a low-resource non-Latin baseline method. The proposed 2D method was trained on synthetic datasets and outperformed the baseline models on both synthetic and real datasets. Fine-tuning experiments using Khmer handwritten palm leaf manuscripts and other non-Latin scripts demonstrated the feasibility of transfer learning from the Khmer OCR method. To contribute to the low-resource language community, the training and evaluation datasets will be made publicly available.https://ieeexplore.ieee.org/document/10316307/Khmer scriptnon-Latin scriptscharacter stackingno explicit word boundariestext recognitionimage chunking |
spellingShingle | Rina Buoy Masakazu Iwamura Sovila Srun Koichi Kise Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition IEEE Access Khmer script non-Latin scripts character stacking no explicit word boundaries text recognition image chunking |
title | Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition |
title_full | Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition |
title_fullStr | Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition |
title_full_unstemmed | Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition |
title_short | Toward a Low-Resource Non-Latin-Complete Baseline: An Exploration of Khmer Optical Character Recognition |
title_sort | toward a low resource non latin complete baseline an exploration of khmer optical character recognition |
topic | Khmer script non-Latin scripts character stacking no explicit word boundaries text recognition image chunking |
url | https://ieeexplore.ieee.org/document/10316307/ |
work_keys_str_mv | AT rinabuoy towardalowresourcenonlatincompletebaselineanexplorationofkhmeropticalcharacterrecognition AT masakazuiwamura towardalowresourcenonlatincompletebaselineanexplorationofkhmeropticalcharacterrecognition AT sovilasrun towardalowresourcenonlatincompletebaselineanexplorationofkhmeropticalcharacterrecognition AT koichikise towardalowresourcenonlatincompletebaselineanexplorationofkhmeropticalcharacterrecognition |