Multi-Lingual Optical Character Recognition System Using the Reinforcement Learning of Character Segmenter

In this article, we present a new multi-lingual Optical Character Recognition (OCR) system for scanned documents. In the case of Latin characters, current open source systems such as Tesseract provide very high accuracy. However, the accuracy of the multi-lingual documents, including Asian character...

Full description

Bibliographic Details
Main Authors:	Jaewoo Park, Eunji Lee, Yoonsik Kim, Isaac Kang, Hyung Il Koo, Nam Ik Cho
Format:	Article
Language:	English
Published:	IEEE 2020-01-01
Series:	IEEE Access
Subjects:	Deep learning document analysis optical character recognition
Online Access:	https://ieeexplore.ieee.org/document/9203882/

_version_	1818873537458339840
author	Jaewoo Park Eunji Lee Yoonsik Kim Isaac Kang Hyung Il Koo Nam Ik Cho
author_facet	Jaewoo Park Eunji Lee Yoonsik Kim Isaac Kang Hyung Il Koo Nam Ik Cho
author_sort	Jaewoo Park
collection	DOAJ
description	In this article, we present a new multi-lingual Optical Character Recognition (OCR) system for scanned documents. In the case of Latin characters, current open source systems such as Tesseract provide very high accuracy. However, the accuracy of the multi-lingual documents, including Asian characters, is usually lower than that for Latin-only documents. For example, when the document is the mix of English, Chinese and/or Korean characters, the OCR accuracy is lowered than English-only because the character/text properties of Chinese and Korean are quite different from Latin-type characters. To tackle these problems, we propose a new framework using three neural blocks (a segmenter, a switcher, and multiple recognizers) and the reinforcement learning of the segmenter: The segmenter partitions a given word image into multiple character images, the switcher assigns a recognizer for each sub-image, and the recognizers perform the recognition of assigned sub-images. The training of recognizers and switcher can be considered traditional image classification tasks and we train them with a supervised learning method. However, the supervised learning of the segmenter has two critical drawbacks: Its objective function is sub-optimal and its training requires a large amount of annotation efforts. Thus, by adopting the REINFORCE algorithm, we train the segmenter so as to optimize the overall performance, i.e., we minimize the edit distance of final recognition results. Experimental results have shown that the proposed method significantly improves the performance for multi-lingual scripts and large character set languages without using character boundary labels.
first_indexed	2024-12-19T12:56:17Z
format	Article
id	doaj.art-e6acd7633bd847f2b5a4f89032f161ec
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-12-19T12:56:17Z
publishDate	2020-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-e6acd7633bd847f2b5a4f89032f161ec2022-12-21T20:20:22ZengIEEEIEEE Access2169-35362020-01-01817443717444810.1109/ACCESS.2020.30257699203882Multi-Lingual Optical Character Recognition System Using the Reinforcement Learning of Character SegmenterJaewoo Park0https://orcid.org/0000-0002-6816-4381Eunji Lee1https://orcid.org/0000-0002-7991-0618Yoonsik Kim2https://orcid.org/0000-0001-8023-8278Isaac Kang3Hyung Il Koo4https://orcid.org/0000-0002-6955-8083Nam Ik Cho5https://orcid.org/0000-0001-5297-4649Department of Electrical and Computer Engineering, INMC, Seoul National University, Seoul, South KoreaDepartment of Electrical and Computer Engineering, INMC, Seoul National University, Seoul, South KoreaDepartment of Electrical and Computer Engineering, INMC, Seoul National University, Seoul, South KoreaDepartment of Electrical and Computer Engineering, INMC, Seoul National University, Seoul, South KoreaDepartment of Electrical and Computer Engineering, Ajou University, Suwon, South KoreaDepartment of Electrical and Computer Engineering, INMC, Seoul National University, Seoul, South KoreaIn this article, we present a new multi-lingual Optical Character Recognition (OCR) system for scanned documents. In the case of Latin characters, current open source systems such as Tesseract provide very high accuracy. However, the accuracy of the multi-lingual documents, including Asian characters, is usually lower than that for Latin-only documents. For example, when the document is the mix of English, Chinese and/or Korean characters, the OCR accuracy is lowered than English-only because the character/text properties of Chinese and Korean are quite different from Latin-type characters. To tackle these problems, we propose a new framework using three neural blocks (a segmenter, a switcher, and multiple recognizers) and the reinforcement learning of the segmenter: The segmenter partitions a given word image into multiple character images, the switcher assigns a recognizer for each sub-image, and the recognizers perform the recognition of assigned sub-images. The training of recognizers and switcher can be considered traditional image classification tasks and we train them with a supervised learning method. However, the supervised learning of the segmenter has two critical drawbacks: Its objective function is sub-optimal and its training requires a large amount of annotation efforts. Thus, by adopting the REINFORCE algorithm, we train the segmenter so as to optimize the overall performance, i.e., we minimize the edit distance of final recognition results. Experimental results have shown that the proposed method significantly improves the performance for multi-lingual scripts and large character set languages without using character boundary labels.https://ieeexplore.ieee.org/document/9203882/Deep learningdocument analysisoptical character recognition
spellingShingle	Jaewoo Park Eunji Lee Yoonsik Kim Isaac Kang Hyung Il Koo Nam Ik Cho Multi-Lingual Optical Character Recognition System Using the Reinforcement Learning of Character Segmenter IEEE Access Deep learning document analysis optical character recognition
title	Multi-Lingual Optical Character Recognition System Using the Reinforcement Learning of Character Segmenter
title_full	Multi-Lingual Optical Character Recognition System Using the Reinforcement Learning of Character Segmenter
title_fullStr	Multi-Lingual Optical Character Recognition System Using the Reinforcement Learning of Character Segmenter
title_full_unstemmed	Multi-Lingual Optical Character Recognition System Using the Reinforcement Learning of Character Segmenter
title_short	Multi-Lingual Optical Character Recognition System Using the Reinforcement Learning of Character Segmenter
title_sort	multi lingual optical character recognition system using the reinforcement learning of character segmenter
topic	Deep learning document analysis optical character recognition
url	https://ieeexplore.ieee.org/document/9203882/
work_keys_str_mv	AT jaewoopark multilingualopticalcharacterrecognitionsystemusingthereinforcementlearningofcharactersegmenter AT eunjilee multilingualopticalcharacterrecognitionsystemusingthereinforcementlearningofcharactersegmenter AT yoonsikkim multilingualopticalcharacterrecognitionsystemusingthereinforcementlearningofcharactersegmenter AT isaackang multilingualopticalcharacterrecognitionsystemusingthereinforcementlearningofcharactersegmenter AT hyungilkoo multilingualopticalcharacterrecognitionsystemusingthereinforcementlearningofcharactersegmenter AT namikcho multilingualopticalcharacterrecognitionsystemusingthereinforcementlearningofcharactersegmenter

Multi-Lingual Optical Character Recognition System Using the Reinforcement Learning of Character Segmenter

Similar Items