An Efficient Unsupervised Approach for OCR Error Correction of Vietnamese OCR Text

Different types of OCR errors often occur in OCR texts due to the low quality of scanned document images or limitations in OCR software. In this paper, we propose a novel unsupervised approach for OCR error correction. Correction candidates for OCR errors are generated and explored in their neighbor...

Full description

Bibliographic Details
Main Authors: Quoc-Dung Nguyen, Nguyet-Minh Phan, Pavel Kromer, Duc-Anh Le
Format: Article
Language:English
Published: IEEE 2023-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10144767/
_version_ 1797798122740514816
author Quoc-Dung Nguyen
Nguyet-Minh Phan
Pavel Kromer
Duc-Anh Le
author_facet Quoc-Dung Nguyen
Nguyet-Minh Phan
Pavel Kromer
Duc-Anh Le
author_sort Quoc-Dung Nguyen
collection DOAJ
description Different types of OCR errors often occur in OCR texts due to the low quality of scanned document images or limitations in OCR software. In this paper, we propose a novel unsupervised approach for OCR error correction. Correction candidates for OCR errors are generated and explored in their neighborhoods using correction character edits controlled by an adapted hill-climbing algorithm. Correction characters are extracted from only original ground truth texts, which do not depend on OCR texts in training data. A weighted objective function used to score and rank correction candidates is heuristically tested to find optimal weight combinations. The proposed model is evaluated on an OCR text dataset originating from the Vietnamese handwritten database in the ICFHR 2018 Vietnamese online handwritten text recognition competition. The proposed model is also verified concerning its stability and complexity. The experimental results show that our model achieves competitive performance compared to the other models in the ICFHR 2018 competition.
first_indexed 2024-03-13T03:58:46Z
format Article
id doaj.art-2a9f5237186b4ff39c107edce93b7d99
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-03-13T03:58:46Z
publishDate 2023-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-2a9f5237186b4ff39c107edce93b7d992023-06-21T23:00:30ZengIEEEIEEE Access2169-35362023-01-0111584065842110.1109/ACCESS.2023.328334010144767An Efficient Unsupervised Approach for OCR Error Correction of Vietnamese OCR TextQuoc-Dung Nguyen0https://orcid.org/0000-0003-1580-9032Nguyet-Minh Phan1Pavel Kromer2https://orcid.org/0000-0001-8428-3332Duc-Anh Le3https://orcid.org/0000-0002-9359-9686Faculty of Mechanical-Electrical and Computer Engineering, School of Technology, Van Lang University, Ho Chi Minh City, VietnamFaculty of Information Technology, Saigon University, Chi Minh City, VietnamDepartment of Computer Science, VSB--Technical University of Ostrava, Ostrava, Czech RepublicThe Institute of Statistical Mathematics, Tokyo, JapanDifferent types of OCR errors often occur in OCR texts due to the low quality of scanned document images or limitations in OCR software. In this paper, we propose a novel unsupervised approach for OCR error correction. Correction candidates for OCR errors are generated and explored in their neighborhoods using correction character edits controlled by an adapted hill-climbing algorithm. Correction characters are extracted from only original ground truth texts, which do not depend on OCR texts in training data. A weighted objective function used to score and rank correction candidates is heuristically tested to find optimal weight combinations. The proposed model is evaluated on an OCR text dataset originating from the Vietnamese handwritten database in the ICFHR 2018 Vietnamese online handwritten text recognition competition. The proposed model is also verified concerning its stability and complexity. The experimental results show that our model achieves competitive performance compared to the other models in the ICFHR 2018 competition.https://ieeexplore.ieee.org/document/10144767/OCRcharacter editerror correctionattention-based encoder-decoderhill climbing
spellingShingle Quoc-Dung Nguyen
Nguyet-Minh Phan
Pavel Kromer
Duc-Anh Le
An Efficient Unsupervised Approach for OCR Error Correction of Vietnamese OCR Text
IEEE Access
OCR
character edit
error correction
attention-based encoder-decoder
hill climbing
title An Efficient Unsupervised Approach for OCR Error Correction of Vietnamese OCR Text
title_full An Efficient Unsupervised Approach for OCR Error Correction of Vietnamese OCR Text
title_fullStr An Efficient Unsupervised Approach for OCR Error Correction of Vietnamese OCR Text
title_full_unstemmed An Efficient Unsupervised Approach for OCR Error Correction of Vietnamese OCR Text
title_short An Efficient Unsupervised Approach for OCR Error Correction of Vietnamese OCR Text
title_sort efficient unsupervised approach for ocr error correction of vietnamese ocr text
topic OCR
character edit
error correction
attention-based encoder-decoder
hill climbing
url https://ieeexplore.ieee.org/document/10144767/
work_keys_str_mv AT quocdungnguyen anefficientunsupervisedapproachforocrerrorcorrectionofvietnameseocrtext
AT nguyetminhphan anefficientunsupervisedapproachforocrerrorcorrectionofvietnameseocrtext
AT pavelkromer anefficientunsupervisedapproachforocrerrorcorrectionofvietnameseocrtext
AT ducanhle anefficientunsupervisedapproachforocrerrorcorrectionofvietnameseocrtext
AT quocdungnguyen efficientunsupervisedapproachforocrerrorcorrectionofvietnameseocrtext
AT nguyetminhphan efficientunsupervisedapproachforocrerrorcorrectionofvietnameseocrtext
AT pavelkromer efficientunsupervisedapproachforocrerrorcorrectionofvietnameseocrtext
AT ducanhle efficientunsupervisedapproachforocrerrorcorrectionofvietnameseocrtext