An Efficient Unsupervised Approach for OCR Error Correction of Vietnamese OCR Text
Different types of OCR errors often occur in OCR texts due to the low quality of scanned document images or limitations in OCR software. In this paper, we propose a novel unsupervised approach for OCR error correction. Correction candidates for OCR errors are generated and explored in their neighbor...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2023-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10144767/ |
_version_ | 1797798122740514816 |
---|---|
author | Quoc-Dung Nguyen Nguyet-Minh Phan Pavel Kromer Duc-Anh Le |
author_facet | Quoc-Dung Nguyen Nguyet-Minh Phan Pavel Kromer Duc-Anh Le |
author_sort | Quoc-Dung Nguyen |
collection | DOAJ |
description | Different types of OCR errors often occur in OCR texts due to the low quality of scanned document images or limitations in OCR software. In this paper, we propose a novel unsupervised approach for OCR error correction. Correction candidates for OCR errors are generated and explored in their neighborhoods using correction character edits controlled by an adapted hill-climbing algorithm. Correction characters are extracted from only original ground truth texts, which do not depend on OCR texts in training data. A weighted objective function used to score and rank correction candidates is heuristically tested to find optimal weight combinations. The proposed model is evaluated on an OCR text dataset originating from the Vietnamese handwritten database in the ICFHR 2018 Vietnamese online handwritten text recognition competition. The proposed model is also verified concerning its stability and complexity. The experimental results show that our model achieves competitive performance compared to the other models in the ICFHR 2018 competition. |
first_indexed | 2024-03-13T03:58:46Z |
format | Article |
id | doaj.art-2a9f5237186b4ff39c107edce93b7d99 |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-03-13T03:58:46Z |
publishDate | 2023-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-2a9f5237186b4ff39c107edce93b7d992023-06-21T23:00:30ZengIEEEIEEE Access2169-35362023-01-0111584065842110.1109/ACCESS.2023.328334010144767An Efficient Unsupervised Approach for OCR Error Correction of Vietnamese OCR TextQuoc-Dung Nguyen0https://orcid.org/0000-0003-1580-9032Nguyet-Minh Phan1Pavel Kromer2https://orcid.org/0000-0001-8428-3332Duc-Anh Le3https://orcid.org/0000-0002-9359-9686Faculty of Mechanical-Electrical and Computer Engineering, School of Technology, Van Lang University, Ho Chi Minh City, VietnamFaculty of Information Technology, Saigon University, Chi Minh City, VietnamDepartment of Computer Science, VSB--Technical University of Ostrava, Ostrava, Czech RepublicThe Institute of Statistical Mathematics, Tokyo, JapanDifferent types of OCR errors often occur in OCR texts due to the low quality of scanned document images or limitations in OCR software. In this paper, we propose a novel unsupervised approach for OCR error correction. Correction candidates for OCR errors are generated and explored in their neighborhoods using correction character edits controlled by an adapted hill-climbing algorithm. Correction characters are extracted from only original ground truth texts, which do not depend on OCR texts in training data. A weighted objective function used to score and rank correction candidates is heuristically tested to find optimal weight combinations. The proposed model is evaluated on an OCR text dataset originating from the Vietnamese handwritten database in the ICFHR 2018 Vietnamese online handwritten text recognition competition. The proposed model is also verified concerning its stability and complexity. The experimental results show that our model achieves competitive performance compared to the other models in the ICFHR 2018 competition.https://ieeexplore.ieee.org/document/10144767/OCRcharacter editerror correctionattention-based encoder-decoderhill climbing |
spellingShingle | Quoc-Dung Nguyen Nguyet-Minh Phan Pavel Kromer Duc-Anh Le An Efficient Unsupervised Approach for OCR Error Correction of Vietnamese OCR Text IEEE Access OCR character edit error correction attention-based encoder-decoder hill climbing |
title | An Efficient Unsupervised Approach for OCR Error Correction of Vietnamese OCR Text |
title_full | An Efficient Unsupervised Approach for OCR Error Correction of Vietnamese OCR Text |
title_fullStr | An Efficient Unsupervised Approach for OCR Error Correction of Vietnamese OCR Text |
title_full_unstemmed | An Efficient Unsupervised Approach for OCR Error Correction of Vietnamese OCR Text |
title_short | An Efficient Unsupervised Approach for OCR Error Correction of Vietnamese OCR Text |
title_sort | efficient unsupervised approach for ocr error correction of vietnamese ocr text |
topic | OCR character edit error correction attention-based encoder-decoder hill climbing |
url | https://ieeexplore.ieee.org/document/10144767/ |
work_keys_str_mv | AT quocdungnguyen anefficientunsupervisedapproachforocrerrorcorrectionofvietnameseocrtext AT nguyetminhphan anefficientunsupervisedapproachforocrerrorcorrectionofvietnameseocrtext AT pavelkromer anefficientunsupervisedapproachforocrerrorcorrectionofvietnameseocrtext AT ducanhle anefficientunsupervisedapproachforocrerrorcorrectionofvietnameseocrtext AT quocdungnguyen efficientunsupervisedapproachforocrerrorcorrectionofvietnameseocrtext AT nguyetminhphan efficientunsupervisedapproachforocrerrorcorrectionofvietnameseocrtext AT pavelkromer efficientunsupervisedapproachforocrerrorcorrectionofvietnameseocrtext AT ducanhle efficientunsupervisedapproachforocrerrorcorrectionofvietnameseocrtext |