Two bigrams based language model for auto correction of Arabic OCR errors

In Optical character recognition (OCR), the characteristics of Arabic text cause more errors than in English text.In this paper, a two bi-grams based language model that uses Wikipedia's database is presented.The method can perform auto detection and correction of non-word errors in Arabic OCR...

Full description

Bibliographic Details
Main Authors: Habeeb, Imad Q., Mohd Yusof, Shahrul Azmi, Ahmad, Faudziah
Format: Article
Language:English
Published: AICIT, Korea 2014
Subjects:
Online Access:https://repo.uum.edu.my/id/eprint/12602/1/JDCTA3630PPL.pdf
_version_ 1825803035621720064
author Habeeb, Imad Q.
Mohd Yusof, Shahrul Azmi
Ahmad, Faudziah
author_facet Habeeb, Imad Q.
Mohd Yusof, Shahrul Azmi
Ahmad, Faudziah
author_sort Habeeb, Imad Q.
collection UUM
description In Optical character recognition (OCR), the characteristics of Arabic text cause more errors than in English text.In this paper, a two bi-grams based language model that uses Wikipedia's database is presented.The method can perform auto detection and correction of non-word errors in Arabic OCR text, and auto detection of real word errors. The method consists of two parts: extracting the context information from Wikipedia's database, and implement the auto detection and correction of incorrect words.This method can be applied to any language with little modifications.The experimental results show successful extraction of context information from Wikipedia's articles. Furthermore, it also shows that using this method can reduce the error rate of Arabic OCR text.
first_indexed 2024-07-04T05:50:20Z
format Article
id uum-12602
institution Universiti Utara Malaysia
language English
last_indexed 2024-07-04T05:50:20Z
publishDate 2014
publisher AICIT, Korea
record_format eprints
spelling uum-126022016-05-15T01:07:50Z https://repo.uum.edu.my/id/eprint/12602/ Two bigrams based language model for auto correction of Arabic OCR errors Habeeb, Imad Q. Mohd Yusof, Shahrul Azmi Ahmad, Faudziah QA76 Computer software In Optical character recognition (OCR), the characteristics of Arabic text cause more errors than in English text.In this paper, a two bi-grams based language model that uses Wikipedia's database is presented.The method can perform auto detection and correction of non-word errors in Arabic OCR text, and auto detection of real word errors. The method consists of two parts: extracting the context information from Wikipedia's database, and implement the auto detection and correction of incorrect words.This method can be applied to any language with little modifications.The experimental results show successful extraction of context information from Wikipedia's articles. Furthermore, it also shows that using this method can reduce the error rate of Arabic OCR text. AICIT, Korea 2014-02 Article PeerReviewed application/pdf en https://repo.uum.edu.my/id/eprint/12602/1/JDCTA3630PPL.pdf Habeeb, Imad Q. and Mohd Yusof, Shahrul Azmi and Ahmad, Faudziah (2014) Two bigrams based language model for auto correction of Arabic OCR errors. International Journal of Digital Content Technology and its Applications (JDCTA), 8 (1). pp. 72-80. ISSN 2233-9310 http://www.aicit.org/jdcta/global/paper_detail.html?jname=JDCTA&q=3630
spellingShingle QA76 Computer software
Habeeb, Imad Q.
Mohd Yusof, Shahrul Azmi
Ahmad, Faudziah
Two bigrams based language model for auto correction of Arabic OCR errors
title Two bigrams based language model for auto correction of Arabic OCR errors
title_full Two bigrams based language model for auto correction of Arabic OCR errors
title_fullStr Two bigrams based language model for auto correction of Arabic OCR errors
title_full_unstemmed Two bigrams based language model for auto correction of Arabic OCR errors
title_short Two bigrams based language model for auto correction of Arabic OCR errors
title_sort two bigrams based language model for auto correction of arabic ocr errors
topic QA76 Computer software
url https://repo.uum.edu.my/id/eprint/12602/1/JDCTA3630PPL.pdf
work_keys_str_mv AT habeebimadq twobigramsbasedlanguagemodelforautocorrectionofarabicocrerrors
AT mohdyusofshahrulazmi twobigramsbasedlanguagemodelforautocorrectionofarabicocrerrors
AT ahmadfaudziah twobigramsbasedlanguagemodelforautocorrectionofarabicocrerrors