Two bigrams based language model for auto correction of Arabic OCR errors
In Optical character recognition (OCR), the characteristics of Arabic text cause more errors than in English text.In this paper, a two bi-grams based language model that uses Wikipedia's database is presented.The method can perform auto detection and correction of non-word errors in Arabic OCR...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
AICIT, Korea
2014
|
Subjects: | |
Online Access: | https://repo.uum.edu.my/id/eprint/12602/1/JDCTA3630PPL.pdf |
_version_ | 1825803035621720064 |
---|---|
author | Habeeb, Imad Q. Mohd Yusof, Shahrul Azmi Ahmad, Faudziah |
author_facet | Habeeb, Imad Q. Mohd Yusof, Shahrul Azmi Ahmad, Faudziah |
author_sort | Habeeb, Imad Q. |
collection | UUM |
description | In Optical character recognition (OCR), the characteristics of Arabic text cause more errors than in English text.In this paper, a two bi-grams based language model that uses Wikipedia's database is presented.The method can perform auto detection and correction of non-word errors in Arabic OCR text, and auto detection of real word errors. The method consists of two parts: extracting the context information from Wikipedia's database, and implement the auto detection and correction of incorrect words.This method can be applied to any language with little modifications.The experimental results show successful extraction of context information from Wikipedia's articles. Furthermore, it also shows that using this method can reduce the error rate of Arabic OCR text. |
first_indexed | 2024-07-04T05:50:20Z |
format | Article |
id | uum-12602 |
institution | Universiti Utara Malaysia |
language | English |
last_indexed | 2024-07-04T05:50:20Z |
publishDate | 2014 |
publisher | AICIT, Korea |
record_format | eprints |
spelling | uum-126022016-05-15T01:07:50Z https://repo.uum.edu.my/id/eprint/12602/ Two bigrams based language model for auto correction of Arabic OCR errors Habeeb, Imad Q. Mohd Yusof, Shahrul Azmi Ahmad, Faudziah QA76 Computer software In Optical character recognition (OCR), the characteristics of Arabic text cause more errors than in English text.In this paper, a two bi-grams based language model that uses Wikipedia's database is presented.The method can perform auto detection and correction of non-word errors in Arabic OCR text, and auto detection of real word errors. The method consists of two parts: extracting the context information from Wikipedia's database, and implement the auto detection and correction of incorrect words.This method can be applied to any language with little modifications.The experimental results show successful extraction of context information from Wikipedia's articles. Furthermore, it also shows that using this method can reduce the error rate of Arabic OCR text. AICIT, Korea 2014-02 Article PeerReviewed application/pdf en https://repo.uum.edu.my/id/eprint/12602/1/JDCTA3630PPL.pdf Habeeb, Imad Q. and Mohd Yusof, Shahrul Azmi and Ahmad, Faudziah (2014) Two bigrams based language model for auto correction of Arabic OCR errors. International Journal of Digital Content Technology and its Applications (JDCTA), 8 (1). pp. 72-80. ISSN 2233-9310 http://www.aicit.org/jdcta/global/paper_detail.html?jname=JDCTA&q=3630 |
spellingShingle | QA76 Computer software Habeeb, Imad Q. Mohd Yusof, Shahrul Azmi Ahmad, Faudziah Two bigrams based language model for auto correction of Arabic OCR errors |
title | Two bigrams based language model for auto correction of Arabic OCR errors |
title_full | Two bigrams based language model for auto correction of Arabic OCR errors |
title_fullStr | Two bigrams based language model for auto correction of Arabic OCR errors |
title_full_unstemmed | Two bigrams based language model for auto correction of Arabic OCR errors |
title_short | Two bigrams based language model for auto correction of Arabic OCR errors |
title_sort | two bigrams based language model for auto correction of arabic ocr errors |
topic | QA76 Computer software |
url | https://repo.uum.edu.my/id/eprint/12602/1/JDCTA3630PPL.pdf |
work_keys_str_mv | AT habeebimadq twobigramsbasedlanguagemodelforautocorrectionofarabicocrerrors AT mohdyusofshahrulazmi twobigramsbasedlanguagemodelforautocorrectionofarabicocrerrors AT ahmadfaudziah twobigramsbasedlanguagemodelforautocorrectionofarabicocrerrors |