Exploiting Script Similarities to Compensate for the Large Amount of Data in Training Tesseract LSTM: Towards Kurdish OCR

Applications based on Long-Short-Term Memory (LSTM) require large amounts of data for their training. Tesseract LSTM is a popular Optical Character Recognition (OCR) engine that has been trained and used in various languages. However, its training becomes obstructed when the target language is not r...

Full description

Bibliographic Details
Main Authors:	Saman Idrees, Hossein Hassani
Format:	Article
Language:	English
Published:	MDPI AG 2021-10-01
Series:	Applied Sciences
Subjects:	optical character recognition tesseract printed-document OCR Kurdish-OCR system offline character recognition system
Online Access:	https://www.mdpi.com/2076-3417/11/20/9752

_version_	1797515349575335936
author	Saman Idrees Hossein Hassani
author_facet	Saman Idrees Hossein Hassani
author_sort	Saman Idrees
collection	DOAJ
description	Applications based on Long-Short-Term Memory (LSTM) require large amounts of data for their training. Tesseract LSTM is a popular Optical Character Recognition (OCR) engine that has been trained and used in various languages. However, its training becomes obstructed when the target language is not resourceful. This research suggests a remedy for the problem of scant data in training Tesseract LSTM for a new language by exploiting a training dataset for a language with a similar script. The target of the experiment is Kurdish. It is a multi-dialect language and is considered less-resourced. We choose Sorani, one of the Kurdish dialects, that is mostly written in Persian-Arabic script. We train Tesseract using an Arabic dataset, and then we use a considerably small amount of texts in Persian-Arabic to train the engine to recognize Sorani texts. Our dataset is based on a series of court case documents in the Kurdistan Region of Iraq. We also fine-tune the engine using 10 Unikurd fonts. We use Lstmeval and Ocreval to evaluate the outputs. The result indicates the achievement of 95.45% accuracy. We also test the engine using texts outside the context of court cases. The accuracy of the system remains close to what was found earlier indicating that the script similarity could be used to overcome the lack of large-scale data.
first_indexed	2024-03-10T06:44:15Z
format	Article
id	doaj.art-5077598697104fcc9ee94afd9f2804e2
institution	Directory Open Access Journal
issn	2076-3417
language	English
last_indexed	2024-03-10T06:44:15Z
publishDate	2021-10-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj.art-5077598697104fcc9ee94afd9f2804e22023-11-22T17:23:52ZengMDPI AGApplied Sciences2076-34172021-10-011120975210.3390/app11209752Exploiting Script Similarities to Compensate for the Large Amount of Data in Training Tesseract LSTM: Towards Kurdish OCRSaman Idrees0Hossein Hassani1Department of Computer Science and Engineering, University of Kurdistan Hewlêr, 30 Meter, Kurdistan Region, Erbil 44001, IraqDepartment of Computer Science and Engineering, University of Kurdistan Hewlêr, 30 Meter, Kurdistan Region, Erbil 44001, IraqApplications based on Long-Short-Term Memory (LSTM) require large amounts of data for their training. Tesseract LSTM is a popular Optical Character Recognition (OCR) engine that has been trained and used in various languages. However, its training becomes obstructed when the target language is not resourceful. This research suggests a remedy for the problem of scant data in training Tesseract LSTM for a new language by exploiting a training dataset for a language with a similar script. The target of the experiment is Kurdish. It is a multi-dialect language and is considered less-resourced. We choose Sorani, one of the Kurdish dialects, that is mostly written in Persian-Arabic script. We train Tesseract using an Arabic dataset, and then we use a considerably small amount of texts in Persian-Arabic to train the engine to recognize Sorani texts. Our dataset is based on a series of court case documents in the Kurdistan Region of Iraq. We also fine-tune the engine using 10 Unikurd fonts. We use Lstmeval and Ocreval to evaluate the outputs. The result indicates the achievement of 95.45% accuracy. We also test the engine using texts outside the context of court cases. The accuracy of the system remains close to what was found earlier indicating that the script similarity could be used to overcome the lack of large-scale data.https://www.mdpi.com/2076-3417/11/20/9752optical character recognitiontesseractprinted-document OCRKurdish-OCR systemoffline character recognition system
spellingShingle	Saman Idrees Hossein Hassani Exploiting Script Similarities to Compensate for the Large Amount of Data in Training Tesseract LSTM: Towards Kurdish OCR Applied Sciences optical character recognition tesseract printed-document OCR Kurdish-OCR system offline character recognition system
title	Exploiting Script Similarities to Compensate for the Large Amount of Data in Training Tesseract LSTM: Towards Kurdish OCR
title_full	Exploiting Script Similarities to Compensate for the Large Amount of Data in Training Tesseract LSTM: Towards Kurdish OCR
title_fullStr	Exploiting Script Similarities to Compensate for the Large Amount of Data in Training Tesseract LSTM: Towards Kurdish OCR
title_full_unstemmed	Exploiting Script Similarities to Compensate for the Large Amount of Data in Training Tesseract LSTM: Towards Kurdish OCR
title_short	Exploiting Script Similarities to Compensate for the Large Amount of Data in Training Tesseract LSTM: Towards Kurdish OCR
title_sort	exploiting script similarities to compensate for the large amount of data in training tesseract lstm towards kurdish ocr
topic	optical character recognition tesseract printed-document OCR Kurdish-OCR system offline character recognition system
url	https://www.mdpi.com/2076-3417/11/20/9752
work_keys_str_mv	AT samanidrees exploitingscriptsimilaritiestocompensateforthelargeamountofdataintrainingtesseractlstmtowardskurdishocr AT hosseinhassani exploitingscriptsimilaritiestocompensateforthelargeamountofdataintrainingtesseractlstmtowardskurdishocr

Exploiting Script Similarities to Compensate for the Large Amount of Data in Training Tesseract LSTM: Towards Kurdish OCR

Similar Items