Optical Character Recognition Applied to Romanian Printed Texts of the 18th–20th Century

The paper discusses Optical Character Recognition (OCR) of historical texts of the 18th–20th century in the Romanian language using the Cyrillic script. We differ three epochs (approximately, the 18th, 19th, and 20th centuries), with different usage of the Cyrillic alphabet in Romanian and, corre...

Full description

Bibliographic Details
Main Authors: Svetlana Cojocaru, Alexandru Colesnicov, Ludmila Malahov, Tudor Bumbu
Format: Article
Language:English
Published: Vladimir Andrunachievici Institute of Mathematics and Computer Science 2016-04-01
Series:Computer Science Journal of Moldova
Online Access:http://www.math.md/files/csjm/v24-n1/v24-n1-(pp106-117).pdf
Description
Summary:The paper discusses Optical Character Recognition (OCR) of historical texts of the 18th–20th century in the Romanian language using the Cyrillic script. We differ three epochs (approximately, the 18th, 19th, and 20th centuries), with different usage of the Cyrillic alphabet in Romanian and, correspondingly, different approach to OCR. We developed historical alphabets and sets of glyphs recognition templates specific for each epoch. The dictionaries in proper alphabets and orthographies were also created. In addition, virtual keyboards, fonts, transliteration utilities, etc. were developed. The resulting technology and toolset permit successful recognition of historical Romanian texts in the Cyrillic script. After transliteration to the modern Latin script we obtain no-barrier access to historical documents.
ISSN:1561-4042