CArDIS: A Swedish Historical Handwritten Character and Word Dataset
This paper introduces a new publicly available image-based Swedish historical handwritten character and word dataset named <bold>C</bold>haracter <bold>Ar</bold>kiv <bold>D</bold>igital <bold>S</bold>weden (CArDIS) (<uri>https://cardisdataset.git...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2022-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/9775079/ |
_version_ | 1828345766499844096 |
---|---|
author | Amir Yavariabdi Huseyin Kusetogullari Turgay Celik Shivani Thummanapally Sakib Rijwan Johan Hall |
author_facet | Amir Yavariabdi Huseyin Kusetogullari Turgay Celik Shivani Thummanapally Sakib Rijwan Johan Hall |
author_sort | Amir Yavariabdi |
collection | DOAJ |
description | This paper introduces a new publicly available image-based Swedish historical handwritten character and word dataset named <bold>C</bold>haracter <bold>Ar</bold>kiv <bold>D</bold>igital <bold>S</bold>weden (CArDIS) (<uri>https://cardisdataset.github.io/CARDIS/</uri>). The samples in CArDIS are collected from 64, 084 Swedish historical documents written by several anonymous priests between 1800 and 1900. The dataset contains 116, 000 Swedish alphabet images in RGB color space with 29 classes, whereas the word dataset contains 30, 000 image samples of ten popular Swedish names as well as 1, 000 region names in Sweden. To examine the performance of different machine learning classifiers on CArDIS dataset, three different experiments are conducted. In the first experiment, classifiers such as Support Vector Machine (SVM), Artificial Neural Networks (ANN), k-Nearest Neighbor (k-NN), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and Random Forest (RF) are trained on existing character datasets which are Extended Modified National Institute of Standards and Technology (EMNIST), IAM and CVL and tested on CArDIS dataset. In the second and third experiments, the same classifiers as well as two pre-trained VGG-16 and VGG-19 classifiers are trained and tested on CArDIS character and word datasets. The experiments show that the machine learning methods trained on existing handwritten character datasets struggle to recognize characters efficiently on the CArDIS dataset, proving that characters in the CArDIS contain unique features and characteristics. Moreover, in the last two experiments, the deep learning-based classifiers provide the best recognition rates. |
first_indexed | 2024-04-14T00:17:03Z |
format | Article |
id | doaj.art-788a50842cd74f388764b0d5c6ac201e |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-04-14T00:17:03Z |
publishDate | 2022-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-788a50842cd74f388764b0d5c6ac201e2022-12-22T02:23:06ZengIEEEIEEE Access2169-35362022-01-0110553385534910.1109/ACCESS.2022.31751979775079CArDIS: A Swedish Historical Handwritten Character and Word DatasetAmir Yavariabdi0https://orcid.org/0000-0002-6264-5010Huseyin Kusetogullari1https://orcid.org/0000-0001-5762-6678Turgay Celik2Shivani Thummanapally3Sakib Rijwan4Johan Hall5https://orcid.org/0000-0003-4537-341XDepartment of Mechatronics Engineering, KTO Karatay University, Konya, TurkeyDepartment of Computer Science, Blekinge Institute of Technology, Karlskrona, SwedenSchool of Electrical and Information Engineering, University of the Witwatersrand, Johannesburg, South AfricaDepartment of Computer Science, Blekinge Institute of Technology, Karlskrona, SwedenDepartment of Computer Science, Blekinge Institute of Technology, Karlskrona, SwedenArkiv Digital, Stockholm, SwedenThis paper introduces a new publicly available image-based Swedish historical handwritten character and word dataset named <bold>C</bold>haracter <bold>Ar</bold>kiv <bold>D</bold>igital <bold>S</bold>weden (CArDIS) (<uri>https://cardisdataset.github.io/CARDIS/</uri>). The samples in CArDIS are collected from 64, 084 Swedish historical documents written by several anonymous priests between 1800 and 1900. The dataset contains 116, 000 Swedish alphabet images in RGB color space with 29 classes, whereas the word dataset contains 30, 000 image samples of ten popular Swedish names as well as 1, 000 region names in Sweden. To examine the performance of different machine learning classifiers on CArDIS dataset, three different experiments are conducted. In the first experiment, classifiers such as Support Vector Machine (SVM), Artificial Neural Networks (ANN), k-Nearest Neighbor (k-NN), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and Random Forest (RF) are trained on existing character datasets which are Extended Modified National Institute of Standards and Technology (EMNIST), IAM and CVL and tested on CArDIS dataset. In the second and third experiments, the same classifiers as well as two pre-trained VGG-16 and VGG-19 classifiers are trained and tested on CArDIS character and word datasets. The experiments show that the machine learning methods trained on existing handwritten character datasets struggle to recognize characters efficiently on the CArDIS dataset, proving that characters in the CArDIS contain unique features and characteristics. Moreover, in the last two experiments, the deep learning-based classifiers provide the best recognition rates.https://ieeexplore.ieee.org/document/9775079/Character and word recognitionmachine learning methodsoptical character recognition (OCR)old handwritten styleSwedish handwritten character datasetSwedish handwritten word dataset |
spellingShingle | Amir Yavariabdi Huseyin Kusetogullari Turgay Celik Shivani Thummanapally Sakib Rijwan Johan Hall CArDIS: A Swedish Historical Handwritten Character and Word Dataset IEEE Access Character and word recognition machine learning methods optical character recognition (OCR) old handwritten style Swedish handwritten character dataset Swedish handwritten word dataset |
title | CArDIS: A Swedish Historical Handwritten Character and Word Dataset |
title_full | CArDIS: A Swedish Historical Handwritten Character and Word Dataset |
title_fullStr | CArDIS: A Swedish Historical Handwritten Character and Word Dataset |
title_full_unstemmed | CArDIS: A Swedish Historical Handwritten Character and Word Dataset |
title_short | CArDIS: A Swedish Historical Handwritten Character and Word Dataset |
title_sort | cardis a swedish historical handwritten character and word dataset |
topic | Character and word recognition machine learning methods optical character recognition (OCR) old handwritten style Swedish handwritten character dataset Swedish handwritten word dataset |
url | https://ieeexplore.ieee.org/document/9775079/ |
work_keys_str_mv | AT amiryavariabdi cardisaswedishhistoricalhandwrittencharacterandworddataset AT huseyinkusetogullari cardisaswedishhistoricalhandwrittencharacterandworddataset AT turgaycelik cardisaswedishhistoricalhandwrittencharacterandworddataset AT shivanithummanapally cardisaswedishhistoricalhandwrittencharacterandworddataset AT sakibrijwan cardisaswedishhistoricalhandwrittencharacterandworddataset AT johanhall cardisaswedishhistoricalhandwrittencharacterandworddataset |