The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents

The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents

In the digital humanities, there is a constant need to turn images and PDF files into plain text to apply analyses such as topic modelling, named entity recognition, and other techniques. However, although there exist different solutions to extract text embedded in PDF files or run OCR on images, th...

Full description

Bibliographic Details
Main Authors:	Julia Damerow, B. R. Erick Peirson, Manfred D. Laubichler
Format:	Article
Language:	English
Published:	Ubiquity Press 2017-09-01
Series:	Journal of Open Research Software
Subjects:	Text extraction OCR Document storage Apache Kafka Java Spring Framework
Online Access:	https://openresearchsoftware.metajnl.com/articles/164

Similar Items

Utilization of OCR and text feature extraction to create a database of labour complaints
by: Yan Puspitarani, et al.
Published: (2020-08-01)

Analysis of Recent Deep Learning Techniques for Arabic Handwritten-Text OCR and Post-OCR Correction
by: Rayyan Najam, et al.
Published: (2023-06-01)

A New Big Data Processing Framework for the Online Roadshow
by: Kang-Ren Leow, et al.
Published: (2023-06-01)

Gradual OCR: An Effective OCR Approach Based on Gradual Detection of Texts
by: Youngki Park, et al.
Published: (2023-11-01)

Basic Test Framework for the Evaluation of Text Line Segmentation and Text Parameter Extraction
by: Darko Brodić, et al.
Published: (2010-05-01)

The Use of Blockchain Technology and OCR in E-Government for Document Management: Inbound Invoice Management as an Example
by: Fatima Azzam, et al.
Published: (2023-07-01)

LSTM Network and OCR Performance for Classification of Decimal Dewey Classification Code
by: Yesy Diah Rosita, et al.
Published: (2020-04-01)

Experimental evaluation of Arabic OCR systems
by: Mansoor Alghamdi, et al.
Published: (2017-11-01)

THE IMPACT OF OCR AND MACHINE LEARNING SOFTWARE IN THE PROCESSING OF FINANCIAL-ACCOUNTING DOCUMENTS AND INFORMATION
by: STOICA RALUCA ANDREEA, et al.
Published: (2023-12-01)

Generating an Ordered Data Set from an OCR Text File
by: Jon Crump
Published: (2014-11-01)

Exploiting Script Similarities to Compensate for the Large Amount of Data in Training Tesseract LSTM: Towards Kurdish OCR
by: Saman Idrees, et al.
Published: (2021-10-01)

中文OCR文件檢索測試集之製作與應用 Construction and Application of an Chinese OCR Test Collection for Information Retrieval
by: Mung-Chu Tsai, et al.
Published: (2003-03-01)

A Regularization-Based Big Data Framework for Winter Precipitation Forecasting on Streaming Data
by: Andreas Kanavos, et al.
Published: (2021-08-01)

Comparative analysis of message brokers
by: Mateusz Kaczor, et al.
Published: (2022-06-01)

IRONEDGE: Stream Processing Architecture for Edge Applications
by: João Pedro Vitorino, et al.
Published: (2023-02-01)

An Efficient Unsupervised Approach for OCR Error Correction of Vietnamese OCR Text
by: Quoc-Dung Nguyen, et al.
Published: (2023-01-01)

A NOVEL TRUE REAL-TIME SPATIOTEMPORAL DATA STREAM PROCESSING FRAMEWORK
by: ATURE ANGBERA, et al.
Published: (2022-09-01)

Improving AI Text Recognition Accuracy with Enhanced OCR For Automated Guided Vehicle
by: Florentinus Budi Setiawan, et al.
Published: (2022-10-01)

Reichsanzeiger-GT: An OCR ground truth dataset based on the historical newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette) (1819–1945)
by: Thomas Schmidt, et al.
Published: (2024-06-01)

OCR Applied for Identification of Vehicles with Irregular Documentation Using IoT
by: Luiz Alfonso Glasenapp, et al.
Published: (2023-02-01)

Efficient Text Bounding Box Identification Using Mask R-CNN: Case of Thai Documents
by: Phanthakan Kiatphaisansophon, et al.
Published: (2024-01-01)

Improving Scene Text Recognition for Indian Languages with Transfer Learning and Font Diversity
by: Sanjana Gunna, et al.
Published: (2022-03-01)

Engineering Resource-Efficient Data Management for Smart Cities with Apache Kafka
by: Theofanis P. Raptis, et al.
Published: (2023-01-01)

Thinning: A Preprocessing Technique for an OCR System for the Brahmi Script
by: H. K. Anasuya Devi
Published: (2006-12-01)

A Real-Time Streaming System for Customized Network Traffic Capture
by: Adrian-Tiberiu Costin, et al.
Published: (2023-07-01)

Towards a general open dataset and model for late medieval Castilian text recognition (HTR/OCR)
by: Matthias Gille Levenson
Published: (2023-10-01)

Volltexte für die Forschung: OCR partizipativ, iterativ und on Demand
by: Anke Hertling, et al.
Published: (2022-08-01)

Projekt OCR-BW
by: Dorothee Huff, et al.
Published: (2022-11-01)

SCADA-Based Message Generator for Multi-Vendor Smart Grids: Distributed Integration and Verification of TASE.2
by: Petr Ilgner, et al.
Published: (2021-10-01)

LAN Traffic Capture Applications Using the Libtins Library
by: Adrian-Tiberiu Costin, et al.
Published: (2021-12-01)

Termination as the Basis for Classification of Document Texts
by: Marina V. Kosova, et al.
Published: (2017-12-01)

Semantic Text Segmentation from Synthetic Images of Full-Text Documents
by: Lukáš Bureš, et al.
Published: (2019-12-01)

Thresholding: A Pixel-Level Image Processing Methodology Preprocessing Technique for an OCR System for the Brahmi Script
by: H. K. Anasuya Devi
Published: (2006-12-01)

Publish/Subscribe Method for Real-Time Data Processing in Massive IoT Leveraging Blockchain for Secured Storage
by: Mohammadhossein Ataei, et al.
Published: (2023-12-01)

Event-Driven Interoperable Manufacturing Ecosystem for Energy Consumption Monitoring
by: Andre Dionisio Rocha, et al.
Published: (2021-06-01)

Artificially Intelligent Readers: An Adaptive Framework for Original Handwritten Numerical Digits Recognition with OCR Methods
by: Parth Hasmukh Jain, et al.
Published: (2023-05-01)

Multilingual character recognition dataset for Moroccan official documents
by: Ali Benaissa, et al.
Published: (2024-02-01)

Real-time Twitter data analysis using Hadoop ecosystem
by: Anisha P. Rodrigues, et al.
Published: (2018-01-01)

Copyright Protection for Text Documents
by: Dujan Taha, et al.
Published: (2007-12-01)

A Study on Big Data Collecting and Utilizing Smart Factory Based Grid Networking Big Data Using Apache Kafka
by: Sangil Park, et al.
Published: (2023-01-01)