The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents

In the digital humanities, there is a constant need to turn images and PDF files into plain text to apply analyses such as topic modelling, named entity recognition, and other techniques. However, although there exist different solutions to extract text embedded in PDF files or run OCR on images, th...

Full description

Bibliographic Details
Main Authors: Julia Damerow, B. R. Erick Peirson, Manfred D. Laubichler
Format: Article
Language:English
Published: Ubiquity Press 2017-09-01
Series:Journal of Open Research Software
Subjects:
Online Access:https://openresearchsoftware.metajnl.com/articles/164
Description
Summary:In the digital humanities, there is a constant need to turn images and PDF files into plain text to apply analyses such as topic modelling, named entity recognition, and other techniques. However, although there exist different solutions to extract text embedded in PDF files or run OCR on images, they typically require additional training (for example, scholars have to learn how to use the command line) or are difficult to automate without programming skills. The Giles Ecosystem is a distributed system based on Apache Kafka that allows users to upload documents for text and image extraction. The system components are implemented using Java and the Spring Framework and are available under an Open Source license on GitHub (<a href="https://github.com/diging/">https://github.com/diging/</a>). Funding statement: Funding was provided by grants from NSF SES 1656284, ASU Presidential Strategic Initiative Fund and the Smart Family Foundation.
ISSN:2049-9647