The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents
In the digital humanities, there is a constant need to turn images and PDF files into plain text to apply analyses such as topic modelling, named entity recognition, and other techniques. However, although there exist different solutions to extract text embedded in PDF files or run OCR on images, th...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Ubiquity Press
2017-09-01
|
Series: | Journal of Open Research Software |
Subjects: | |
Online Access: | https://openresearchsoftware.metajnl.com/articles/164 |
_version_ | 1811293890701950976 |
---|---|
author | Julia Damerow B. R. Erick Peirson Manfred D. Laubichler |
author_facet | Julia Damerow B. R. Erick Peirson Manfred D. Laubichler |
author_sort | Julia Damerow |
collection | DOAJ |
description | In the digital humanities, there is a constant need to turn images and PDF files into plain text to apply analyses such as topic modelling, named entity recognition, and other techniques. However, although there exist different solutions to extract text embedded in PDF files or run OCR on images, they typically require additional training (for example, scholars have to learn how to use the command line) or are difficult to automate without programming skills. The Giles Ecosystem is a distributed system based on Apache Kafka that allows users to upload documents for text and image extraction. The system components are implemented using Java and the Spring Framework and are available under an Open Source license on GitHub (<a href="https://github.com/diging/">https://github.com/diging/</a>). Funding statement: Funding was provided by grants from NSF SES 1656284, ASU Presidential Strategic Initiative Fund and the Smart Family Foundation. |
first_indexed | 2024-04-13T05:08:21Z |
format | Article |
id | doaj.art-81b79b76877c440b84393650a78a9d65 |
institution | Directory Open Access Journal |
issn | 2049-9647 |
language | English |
last_indexed | 2024-04-13T05:08:21Z |
publishDate | 2017-09-01 |
publisher | Ubiquity Press |
record_format | Article |
series | Journal of Open Research Software |
spelling | doaj.art-81b79b76877c440b84393650a78a9d652022-12-22T03:01:06ZengUbiquity PressJournal of Open Research Software2049-96472017-09-015110.5334/jors.164133The Giles Ecosystem – Storage, Text Extraction, and OCR of DocumentsJulia Damerow0B. R. Erick Peirson1Manfred D. Laubichler2Arizona State UniversityArizona State UniversityArizona State UniversityIn the digital humanities, there is a constant need to turn images and PDF files into plain text to apply analyses such as topic modelling, named entity recognition, and other techniques. However, although there exist different solutions to extract text embedded in PDF files or run OCR on images, they typically require additional training (for example, scholars have to learn how to use the command line) or are difficult to automate without programming skills. The Giles Ecosystem is a distributed system based on Apache Kafka that allows users to upload documents for text and image extraction. The system components are implemented using Java and the Spring Framework and are available under an Open Source license on GitHub (<a href="https://github.com/diging/">https://github.com/diging/</a>). Funding statement: Funding was provided by grants from NSF SES 1656284, ASU Presidential Strategic Initiative Fund and the Smart Family Foundation.https://openresearchsoftware.metajnl.com/articles/164Text extractionOCRDocument storageApache KafkaJavaSpring Framework |
spellingShingle | Julia Damerow B. R. Erick Peirson Manfred D. Laubichler The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents Journal of Open Research Software Text extraction OCR Document storage Apache Kafka Java Spring Framework |
title | The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents |
title_full | The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents |
title_fullStr | The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents |
title_full_unstemmed | The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents |
title_short | The Giles Ecosystem – Storage, Text Extraction, and OCR of Documents |
title_sort | giles ecosystem storage text extraction and ocr of documents |
topic | Text extraction OCR Document storage Apache Kafka Java Spring Framework |
url | https://openresearchsoftware.metajnl.com/articles/164 |
work_keys_str_mv | AT juliadamerow thegilesecosystemstoragetextextractionandocrofdocuments AT brerickpeirson thegilesecosystemstoragetextextractionandocrofdocuments AT manfreddlaubichler thegilesecosystemstoragetextextractionandocrofdocuments AT juliadamerow gilesecosystemstoragetextextractionandocrofdocuments AT brerickpeirson gilesecosystemstoragetextextractionandocrofdocuments AT manfreddlaubichler gilesecosystemstoragetextextractionandocrofdocuments |