A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics

The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to date. In fact, most PG studies so far ei...

Full description

Bibliographic Details
Main Authors:	Martin Gerlach, Francesc Font-Clos
Format:	Article
Language:	English
Published:	MDPI AG 2020-01-01
Series:	Entropy
Subjects:	project gutenberg jensen–shannon divergence reproducibility quantitative linguistics natural language processing
Online Access:	https://www.mdpi.com/1099-4300/22/1/126

_version_	1798038859686084608
author	Martin Gerlach Francesc Font-Clos
author_facet	Martin Gerlach Francesc Font-Clos
author_sort	Martin Gerlach
collection	DOAJ
description	The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to date. In fact, most PG studies so far either consider only a small number of manually selected books, leading to potential biased subsets, or employ vastly different pre-processing strategies (often specified in insufficient details), raising concerns regarding the reproducibility of published results. In order to address these shortcomings, here we present the Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than <inline-formula> <math display="inline"> <semantics> <mrow> <mn>3</mn> <mo>×</mo> <msup> <mn>10</mn> <mn>9</mn> </msup> </mrow> </semantics> </math> </inline-formula> word-tokens. Using different sources of annotated metadata, we not only provide a broad characterization of the content of PG, but also show different examples highlighting the potential of SPGC for investigating language variability across time, subjects, and authors. We publish our methodology in detail, the code to download and process the data, as well as the obtained corpus itself on three different levels of granularity (raw text, timeseries of word tokens, and counts of words). In this way, we provide a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval.
first_indexed	2024-04-11T21:46:07Z
format	Article
id	doaj.art-b9a45c3917664a0aa1026ab9c6984b9f
institution	Directory Open Access Journal
issn	1099-4300
language	English
last_indexed	2024-04-11T21:46:07Z
publishDate	2020-01-01
publisher	MDPI AG
record_format	Article
series	Entropy
spelling	doaj.art-b9a45c3917664a0aa1026ab9c6984b9f2022-12-22T04:01:25ZengMDPI AGEntropy1099-43002020-01-0122112610.3390/e22010126e22010126A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative LinguisticsMartin Gerlach0Francesc Font-Clos1Department of Chemical and Biological Engineering, Northwestern University, Evanston, IL 60208, USACenter for Complexity and Biosystems, Department of Physics, University of Milan, 20133 Milano, ItalyThe use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to date. In fact, most PG studies so far either consider only a small number of manually selected books, leading to potential biased subsets, or employ vastly different pre-processing strategies (often specified in insufficient details), raising concerns regarding the reproducibility of published results. In order to address these shortcomings, here we present the Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than <inline-formula> <math display="inline"> <semantics> <mrow> <mn>3</mn> <mo>×</mo> <msup> <mn>10</mn> <mn>9</mn> </msup> </mrow> </semantics> </math> </inline-formula> word-tokens. Using different sources of annotated metadata, we not only provide a broad characterization of the content of PG, but also show different examples highlighting the potential of SPGC for investigating language variability across time, subjects, and authors. We publish our methodology in detail, the code to download and process the data, as well as the obtained corpus itself on three different levels of granularity (raw text, timeseries of word tokens, and counts of words). In this way, we provide a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval.https://www.mdpi.com/1099-4300/22/1/126project gutenbergjensen–shannon divergencereproducibilityquantitative linguisticsnatural language processing
spellingShingle	Martin Gerlach Francesc Font-Clos A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics Entropy project gutenberg jensen–shannon divergence reproducibility quantitative linguistics natural language processing
title	A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics
title_full	A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics
title_fullStr	A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics
title_full_unstemmed	A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics
title_short	A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics
title_sort	standardized project gutenberg corpus for statistical analysis of natural language and quantitative linguistics
topic	project gutenberg jensen–shannon divergence reproducibility quantitative linguistics natural language processing
url	https://www.mdpi.com/1099-4300/22/1/126
work_keys_str_mv	AT martingerlach astandardizedprojectgutenbergcorpusforstatisticalanalysisofnaturallanguageandquantitativelinguistics AT francescfontclos astandardizedprojectgutenbergcorpusforstatisticalanalysisofnaturallanguageandquantitativelinguistics AT martingerlach standardizedprojectgutenbergcorpusforstatisticalanalysisofnaturallanguageandquantitativelinguistics AT francescfontclos standardizedprojectgutenbergcorpusforstatisticalanalysisofnaturallanguageandquantitativelinguistics

A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics

Similar Items