Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size
Recently, it was demonstrated that generalized entropies of order α offer novel and important opportunities to quantify the similarity of symbol sequences where α is a free parameter. Varying this parameter makes it possible to magnify differences between different texts at specifi...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2019-05-01
|
Series: | Entropy |
Subjects: | |
Online Access: | https://www.mdpi.com/1099-4300/21/5/464 |
_version_ | 1828120526029062144 |
---|---|
author | Alexander Koplenig Sascha Wolfer Carolin Müller-Spitzer |
author_facet | Alexander Koplenig Sascha Wolfer Carolin Müller-Spitzer |
author_sort | Alexander Koplenig |
collection | DOAJ |
description | Recently, it was demonstrated that generalized entropies of order α offer novel and important opportunities to quantify the similarity of symbol sequences where α is a free parameter. Varying this parameter makes it possible to magnify differences between different texts at specific scales of the corresponding word frequency spectrum. For the analysis of the statistical properties of natural languages, this is especially interesting, because textual data are characterized by Zipf’s law, i.e., there are very few word types that occur very often (e.g., function words expressing grammatical relationships) and many word types with a very low frequency (e.g., content words carrying most of the meaning of a sentence). Here, this approach is systematically and empirically studied by analyzing the lexical dynamics of the German weekly news magazine <i>Der Spiegel</i> (consisting of approximately 365,000 articles and 237,000,000 words that were published between 1947 and 2017). We show that, analogous to most other measures in quantitative linguistics, similarity measures based on generalized entropies depend heavily on the sample size (i.e., text length). We argue that this makes it difficult to quantify lexical dynamics and language change and show that standard sampling approaches do not solve this problem. We discuss the consequences of the results for the statistical analysis of languages. |
first_indexed | 2024-04-11T14:06:24Z |
format | Article |
id | doaj.art-e4e4a72b2ae44ad2a6f9bec2c704847b |
institution | Directory Open Access Journal |
issn | 1099-4300 |
language | English |
last_indexed | 2024-04-11T14:06:24Z |
publishDate | 2019-05-01 |
publisher | MDPI AG |
record_format | Article |
series | Entropy |
spelling | doaj.art-e4e4a72b2ae44ad2a6f9bec2c704847b2022-12-22T04:19:53ZengMDPI AGEntropy1099-43002019-05-0121546410.3390/e21050464e21050464Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample SizeAlexander Koplenig0Sascha Wolfer1Carolin Müller-Spitzer2Department of Lexical Studies, Institute for the German language (IDS), 68161 Mannheim, GermanyDepartment of Lexical Studies, Institute for the German language (IDS), 68161 Mannheim, GermanyDepartment of Lexical Studies, Institute for the German language (IDS), 68161 Mannheim, GermanyRecently, it was demonstrated that generalized entropies of order α offer novel and important opportunities to quantify the similarity of symbol sequences where α is a free parameter. Varying this parameter makes it possible to magnify differences between different texts at specific scales of the corresponding word frequency spectrum. For the analysis of the statistical properties of natural languages, this is especially interesting, because textual data are characterized by Zipf’s law, i.e., there are very few word types that occur very often (e.g., function words expressing grammatical relationships) and many word types with a very low frequency (e.g., content words carrying most of the meaning of a sentence). Here, this approach is systematically and empirically studied by analyzing the lexical dynamics of the German weekly news magazine <i>Der Spiegel</i> (consisting of approximately 365,000 articles and 237,000,000 words that were published between 1947 and 2017). We show that, analogous to most other measures in quantitative linguistics, similarity measures based on generalized entropies depend heavily on the sample size (i.e., text length). We argue that this makes it difficult to quantify lexical dynamics and language change and show that standard sampling approaches do not solve this problem. We discuss the consequences of the results for the statistical analysis of languages.https://www.mdpi.com/1099-4300/21/5/464generalized entropygeneralized divergenceJensen–Shannon divergencesample sizetext lengthZipf’s law |
spellingShingle | Alexander Koplenig Sascha Wolfer Carolin Müller-Spitzer Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size Entropy generalized entropy generalized divergence Jensen–Shannon divergence sample size text length Zipf’s law |
title | Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size |
title_full | Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size |
title_fullStr | Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size |
title_full_unstemmed | Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size |
title_short | Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size |
title_sort | studying lexical dynamics and language change via generalized entropies the problem of sample size |
topic | generalized entropy generalized divergence Jensen–Shannon divergence sample size text length Zipf’s law |
url | https://www.mdpi.com/1099-4300/21/5/464 |
work_keys_str_mv | AT alexanderkoplenig studyinglexicaldynamicsandlanguagechangeviageneralizedentropiestheproblemofsamplesize AT saschawolfer studyinglexicaldynamicsandlanguagechangeviageneralizedentropiestheproblemofsamplesize AT carolinmullerspitzer studyinglexicaldynamicsandlanguagechangeviageneralizedentropiestheproblemofsamplesize |