From Boltzmann to Zipf through Shannon and Jaynes

The word-frequency distribution provides the fundamental building blocks that generate discourse in natural language. It is well known, from empirical evidence, that the word-frequency distribution of almost any text is described by Zipf’s law, at least approximately. Following Stephens an...

Full description

Bibliographic Details
Main Authors: Álvaro Corral, Montserrat García del Muro
Format: Article
Language:English
Published: MDPI AG 2020-02-01
Series:Entropy
Subjects:
Online Access:https://www.mdpi.com/1099-4300/22/2/179
_version_ 1828275585754857472
author Álvaro Corral
Montserrat García del Muro
author_facet Álvaro Corral
Montserrat García del Muro
author_sort Álvaro Corral
collection DOAJ
description The word-frequency distribution provides the fundamental building blocks that generate discourse in natural language. It is well known, from empirical evidence, that the word-frequency distribution of almost any text is described by Zipf’s law, at least approximately. Following Stephens and Bialek (2010), we interpret the frequency of any word as arising from the interaction potentials between its constituent letters. Indeed, Jaynes’ maximum-entropy principle, with the constrains given by every empirical two-letter marginal distribution, leads to a Boltzmann distribution for word probabilities, with an energy-like function given by the sum of the all-to-all pairwise (two-letter) potentials. The so-called improved iterative-scaling algorithm allows us finding the potentials from the empirical two-letter marginals. We considerably extend Stephens and Bialek’s results, applying this formalism to words with length of up to six letters from the English subset of the recently created Standardized Project Gutenberg Corpus. We find that the model is able to reproduce Zipf’s law, but with some limitations: the general Zipf’s power-law regime is obtained, but the probability of individual words shows considerable scattering. In this way, a pure statistical-physics framework is used to describe the probabilities of words. As a by-product, we find that both the empirical two-letter marginal distributions and the interaction-potential distributions follow well-defined statistical laws.
first_indexed 2024-04-13T06:49:09Z
format Article
id doaj.art-2548812c53704d07a4833241a7a0dcb6
institution Directory Open Access Journal
issn 1099-4300
language English
last_indexed 2024-04-13T06:49:09Z
publishDate 2020-02-01
publisher MDPI AG
record_format Article
series Entropy
spelling doaj.art-2548812c53704d07a4833241a7a0dcb62022-12-22T02:57:28ZengMDPI AGEntropy1099-43002020-02-0122217910.3390/e22020179e22020179From Boltzmann to Zipf through Shannon and JaynesÁlvaro Corral0Montserrat García del Muro1Centre de Recerca Matemàtica, Edifici C, Campus Bellaterra, E-08193 Barcelona, SpainDepartament de Física de la Matèria Condensada, Universitat de Barcelona, Martí i Franquès 1, E-08028 Barcelona, SpainThe word-frequency distribution provides the fundamental building blocks that generate discourse in natural language. It is well known, from empirical evidence, that the word-frequency distribution of almost any text is described by Zipf’s law, at least approximately. Following Stephens and Bialek (2010), we interpret the frequency of any word as arising from the interaction potentials between its constituent letters. Indeed, Jaynes’ maximum-entropy principle, with the constrains given by every empirical two-letter marginal distribution, leads to a Boltzmann distribution for word probabilities, with an energy-like function given by the sum of the all-to-all pairwise (two-letter) potentials. The so-called improved iterative-scaling algorithm allows us finding the potentials from the empirical two-letter marginals. We considerably extend Stephens and Bialek’s results, applying this formalism to words with length of up to six letters from the English subset of the recently created Standardized Project Gutenberg Corpus. We find that the model is able to reproduce Zipf’s law, but with some limitations: the general Zipf’s power-law regime is obtained, but the probability of individual words shows considerable scattering. In this way, a pure statistical-physics framework is used to describe the probabilities of words. As a by-product, we find that both the empirical two-letter marginal distributions and the interaction-potential distributions follow well-defined statistical laws.https://www.mdpi.com/1099-4300/22/2/179maximum entropy principletwo-letter interactionsboltzmann factorword-frequency distributionzipf’s lawquantitative linguisticspower laws
spellingShingle Álvaro Corral
Montserrat García del Muro
From Boltzmann to Zipf through Shannon and Jaynes
Entropy
maximum entropy principle
two-letter interactions
boltzmann factor
word-frequency distribution
zipf’s law
quantitative linguistics
power laws
title From Boltzmann to Zipf through Shannon and Jaynes
title_full From Boltzmann to Zipf through Shannon and Jaynes
title_fullStr From Boltzmann to Zipf through Shannon and Jaynes
title_full_unstemmed From Boltzmann to Zipf through Shannon and Jaynes
title_short From Boltzmann to Zipf through Shannon and Jaynes
title_sort from boltzmann to zipf through shannon and jaynes
topic maximum entropy principle
two-letter interactions
boltzmann factor
word-frequency distribution
zipf’s law
quantitative linguistics
power laws
url https://www.mdpi.com/1099-4300/22/2/179
work_keys_str_mv AT alvarocorral fromboltzmanntozipfthroughshannonandjaynes
AT montserratgarciadelmuro fromboltzmanntozipfthroughshannonandjaynes