From Boltzmann to Zipf through Shannon and Jaynes
The word-frequency distribution provides the fundamental building blocks that generate discourse in natural language. It is well known, from empirical evidence, that the word-frequency distribution of almost any text is described by Zipf’s law, at least approximately. Following Stephens an...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2020-02-01
|
Series: | Entropy |
Subjects: | |
Online Access: | https://www.mdpi.com/1099-4300/22/2/179 |
_version_ | 1828275585754857472 |
---|---|
author | Álvaro Corral Montserrat García del Muro |
author_facet | Álvaro Corral Montserrat García del Muro |
author_sort | Álvaro Corral |
collection | DOAJ |
description | The word-frequency distribution provides the fundamental building blocks that generate discourse in natural language. It is well known, from empirical evidence, that the word-frequency distribution of almost any text is described by Zipf’s law, at least approximately. Following Stephens and Bialek (2010), we interpret the frequency of any word as arising from the interaction potentials between its constituent letters. Indeed, Jaynes’ maximum-entropy principle, with the constrains given by every empirical two-letter marginal distribution, leads to a Boltzmann distribution for word probabilities, with an energy-like function given by the sum of the all-to-all pairwise (two-letter) potentials. The so-called improved iterative-scaling algorithm allows us finding the potentials from the empirical two-letter marginals. We considerably extend Stephens and Bialek’s results, applying this formalism to words with length of up to six letters from the English subset of the recently created Standardized Project Gutenberg Corpus. We find that the model is able to reproduce Zipf’s law, but with some limitations: the general Zipf’s power-law regime is obtained, but the probability of individual words shows considerable scattering. In this way, a pure statistical-physics framework is used to describe the probabilities of words. As a by-product, we find that both the empirical two-letter marginal distributions and the interaction-potential distributions follow well-defined statistical laws. |
first_indexed | 2024-04-13T06:49:09Z |
format | Article |
id | doaj.art-2548812c53704d07a4833241a7a0dcb6 |
institution | Directory Open Access Journal |
issn | 1099-4300 |
language | English |
last_indexed | 2024-04-13T06:49:09Z |
publishDate | 2020-02-01 |
publisher | MDPI AG |
record_format | Article |
series | Entropy |
spelling | doaj.art-2548812c53704d07a4833241a7a0dcb62022-12-22T02:57:28ZengMDPI AGEntropy1099-43002020-02-0122217910.3390/e22020179e22020179From Boltzmann to Zipf through Shannon and JaynesÁlvaro Corral0Montserrat García del Muro1Centre de Recerca Matemàtica, Edifici C, Campus Bellaterra, E-08193 Barcelona, SpainDepartament de Física de la Matèria Condensada, Universitat de Barcelona, Martí i Franquès 1, E-08028 Barcelona, SpainThe word-frequency distribution provides the fundamental building blocks that generate discourse in natural language. It is well known, from empirical evidence, that the word-frequency distribution of almost any text is described by Zipf’s law, at least approximately. Following Stephens and Bialek (2010), we interpret the frequency of any word as arising from the interaction potentials between its constituent letters. Indeed, Jaynes’ maximum-entropy principle, with the constrains given by every empirical two-letter marginal distribution, leads to a Boltzmann distribution for word probabilities, with an energy-like function given by the sum of the all-to-all pairwise (two-letter) potentials. The so-called improved iterative-scaling algorithm allows us finding the potentials from the empirical two-letter marginals. We considerably extend Stephens and Bialek’s results, applying this formalism to words with length of up to six letters from the English subset of the recently created Standardized Project Gutenberg Corpus. We find that the model is able to reproduce Zipf’s law, but with some limitations: the general Zipf’s power-law regime is obtained, but the probability of individual words shows considerable scattering. In this way, a pure statistical-physics framework is used to describe the probabilities of words. As a by-product, we find that both the empirical two-letter marginal distributions and the interaction-potential distributions follow well-defined statistical laws.https://www.mdpi.com/1099-4300/22/2/179maximum entropy principletwo-letter interactionsboltzmann factorword-frequency distributionzipf’s lawquantitative linguisticspower laws |
spellingShingle | Álvaro Corral Montserrat García del Muro From Boltzmann to Zipf through Shannon and Jaynes Entropy maximum entropy principle two-letter interactions boltzmann factor word-frequency distribution zipf’s law quantitative linguistics power laws |
title | From Boltzmann to Zipf through Shannon and Jaynes |
title_full | From Boltzmann to Zipf through Shannon and Jaynes |
title_fullStr | From Boltzmann to Zipf through Shannon and Jaynes |
title_full_unstemmed | From Boltzmann to Zipf through Shannon and Jaynes |
title_short | From Boltzmann to Zipf through Shannon and Jaynes |
title_sort | from boltzmann to zipf through shannon and jaynes |
topic | maximum entropy principle two-letter interactions boltzmann factor word-frequency distribution zipf’s law quantitative linguistics power laws |
url | https://www.mdpi.com/1099-4300/22/2/179 |
work_keys_str_mv | AT alvarocorral fromboltzmanntozipfthroughshannonandjaynes AT montserratgarciadelmuro fromboltzmanntozipfthroughshannonandjaynes |