Zipf’s law – Of dwarves and giants

Imagine this: around 6 percent of the things you say and write are “the…” and that’s it: the is the most frequent word of the English language and you use it altogether probably as much as often compared to other words. But this fact is just the tip of the iceberg of a rather puzzling and remarkable property of the human language. When you look at the frequency ranking of the top 20 words in English, namely: the of and to a in that it is was I for on you he be with as by at (cf. http://www.wordcount.org/main.php or another source http://www.wordfrequency.info), the words occur according to a highly regular and systematic frequency distribution the so called Zipf’s law named after the linguist George Kingsley Zipf (1902-1950) (cf. Pustet 2004). According to his work the second most frequent word of a language appears half as often as the most frequent word, the third most frequent one a third as often, the forth a forth as often, and so on until you get something like this:

And this works for all the words in a language, from highly frequent ones like the to less frequent ones like jellyfish. So the frequency of a word is just 1 over its rank and follows therefore the Zipfian power law or maybe even a set pattern so to speak. The frequency of words of a natural language vary in this way enormously, which is not trivial at all; as a result there are few ‘giant words’ as the or with and countless many dwarves as ravioli or catamaran and those giants cover a ginormous amount of the language produced.

And this is not only true for English but for all languages for which so far data is available, even for languages, which are not even deciphered yet as e. g. Meroitic (cf. Smith 2008), which could indicate that this pattern applies to all languages in the world. Just have a look at this:

(cf. Bentz et al. 2015 or Piantadosi 2014: 1117 for even more languages: Spanish, Russian, Greek, Portuguese, Chinese, Swahili, Chilean, Finnish, Estonian, French, Czech, Turkish, Polish, Basque, Maori, Tok Pisin)

It is to some extent even true for the around 470 words in this tiny little piece of blog:

But why is that? Very many linguists tried to figure this out and give a good reason for it. The longer than usual bibliography below gives an impression of that. For example Altmann et al. (2011) claim that a word’s certain use, its niche, which means its characteristic properties and contexts in which it is used have a strong impact on its frequency in a language and also on the changes involved over time. To put it simple, people start to use the word chat once the concept of chat is ‘invented’ with which the total amount of occurrences increases. However this is only one of many explanations, notions or implications of the Zipfian Distribution or the language riddle of giants and dwarves and yet there is still a lot to explore about it.

For further reading explore the literature below.

Jonas Schreiber (FAU Erlangen-Nürnberg)
Intern at Brill’s Linguistic Bibliography

Bibliography

Altmann, Gabriel: Zipfian linguistics. – Glottometrics 3, 2002, 19-26.

Altmann, Eduardo G.; Pierrehumbert, Janet B.; Motter, Adilson E.: Beyond Word Frequency: Bursts, Lulls, and Scaling in the Temporal Distributions of Words. – PLoS ONE 4(11): e7678, 2009. doi:10.1371/journal.pone.000767

Baayen, R. Harald. Word frequency distributions. Dordrecht: Kluwer Academic Publishers, 2001.

Balasubrahmanyan, V. K.; Naranan, S.: Algorithmic information, complexity and Zipf’s law. – Glottometrics 4, 2002, 1-26.

Bentz, Christian; Verkerk, Annemarie; Kiela, Douwe; Hill, Felix; Buttery, Paula : Adaptive communication : languages with more non-native speakers tend to have fewer word forms. – PLoS ONE 10(6): e0128254, 2015. doi:10.1371/journal.pone.0128254.

Borin, Lars: Med Zipf mot framtiden – en integrerad lexikonresurs för svensk språkteknologi. – LexicoNordica 17, 2010, 35-54.

Dębowski, Lukasz: Zipf’s law against the text size: a half-rational model. – Glottometrics 4, 2002, 49-60.

Ellis, Nick C.: Formulaic language and second language acquisition : Zipf and the phrasal teddy bear. – Annual review of applied linguistics 32, 2012, 17-44.

Fenk-Oczlon, Gertraud; Fenk, August: Zipf’s tool analogy and word order. – Glottometrics 5, 2002, 22-28.

Ferrer i Cancho, Ramon: Hidden communication aspects in the exponent of Zipf’s law. – Glottometrics 11, 2005, 98-119.

Ferrer i Cancho, Ramon; Solé, Ricard V.: Two regimes in the frequency of words and the origins of complex lexicons : Zipf’s Law revisited. – Journal of quantitative linguistics 8/3, 2001, 165-173.

Ferrer i Cancho, Ramon; Servedio, Vito: Can simple models explain Zipf’s law for all exponents? – Glottometrics 11, 2005, 1-8.

Grzybek, Peter; Kelih, Emmerich: Häufigkeiten von Buchstaben / Graphemen / Phonemen : Konvergenzen des Rangierungsverhaltens. – Glottometrics 9, 2005, 62-73

Hatzigeorgiu, Nick; Mikros, Georgios K.; [Karagiannis, Giorgios] Carayannis, George: Word Length, Word Frequencies and Zipf’s Law in the Greek Language. – Journal of quantitative linguistics 8/3, 2001, 175-185.

Hřebíček, Luděk: Zipf’s law and text. – Glottometrics 3, 2002, 27-38.

Kromer, Victor: Zipf’s law and its modification possibilities. – Glottometrics 5, 2002, 1-13.

Manin, Dmitrii Y.: Mandelbrot’s model for Zipf’s law : can Mandelbrot’s model explain Zipf’s law for language? – Journal of quantitative linguistics 16/3, 2009, 274-285.

Montemurro, Marcello A.; Zanette, D.: Frequency-rank distribution of words in large text samples: phenomenology and models. – Glottometrics 4, 2002, 87-98.

Németh, Géza; Zainkó, Csaba: Multilingual statistical text analysis, Zipf’s law and Hungarian speech generation. – Acta linguistica Hungarica: an international journal of linguistics 49/3-4, 2002, 385-405.

Piantadosi, Steven T.: Zipf’s word frequency law in natural language: a critical review and future directions. – Psychonomic bulletin & review 21, 2014, 1112-1130.

Pine, Julian M.; Freudenthal, Daniel; Krajewski, Grzegorz; Gobet, Fernand R.: Do young children have adult-like syntactic categories? : Zipf’s law and the case of the determiner. – Cognition 127/3, 2013, 345-360.

Prün, Claudia: A text linguistic hypothesis of G. K. Zipf. – Journal of quantitative linguistics 4, 1997, 244-251.

Prün, Claudia; Zipf, Robert: Biographical notes on G. K. Zipf [1902-1950]. – Glottometrics 3, 2002, 1-10.

Pustet, Regina: Zipf and His Heirs. – Language sciences 26/1, 2004, 1-25.

Rousseau, Ronald: George Kingsley Zipf [1902-1950]: life, ideas, his law and informetrics. – Glottometrics 3, 2002, 11-18.

Sigurd, Bengt; Eeg-Olofsson, Mats; Weijer, Joost van de: Word length, sentence length and frequency : Zipf revisited. – Studia linguistica : a journal of general linguistics 58/1, 2004, 37-52 | With data from English, Swedish and German.

Smith, Reginald: Investigation of the Zipf-plot of the extinct Meroitic language. – Glottometrics 15, 2007, 53-61.

Uhlířová, Ludmila: Zipf’s notion of “economy” on the text level. – Glottometrics 3, 2002, 39-60.

Wheeler, Eric S.: Zipf’s law and why it works everywhere. – Glottometrics 4, 2002, 45-48.

Zanette, D.; Montemurro, Marcello A.: Dynamics of Text Generation with Realistic Zipf’s Distribution. – Journal of quantitative linguistics 12/1, 2005, 29-40.

Zipf, George K.: Human behavior and the principle of least effort. Cambridge, MA: Addison-Wesley Press, 1949.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: