Zipf’s law – Of dwarves and giants

Imagine this: around 6 percent of the things you say and write are “the…” and that’s it: the is the most frequent word of the English language and you use it altogether probably as much as often compared to other words. But this fact is just the tip of the iceberg of a rather puzzling and remarkable property of the human language. When you look at the frequency ranking of the top 20 words in English, namely: the of and to a in that it is was I for on you he be with as by at (cf. http://www.wordcount.org/main.php or another source http://www.wordfrequency.info), the words occur according to a highly regular and systematic frequency distribution the so called Zipf’s law named after the linguist George Kingsley Zipf (1902-1950) (cf. Pustet 2004). According to his work the second most frequent word of a language appears half as often as the most frequent word, the third most frequent one a third as often, the forth a forth as often, and so on until you get something like this:

And this works for all the words in a language, from highly frequent ones like the to less frequent ones like jellyfish. So the frequency of a word is just 1 over its rank and follows therefore the Zipfian power law or maybe even a set pattern so to speak. The frequency of words of a natural language vary in this way enormously, which is not trivial at all; as a result there are few ‘giant words’ as the or with and countless many dwarves as ravioli or catamaran and those giants cover a ginormous amount of the language produced.

And this is not only true for English but for all languages for which so far data is available, even for languages, which are not even deciphered yet as e. g. Meroitic (cf. Smith 2008), which could indicate that this pattern applies to all languages in the world. Just have a look at this:

(cf. Bentz et al. 2015 or Piantadosi 2014: 1117 for even more languages: Spanish, Russian, Greek, Portuguese, Chinese, Swahili, Chilean, Finnish, Estonian, French, Czech, Turkish, Polish, Basque, Maori, Tok Pisin)

It is to some extent even true for the around 470 words in this tiny little piece of blog:

But why is that? Very many linguists tried to figure this out and give a good reason for it. The longer than usual bibliography below gives an impression of that. For example Altmann et al. (2011) claim that a word’s certain use, its niche, which means its characteristic properties and contexts in which it is used have a strong impact on its frequency in a language and also on the changes involved over time. To put it simple, people start to use the word chat once the concept of chat is ‘invented’ with which the total amount of occurrences increases. However this is only one of many explanations, notions or implications of the Zipfian Distribution or the language riddle of giants and dwarves and yet there is still a lot to explore about it.

For further reading explore the literature below.

Jonas Schreiber (FAU Erlangen-Nürnberg)
Intern at Brill’s Linguistic Bibliography

Bibliography

Altmann, Gabriel: Zipfian linguistics. – Glottometrics 3, 2002, 19-26.

Altmann, Eduardo G.; Pierrehumbert, Janet B.; Motter, Adilson E.: Beyond Word Frequency: Bursts, Lulls, and Scaling in the Temporal Distributions of Words. – PLoS ONE 4(11): e7678, 2009. doi:10.1371/journal.pone.000767

Baayen, R. Harald. Word frequency distributions. Dordrecht: Kluwer Academic Publishers, 2001.

Balasubrahmanyan, V. K.; Naranan, S.: Algorithmic information, complexity and Zipf’s law. – Glottometrics 4, 2002, 1-26.

Bentz, Christian; Verkerk, Annemarie; Kiela, Douwe; Hill, Felix; Buttery, Paula : Adaptive communication : languages with more non-native speakers tend to have fewer word forms. – PLoS ONE 10(6): e0128254, 2015. doi:10.1371/journal.pone.0128254.

Borin, Lars: Med Zipf mot framtiden – en integrerad lexikonresurs för svensk språkteknologi. – LexicoNordica 17, 2010, 35-54.

Dębowski, Lukasz: Zipf’s law against the text size: a half-rational model. – Glottometrics 4, 2002, 49-60.

Ellis, Nick C.: Formulaic language and second language acquisition : Zipf and the phrasal teddy bear. – Annual review of applied linguistics 32, 2012, 17-44.

Fenk-Oczlon, Gertraud; Fenk, August: Zipf’s tool analogy and word order. – Glottometrics 5, 2002, 22-28.

Ferrer i Cancho, Ramon: Hidden communication aspects in the exponent of Zipf’s law. – Glottometrics 11, 2005, 98-119.

Ferrer i Cancho, Ramon; Solé, Ricard V.: Two regimes in the frequency of words and the origins of complex lexicons : Zipf’s Law revisited. – Journal of quantitative linguistics 8/3, 2001, 165-173.

Ferrer i Cancho, Ramon; Servedio, Vito: Can simple models explain Zipf’s law for all exponents? – Glottometrics 11, 2005, 1-8.

Grzybek, Peter; Kelih, Emmerich: Häufigkeiten von Buchstaben / Graphemen / Phonemen : Konvergenzen des Rangierungsverhaltens. – Glottometrics 9, 2005, 62-73

Hatzigeorgiu, Nick; Mikros, Georgios K.; [Karagiannis, Giorgios] Carayannis, George: Word Length, Word Frequencies and Zipf’s Law in the Greek Language. – Journal of quantitative linguistics 8/3, 2001, 175-185.

Hřebíček, Luděk: Zipf’s law and text. – Glottometrics 3, 2002, 27-38.

Kromer, Victor: Zipf’s law and its modification possibilities. – Glottometrics 5, 2002, 1-13.

Manin, Dmitrii Y.: Mandelbrot’s model for Zipf’s law : can Mandelbrot’s model explain Zipf’s law for language? – Journal of quantitative linguistics 16/3, 2009, 274-285.

Montemurro, Marcello A.; Zanette, D.: Frequency-rank distribution of words in large text samples: phenomenology and models. – Glottometrics 4, 2002, 87-98.

Németh, Géza; Zainkó, Csaba: Multilingual statistical text analysis, Zipf’s law and Hungarian speech generation. – Acta linguistica Hungarica: an international journal of linguistics 49/3-4, 2002, 385-405.

Piantadosi, Steven T.: Zipf’s word frequency law in natural language: a critical review and future directions. – Psychonomic bulletin & review 21, 2014, 1112-1130.

Pine, Julian M.; Freudenthal, Daniel; Krajewski, Grzegorz; Gobet, Fernand R.: Do young children have adult-like syntactic categories? : Zipf’s law and the case of the determiner. – Cognition 127/3, 2013, 345-360.

Prün, Claudia: A text linguistic hypothesis of G. K. Zipf. – Journal of quantitative linguistics 4, 1997, 244-251.

Prün, Claudia; Zipf, Robert: Biographical notes on G. K. Zipf [1902-1950]. – Glottometrics 3, 2002, 1-10.

Pustet, Regina: Zipf and His Heirs. – Language sciences 26/1, 2004, 1-25.

Rousseau, Ronald: George Kingsley Zipf [1902-1950]: life, ideas, his law and informetrics. – Glottometrics 3, 2002, 11-18.

Sigurd, Bengt; Eeg-Olofsson, Mats; Weijer, Joost van de: Word length, sentence length and frequency : Zipf revisited. – Studia linguistica : a journal of general linguistics 58/1, 2004, 37-52 | With data from English, Swedish and German.

Smith, Reginald: Investigation of the Zipf-plot of the extinct Meroitic language. – Glottometrics 15, 2007, 53-61.

Uhlířová, Ludmila: Zipf’s notion of “economy” on the text level. – Glottometrics 3, 2002, 39-60.

Wheeler, Eric S.: Zipf’s law and why it works everywhere. – Glottometrics 4, 2002, 45-48.

Zanette, D.; Montemurro, Marcello A.: Dynamics of Text Generation with Realistic Zipf’s Distribution. – Journal of quantitative linguistics 12/1, 2005, 29-40.

Zipf, George K.: Human behavior and the principle of least effort. Cambridge, MA: Addison-Wesley Press, 1949.

Semantic prosody

When we think of words like happen, cause, perfectly, or totally with respect to their contexts of use they seem more or less neutral concerning their associative meaning. Anything could cause something to happen whether perfectly or totally whatsoever. To sum it up: everything seems possible to occur, words obviously combine without restrictions.

Yet when you look closely at corpus data for each of those examples especially in the KWIC (Key Word In Context) mode you will see that the words used in combination with e. g. cause “group in interesting ways” as Hoey (2005) puts it (p. 22):

From Hunston (2007, 251)
From Hunston (2007, 251)

We see concern (14, 15), problems (13), anger (4), damage (8), misery (11) and several seemingly unpleasant diseases including dizziness and vomiting (9), a kidney stone (6) or even inflammation of the liver (10). Now you could claim that the respective data may not be valid enough as it shows such obvious biases. Stubbs (1995) and several others however found out that an amount of over 90% of the occurrences of cause in the British National Corpus – a rather representative sample of the English language – is associated with negative meaning. Hunston  (2002: 142) even writes that this “can be observed only by looking at a large number of instances of a word or phrase, because it relies on the typical use of a word of phrase.” The whole concept therefore seems rather to be more corpus driven than just a corpus based theory.  So altogether it isn’t a marginal phenomenon at all and many further examples can be encountered by having a closer look at corpus data.

Sinclair (1991) sums it up as the following:  “[M]any uses of words and phrases show a tendency to occur in a certain semantic environment, for example the word happen is associated with unpleasant things – accidents and the like” (Sinclair 1991: 112) or as Vincent Vega in Quentin Tarantino’s Pulp Fiction states…

This phenomenon is called semantic prosody or discourse prosody: a “consistent aura of meaning” (Louw 1993: 157) emerging around words as they are frequently used in certain environments as shown above. The expression was first introduced by Louw (1993) following John Rupert Firth’s description of prosody in phonological terms: “Firth (1957) argued that when we pronounce a word such as /ʃɪp/ our mouth is already shaping the [ɪ] sound even as it makes the [ʃ] sound.” (Hoey 2005: 22). On that account as the sounds in words interact while we pronounce them, meaning seems to do the same as it emerges from usage.

So now we know for sure when we ask ourselves, what could possibly happen

For further reading explore the literature below.

Jonas Schreiber (FAU Erlangen-Nürnberg)
Intern at Brill’s Linguistic Bibliography

Bibliography

Begagić, Mirna: Semantic preference and semantic prosody of the collocation make sense. – Jezikoslovlje 14/2-3, 2013, 403-416.

Hoey, Michael: Lexical priming : a new theory of words and language. – London : Routledge, 2005.

Hunston, Susan: Corpora in applied linguistics. – Cambridge : Cambridge UP, 2002.

Hunston, Susan: Semantic prosody revisited. – International journal of corpus linguistics 12/2, 2007, 249-268.

Louw, Bill: Irony in the text or insinceriy in the writer : the diagnostic potential of semantic prosodies. – In: Text and technology : in honour of John Sinclair / Ed. by Mona Baker ; Gill Francis ; Elena Tognini-Bonelli. – Amsterdam : Benjamins, 1993, 157-176.

Morley, John; Partington, Alan Scott: A few frequently asked questions about semantic – or evaluative – prosody. – International journal of corpus linguistics 14/2, 2009, 139-158.

Partington, Alan Scott: “Utterly content in each other’s company” : semantic prosody and semantic preference. – International journal of corpus linguistics 9/1, 2004, 131-156.

Sinclair, John McH.: Corpus, concordance, collocation. – Oxford : Oxford UP, 1991.

Stubbs, Michael W.: Collocations and semantic profiles : on the cause of the trouble with quantitative studies. – Functions of languge 2/1, 1995, 23-55.