What is Zipf’s law?
In 1935, American linguist George Kingsley Zipf tried to explain a peculiarity that he noticed about the way in which we use words in a language. He found that very few words are used regularly, while most words are very rarely used. He then ranked the words according to their popularity and saw a pattern surfacing. The most popular word was used twice as much as the second most popular word and thrice as much as the third most frequently used word.
But he soon realized that this pattern was not limited to words in a language.
The has been noticed across a wide range of datasets - including neural activity, firm sizes, city sizes, amino acid sequences, etc. and has been named Zipf’s law.
It establishes a relationship between rank order and frequency of occurrence. According to Zipf’s law, when we rank observations by their frequency, the frequency of a specific observation occurring is inversely proportional to its rank.
Zipf’s Law has been observed in several domains. Even though it was originally formulated for word frequency, Zipf’s Law has been observed in a wide range of domains like city size, firm size, mutual fund size, amino acid sequences, neural activity, the genome, family names, income, financial markets, internet file sizes, and human behavior. There are models that explain Zipf’s Law in each of these domains, but these explanations generally tend to be domain-specific.
Recently, techniques from statistical physics were used to demonstrate that a fairly broad class of models provides a general explanation of Zipf’s Law. The explanation provided rests on the observation that real-world data is often generated from underlying causes, known as latent variables. The latent variables combine multiple models that do not obey Zipf’s law, thus giving a model that does.
There is an explanation of Zipf’s Law that is made up of two parts. The first part is that Zipf’s law implies a broad range of frequencies. Mora and Bialek quantified this notion by who demonstrated that a perfectly flat distribution over a range of frequencies is mathematically equivalent to Zipf’s law over that range. This is a result that applies in any and all domains. But it is critical to understand the realistic case: the way in which a finite range of frequencies with an uneven distribution could lead to something similar to, but not exactly, Zipf’s law. Mora and Bialek’s result is then extended and a a general relationship is then derived that quantifies deviations from Zipf’s law for arbitrary distributions over frequency, from very broad to very narrow, and also to multi-modal distributions. This relationship shows you that Zipf’s law emerges when the distribution over frequency is broad enough, even if it is not too flat. Latent variables can, but do not have to, induce a broad range of frequencies. It has actually been demonstrated in a theoretical and empirical manner that in a range of critical domains, latent variables do give rise to a broad range of frequencies, and therefore, Zipf’s law.
Is Zipf's law the biggest mystery in computational linguistics?
Sander Lestrade, linguist at Radboud University in Nijmegen, the Netherlands says that Zipf's law can safely be considered to be the biggest mystery in computational linguistics. He said that inspite of decades of theorizing, the origins of Zipf’s Law are still elusive.
Lestrade demonstrated that it is possible to explain Zipf’s Law by the interaction between the structure of sentences (syntax) and the meaning of words (semantics) in a text. Through the use of computer simulations, he could demonstrate that neither syntax or semantics suffices to induce a Zipfian distribution on its own, but that syntax and semantics 'need' each other for that.
What is the formula for Zipf’s law?
Let us say that r is the rank of an observation.
Prob(r) is the probability of the observation at rank r.
freq(r) is the number of times the observation at rank r appears in the dataset.
N is the total number of observations in a dataset. It is not the number of unique observations.
We know that Prob(r) = freq(r)/N
According to Zipf’s law,
r x Prob(r) = A
A is a constant that is empirically determined from the data. In most situations, A=0.1
Zipf’s law is a statistical law, it holds true for most observations, but not all.
Since Prob(r) = freq(r)/N, Zipf’s law can be rewritten like this:
r x freq(r) = A x N
How do you verify Zipf’s law?
In order to verify Zipf’s law, we would need to calculate the frequency of every observation in a dataset, rank them all, and calculate r x freq(r), checking whether it is approximately the same for every observation in our dataset. It does not need to be an exact match for every single observation, but it should be a close match for most observations.
Keep in mind that Zipf’s law has the highest rate of errors for the most frequent and the least frequent observations. Avoid looking solely at those observations.
The best way to verify Zipf’s law is to plot it on a graph.
Plot log(r) on the x-axis and log(freq(r)) on the y-axis of the graph. If we see a line with a slope of -1, it means that Zipf’s law holds for this dataset. In this situation, if the line intersects the x-axis at point A and the y-axis at point B, and O is the origin, then OA should be equal to OB.
Do all languages follow Zipf’s law?
Not all languages follow Zipf’s law perfectly. But Zipf’s law holds at least approximately in almost all languages, including languages that are extinct and languages that have not been translated as yet.