A remarkable phenomenon in probability theory is that of universality – that many seemingly unrelated probability distributions, which ostensibly involve large numbers of unknown parameters, can end up converging to a universal law that may only depend on a small handful of parameters. One of the most famous examples of the universality phenomenon is the central limit theorem; another rich source of examples comes from random matrix theory, which is one of the areas of my own research.
Analogous universality phenomena also show up in empirical distributions – the distributions of a statistic from a large population of “real-world” objects. Examples include Benford’s law, Zipf’s law, and the Pareto distribution (of which the Pareto principle or 80-20 law is a special case). These laws govern the asymptotic distribution of many statistics which
- (i) take values as positive numbers;
- (ii) range over many different orders of magnitude;
- (iiii) arise from a complicated combination of largely independent factors (with different samples of arising from different independent factors); and
- (iv) have not been artificially rounded, truncated, or otherwise constrained in size.
Examples here include the population of countries or cities, the frequency of occurrence of words in a language, the mass of astronomical objects, or the net worth of individuals or corporations. The laws are then as follows:
- Benford’s law: For , the proportion of whose first digit is is approximately . Thus, for instance, should have a first digit of about of the time, but a first digit of only about of the time.
- Zipf’s law: The largest value of should obey an approximate power law, i.e. it should be approximately for the first few and some parameters . In many cases, is close to .
- Pareto distribution: The proportion of with at least digits (before the decimal point), where is above the median number of digits, should obey an approximate exponential law, i.e. be approximately of the form for some . Again, in many cases is close to .
Benford’s law and Pareto distribution are stated here for base , which is what we are most familiar with, but the laws hold for any base (after replacing all the occurrences of in the above laws with the new base, of course). The laws tend to break down if the hypotheses (i)-(iv) are dropped. For instance, if the statistic concentrates around its mean (as opposed to being spread over many orders of magnitude), then the normal distribution tends to be a much better model (as indicated by such results as the central limit theorem). If instead the various samples of the statistics are highly correlated with each other, then other laws can arise (for instance, the eigenvalues of a random matrix, as well as many empirically observed matrices, are correlated to each other, with the behaviour of the largest eigenvalues being governed by laws such as the Tracy-Widom law rather than Zipf’s law, and the bulk distribution being governed by laws such as the semicircular law rather than the normal or Pareto distributions).
To illustrate these laws, let us take as a data set the populations of 235 countries and regions of the world in 2007 (using the CIA world factbook); I have put the raw data here. This is a relatively small sample (cf. my previous post), but is already enough to discern these laws in action. For instance, here is how the data set tracks with Benford’s law (rounded to three significant figures):
Countries | Number | Benford prediction | |
1 | Angola, Anguilla, Aruba, Bangladesh, Belgium, Botswana, Brazil, Burkina Faso, Cambodia, Cameroon, Chad, Chile, China, Christmas Island, Cook Islands, Cuba, Czech Republic, Ecuador, Estonia, Gabon, (The) Gambia, Greece, Guam, Guatemala, Guinea-Bissau, India, Japan, Kazakhstan, Kiribati, Malawi, Mali, Mauritius, Mexico, (Federated States of) Micronesia, Nauru, Netherlands, Niger, Nigeria, Niue, Pakistan, Portugal, Russia, Rwanda, Saint Lucia, Saint Vincent and the Grenadines, Senegal, Serbia, Swaziland, Syria, Timor-Leste (East-Timor), Tokelau, Tonga, Trinidad and Tobago, Tunisia, Tuvalu, (U.S.) Virgin Islands, Wallis and Futuna, Zambia, Zimbabwe | 59 () | 71 () |
2 | Armenia, Australia, Barbados, British Virgin Islands, Cote d’Ivoire, French Polynesia, Ghana, Gibraltar, Indonesia, Iraq, Jamaica, (North) Korea, Kosovo, Kuwait, Latvia, Lesotho, Macedonia, Madagascar, Malaysia, Mayotte, Mongolia, Mozambique, Namibia, Nepal, Netherlands Antilles, New Caledonia Norfolk Island, Palau, Peru, Romania, Saint Martin, Samoa, San Marino, Sao Tome and Principe, Saudi Arabia, Slovenia, Sri Lanka, Svalbard, Taiwan, Turks and Caicos Islands, Uzbekistan, Vanuatu, Venezuela, Yemen | 44 () | 41 () |
3 | Afghanistan, Albania, Algeria, (The) Bahamas, Belize, Brunei, Canada, (Rep. of the) Congo, Falkland Islands (Islas Malvinas), Iceland, Kenya, Lebanon, Liberia, Liechtenstein, Lithuania, Maldives, Mauritania, Monaco, Morocco, Oman, (Occupied) Palestinian Territory, Panama, Poland, Puerto Rico, Saint Kitts and Nevis, Uganda, United States of America, Uruguay, Western Sahara | 29 () | 29 () |
4 | Argentina, Bosnia and Herzegovina, Burma (Myanmar), Cape Verde, Cayman Islands, Central African Republic, Colombia, Costa Rica, Croatia, Faroe Islands, Georgia, Ireland, (South) Korea, Luxembourg, Malta, Moldova, New Zealand, Norway, Pitcairn Islands, Singapore, South Africa, Spain, Sudan, Suriname, Tanzania, Ukraine, United Arab Emirates | 27 () | 22 () |
5 | (Macao SAR) China, Cocos Islands, Denmark, Djibouti, Eritrea, Finland, Greenland, Italy, Kyrgyzstan, Montserrat, Nicaragua, Papua New Guinea, Slovakia, Solomon Islands, Togo, Turkmenistan | 16 () | 19 () |
6 | American Samoa, Bermuda, Bhutan, (Dem. Rep. of the) Congo, Equatorial Guinea, France, Guernsey, Iran, Jordan, Laos, Libya, Marshall Islands, Montenegro, Paraguay, Sierra Leone, Thailand, United Kingdom | 17 () | 16 () |
7 | Bahrain, Bulgaria, (Hong Kong SAR) China, Comoros, Cyprus, Dominica, El Salvador, Guyana, Honduras, Israel, (Isle of) Man, Saint Barthelemy, Saint Helena, Saint Pierre and Miquelon, Switzerland, Tajikistan, Turkey | 17 () | 14 () |
8 | Andorra, Antigua and Barbuda, Austria, Azerbaijan, Benin, Burundi, Egypt, Ethiopia, Germany, Haiti, Holy See (Vatican City), Northern Mariana Islands, Qatar, Seychelles, Vietnam | 15 () | 12 () |
9 | Belarus, Bolivia, Dominican Republic, Fiji, Grenada, Guinea, Hungary, Jersey, Philippines, Somalia, Sweden | 11 () | 11 () |
Here is how the same data tracks Zipf’s law for the first twenty values of , with the parameters and (selected by log-linear regression), again rounding to three significant figures:
Country | Population | Zipf prediction | Deviation from prediction | |
1 | China | 1,330,000,000 | 1,280,000,000 | |
2 | India | 1,150,000,000 | 626,000,000 | |
3 | USA | 304,000,000 | 412,000,000 | |
4 | Indonesia | 238,000,000 | 307,000,000 | |
5 | Brazil | 196,000,000 | 244,000,000 | |
6 | Pakistan | 173,000,000 | 202,000,000 | |
7 | Bangladesh | 154,000,000 | 172,000,000 | |
8 | Nigeria | 146,000,000 | 150,000,000 | |
9 | Russia | 141,000,000 | 133,000,000 | |
10 | Japan | 128,000,000 | 120,000,000 | |
11 | Mexico | 110,000,000 | 108,000,000 | |
12 | Philippines | 96,100,000 | 98,900,000 | |
13 | Vietnam | 86,100,000 | 91,100,000 | |
14 | Ethiopia | 82,600,000 | 84,400,000 | |
15 | Germany | 82,400,000 | 78,600,000 | |
16 | Egypt | 81,700,000 | 73,500,000 | |
17 | Turkey | 71,900,000 | 69,100,000 | |
18 | Congo | 66,500,000 | 65,100,000 | |
19 | Iran | 65,900,000 | 61,600,000 | |
20 | Thailand | 65,500,000 | 58,400,000 |
As one sees, Zipf’s law is not particularly precise at the extreme edge of the statistics (when is very small), but becomes reasonably accurate (given the small sample size, and given that we are fitting twenty data points using only two parameters) for moderate sizes of .
This data set has too few scales in base to illustrate the Pareto distribution effectively – over half of the country populations are either seven or eight digits in that base. But if we instead work in base , then country populations range in a decent number of scales (the majority of countries have population between and ), and we begin to see the law emerge, where is now the number of digits in binary, the best-fit parameters are and :
Countries with binary digit populations | Number | Pareto prediction | |
31 | China, India | 2 | 1 |
30 | ” | 2 | 2 |
29 | “, United States of America | 3 | 5 |
28 | “, Indonesia, Brazil, Pakistan, Bangladesh, Nigeria, Russia | 9 | 8 |
27 | “, Japan, Mexico, Philippines, Vietnam, Ethiopia, Germany, Egypt, Turkey | 17 | 15 |
26 | “, (Dem. Rep. of the) Congo, Iran, Thailand, France, United Kingdom, Italy, South Africa, (South) Korea, Burma (Myanmar), Ukraine, Colombia, Spain, Argentina, Sudan, Tanzania, Poland, Kenya, Morocco, Algeria | 36 | 27 |
25 | “, Canada, Afghanistan, Uganda, Nepal, Peru, Iraq, Saudi Arabia, Uzbekistan, Venezuela, Malaysia, (North) Korea, Ghana, Yemen, Taiwan, Romania, Mozambique, Sri Lanka, Australia, Cote d’Ivoire, Madagascar, Syria, Cameroon | 58 | 49 |
24 | “, Netherlands, Chile, Kazakhstan, Burkina Faso, Cambodia, Malawi, Ecuador, Niger, Guatemala, Senegal, Angola, Mali, Zambia, Cuba, Zimbabwe, Greece, Portugal, Belgium, Tunisia, Czech Republic, Rwanda, Serbia, Chad, Hungary, Guinea, Belarus, Somalia, Dominican Republic, Bolivia, Sweden, Haiti, Burundi, Benin | 91 | 88 |
23 | “, Austria, Azerbaijan, Honduras, Switzerland, Bulgaria, Tajikistan, Israel, El Salvador, (Hong Kong SAR) China, Paraguay, Laos, Sierra Leone, Jordan, Libya, Papua New Guinea, Togo, Nicaragua, Eritrea, Denmark, Slovakia, Kyrgyzstan, Finland, Turkmenistan, Norway, Georgia, United Arab Emirates, Singapore, Bosnia and Herzegovina, Croatia, Central African Republic, Moldova, Costa Rica | 123 | 159 |
Thus, with each new scale, the number of countries introduced increases by a factor of a little less than , on the average. This approximate doubling of countries with each new scale begins to falter at about the population (i.e. at around million), for the simple reason that one has begun to run out of countries. (Note that the median-population country in this set, Singapore, has a population with binary digits.)
These laws are not merely interesting statistical curiosities; for instance, Benford’s law is often used to help detect fraudulent statistics (such as those arising from accounting fraud), as many such statistics are invented by choosing digits at random, and will therefore deviate significantly from Benford’s law. (This is nicely discussed in Robert Matthews’ New Scientist article “The power of one“; this article can also be found on the web at a number of other places.) In a somewhat analogous spirit, Zipf’s law and the Pareto distribution can be used to mathematically test various models of real-world systems (e.g. formation of astronomical objects, accumulation of wealth, population growth of countries, etc.), without necessarily having to fit all the parameters of that model with the actual data.
Being empirically observed phenomena rather than abstract mathematical facts, Benford’s law, Zipf’s law, and the Pareto distribution cannot be “proved” the same way a mathematical theorem can be proved. However, one can still support these laws mathematically in a number of ways, for instance showing how these laws are compatible with each other, and with other plausible hypotheses on the source of the data. In this post I would like to describe a number of ways (both technical and non-technical) in which one can do this; these arguments do not fully explain these laws (in particular, the empirical fact that the exponent in Zipf’s law or the Pareto distribution is often close to is still quite a mysterious phenomenon), and do not always have the same universal range of applicability as these laws seem to have, but I hope that they do demonstrate that these laws are not completely arbitrary, and ought to have a satisfactory basis of mathematical support.
— 1. Scale invariance —
One consistency check that is enjoyed by all of these laws is that of scale invariance – they are invariant under rescalings of the data (for instance, by changing the units).
For example, suppose for sake of argument that the country populations of the world in 2007 obey Benford’s law, thus for instance about of the countries have population with first digit , have population with first digit , and so forth. Now, imagine that several decades in the future, say in 2067, all of the countries in the world double their population, from to a new population . (This makes the somewhat implausible assumption that growth rates are uniform across all countries; I will talk about what happens when one omits this hypothesis later.) To further simplify the experiment, suppose that no countries are created or dissolved during this time period. What happens to Benford’s law when passing from to ?
The key observation here, of course, is that the first digit of is linked to the first digit of . If, for instance, the first digit of is , then the first digit of is either or ; conversely, if the first digit of is or , then the first digit of is . As a consequence, the proportion of ‘s with first digit is equal to the proportion of ‘s with first digit , plus the proportion of ‘s with first digit . This is consistent with Benford’s law holding for both and , since
(or numerically, after rounding). Indeed one can check the other digit ranges also and that conclude that Benford’s law for is compatible with Benford’s law for ; to pick a contrasting example, a uniformly distributed model in which each digit from to is the first digit of occurs with probability totally fails to be preserved under doubling.
One can be even more precise. Observe (through telescoping series) that Benford’s law implies that
for all integers , where the left-hand side denotes the proportion of data for which lies between and for some integer . Suppose now that we generalise Benford’s law to the continuous Benford’s law, which asserts that (1) is true for all real numbers . Then it is not hard to show that a statistic obeys the continuous Benford’s law if and only if its dilate does, and similarly with replaced by any other constant growth factor. (This is easiest seen by observing that (1) is equivalent to asserting that the fractional part of is uniformly distributed.) In fact, the continuous Benford law is the only distribution for the quantities on the left-hand side of (1) with this scale-invariance property; this fact is a special case of the general fact that Haar measures are unique (see e.g. these lecture notes).
It is also easy to see that Zipf’s law and the Pareto distribution also enjoy this sort of scale-invariance property, as long as one generalises the Pareto distribution
from integer to real , just as with Benford’s law. Once one does that, one can phrase the Pareto distribution law independently of any base as
for any much larger than the median value of , at which point the scale-invariance is easily seen.
One may object that the above thought-experiment was too idealised, because it assumed uniform growth rates for all the statistics at once. What happens if there are non-uniform growth rates? To keep the computations simple, let us consider the following toy model, where we take the same 2007 population statistics as before, and assume that half of the countries (the “high-growth” countries) will experience a population doubling by 2067, while the other half (the “zero-growth” countries) will keep their population constant, thus the 2067 population statistic is equal to half the time and half the time. (We will assume that our sample sizes are large enough that the law of large numbers kicks in, and we will therefore ignore issues such as what happens to this “half the time” if the number of samples is odd.) Furthermore, we make the plausible but crucial assumption that the event that a country is a high-growth or a zero-growth country is independent of the first digit of the 2007 population; thus, for instance, a country whose population begins with is assumed to be just as likely to be high-growth as one whose population begins with .
Now let’s have a look again at the proportion of countries whose 2067 population begins with either or . There are exactly two ways in which a country can fall into this category: either it is a zero-growth country whose 2007 population also began with either or , or it was a high-growth country whose population in 2007 began with . Since all countries have a probability of being high-growth regardless of the first digit of their population, we conclude the identity
which is once again compatible with Benford’s law for since
More generally, it is not hard to show that if obeys the continuous Benford’s law (1), and one multiplies by some positive multiplier which is independent of the first digit of (and, a fortiori, is independent of the fractional part of ), one obtains another quantity which also obeys the continuous Benford’s law. (Indeed, we have already seen this to be the case when is a deterministic constant, and the case when is random then follows simply by conditioning to be fixed.)
In particular, we see an absorptive property of Benford’s law: if obeys Benford’s law, and is any positive statistic independent of , then the product also obeys Benford’s law – even if did not obey this law. Thus, if a statistic is the product of many independent factors, then it only requires a single factor to obey Benford’s law in order for the whole product to obey the law also. For instance, the population of a country is the product of its area and its population density. Assuming that the population density of a country is independent of the size of that country (which is not a completely reasonable assumption, but let us take it for the sake of argument), then we see that Benford’s law for the population would follow if just one of the area or population density obeyed this law. It is also clear that Benford’s law is the only distribution with this absorptive property (if there was another law with this property, what would happen if one multiplied a statistic with that law with an independent statistic with Benford’s law?). Thus we begin to get a glimpse as to why Benford’s law is universal for quantities which are the product of many separate factors, in a manner that no other law could be.
As an example: for any given number , the uniform distribution from to does not obey Benford’s law; for instance, if one picks a random number from to then each digit from to appears as the first digit with an equal probability of each. However, if is not fixed, but instead obeys Benford’s law, then a random number selected from to also obeys Benford’s law (ignoring for now the distinction between continuous and discrete distributions), as it can be viewed as the product of with an independent random number selected from between and .
Actually, one can say something even stronger than the absorption property. Suppose that the continuous Benford’s law (1) for a statistic did not hold exactly, but instead held with some accuracy , thus
for all . Then it is not hard to see that any dilated statistic, such as , or more generally for any fixed deterministic , also obeys (5) with exactly the same accuracy . But now suppose one uses a variable multiplier; for instance, suppose one uses the model discussed earlier in which is equal to half the time and half the time. Then the relationship between the distribution of the first digit of and the first digit of is given by formulae such as (4). Now, in the right-hand side of (4), each of the two terms and differs from the Benford’s law predictions of and respectively by at most . Since the left-hand side of (4) is the average of these two terms, it also differs from the Benford law prediction by at most . But the averaging opens up an opportunity for cancelling; for instance, an overestimate of for could cancel an underestimate of for to produce a spot-on prediction for . Thus we see that variable multipliers (or variable growth rates) not only preserve Benford’s law, but in fact stabilise it by averaging out the errors. In fact, if one started with a distribution which did not initially obey Benford’s law, and then started applying some variable (and independent) growth rates to the various samples in the distribution, then under reasonable assumptions one can show that the resulting distribution will converge to Benford’s law over time. This helps explain the universality of Benford’s law for statistics such as populations, for which the independent variable growth law is not so unreasonable (at least, until the population hits some maximum capacity threshold).
Note that the independence property is crucial; if for instance population growth always slowed down for some inexplicable reason to a crawl whenever the first digit of the population was , then there would be a noticeable deviation from Benford’s law, particularly in digits and , due to this growth bottleneck. But this is not a particularly plausible scenario (being somewhat analogous to Maxwell’s demon in thermodynamics).
The above analysis can also be carried over to some extent to the Pareto distribution and Zipf’s law; if a statistic obeys these laws approximately, then after multiplying by an independent variable , the product will obey the same laws with equal or higher accuracy, so long as is small compared to the number of scales that typically ranges over. (One needs a restriction such as this because the Pareto distribution and Zipf’s law must break down below the median.) These laws are also stable under other multiplicative processes, for instance if some fraction of the samples in spontaneously split into two smaller pieces, or conversely if two samples in spontaneously merge into one; as before, the key is that the occurrence of these events should be independent of the actual size of the objects being split. If one considers a generalisation of the Pareto or Zipf law in which the exponent is not fixed, but varies with or , then the effect of these sorts of multiplicative changes is to blur and average together the various values of , thus “flattening” the curve over time and making the distribution approach Zipf’s law and/or the Pareto distribution. This helps explain why eventually becomes constant; however, I do not have a good explanation as to why is often close to .
— 2. Compatibility between laws —
Another mathematical line of support for Benford’s law, Zipf’s law, and the Pareto distribution are that the laws are highly compatible with each other. For instance, Zipf’s law and the Pareto distribution are formally equivalent: if there are samples of , then applying (3) with equal to the largest value of gives
which implies Zipf’s law with . Conversely one can deduce the Pareto distribution from Zipf’s law. These deductions are only formal in nature, because the Pareto distribution can only hold exactly for continuous distributions, whereas Zipf’s law only makes sense for discrete distributions, but one can generate more rigorous variants of these deductions without much difficulty.
In some literature, Zipf’s law is applied primarily near the extreme edge of the distribution (e.g. the top of the sample space), whereas the Pareto distribution in regions closer to the bulk (e.g. between the top and and top ). But this is mostly a difference of degree rather than of kind, though in some cases (such as with the example of the 2007 country populations data set) the exponent for the Pareto distribtion in the bulk can differ slightly from the exponent for Zipf’s law at the extreme edge.
The relationship between Zipf’s law or the Pareto distribution and Benford’s law is more subtle. For instance Benford’s law predicts that the proportion of with initial digit should equal the proportion of with initial digit or . But if one formally uses the Pareto distribution (3) to compare those between and , and those between and , it seems that the former is larger by a factor of , which upon summing by appears inconsistent with Benford’s law (unless is extremely large). A similar inconsistency is revealed if one uses Zipf’s law instead.
However, the fallacy here is that the Pareto distribution (or Zipf’s law) does not apply on the entire range of , but only on the upper tail region when is significantly higher than the median; it is a law for the outliers of only. In contrast, Benford’s law concerns the behaviour of typical values of ; the behaviour of the top is of negligible significance to Benford’s law, though it is of prime importance for Zipf’s law and the Pareto distribution. Thus the two laws describe different components of the distribution and thus complement each other. Roughly speaking, Benford’s law asserts that the bulk distribution of is locally uniform at unit scales, while the Pareto distribution (or Zipf’s law) asserts that the tail distribution of decays exponentially. Note that Benford’s law only describes the fine-scale behaviour of the bulk distribution; the coarse-scale distribution can be a variety of distributions (e.g. log-gaussian).
50 comments
Comments feed for this article
4 July, 2009 at 8:00 am
Kevin O'Bryant
Another use of these laws: detecting election fraud . See http://science.slashdot.org/story/09/06/16/2137203/Statistical-Suspicions-In-Irans-Election for Benford’s Law, and http://yro.slashdot.org/article.pl?sid=07/12/07/2222212 for a different sort of election fraud.
4 July, 2009 at 10:36 am
Steve Lawford
Prof. Tao,
Xavier Gabaix (NYU stern) has done a lot of work on this in Economics, e.g. http://pages.stern.nyu.edu/%7Exgabaix/papers/zipf.pdf and the survey paper http://ssrn.com/abstract=1257822.
Regards,
Steve Lawford
4 July, 2009 at 10:55 am
Rules of Thumb « OU Math Club
[...] we will leave it to Terence Tao to explain all three. You can read his recent blog post on the mathematics of these three rules of thumb here. Another Rule of [...]
4 July, 2009 at 4:13 pm
Top Posts « WordPress.com
[...] Benford’s law, Zipf’s law, and the Pareto distribution A remarkable phenomenon in probability theory is that of universality – that many seemingly unrelated [...] [...]
5 July, 2009 at 12:07 pm
Mario Figueiredo
Very interesting post.
Maybe you already know this, but an interesting “derivation” of Benford’s law can be found in the following paper by Ted Hill (”A statistical derivation of the significant-digit law,” Statistical Science 10, 354-363 (1995)):
http://www.tphill.net/publications/BENFORD%20PAPERS/statisticalDerivationSigDigitLaw1995.pdf
5 July, 2009 at 5:07 pm
links for 2009-07-05 « Blarney Fellow
[...] Benford’s law, Zipf’s law, and the Pareto distribution « What’s new (tags: statistics probability math physics) [...]
5 July, 2009 at 9:14 pm
robert
Has anyone done a visual representation of Terence Tao’s interesting findings? Thanks.
6 July, 2009 at 10:30 pm
Qiaochu Yuan
This is probably standard material, but are there similar scale invariance-type arguments supporting the Gaussian distribution?
7 July, 2009 at 1:08 pm
robertfurber
@ Qiaochu Yuan
The Gaussian distribution is supported by the central limit theorem, and is to do with adding random variables rather than multiplying. There are also the “stable distributions” which can turn up if some of the assumptions of the CLT fail to hold.
http://en.wikipedia.org/wiki/Central_limit_theorem
http://en.wikipedia.org/wiki/Stable_distribution
8 July, 2009 at 7:21 am
피타고라스의 창 » ‘벤포드의 법칙’ 대전
[...] Benford’s law, Zipf’s law, and the Pareto distribution [...]
9 July, 2009 at 2:22 pm
More on Benford’s Law « Xi'an's Og
[...] In connection with an earlier post on Benford’s Law, I want to point out a recent entry on Terry Tiao’s blog. He points out that Benford’s Law is the Haar measure in that [...]
11 July, 2009 at 4:20 pm
Generality of power law distributions « The Metropolis Sampler
[...] of power law distributions Terry Tao has posted an intriguing discussion of Benford’s law, Zipf’s law and the Pareto distribution. These probability distribution functions appear to arise quite often from empirical data and Tao [...]
11 July, 2009 at 4:32 pm
tom w
The above post was very enjoyable reading. I’ve seen power-law distributions derived from maximum entropy principles, but the multiplicative property is very neat and indicates that there is a (to my mind) more satisfactory explanation.
14 July, 2009 at 10:50 pm
eitan bachmat
Very interesting post. One can relate the appearance of \alpha=1 to another well known law, Murphy’s law.
it can be shown that in several queueing theoretic settings, the Pareto distribution with parameter \alpha =1 (or its discrete counterparts) is the worst possible service time distribution in the sense that it leads (asymptotically over ever increasing scales) to the longest waiting times or slowdown.
This is partially related to the fact that the distribution satisfies a modular transformation law with weight 4. Following the Pollaczek-Khinchine formula and Riemann’s trick for producing functional equations from modular transformations, this distribution has some nice symmetries
when computing waiting times.
its appearance in measurments of file and job sizes in large computing environments may explain why we sometimes wait a long time.
3 August, 2009 at 2:05 pm
Nathan Shewmaker
The best explanation I’ve read of Benford’s Law was in terms of Digital Signal Processing, written by Steven W. Smith.
http://www.dspguide.com/ch34.htm
29 October, 2009 at 12:47 am
Oded
Below is a maxentropy explanation of all this rules based on inert balls in boxes statistics.
http://arxiv.org/abs/0907.4852
7 November, 2009 at 8:23 am
Takis Konstantopoulos
I’d be interested to find out how, in an elementary probability class, you would introduce the normal distribution. As you know, most classes do it by fiat: here is a formula, here is how we use it, and, perhaps (…) a bit of a posteriori motivation via the CLT (which is never, really, explained). I happen to like explaining the reason right upfront (the universality, that is), even though I can’t be 100% rigorous. Nevertheless, it seems (to me) to be hitting more notes if you can explain something through its properties rather than as a blunt formula.
11 November, 2009 at 11:24 am
Terence Tao
Hmm. I’ve actually never taught a probability class, but if one had to, I suppose one could first begin with an explicit derivation of CLT for Bernoulli distributions (discrete random walk) using Stirling’s formula. This is still quite lengthy but it is at least elementary.
One can also do a relatively short computation showing that the sum of two independent gaussians is again a gaussian. This already shows that the only possible universal limit for the CLT is a gaussian, and if one also uses the Lindeberg exchange trick one can in fact derive the full CLT from this.
The “correct” way to view the CLT, I guess, is via either the Fourier transform or the heat equation, but these are not easy to introduce rapidly in an elementary probability theory class. One can also start from the observation that the gaussian is the stationary point of the Ornstein-Uhlenbeck process (which, infinitesimally, corresponds to renormalised Brownian motion) but this is even less elementary…
8 November, 2009 at 12:00 am
Oded
When I wrote my paper about the maximum entropy distribution functions of system with high occupation number, I decided to add a paragraph about the standard maximum entropy solution namely, the low occupation canonic distribution. To my surprise I have not found any analysis based on probability that connects the canonic exponential distribution to the normal (Gaussian) distribution. (Maybe someone in this forum can show such a derivation?).
When I derived it myself, I found that the probability function that results from the canonic distribution is a distorted bell-like function very similar to the distribution of speed of molecule in gas and the death distribution function. However, this function is not the normal distribution. When the density of a particle increases the distribution function decreases exponentially and yields the Maxwell Boltzmann distribution as explained in my paper.
19 May, 2010 at 2:23 pm
Versions of Benford’s Law « Xi'an's Og
[...] note picks at the scale-invariant characterisation of Benford’s Law when Terry Tao’s entry represents it as a special case of Haar [...]
28 June, 2010 at 11:21 pm
Accelerating Future » Why Benford’s Law?
[...] law. A new preprint has been published on the topic at arXiv, kickstarting an entertaining round of debate. Here’s the Wikipedia description: Benford’s law, also called the first-digit law, [...]
8 September, 2010 at 7:26 am
Irrational rotations of the circle and Benford’s law « Division by Zero
[...] distributed, but obey Benford’s law: the leading digit occurs with frequency . [See Terence Tao's post for details on when Benford's law applies, but roughly speaking the data must be very spread out [...]
14 September, 2010 at 10:15 pm
A second draft of a non-technical article on universality « What’s new
[...] Tao, “Benford’s law, Zipf’s law, and the Pareto distribution“, blog post, [...]
21 November, 2010 at 8:32 am
What is a random positive integer? | Quasi-Coherent
[...] recommend reading Terence Tao‘s article on Benford’s Law and other phenomena of a similar flavor for further [...]
4 March, 2011 at 9:04 am
Michael Hardy
If you like the Pareto distribution, here’s a paper I wrote showing equivalence of 80/20-type laws and the power law:
Michael Hardy (2010) “Pareto’s Law”, Mathematical Intelligencer, 32 (3), 38–43. doi: 10.1007/s00283-010-9159-2
24 June, 2011 at 5:17 am
Jason Long
It’s much more of a layman’s demonstration, but a friend and I are working on a site that shows Benford’s Law in action against publicly available datasets (hopefully more to come).
http://testingbenfordslaw.com
21 July, 2011 at 12:15 pm
Power Laws - Blog – Stack Exchange
[...] don’t have a good conceptual explanation for Lotka’s law. There are certain mathematical reasons why power laws are plausible in certain situations, but I don’t know of any good explanations [...]
21 July, 2011 at 12:34 pm
Update « Annoying Precision
[...] Curiously enough, the Zipf distribution which shows up in that post is the same as the zeta distribution that shows up when trying to motivate the definition of the Riemann zeta function. I’m sure there is a conceptual explanation of this connection somewhere, probably coming from statistical mechanics, but I don’t know it. I suppose the approximate scale invariance of the zeta distribution is relevant to its appearance in many real-life statistics, as described in Terence Tao’s blog post on the subject here. [...]
22 July, 2011 at 4:26 am
Sune Kristian Jakobsen
I think you write 30,7% two times where is should be 30,1% [Corrected, thanks - T.]
26 July, 2011 at 1:46 pm
Quora
Why is it that, in many data sets, there are about six times more numbers starting with the digit 1 than with the digit 9 — a phenomenon called Benford’s Law?…
If you list all the countries in the world and their populations, 27% of the numbers will start with the digit 1. Only 3% of them will start with the digit 9. Something very similar holds if you look at the heights of the 60 tallest structures in the w…
30 July, 2011 at 11:06 am
Pearson-Wong Diffusions | Research Notebook
[...] (1999) or Tao (2009). [...]
28 May, 2012 at 11:01 am
Drawing a natural number at random, foundations of probability, determinism, Laplace (8) « Frank Waaldijk's math & science & philosophy blog
[...] and discrepancy between Benford’s law and Zipf’s law A good reference for this is Terry Tao’s blogpost on Benford’s law. We quote: Analogous universality phenomena also show up in empirical distributions – the [...]
30 December, 2013 at 3:29 am
fwaaldijk
Since I’m being listed here anyway due to some pingback, I might as well add my 2¢ worth. Benford’s law opens up an interesting view on the question: `can we draw a natural number at random?’. It turns out that we can, provided that we speak of relative probabilities only. (The role of the number 0 remains mysterious).
For 0< n,m in N let’s denote the relative chance of drawing n vs. drawing m by: P(n)/P(m). Then P(n)/P(m) = (log (n+1)/n)/(log (m+1)/m), see the series posted last year on my blog fwaaldijk.wordpress.com. These relative probabilities can be derived heuristically, and are in accordance with Benford's law. The heuristics are interesting I think, but also quite speculative.
A `discrete' or `Zipfian' case can also be stated, which then yields P(n)/P(m)=m/n. But I did not study this in any detail.
I would be happy to receive some feedback on these ideas. Kind regards, Frank
18 June, 2012 at 8:04 pm
M. Hampton
In the book “Information Theory, Inference, and Learning Algorithms”, by David MacKay, he points out that the scale-free (improper) prior distribution proportional to 1/x can be thought of as the exponential of a uniform distribution on ln(x). The empirical prevalence of distributions close to C/x (i.e. Zipf’s law in the discrete case) might reflect that if a positive quantity is arising from an approximately scale-free system (over some finite but large range of scales), the probability density of the logarithm of that quantity should be approximately uniform.
While imprecise, I find this a compelling line of thought.
18 June, 2012 at 10:52 pm
Oded Kafri
I would like to draw your attention to the paper arXiv:0907.4852 “the distributions in nature and entropy principle” that show the solution to Benford’s law Pareto rule and Zipf law all bases on Planck equation and requires no parameters at all!
Have fun.
19 June, 2012 at 1:11 pm
petequinn
Thanks for sharing that Oded, I will read it more carefully when I find some quiet time!
9 July, 2012 at 9:10 pm
Anonymous
Of interest is Hammings paper “On the Distribution of Numbers” at http://www.alcatel-lucent.com/bstj/vol49-1970/articles/bstj49-8-1609.pdf
In it, he presents a derivation of Benford’s law for multiplicative processes.
29 July, 2012 at 8:41 pm
Benford’s Law and Baseball « Gödel’s Lost Letter and P=NP
[...] to certain mixtures of uniform distributions by Élise Janvresse and Thierry de la Rue, and a post three years ago by Terry Tao. The upshot is that data resulting from a complex mix of factors will [...]
25 September, 2012 at 5:45 am
OMG, so much Data! | Notepad PlusPlus
[...] [5] https://terrytao.wordpress.com/2009/07/03/benfords-law-zipfs-law-and-the-pareto-distribution/ [...]
25 September, 2012 at 6:16 am
OMG, so much Data! | ugresearchspur
[...] [5] https://terrytao.wordpress.com/2009/07/03/benfords-law-zipfs-law-and-the-pareto-distribution/ Share this:TwitterFacebookLike this:LikeBe the first to like this. This entry was posted in Uncategorized on September 25, 2012 by isaaspur. [...]
1 January, 2013 at 7:00 am
Benford's Law | Citizen Scientists League
[...] Note: For the more mathematically inclined, Terence Tao wrote an excellent blog post on Benford’s Law and other universal laws, which can be found here: https://terrytao.wordpress.com/2009/07/03/benfords-law-zipfs-law-and-the-pareto-distribution/. [...]
12 January, 2013 at 3:40 pm
Roland
Thank you Terence, this is a precious post.
I was actually looking for turbulence, where also this blog provides some brillant food for thought, guess am becoming a new fan.
Never heard about Benford before so I appreciate especially the Benford application to country population.
It is beautiful in, conversely, showing that
a) national borders are perfectly chaotic
b) since Benford overestimates 1, but consistently underestimates 2..9, that human population is far from infinite.
13 January, 2013 at 7:28 am
Oded Kafri
I read the above comment about the description of Benford’ law which is result of phenomena “which ostensibly involve large numbers of unknown parameters”. Since I provided a derivation of all these laws, which is based on 1901 Planck’s calculation of the radiation law, I would like to comment that, Pareto, Zipf and Benford’s law are purely statistical, exactly as the Bell like distribution (Maxwell-Boltzmann distribution). There are no free parameters, etc. The Zipf distribution is an approximation of the Planck’s distribution for high occupation numbers which means a random statistical distribution in which there are more particles than boxes. A good example for this distribution is the distribution of population in cities, as there are obviously more peoples than cities. Another example is the famous original Zipf’s example of the distribution of words in texts. There are more words in texts then the number of different words. The Maxwell-Boltzmann distribution is an approximation of Planck distribution for law occupation numbers, which means, a random distribution in which the number of particles is smaller than the number of boxes. IQ is a good example, as every man has only one IQ number from many possibilities. Another example is the death probability. Every man dies in one date from numerous possibilities.
For more reading O. Kafri “The distribution in nature and entropy principle” arXiv:0907.4852 and references therein.
16 January, 2013 at 5:14 pm
Roland
To whom it may concern:
i am well aware of a weak form of maxwellian demon who is looking ceaselessly after all these threads.
i have recently seen him at work. he is cleaning things up very helpfully.
i am trying to make a meaningful point here.
in doing so, various fragments of this post are slightly missing the focus of this thread’s subject matter.
perhaps more slightly than is tolerable by the blog administrator.
also, please accept my apologies for all that self promotion after you have read this entire post.
this is an empirical experiment on the acceptable limits of commenting.
if the experiment does not go terribly wrong,
i am excited about becoming a happy participant of this lively community.
any feedback what so ever highly appreciated.
thank you very much for your attention.
********************************************
Dear blog administrator,
in the event that interference with this post is inevitable,
please inform any one concerned, by any means deemed necessary, of what is going on.
thank you.
********************************************
Dear Oded,
you were so promptly drawing my attention to your paper.
i could not fail to notice how many times you drew attention to some one before,in the previous couple of years, on this very thread.
you deserve a feedback.
so i felt obligued to commit my quiet time to your paper.
Oded, i am very sorry but you are not deriving all these laws.
i want to believe your concepts, am almost there, but not quite, no.
you have truly stunning ideas and you firmly know your stuff.
but in its current form, i am not able to comprehend your paper fully.
your narrative is bumpy – that is, lengthy in the easy parts but non existant in the dificult (and interesting) parts.
also i am too lazy to do all of your research (again).
therefore it is not possible for me (and i suppose not very much any one) to have a meaningful discussion of its content.
(as if i knew anything !!)
Let me draw your attention to Terry’s wonderful (as usual) advice “on writing” here:
https://terrytao.wordpress.com/advice-on-writing-papers/
You tripped some of those land mines, if not most of them.
Your paper needs major pimping in order to be taken seriously.
On a different level:
exactly the form of your paper (which i just criticized so cruelly)
reveals much of your personality.
a personality that reminds me much of my self when i was at graduate school.
*** self promotion alert, start ***
oh how i love the proof by excel sheet !!
also, i am reminded (painfully so)
of how the pursuit of academic maturity is so incredibly painful.
and it never stops once it got started.
academic maturity has little to do with academic status or glory.
sorry i was stealing that line from somewhere. i will do it again.
having studied neither pure math nor computer science,
i bet aspirants in both these fields get as good a deal of masochistic drill as in any discipline (except medicine where doctor training aims at sadistic tendencies).
i have failed to achieve ph.d. status on three separate accounts and have never ever published anything.
one such failure relates closely to me being fed up with latex. latex sucks!
i assure any one caring to ask:
academic status is non essential and academic maturity is unlearnable from university alone.
academic status in fact hurts certain career options.
i have met math doctors working for a fraction of my income,
*** self promotion alert, end ***
and i earn low coin.
Dear Oded, you are well on your way.
You need to cheer up and keep going.
Please do your homework: pimp that paper and re-load it.
You owe it to your self.
And the community.
If you re-load, i promise i will re-view.
Perhaps i will motivate myself and do some home work of my own.
Turbulence needs fixing, after all.
Most sincerely,
Roland
********************************************
In relating to the subject matter at hand, i know of other good Benford examples i like to share:
my undergrad math teacher (with a background in telecommunication) talked about huge samples
of lengths of telephone conversations.
it seemed there was no way of predicting how much longer a conversation (in progress) is going to last, given it is in progress already for so many minutes.
so conversation statistics would be more perfectly chaotic than death statistics.
in analogy to Oded’s notion, there are more conversation boxes than death boxes.
everything else from said math class (which may very well have been about Benford) i have genuinely forgotten by the way.
coincidentally, i stumbled on a very enlightening non math article elaborating on (among other interesting things) the randomness of cancer. cancer positively qualifies as the mother of all Benfordian statistics with an infinite number of boxes.
I fell in love with the wording. Two thumbs up !!
16 January, 2013 at 5:27 pm
Roland
Whoops, i always seem to forget something important at the end.
The paper on perfect randomness of cancer:
https://www.nsfwcorp.com/dispatch/watson-cancer
11 November, 2013 at 7:03 am
Oded Kafri
Hi Poland, thank you for reading my arxiv post . I Just published a popular book about logical distributions that includes the derivations only in the appendixes. The book’s name is “Entropy God’s Dice Game”. I hope that you and also Professor Tao will be exposed to a physical approach to probability. Also there is blog http://www.entropy-book.com that discuss the subject matter. I hope that you will find this material helpful.
27 September, 2013 at 12:28 am
Artificial Intelligence Blog · Zipf’s Law, ArtEnt Blog Hits
[…] (Though Zipf’s “law” has been known for a long time, this post is at least partly inspired by Tarence Tao’s wonderful post “Benford’s law, Zipf’s law, and the Pareto distribution“.) […]
29 September, 2013 at 6:54 pm
Joe Wright
Amazing work. It’s amazing how many things this theory applies to.
10 January, 2014 at 5:25 am
Richard Johnson
These phenomena are plausibly results of the Central Limit Theorem rather than any other law. If you add i.i.d. errors with a finite variance you get a normal distribution. In the Zipf law settings one tends to multiply together i.i.d. errors instead. But the log of this product of errors is again a sum of i.i.d. errors which is asymptotically normal. Thus the log of e.g. city size is asymptotically normal and the distribution of city sizes is lognormal.
This would certainly explain how ‘Zipf’s law seems to turn up everywhere’ if it is a manifestation of the Central Limit Theorem.
10 January, 2014 at 9:36 am
Terence Tao
Benford’s law can indeed often be explained through a central limit theorem analysis, but Zipf’s law and Pareto’s law cannot; the log-normal distribution has different tails than the power law distribution, and so something else is going on at the extreme end of city sizes (or extremely popular words, etc.) I posed a question on MathOverflow about this at http://mathoverflow.net/questions/39224/is-there-a-natural-random-process-that-is-rigorously-known-to-produce-zipfs-law , and got some fairly satisfactory answers for the case of frequent words, but the situation was murkier for things like the largest cities.