You are currently browsing the tag archive for the ‘randomness’ tag.
Given a function on the natural numbers taking values in , one can invoke the Furstenberg correspondence principle to locate a measure preserving system – a probability space together with a measure-preserving shift (or equivalently, a measure-preserving -action on ) – together with a measurable function (or “observable”) that has essentially the same statistics as in the sense that
for any integers . In particular, one has
whenever the limit on the right-hand side exists. We will refer to the system together with the designated function as a Furstenberg limit ot the sequence . These Furstenberg limits capture some, but not all, of the asymptotic behaviour of ; roughly speaking, they control the typical “local” behaviour of , involving correlations such as in the regime where are much smaller than . However, the control on error terms here is usually only qualitative at best, and one usually does not obtain non-trivial control on correlations in which the are allowed to grow at some significant rate with (e.g. like some power of ).
The correspondence principle is discussed in these previous blog posts. One way to establish the principle is by introducing a Banach limit that extends the usual limit functional on the subspace of consisting of convergent sequences while still having operator norm one. Such functionals cannot be constructed explicitly, but can be proven to exist (non-constructively and non-uniquely) using the Hahn-Banach theorem; one can also use a non-principal ultrafilter here if desired. One can then seek to construct a system and a measurable function for which one has the statistics
for all . One can explicitly construct such a system as follows. One can take to be the Cantor space with the product -algebra and the shift
with the function being the coordinate function at zero:
(so in particular for any ). The only thing remaining is to construct the invariant measure . In order to be consistent with (2), one must have
for any distinct integers and signs . One can check that this defines a premeasure on the Boolean algebra of defined by cylinder sets, and the existence of then follows from the Hahn-Kolmogorov extension theorem (or the closely related Kolmogorov extension theorem). One can then check that the correspondence (2) holds, and that is translation-invariant; the latter comes from the translation invariance of the (Banach-)Césaro averaging operation . A variant of this construction shows that the Furstenberg limit is unique up to equivalence if and only if all the limits appearing in (1) actually exist.
One can obtain a slightly tighter correspondence by using a smoother average than the Césaro average. For instance, one can use the logarithmic Césaro averages in place of the Césaro average , thus one replaces (2) by
Whenever the Césaro average of a bounded sequence exists, then the logarithmic Césaro average exists and is equal to the Césaro average. Thus, a Furstenberg limit constructed using logarithmic Banach-Césaro averaging still obeys (1) for all when the right-hand side limit exists, but also obeys the more general assertion
whenever the limit of the right-hand side exists.
In a recent paper of Frantizinakis, the Furstenberg limits of the Liouville function (with logarithmic averaging) were studied. Some (but not all) of the known facts and conjectures about the Liouville function can be interpreted in the Furstenberg limit. For instance, in a recent breakthrough result of Matomaki and Radziwill (discussed previously here), it was shown that the Liouville function exhibited cancellation on short intervals in the sense that
In terms of Furstenberg limits of the Liouville function, this assertion is equivalent to the assertion that
for all Furstenberg limits of Liouville (including those without logarithmic averaging). Invoking the mean ergodic theorem (discussed in this previous post), this assertion is in turn equivalent to the observable that corresponds to the Liouville function being orthogonal to the invariant factor of ; equivalently, the first Gowers-Host-Kra seminorm of (as defined for instance in this previous post) vanishes. The Chowla conjecture, which asserts that
for all distinct integers , is equivalent to the assertion that all the Furstenberg limits of Liouville are equivalent to the Bernoulli system ( with the product measure arising from the uniform distribution on , with the shift and observable as before). Similarly, the logarithmically averaged Chowla conjecture
is equivalent to the assertion that all the Furstenberg limits of Liouville with logarithmic averaging are equivalent to the Bernoulli system. Recently, I was able to prove the two-point version
of the logarithmically averaged Chowla conjecture, for any non-zero integer ; this is equivalent to the perfect strong mixing property
for any Furstenberg limit of Liouville with logarithmic averaging, and any .
The situation is more delicate with regards to the Sarnak conjecture, which is equivalent to the assertion that
for any zero-entropy sequence (see this previous blog post for more discussion). Morally speaking, this conjecture should be equivalent to the assertion that any Furstenberg limit of Liouville is disjoint from any zero entropy system, but I was not able to formally establish an implication in either direction due to some technical issues regarding the fact that the Furstenberg limit does not directly control long-range correlations, only short-range ones. (There are however ergodic theoretic interpretations of the Sarnak conjecture that involve the notion of generic points; see this paper of El Abdalaoui, Lemancyk, and de la Rue.) But the situation is currently better with the logarithmically averaged Sarnak conjecture
as I was able to show that this conjecture was equivalent to the logarithmically averaged Chowla conjecture, and hence to all Furstenberg limits of Liouville with logarithmic averaging being Bernoulli; I also showed the conjecture was equivalent to local Gowers uniformity of the Liouville function, which is in turn equivalent to the function having all Gowers-Host-Kra seminorms vanishing in every Furstenberg limit with logarithmic averaging. In this recent paper of Frantzikinakis, this analysis was taken further, showing that the logarithmically averaged Chowla and Sarnak conjectures were in fact equivalent to the much milder seeming assertion that all Furstenberg limits with logarithmic averaging were ergodic.
Actually, the logarithmically averaged Furstenberg limits have more structure than just a -action on a measure preserving system with a single observable . Let denote the semigroup of affine maps on the integers with and positive. Also, let denote the profinite integers (the inverse limit of the cyclic groups ). Observe that acts on by taking the inverse limit of the obvious actions of on .
Proposition 1 (Enriched logarithmically averaged Furstenberg limit of Liouville) Let be a Banach limit. Then there exists a probability space with an action of the affine semigroup , as well as measurable functions and , with the following properties:
- (i) (Affine Furstenberg limit) For any , and any congruence class , one has
- (ii) (Equivariance of ) For any , one has
for -almost every .
- (iii) (Multiplicativity at fixed primes) For any prime , one has
for -almost every , where is the dilation map .
- (iv) (Measure pushforward) If is of the form and is the set , then the pushforward of by is equal to , that is to say one has
for every measurable .
Note that can be viewed as the subgroup of consisting of the translations . If one only keeps the -portion of the action and forgets the rest (as well as the function ) then the action becomes measure-preserving, and we recover an ordinary Furstenberg limit with logarithmic averaging. However, the additional structure here can be quite useful; for instance, one can transfer the proof of (3) to this setting, which we sketch below the fold, after proving the proposition.
The observable , roughly speaking, means that points in the Furstenberg limit constructed by this proposition are still “virtual integers” in the sense that one can meaningfully compute the residue class of modulo any natural number modulus , by first applying and then reducing mod . The action of means that one can also meaningfully multiply by any natural number, and translate it by any integer. As with other applications of the correspondence principle, the main advantage of moving to this more “virtual” setting is that one now acquires a probability measure , so that the tools of ergodic theory can be readily applied.
Given a random variable that takes on only finitely many values, we can define its Shannon entropy by the formula
with the convention that . (In some texts, one uses the logarithm to base rather than the natural logarithm, but the choice of base will not be relevant for this discussion.) This is clearly a nonnegative quantity. Given two random variables taking on finitely many values, the joint variable is also a random variable taking on finitely many values, and also has an entropy . It obeys the Shannon inequalities
so we can define some further nonnegative quantities, the mutual information
and the conditional entropies
More generally, given three random variables , one can define the conditional mutual information
and the final of the Shannon entropy inequalities asserts that this quantity is also non-negative.
The mutual information is a measure of the extent to which and fail to be independent; indeed, it is not difficult to show that vanishes if and only if and are independent. Similarly, vanishes if and only if and are conditionally independent relative to . At the other extreme, is a measure of the extent to which fails to depend on ; indeed, it is not difficult to show that if and only if is determined by in the sense that there is a deterministic function such that . In a related vein, if and are equivalent in the sense that there are deterministic functional relationships , between the two variables, then is interchangeable with for the purposes of computing the above quantities, thus for instance , , , , etc..
One can get some initial intuition for these information-theoretic quantities by specialising to a simple situation in which all the random variables being considered come from restricting a single random (and uniformly distributed) boolean function on a given finite domain to some subset of :
In this case, has the law of a random uniformly distributed boolean function from to , and the entropy here can be easily computed to be , where denotes the cardinality of . If is the restriction of to , and is the restriction of to , then the joint variable is equivalent to the restriction of to . If one discards the normalisation factor , one then obtains the following dictionary between entropy and the combinatorics of finite sets:
Random variables | Finite sets |
Entropy | Cardinality |
Joint variable | Union |
Mutual information | Intersection cardinality |
Conditional entropy | Set difference cardinality |
Conditional mutual information | |
independent | disjoint |
determined by | a subset of |
conditionally independent relative to |
Every (linear) inequality or identity about entropy (and related quantities, such as mutual information) then specialises to a combinatorial inequality or identity about finite sets that is easily verified. For instance, the Shannon inequality becomes the union bound , and the definition of mutual information becomes the inclusion-exclusion formula
For a more advanced example, consider the data processing inequality that asserts that if are conditionally independent relative to , then . Specialising to sets, this now says that if are disjoint outside of , then ; this can be made apparent by considering the corresponding Venn diagram. This dictionary also suggests how to prove the data processing inequality using the existing Shannon inequalities. Firstly, if and are not necessarily disjoint outside of , then a consideration of Venn diagrams gives the more general inequality
and a further inspection of the diagram then reveals the more precise identity
Using the dictionary in the reverse direction, one is then led to conjecture the identity
which (together with non-negativity of conditional mutual information) implies the data processing inequality, and this identity is in turn easily established from the definition of mutual information.
On the other hand, not every assertion about cardinalities of sets generalises to entropies of random variables that are not arising from restricting random boolean functions to sets. For instance, a basic property of sets is that disjointness from a given set is preserved by unions:
Indeed, one has the union bound
Applying the dictionary in the reverse direction, one might now conjecture that if was independent of and was independent of , then should also be independent of , and furthermore that
but these statements are well known to be false (for reasons related to pairwise independence of random variables being strictly weaker than joint independence). For a concrete counterexample, one can take to be independent, uniformly distributed random elements of the finite field of two elements, and take to be the sum of these two field elements. One can easily check that each of and is separately independent of , but the joint variable determines and thus is not independent of .
From the inclusion-exclusion identities
one can check that (1) is equivalent to the trivial lower bound . The basic issue here is that in the dictionary between entropy and combinatorics, there is no satisfactory entropy analogue of the notion of a triple intersection . (Even the double intersection only exists information theoretically in a “virtual” sense; the mutual information allows one to “compute the entropy” of this “intersection”, but does not actually describe this intersection itself as a random variable.)
However, this issue only arises with three or more variables; it is not too difficult to show that the only linear equalities and inequalities that are necessarily obeyed by the information-theoretic quantities associated to just two variables are those that are also necessarily obeyed by their combinatorial analogues . (See for instance the Venn diagram at the Wikipedia page for mutual information for a pictorial summation of this statement.)
One can work with a larger class of special cases of Shannon entropy by working with random linear functions rather than random boolean functions. Namely, let be some finite-dimensional vector space over a finite field , and let be a random linear functional on , selected uniformly among all such functions. Every subspace of then gives rise to a random variable formed by restricting to . This random variable is also distributed uniformly amongst all linear functions on , and its entropy can be easily computed to be . Given two random variables formed by restricting to respectively, the joint random variable determines the random linear function on the union on the two spaces, and thus by linearity on the Minkowski sum as well; thus is equivalent to the restriction of to . In particular, . This implies that and also , where is the quotient map. After discarding the normalising constant , this leads to the following dictionary between information theoretic quantities and linear algebra quantities, analogous to the previous dictionary:
Random variables | Subspaces |
Entropy | Dimension |
Joint variable | Sum |
Mutual information | Dimension of intersection |
Conditional entropy | Dimension of projection |
Conditional mutual information | |
independent | transverse () |
determined by | a subspace of |
conditionally independent relative to | , transverse. |
The combinatorial dictionary can be regarded as a specialisation of the linear algebra dictionary, by taking to be the vector space over the finite field of two elements, and only considering those subspaces that are coordinate subspaces associated to various subsets of .
As before, every linear inequality or equality that is valid for the information-theoretic quantities discussed above, is automatically valid for the linear algebra counterparts for subspaces of a vector space over a finite field by applying the above specialisation (and dividing out by the normalising factor of ). In fact, the requirement that the field be finite can be removed by applying the compactness theorem from logic (or one of its relatives, such as Los’s theorem on ultraproducts, as done in this previous blog post).
The linear algebra model captures more of the features of Shannon entropy than the combinatorial model. For instance, in contrast to the combinatorial case, it is possible in the linear algebra setting to have subspaces such that and are separately transverse to , but their sum is not; for instance, in a two-dimensional vector space , one can take to be the one-dimensional subspaces spanned by , , and respectively. Note that this is essentially the same counterexample from before (which took to be the field of two elements). Indeed, one can show that any necessarily true linear inequality or equality involving the dimensions of three subspaces (as well as the various other quantities on the above table) will also be necessarily true when applied to the entropies of three discrete random variables (as well as the corresponding quantities on the above table).
However, the linear algebra model does not completely capture the subtleties of Shannon entropy once one works with four or more variables (or subspaces). This was first observed by Ingleton, who established the dimensional inequality
for any subspaces . This is easiest to see when the three terms on the right-hand side vanish; then are transverse, which implies that ; similarly . But and are transverse, and this clearly implies that and are themselves transverse. To prove the general case of Ingleton’s inequality, one can define and use (and similarly for instead of ) to reduce to establishing the inequality
which can be rearranged using (and similarly for instead of ) and as
but this is clear since .
Returning to the entropy setting, the analogue
of (3) is true (exercise!), but the analogue
of Ingleton’s inequality is false in general. Again, this is easiest to see when all the terms on the right-hand side vanish; then are conditionally independent relative to , and relative to , and and are independent, and the claim (4) would then be asserting that and are independent. While there is no linear counterexample to this statement, there are simple non-linear ones: for instance, one can take to be independent uniform variables from , and take and to be (say) and respectively (thus are the indicators of the events and respectively). Once one conditions on either or , one of has positive conditional entropy and the other has zero entropy, and so are conditionally independent relative to either or ; also, or are independent of each other. But and are not independent of each other (they cannot be simultaneously equal to ). Somehow, the feature of the linear algebra model that is not present in general is that in the linear algebra setting, every pair of subspaces has a well-defined intersection that is also a subspace, whereas for arbitrary random variables , there does not necessarily exist the analogue of an intersection, namely a “common information” random variable that has the entropy of and is determined either by or by .
I do not know if there is any simpler model of Shannon entropy that captures all the inequalities available for four variables. One significant complication is that there exist some information inequalities in this setting that are not of Shannon type, such as the Zhang-Yeung inequality
One can however still use these simpler models of Shannon entropy to be able to guess arguments that would work for general random variables. An example of this comes from my paper on the logarithmically averaged Chowla conjecture, in which I showed among other things that
whenever was sufficiently large depending on , where is the Liouville function. The information-theoretic part of the proof was as follows. Given some intermediate scale between and , one can form certain random variables . The random variable is a sign pattern of the form where is a random number chosen from to (with logarithmic weighting). The random variable was tuple of reductions of to primes comparable to . Roughly speaking, what was implicitly shown in the paper (after using the multiplicativity of , the circle method, and the Matomaki-Radziwill theorem on short averages of multiplicative functions) is that if the inequality (5) fails, then there was a lower bound
on the mutual information between and . From translation invariance, this also gives the more general lower bound
for any , where denotes the shifted sign pattern . On the other hand, one had the entropy bounds
and from concatenating sign patterns one could see that is equivalent to the joint random variable for any . Applying these facts and using an “entropy decrement” argument, I was able to obtain a contradiction once was allowed to become sufficiently large compared to , but the bound was quite weak (coming ultimately from the unboundedness of as the interval of values of under consideration becomes large), something of the order of ; the quantity needs at various junctures to be less than a small power of , so the relationship between and becomes essentially quadruple exponential in nature, . The basic strategy was to observe that the lower bound (6) causes some slowdown in the growth rate of the mean entropy, in that this quantity decreased by as increased from to , basically by dividing into components , and observing from (6) each of these shares a bit of common information with the same variable . This is relatively clear when one works in a set model, in which is modeled by a set of size , and is modeled by a set of the form
for various sets of size (also there is some translation symmetry that maps to a shift while preserving all of the ).
However, on considering the set model recently, I realised that one can be a little more efficient by exploiting the fact (basically the Chinese remainder theorem) that the random variables are basically jointly independent as ranges over dyadic values that are much smaller than , which in the set model corresponds to the all being disjoint. One can then establish a variant
of (6), which in the set model roughly speaking asserts that each claims a portion of the of cardinality that is not claimed by previous choices of . This leads to a more efficient contradiction (relying on the unboundedness of rather than ) that looks like it removes one order of exponential growth, thus the relationship between and is now . Returning to the entropy model, one can use (7) and Shannon inequalities to establish an inequality of the form
for a small constant , which on iterating and using the boundedness of gives the claim. (A modification of this analysis, at least on the level of the back of the envelope calculation, suggests that the Matomaki-Radziwill theorem is needed only for ranges greater than or so, although at this range the theorem is not significantly simpler than the general case).
Klaus Roth, who made fundamental contributions to analytic number theory, died this Tuesday, aged 90.
I never met or communicated with Roth personally, but was certainly influenced by his work; he wrote relatively few papers, but they tended to have outsized impact. For instance, he was one of the key people (together with Bombieri) to work on simplifying and generalising the large sieve, taking it from the technically formidable original formulation of Linnik and Rényi to the clean and general almost orthogonality principle that we have today (discussed for instance in these lecture notes of mine). The paper of Roth that had the most impact on my own personal work was his three-page paper proving what is now known as Roth’s theorem on arithmetic progressions:
Theorem 1 (Roth’s theorem on arithmetic progressions) Let be a set of natural numbers of positive upper density (thus ). Then contains infinitely many arithmetic progressions of length three (with non-zero of course).
At the heart of Roth’s elegant argument was the following (surprising at the time) dichotomy: if had some moderately large density within some arithmetic progression , either one could use Fourier-analytic methods to detect the presence of an arithmetic progression of length three inside , or else one could locate a long subprogression of on which had increased density. Iterating this dichotomy by an argument now known as the density increment argument, one eventually obtains Roth’s theorem, no matter which side of the dichotomy actually holds. This argument (and the many descendants of it), based on various “dichotomies between structure and randomness”, became essential in many other results of this type, most famously perhaps in Szemerédi’s proof of his celebrated theorem on arithmetic progressions that generalised Roth’s theorem to progressions of arbitrary length. More recently, my recent work on the Chowla and Elliott conjectures that was a crucial component of the solution of the Erdös discrepancy problem, relies on an entropy decrement argument which was directly inspired by the density increment argument of Roth.
The Erdös discrepancy problem also is connected with another well known theorem of Roth:
Theorem 2 (Roth’s discrepancy theorem for arithmetic progressions) Let be a sequence in . Then there exists an arithmetic progression in with positive such that
for an absolute constant .
In fact, Roth proved a stronger estimate regarding mean square discrepancy, which I am not writing down here; as with the Roth theorem in arithmetic progressions, his proof was short and Fourier-analytic in nature (although non-Fourier-analytic proofs have since been found, for instance the semidefinite programming proof of Lovasz). The exponent is known to be sharp (a result of Matousek and Spencer).
As a particular corollary of the above theorem, for an infinite sequence of signs, the sums are unbounded in . The Erdös discrepancy problem asks whether the same statement holds when is restricted to be zero. (Roth also established discrepancy theorems for other sets, such as rectangles, which will not be discussed here.)
Finally, one has to mention Roth’s most famous result, cited for instance in his Fields medal citation:
Theorem 3 (Roth’s theorem on Diophantine approximation) Let be an irrational algebraic number. Then for any there is a quantity such that
From the Dirichlet approximation theorem (or from the theory of continued fractions) we know that the exponent in the denominator cannot be reduced to or below. A classical and easy theorem of Liouville gives the claim with the exponent replaced by the degree of the algebraic number ; work of Thue and Siegel reduced this exponent, but Roth was the one who obtained the near-optimal result. An important point is that the constant is ineffective – it is a major open problem in Diophantine approximation to produce any bound significantly stronger than Liouville’s theorem with effective constants. This is because the proof of Roth’s theorem does not exclude any single rational from being close to , but instead very ingeniously shows that one cannot have two different rationals , that are unusually close to , even when the denominators are very different in size. (I refer to this sort of argument as a “dueling conspiracies” argument; they are strangely prevalent throughout analytic number theory.)
The most fundamental unsolved problem in complexity theory is undoubtedly the P=NP problem, which asks (roughly speaking) whether a problem which can be solved by a non-deterministic polynomial-time (NP) algorithm, can also be solved by a deterministic polynomial-time (P) algorithm. The general belief is that , i.e. there exist problems which can be solved by non-deterministic polynomial-time algorithms but not by deterministic polynomial-time algorithms.
One reason why the question is so difficult to resolve is that a certain generalisation of this question has an affirmative answer in some cases, and a negative answer in other cases. More precisely, if we give all the algorithms access to an oracle, then for one choice of this oracle, all the problems that are solvable by non-deterministic polynomial-time algorithms that calls (), can also be solved by a deterministic polynomial-time algorithm algorithm that calls (), thus ; but for another choice of this oracle, there exist problems solvable by non-deterministic polynomial-time algorithms that call , which cannot be solved by a deterministic polynomial-time algorithm that calls , thus . One particular consequence of this result (which is due to Baker, Gill, and Solovay) is that there cannot be any relativisable proof of either or , where “relativisable” means that the proof would also work without any changes in the presence of an oracle.
The Baker-Gill-Solovay result was quite surprising, but the idea of the proof turns out to be rather simple. To get an oracle such that , one basically sets to be a powerful simulator that can simulate non-deterministic machines (and, furthermore, can also simulate itself); it turns out that any PSPACE-complete oracle would suffice for this task. To get an oracle for which , one has to be a bit sneakier, setting to be a query device for a sparse set of random (or high-complexity) strings, which are too complex to be guessed at by any deterministic polynomial-time algorithm.
Unfortunately, the simple idea of the proof can be obscured by various technical details (e.g. using Turing machines to define and precisely), which require a certain amount of time to properly absorb. To help myself try to understand this result better, I have decided to give a sort of “allegory” of the proof, based around a (rather contrived) story about various students trying to pass a multiple choice test, which avoids all the technical details but still conveys the basic ideas of the argument. This allegory was primarily for my own benefit, but I thought it might also be of interest to some readers here (and also has some tangential relation to the proto-polymath project of determinstically finding primes), so I reproduce it below.
This week I am in Bremen, where the 50th International Mathematical Olympiad is being held. A number of former Olympians (Béla Bollobás, Tim Gowers, Laci Lovasz, Stas Smirnov, Jean-Christophe Yoccoz, and myself) were invited to give a short talk (20 minutes in length) at the celebratory event for this anniversary. I chose to talk on a topic I have spoken about several times before, on “Structure and randomness in the prime numbers“. Given the time constraints, there was a limit as to how much substance I could put into the talk; but I try to describe, in very general terms, what we know about the primes, and what we suspect to be true, but cannot yet establish. As I have mentioned in previous talks, the key problem is that we suspect the distribution of the primes to obey no significant patterns (other than “local” structure, such as having a strong tendency to be odd (which is local information at the 2 place), or obeying the prime number theorem (which is local information at the infinity place)), but we still do not have fully satisfactory tools for establishing the absence of a pattern. (This is in contrast with many types of Olympiad problems, where the key to solving a problem often lies in discovering the right pattern or structure in the problem to exploit.)
The PDF of the talk is here; I decided to try out the Beamer LaTeX package for a change.
A remarkable phenomenon in probability theory is that of universality – that many seemingly unrelated probability distributions, which ostensibly involve large numbers of unknown parameters, can end up converging to a universal law that may only depend on a small handful of parameters. One of the most famous examples of the universality phenomenon is the central limit theorem; another rich source of examples comes from random matrix theory, which is one of the areas of my own research.
Analogous universality phenomena also show up in empirical distributions – the distributions of a statistic from a large population of “real-world” objects. Examples include Benford’s law, Zipf’s law, and the Pareto distribution (of which the Pareto principle or 80-20 law is a special case). These laws govern the asymptotic distribution of many statistics which
- (i) take values as positive numbers;
- (ii) range over many different orders of magnitude;
- (iiii) arise from a complicated combination of largely independent factors (with different samples of arising from different independent factors); and
- (iv) have not been artificially rounded, truncated, or otherwise constrained in size.
Examples here include the population of countries or cities, the frequency of occurrence of words in a language, the mass of astronomical objects, or the net worth of individuals or corporations. The laws are then as follows:
- Benford’s law: For , the proportion of whose first digit is is approximately . Thus, for instance, should have a first digit of about of the time, but a first digit of only about of the time.
- Zipf’s law: The largest value of should obey an approximate power law, i.e. it should be approximately for the first few and some parameters . In many cases, is close to .
- Pareto distribution: The proportion of with at least digits (before the decimal point), where is above the median number of digits, should obey an approximate exponential law, i.e. be approximately of the form for some . Again, in many cases is close to .
Benford’s law and Pareto distribution are stated here for base , which is what we are most familiar with, but the laws hold for any base (after replacing all the occurrences of in the above laws with the new base, of course). The laws tend to break down if the hypotheses (i)-(iv) are dropped. For instance, if the statistic concentrates around its mean (as opposed to being spread over many orders of magnitude), then the normal distribution tends to be a much better model (as indicated by such results as the central limit theorem). If instead the various samples of the statistics are highly correlated with each other, then other laws can arise (for instance, the eigenvalues of a random matrix, as well as many empirically observed matrices, are correlated to each other, with the behaviour of the largest eigenvalues being governed by laws such as the Tracy-Widom law rather than Zipf’s law, and the bulk distribution being governed by laws such as the semicircular law rather than the normal or Pareto distributions).
To illustrate these laws, let us take as a data set the populations of 235 countries and regions of the world in 2007 (using the CIA world factbook); I have put the raw data here. This is a relatively small sample (cf. my previous post), but is already enough to discern these laws in action. For instance, here is how the data set tracks with Benford’s law (rounded to three significant figures):
Countries | Number | Benford prediction | |
1 | Angola, Anguilla, Aruba, Bangladesh, Belgium, Botswana, Brazil, Burkina Faso, Cambodia, Cameroon, Chad, Chile, China, Christmas Island, Cook Islands, Cuba, Czech Republic, Ecuador, Estonia, Gabon, (The) Gambia, Greece, Guam, Guatemala, Guinea-Bissau, India, Japan, Kazakhstan, Kiribati, Malawi, Mali, Mauritius, Mexico, (Federated States of) Micronesia, Nauru, Netherlands, Niger, Nigeria, Niue, Pakistan, Portugal, Russia, Rwanda, Saint Lucia, Saint Vincent and the Grenadines, Senegal, Serbia, Swaziland, Syria, Timor-Leste (East-Timor), Tokelau, Tonga, Trinidad and Tobago, Tunisia, Tuvalu, (U.S.) Virgin Islands, Wallis and Futuna, Zambia, Zimbabwe | 59 () | 71 () |
2 | Armenia, Australia, Barbados, British Virgin Islands, Cote d’Ivoire, French Polynesia, Ghana, Gibraltar, Indonesia, Iraq, Jamaica, (North) Korea, Kosovo, Kuwait, Latvia, Lesotho, Macedonia, Madagascar, Malaysia, Mayotte, Mongolia, Mozambique, Namibia, Nepal, Netherlands Antilles, New Caledonia Norfolk Island, Palau, Peru, Romania, Saint Martin, Samoa, San Marino, Sao Tome and Principe, Saudi Arabia, Slovenia, Sri Lanka, Svalbard, Taiwan, Turks and Caicos Islands, Uzbekistan, Vanuatu, Venezuela, Yemen | 44 () | 41 () |
3 | Afghanistan, Albania, Algeria, (The) Bahamas, Belize, Brunei, Canada, (Rep. of the) Congo, Falkland Islands (Islas Malvinas), Iceland, Kenya, Lebanon, Liberia, Liechtenstein, Lithuania, Maldives, Mauritania, Monaco, Morocco, Oman, (Occupied) Palestinian Territory, Panama, Poland, Puerto Rico, Saint Kitts and Nevis, Uganda, United States of America, Uruguay, Western Sahara | 29 () | 29 () |
4 | Argentina, Bosnia and Herzegovina, Burma (Myanmar), Cape Verde, Cayman Islands, Central African Republic, Colombia, Costa Rica, Croatia, Faroe Islands, Georgia, Ireland, (South) Korea, Luxembourg, Malta, Moldova, New Zealand, Norway, Pitcairn Islands, Singapore, South Africa, Spain, Sudan, Suriname, Tanzania, Ukraine, United Arab Emirates | 27 () | 22 () |
5 | (Macao SAR) China, Cocos Islands, Denmark, Djibouti, Eritrea, Finland, Greenland, Italy, Kyrgyzstan, Montserrat, Nicaragua, Papua New Guinea, Slovakia, Solomon Islands, Togo, Turkmenistan | 16 () | 19 () |
6 | American Samoa, Bermuda, Bhutan, (Dem. Rep. of the) Congo, Equatorial Guinea, France, Guernsey, Iran, Jordan, Laos, Libya, Marshall Islands, Montenegro, Paraguay, Sierra Leone, Thailand, United Kingdom | 17 () | 16 () |
7 | Bahrain, Bulgaria, (Hong Kong SAR) China, Comoros, Cyprus, Dominica, El Salvador, Guyana, Honduras, Israel, (Isle of) Man, Saint Barthelemy, Saint Helena, Saint Pierre and Miquelon, Switzerland, Tajikistan, Turkey | 17 () | 14 () |
8 | Andorra, Antigua and Barbuda, Austria, Azerbaijan, Benin, Burundi, Egypt, Ethiopia, Germany, Haiti, Holy See (Vatican City), Northern Mariana Islands, Qatar, Seychelles, Vietnam | 15 () | 12 () |
9 | Belarus, Bolivia, Dominican Republic, Fiji, Grenada, Guinea, Hungary, Jersey, Philippines, Somalia, Sweden | 11 () | 11 () |
Here is how the same data tracks Zipf’s law for the first twenty values of , with the parameters and (selected by log-linear regression), again rounding to three significant figures:
Country | Population | Zipf prediction | Deviation from prediction | |
1 | China | 1,330,000,000 | 1,280,000,000 | |
2 | India | 1,150,000,000 | 626,000,000 | |
3 | USA | 304,000,000 | 412,000,000 | |
4 | Indonesia | 238,000,000 | 307,000,000 | |
5 | Brazil | 196,000,000 | 244,000,000 | |
6 | Pakistan | 173,000,000 | 202,000,000 | |
7 | Bangladesh | 154,000,000 | 172,000,000 | |
8 | Nigeria | 146,000,000 | 150,000,000 | |
9 | Russia | 141,000,000 | 133,000,000 | |
10 | Japan | 128,000,000 | 120,000,000 | |
11 | Mexico | 110,000,000 | 108,000,000 | |
12 | Philippines | 96,100,000 | 98,900,000 | |
13 | Vietnam | 86,100,000 | 91,100,000 | |
14 | Ethiopia | 82,600,000 | 84,400,000 | |
15 | Germany | 82,400,000 | 78,600,000 | |
16 | Egypt | 81,700,000 | 73,500,000 | |
17 | Turkey | 71,900,000 | 69,100,000 | |
18 | Congo | 66,500,000 | 65,100,000 | |
19 | Iran | 65,900,000 | 61,600,000 | |
20 | Thailand | 65,500,000 | 58,400,000 |
As one sees, Zipf’s law is not particularly precise at the extreme edge of the statistics (when is very small), but becomes reasonably accurate (given the small sample size, and given that we are fitting twenty data points using only two parameters) for moderate sizes of .
This data set has too few scales in base to illustrate the Pareto distribution effectively – over half of the country populations are either seven or eight digits in that base. But if we instead work in base , then country populations range in a decent number of scales (the majority of countries have population between and ), and we begin to see the law emerge, where is now the number of digits in binary, the best-fit parameters are and :
Countries with binary digit populations | Number | Pareto prediction | |
31 | China, India | 2 | 1 |
30 | ” | 2 | 2 |
29 | “, United States of America | 3 | 5 |
28 | “, Indonesia, Brazil, Pakistan, Bangladesh, Nigeria, Russia | 9 | 8 |
27 | “, Japan, Mexico, Philippines, Vietnam, Ethiopia, Germany, Egypt, Turkey | 17 | 15 |
26 | “, (Dem. Rep. of the) Congo, Iran, Thailand, France, United Kingdom, Italy, South Africa, (South) Korea, Burma (Myanmar), Ukraine, Colombia, Spain, Argentina, Sudan, Tanzania, Poland, Kenya, Morocco, Algeria | 36 | 27 |
25 | “, Canada, Afghanistan, Uganda, Nepal, Peru, Iraq, Saudi Arabia, Uzbekistan, Venezuela, Malaysia, (North) Korea, Ghana, Yemen, Taiwan, Romania, Mozambique, Sri Lanka, Australia, Cote d’Ivoire, Madagascar, Syria, Cameroon | 58 | 49 |
24 | “, Netherlands, Chile, Kazakhstan, Burkina Faso, Cambodia, Malawi, Ecuador, Niger, Guatemala, Senegal, Angola, Mali, Zambia, Cuba, Zimbabwe, Greece, Portugal, Belgium, Tunisia, Czech Republic, Rwanda, Serbia, Chad, Hungary, Guinea, Belarus, Somalia, Dominican Republic, Bolivia, Sweden, Haiti, Burundi, Benin | 91 | 88 |
23 | “, Austria, Azerbaijan, Honduras, Switzerland, Bulgaria, Tajikistan, Israel, El Salvador, (Hong Kong SAR) China, Paraguay, Laos, Sierra Leone, Jordan, Libya, Papua New Guinea, Togo, Nicaragua, Eritrea, Denmark, Slovakia, Kyrgyzstan, Finland, Turkmenistan, Norway, Georgia, United Arab Emirates, Singapore, Bosnia and Herzegovina, Croatia, Central African Republic, Moldova, Costa Rica | 123 | 159 |
Thus, with each new scale, the number of countries introduced increases by a factor of a little less than , on the average. This approximate doubling of countries with each new scale begins to falter at about the population (i.e. at around million), for the simple reason that one has begun to run out of countries. (Note that the median-population country in this set, Singapore, has a population with binary digits.)
These laws are not merely interesting statistical curiosities; for instance, Benford’s law is often used to help detect fraudulent statistics (such as those arising from accounting fraud), as many such statistics are invented by choosing digits at random, and will therefore deviate significantly from Benford’s law. (This is nicely discussed in Robert Matthews’ New Scientist article “The power of one“; this article can also be found on the web at a number of other places.) In a somewhat analogous spirit, Zipf’s law and the Pareto distribution can be used to mathematically test various models of real-world systems (e.g. formation of astronomical objects, accumulation of wealth, population growth of countries, etc.), without necessarily having to fit all the parameters of that model with the actual data.
Being empirically observed phenomena rather than abstract mathematical facts, Benford’s law, Zipf’s law, and the Pareto distribution cannot be “proved” the same way a mathematical theorem can be proved. However, one can still support these laws mathematically in a number of ways, for instance showing how these laws are compatible with each other, and with other plausible hypotheses on the source of the data. In this post I would like to describe a number of ways (both technical and non-technical) in which one can do this; these arguments do not fully explain these laws (in particular, the empirical fact that the exponent in Zipf’s law or the Pareto distribution is often close to is still quite a mysterious phenomenon), and do not always have the same universal range of applicability as these laws seem to have, but I hope that they do demonstrate that these laws are not completely arbitrary, and ought to have a satisfactory basis of mathematical support. Read the rest of this entry »
One further paper in this stream: László Erdős, José Ramírez, Benjamin Schlein, Van Vu, Horng-Tzer Yau, and myself have just uploaded to the arXiv the paper “Bulk universality for Wigner hermitian matrices with subexponential decay“, submitted to Mathematical Research Letters. (Incidentally, this is my first six-author paper I have been involved in, not counting the polymath projects of course, though I have had a number of five-author papers.)
This short paper (9 pages) combines the machinery from two recent papers on the universality conjecture for the eigenvalue spacings in the bulk for Wigner random matrices (see my earlier blog post for more discussion). On the one hand, the paper of Erdős-Ramírez-Schlein-Yau established this conjecture under the additional hypothesis that the distribution of the individual entries obeyed some smoothness and exponential decay conditions. Meanwhile, the paper of Van Vu and myself (which I discussed in my earlier blog post) established the conjecture under a somewhat different set of hypotheses, namely that the distribution of the individual entries obeyed some moment conditions (in particular, the third moment had to vanish), a support condition (the entries had to have real part supported in at least three points), and an exponential decay condition.
After comparing our results, the six of us realised that our methods could in fact be combined rather easily to obtain a stronger result, establishing the universality conjecture assuming only a exponential decay (or more precisely, sub-exponential decay) bound on the coefficients; thus all regularity, moment, and support conditions have been eliminated. (There is one catch, namely that we can no longer control a single spacing for a single fixed i, but must now average over all before recovering the universality. This is an annoying technical issue but it may be resolvable in the future with further refinements to the method.)
I can describe the main idea behind the unified approach here. One can arrange the Wigner matrices in a hierarchy, from most structured to least structured:
- The most structured (or special) ensemble is the Gaussian Unitary Ensemble (GUE), in which the coefficients are gaussian. Here, one has very explicit and tractable formulae for the eigenvalue distributions, gap spacing, etc.
- The next most structured ensemble of Wigner matrices are the Gaussian-divisible or Johansson matrices, which are matrices H of the form , where is another Wigner matrix, V is a GUE matrix independent of , and is a fixed parameter independent of n. Here, one still has quite explicit (though not quite as tractable) formulae for the joint eigenvalue distribution and related statistics. Note that the limiting case t=1 is GUE.
- After this, one has the Ornstein-Uhlenbeck-evolved matrices, which are also of the form , but now decays at a power rate with n, rather than being comparable to 1. Explicit formulae still exist for these matrices, but extracting universality out of this is hard work (and occupies the bulk of the paper of Erdős-Ramírez-Schlein-Yau).
- Finally, one has arbitrary Wigner matrices, which can be viewed as the t=0 limit of the above Ornstein-Uhlenbeck process.
The arguments in the paper of Erdős-Ramírez-Schlein-Yau can be summarised as follows (I assume subexponential decay throughout this discussion):
- (Structured case) The universality conjecture is true for Ornstein-Uhlenbeck-evolved matrices with for any . (The case was treated in an earlier paper of Erdős-Ramírez-Schlein-Yau, while the case where t is comparable to 1 was treated by Johansson.)
- (Matching) Every Wigner matrix with suitable smoothness conditions can be “matched” with an Ornstein-Uhlenbeck-evolved matrix, in the sense that the eigenvalue statistics for the two matrices are asymptotically identical. (This is relatively easy due to the fact that can be taken arbitrarily close to zero.)
- Combining 1. and 2. one obtains universality for all Wigner matrices obeying suitable smoothness conditions.
The arguments in the paper of Van and myself can be summarised as follows:
- (Structured case) The universality conjecture is true for Johansson matrices, by the paper of Johansson.
- (Matching) Every Wigner matrix with some moment and support conditions can be “matched” with a Johansson matrix, in the sense that the first four moments of the entries agree, and hence (by the Lindeberg strategy in our paper) have asymptotically identical statistics.
- Combining 1. and 2. one obtains universality for all Wigner matrices obtaining suitable moment and support conditions.
What we realised is by combining the hard part 1. of the paper of Erdős-Ramírez-Schlein-Yau with the hard part 2. of the paper of Van and myself, we can remove all regularity, moment, and support conditions. Roughly speaking, the unified argument proceeds as follows:
- (Structured case) By the arguments of Erdős-Ramírez-Schlein-Yau, the universality conjecture is true for Ornstein-Uhlenbeck-evolved matrices with for any .
- (Matching) Every Wigner matrix can be “matched” with an Ornstein-Uhlenbeck-evolved matrix for (say), in the sense that the first four moments of the entries almost agree, which is enough (by the arguments of Van and myself) to show that these two matrices have asymptotically identical statistics on the average.
- Combining 1. and 2. one obtains universality for the averaged statistics for all Wigner matrices.
The averaging should be removable, but this would require better convergence results to the semicircular law than are currently known (except with additional hypotheses, such as vanishing third moment). The subexponential decay should also be relaxed to a condition of finiteness for some fixed moment , but we did not pursue this direction in order to keep the paper short.
One of the most important topological concepts in analysis is that of compactness (as discussed for instance in my Companion article on this topic). There are various flavours of this concept, but let us focus on sequential compactness: a subset E of a topological space X is sequentially compact if every sequence in E has a convergent subsequence whose limit is also in E. This property allows one to do many things with the set E. For instance, it allows one to maximise a functional on E:
Proposition 1. (Existence of extremisers) Let E be a non-empty sequentially compact subset of a topological space X, and let be a continuous function. Then the supremum is attained at at least one point , thus for all . (In particular, this supremum is finite.) Similarly for the infimum.
Proof. Let be the supremum . By the definition of supremum (and the axiom of (countable) choice), one can find a sequence in E such that . By compactness, we can refine this sequence to a subsequence (which, by abuse of notation, we shall continue to call ) such that converges to a limit x in E. Since we still have , and f is continuous at x, we conclude that f(x)=L, and the claim for the supremum follows. The claim for the infimum is similar.
Remark 1. An inspection of the argument shows that one can relax the continuity hypothesis on F somewhat: to attain the supremum, it suffices that F be upper semicontinuous, and to attain the infimum, it suffices that F be lower semicontinuous.
We thus see that sequential compactness is useful, among other things, for ensuring the existence of extremisers. In finite-dimensional spaces (such as vector spaces), compact sets are plentiful; indeed, the Heine-Borel theorem asserts that every closed and bounded set is compact. However, once one moves to infinite-dimensional spaces, such as function spaces, then the Heine-Borel theorem fails quite dramatically; most of the closed and bounded sets one encounters in a topological vector space are non-compact, if one insists on using a reasonably “strong” topology. This causes a difficulty in (among other things) calculus of variations, which is often concerned to finding extremisers to a functional on a subset E of an infinite-dimensional function space X.
In recent decades, mathematicians have found a number of ways to get around this difficulty. One of them is to weaken the topology to recover compactness, taking advantage of such results as the Banach-Alaoglu theorem (or its sequential counterpart). Of course, there is a tradeoff: weakening the topology makes compactness easier to attain, but makes the continuity of F harder to establish. Nevertheless, if F enjoys enough “smoothing” or “cancellation” properties, one can hope to obtain continuity in the weak topology, allowing one to do things such as locate extremisers. (The phenomenon that cancellation can lead to continuity in the weak topology is sometimes referred to as compensated compactness.)
Another option is to abandon trying to make all sequences have convergent subsequences, and settle just for extremising sequences to have convergent subsequences, as this would still be enough to retain Theorem 1. Pursuing this line of thought leads to the Palais-Smale condition, which is a substitute for compactness in some calculus of variations situations.
But in many situations, one cannot weaken the topology to the point where the domain E becomes compact, without destroying the continuity (or semi-continuity) of F, though one can often at least find an intermediate topology (or metric) in which F is continuous, but for which E is still not quite compact. Thus one can find sequences in E which do not have any subsequences that converge to a constant element , even in this intermediate metric. (As we shall see shortly, one major cause of this failure of compactness is the existence of a non-trivial action of a non-compact group G on E; such a group action can cause compensated compactness or the Palais-Smale condition to fail also.) Because of this, it is a priori conceivable that a continuous function F need not attain its supremum or infimum.
Nevertheless, even though a sequence does not have any subsequences that converge to a constant x, it may have a subsequence (which we also call ) which converges to some non-constant sequence (in the sense that the distance between the subsequence and the new sequence in a this intermediate metric), where the approximating sequence is of a very structured form (e.g. “concentrating” to a point, or “travelling” off to infinity, or a superposition of several concentrating or travelling profiles of this form). This weaker form of compactness, in which superpositions of a certain type of profile completely describe all the failures (or defects) of compactness, is known as concentration compactness, and the decomposition of the subsequence is known as the profile decomposition. In many applications, it is a sufficiently good substitute for compactness that one can still do things like locate extremisers for functionals F – though one often has to make some additional assumptions of F to compensate for the more complicated nature of the compactness. This phenomenon was systematically studied by P.L. Lions in the 80s, and found great application in calculus of variations and nonlinear elliptic PDE. More recently, concentration compactness has been a crucial and powerful tool in the non-perturbative analysis of nonlinear dispersive PDE, in particular being used to locate “minimal energy blowup solutions” or “minimal mass blowup solutions” for such a PDE (analogously to how one can use calculus of variations to find minimal energy solutions to a nonlinear elliptic equation); see for instance this recent survey by Killip and Visan.
In typical applications, the concentration compactness phenomenon is exploited in moderately sophisticated function spaces (such as Sobolev spaces or Strichartz spaces), with the failure of traditional compactness being connected to a moderately complicated group G of symmetries (e.g. the group generated by translations and dilations). Because of this, concentration compactness can appear to be a rather complicated and technical concept when it is first encountered. In this note, I would like to illustrate concentration compactness in a simple toy setting, namely in the space of absolutely summable sequences, with the uniform () metric playing the role of the intermediate metric, and the translation group playing the role of the symmetry group G. This toy setting is significantly simpler than any model that one would actually use in practice [for instance, in most applications X is a Hilbert space], but hopefully it serves to illuminate this useful concept in a less technical fashion.
The U.S. presidential election is now only a few weeks away. The politics of this election are of course interesting and important, but I do not want to discuss these topics here (there is not exactly a shortage of other venues for such a discussion), and would request that readers refrain from doing so in the comments to this post. However, I thought it would be apropos to talk about some of the basic mathematics underlying electoral polling, and specifically to explain the fact, which can be highly unintuitive to those not well versed in statistics, that polls can be accurate even when sampling only a tiny fraction of the entire population.
Take for instance a nationwide poll of U.S. voters on which presidential candidate they intend to vote for. A typical poll will ask a number of randomly selected voters for their opinion; a typical value here is . In contrast, the total voting-eligible population of the U.S. – let’s call this set – is about 200 million. (The actual turnout in the election is likely to be closer to 100 million, but let’s ignore this fact for the sake of discussion.) Thus, such a poll would sample about 0.0005% of the total population – an incredibly tiny fraction. Nevertheless, the margin of error (at the 95% confidence level) for such a poll, if conducted under idealised conditions (see below), is about 3%. In other words, if we let denote the proportion of the entire population that will vote for a given candidate , and let denote the proportion of the polled voters that will vote for , then the event will occur with probability at least 0.95. Thus, for instance (and oversimplifying a little – see below), if the poll reports that 55% of respondents would vote for A, then the true percentage of the electorate that would vote for A has at least a 95% chance of lying between 52% and 58%. Larger polls will of course give a smaller margin of error; for instance the margin of error for an (idealised) poll of 2,000 voters is about 2%.
I’ll give a rigorous proof of a weaker version of the above statement (giving a margin of error of about 7%, rather than 3%) in an appendix at the end of this post. But the main point of my post here is a little different, namely to address the common misconception that the accuracy of a poll is a function of the relative sample size rather than the absolute sample size, which would suggest that a poll involving only 0.0005% of the population could not possibly have a margin of error as low as 3%. I also want to point out some limitations of the mathematical analysis; depending on the methodology and the context, some polls involving 1000 respondents may have a much higher margin of error than the idealised rate of 3%.
Recent Comments