Theorem 1 (Halasz inequality)Let be a multiplicative function bounded in magnitude by , and suppose that , , and are such that

As a qualitative corollary, we conclude (by standard compactness arguments) that if

as . In the more recent work of this paper of Granville and Soundararajan, the sharper bound

is obtained (with a more precise description of the term).

The usual proofs of Halasz’s theorem are somewhat lengthy (though there has been a recent simplification, in forthcoming work of Granville, Harper, and Soundarajan). Below the fold I would like to give a relatively short proof of the following “cheap” version of the inequality, which has slightly weaker quantitative bounds, but still suffices to give qualitative conclusions such as (2).

Theorem 2 (Cheap Halasz inequality)Let be a multiplicative function bounded in magnitude by . Let and , and suppose that is sufficiently large depending on . If (1) holds for all , then

The non-optimal exponent can probably be improved a bit by being more careful with the exponents, but I did not try to optimise it here. A similar bound appears in the first paper of Halasz on this topic.

The idea of the argument is to split as a Dirichlet convolution where is the portion of coming from “small”, “medium”, and “large” primes respectively (with the dividing line between the three types of primes being given by various powers of ). Using a Perron-type formula, one can express this convolution in terms of the product of the Dirichlet series of respectively at various complex numbers with . One can use based estimates to control the Dirichlet series of , while using the hypothesis (1) one can get estimates on the Dirichlet series of . (This is similar to the Fourier-analytic approach to ternary additive problems, such as Vinogradov’s theorem on representing large odd numbers as the sum of three primes.) This idea was inspired by a similar device used in the work of Granville, Harper, and Soundarajan. A variant of this argument also appears in unpublished work of Adam Harper.

I thank Andrew Granville for helpful comments which led to significant simplifications of the argument.

** — 1. Basic estimates — **

We need the following basic tools from analytic number theory. We begin with a variant of the classical Perron formula.

Proposition 3 (Perron type formula)Let be an arithmetic function bounded in magnitude by , and let . Assume that the Dirichlet series is absolutely convergent for . Then

*Proof:* By telescoping series (and treating the contribution of trivially), it suffices to show that

whenever .

The left-hand side can be written as

where . We now introduce the mollified version

of , where

and is a fixed smooth function supported on that equals at the origin. Basic Fourier analysis then tells us that is a Schwartz function with total mass one. This gives the crude bound

for any . For or , we use the bound (say) to arrive at the bound

for we again use and write

and use the Lipschitz bound for to obtain

for such . Putting all these bounds together, we see that

for all . In particular, we can write (3) as

The expression is bounded by when , is bounded by when , is bounded by when or , and is bounded by otherwise. From these bounds, a routine calculation (using the hypothesis ) shows that

and so it remains to show that

Writing

where

we see from the triangle inequality and the support of that

But from integration by parts we see that , and the claim follows.

Next, we recall a standard mean value estimate for Dirichlet series:

Proposition 4 ( mean value estimate)Let be an arithmetic function, and let . Assume that the Dirichlet series is absolutely convergent for . Then

*Proof:* This follows from Lemma 7.1 of Iwaniec-Kowalski; for the convenience of the reader we reproduce the short proof here. Introducing the normalised sinc function , we have

But a standard Fourier-analytic computation shows that vanishes unless , in which case the integral is , and the claim follows.

Now we recall a basic sieve estimate:

Proposition 5 (Sieve bound)Let , let be an interval of length , and let be a set of primes up to . If we remove one residue class mod from for every , the number of remaining natural numbers in is at most .

*Proof:* This follows for instance from the fundamental lemma of sieve theory (see e.g. Corollary 19 of this blog post). (One can also use the Selberg sieve or the large sieve.)

Finally, we record a standard estimate on the number of smooth numbers:

Proposition 6Let and , and suppose that is sufficiently large depending on . Then the number of natural numbers in which have no prime factor larger than is at most .

*Proof:* See Corollary 1.3 of this paper of Hildebrand and Tenenbaum. (The result also follows from the more classical work of Dickman.) We sketch a short proof here due to Kevin Ford. Let denote the set of numbers that are “smooth” in the sense that they have no prime factor larger than . It then suffices to prove the bound

since the contribution of those less than (say) is negligible, and for the other values of , is comparable to . Writing , we can rearrange the left-hand side as

By the prime number theorem, the contribution to of those is , and the contribution of those with consists only of prime powers, which contribute . Combining these estimates, we can get a bound of the form

where is a quantity to be chosen later. Thus we can bound the left-hand side of (4) by

which by Euler products can be bounded by

By the mean value theorem applied to the function , we can bound by for . By Mertens’ theorem, we thus get a bound of

If we make the choice , we obtain the required bound (4).

** — 2. Proof of theorem — **

By increasing as necessary we may assume that (say). Let be small parameters (depending on ) to be optimised later; we assume to be sufficiently large depending on . Call a prime *small* if , *medium* if , and *large* if . Observe that for any we can factorise as a Dirichlet convolution

where

- is the restriction of to those natural numbers whose prime factors are all small;
- is the restriction of to those natural numbers whose prime factors are all medium;
- is the restriction of to those natural numbers whose prime factors are all large.

It is convenient to remove the Dirac function from , so we write

and split

Note that is the restriction of to those numbers whose prime factors are all small or medium. By Proposition 6, the number of such can certainly be bounded by if is sufficienty large. Thus the contribution of this term to (5) is .

Similarly, is the restriction of to those numbers which contain at least one large prime factor, but no medium prime factors. By Proposition 5 the number of such is bounded by if is sufficiently large. Thus the contribution of this term to (5) is , and hence

Note that is only supported on numbers whose prime factors do not exceed , so the Dirichlet series of is absolutely convergent for and is equal to , where are the Dirichlet series of respectively. Since is bounded in magnitude by (being a restriction of ), we may apply Proposition 3 and conclude (for large enough, and discarding the denominator) that

We now record some estimates:

Lemma 7For sufficiently large , we haveand

*Proof:* We just prove the former inequality, as the latter is similar. By Proposition 4, we have

The term vanishes unless , and we have , so we can bound the right-hand side by

The inner summand is bounded by and supported on those that are not divisible by any small primes. From Proposition 5 and Mertens’ theorem we conclude that

and thus

as desired.

We also have an estimate:

Lemma 8For sufficiently large , we havefor all .

*Proof:* From Euler products, Mertens’ theorem, and (1) we have

as desired.

Applying Hölder’s inequality, we conclude that

Setting and we obtain the claim.

Filed under: expository, math.NT Tagged: Halasz's theorem, pretentious multiplicative functions ]]>

Theorem 1 (Central limit theorem)Let be iid copies of a real random variable of mean and variance , and write . Then, for any fixed , we have

This is however not the end of the matter; there are many variants, refinements, and generalisations of the central limit theorem, and the purpose of this set of notes is to present a small sample of these variants.

First of all, the above theorem does not quantify the *rate* of convergence in (1). We have already addressed this issue to some extent with the Berry-Esséen theorem, which roughly speaking gives a convergence rate of uniformly in if we assume that has finite third moment. However there are still some quantitative versions of (1) which are not addressed by the Berry-Esséen theorem. For instance one may be interested in bounding the *large deviation probabilities*

in the setting where grows with . The central limit theorem (1) suggests that this probability should be bounded by something like ; however, this theorem only kicks in when is very large compared with . For instance, if one uses the Berry-Esséen theorem, one would need as large as or so to reach the desired bound of , even under the assumption of finite third moment. Basically, the issue is that convergence-in-distribution results, such as the central limit theorem, only really control the *typical* behaviour of statistics in ; they are much less effective at controlling the very rare *outlier* events in which the statistic strays far from its typical behaviour. Fortunately, there are large deviation inequalities (or *concentration of measure inequalities*) that do provide exponential type bounds for quantities such as (2), which are valid for both small and large values of . A basic example of this is the Chernoff bound that made an appearance in Exercise 47 of Notes 4; here we give some further basic inequalities of this type, including versions of the Bennett and Hoeffding inequalities.

In the other direction, we can also look at the fine scale behaviour of the sums by trying to control probabilities such as

where is now bounded (but can grow with ). The central limit theorem predicts that this quantity should be roughly , but even if one is able to invoke the Berry-Esséen theorem, one cannot quite see this main term because it is dominated by the error term in Berry-Esséen. There is good reason for this: if for instance takes integer values, then also takes integer values, and can vanish when is less than and is slightly larger than an integer. However, this turns out to essentially be the only obstruction; if does not lie in a lattice such as , then we can establish a *local limit theorem* controlling (3), and when does take values in a lattice like , there is a discrete local limit theorem that controls probabilities such as . Both of these limit theorems will be proven by the Fourier-analytic method used in the previous set of notes.

We also discuss other limit theorems in which the limiting distribution is something other than the normal distribution. Perhaps the most common example of these theorems is the Poisson limit theorems, in which one sums a large number of indicator variables (or approximate indicator variables), each of which is rarely non-zero, but which collectively add up to a random variable of medium-sized mean. In this case, it turns out that the limiting distribution should be a Poisson random variable; this again is an easy application of the Fourier method. Finally, we briefly discuss limit theorems for other stable laws than the normal distribution, which are suitable for summing random variables of infinite variance, such as the Cauchy distribution.

Finally, we mention a very important class of generalisations to the CLT (and to the variants of the CLT discussed in this post), in which the hypothesis of joint independence between the variables is relaxed, for instance one could assume only that the form a martingale. Many (though not all) of the proofs of the CLT extend to these more general settings, and this turns out to be important for many applications in which one does not expect joint independence. However, we will not discuss these generalisations in this course, as they are better suited for subsequent courses in this series when the theory of martingales, conditional expectation, and related tools are developed.

** — 1. Large deviation inequalities — **

We now look at some upper bounds for the large deviation probability (2). To get some intuition as to what kinds of bounds one can expect, we first consider some examples. First suppose that has the standard normal distribution, then , , and has the distribution of , so that has the distribution of . We thus have

which on using the inequality leads to the bound

Next, we consider the example when is a Bernoulli random variable drawn uniformly at random from . Then , , and has the standard binomial distribution on , thus . By symmetry, we then have

We recall Stirling’s formula, which we write crudely as

as , where denotes a quantity with as (and similarly for other uses of the notation in the sequel). If and is bounded away from zero and one, we then have the asymptotic

where is the entropy function

(compare with Exercise 17 of Notes 3). One can check that is decreasing for , and so one can compute that

as for any fixed . To compare this with (4), observe from Taylor expansion that

as .

Finally, consider the example where takes values in with and for some small , thus and . We have , and hence

with

Here, we see that the large deviation probability is somewhat larger than the gaussian prediction of . Instead, the exponent is approximately related to and by the formula

We now give a general large deviations inequality that is consistent with the above examples.

Proposition 2 (Cheap Bennett inequality)Let , and let be independent random variables, each of which takes values in an interval of length at most . Write , and write for the mean of . Let be such that has variance at most . Then for any , we have

There is more precise form of this inequality known as Bennett’s inequality, but we will not prove it here.

The first term in the minimum dominates when , and the second term dominates when . Sometimes it is convenient to weaken the estimate by discarding the logarithmic factor, leading to

(possibly with a slightly different choice of ); thus we have Gaussian type large deviation estimates for as large as , and (slightly better than) exponential decay after that.

In the case when are iid copies of a random variable of mean and variance taking values in an interval of length , we have and , and the above inequality simplifies slightly to

*Proof:* We first begin with some quick reductions. Firstly, by dividing the (and , , and ) by , we may normalise ; by subtracting the mean from each of the , we may assume that the have mean zero, so that as well. We also write for the variance of each , so that . Our task is to show that

for all . We will just prove the upper tail bound

the lower tail bound then follows by replacing all with their negations , and the claim then follows by summing the two estimates.

We use the “exponential moment method”, previously seen in proving the Chernoff bound (Exercise 47 of Notes 4), in which one uses the exponential moment generating function of . On the one hand, from Markov’s inequality one has

for any real parameter . On the other hand, from the joint independence of the one has

Since the take values in an interval of length at most and have mean zero, we have and so . This leads to the Taylor expansion

so on taking expectations we have

Putting all this together, we conclude that

If (say), one can then set to be a small multiple of to obtain a bound of the form

If instead , one can set to be a small multiple of to obtain a bound of the form

In either case, the claim follows.

The following variant of the above proposition is also useful, in which we get a simpler bound at the expense of worsening the quantity slightly:

Proposition 3 (Cheap Hoeffding inequality)Let be independent random variables, with each taking values in an interval with . Write , and write for the mean of , and write

In fact one can take , a fact known as Hoeffding’s inequality; see Exercise 6 below for a special case of this.

*Proof:* We again normalise the to have mean zero, so that . We then have for each , so by Taylor expansion we have for any real that

and thus

Multiplying in , we then have

and one can now repeat the previous arguments (but without the factor to deal with).

Remark 4In the above examples, the underlying random variable was assumed to either be restricted to an interval, or to be subgaussian. This type of hypothesis is necessary if one wishes to have estimates on (2) that are similarly subgaussian. For instance, suppose has a zeta distributionfor some and all natural numbers , where . One can check that this distribution has finite mean and variance for . On the other hand, since we trivially have , we have the crude lower bound

which shows that in this case the expression (2) only decays at a polynomial rate in rather than an exponential or subgaussian rate.

Exercise 5 (Khintchine inequality)Let be iid copies of a Bernoulli random variable drawn uniformly from .

- (i) For any non-negative reals and any , show that
for some constant depending only on . When , show that one can take and equality holds.

- (ii) With the hypotheses in (i), obtain the matching lower bound
for some depending only on . (

Hint:use (i) and Hölder’s inequality.)- (iii) For any and any functions on a measure space , show that
and

with the same constants as in (i), (ii). When , show that one can take and equality holds.

- (iv) (Marcinkiewicz-Zygmund theorem) The Khintchine inequality is very useful in real analysis; we give one example here. Let be measure spaces, let , and suppose is a linear operator obeying the bound
for all and some finite . Show that for any finite sequence , one has the bound

for some constant depending only on . (

Hint:test against a random sum .)- (v) By using gaussian sums in place of random signs, show that one can take the constant in (iv) to be one. (For simplicity, let us take the functions in to be real valued.)

In this set of notes we have not focused on getting explicit constants in the large deviation inequalities, but it is not too difficult to do so with a little extra work. We give just one example here:

Exercise 6Let be iid copies of a Bernoulli random variable drawn uniformly from .

- (i) Show that for any real . (
Hint:expand both sides as an infinite Taylor series in .)- (ii) Show that for any real numbers and any , we have
(Note that this is consistent with (6) with .)

There are many further large deviation inequalities than the ones presented here. For instance, the Azuma-Hoeffding inequality gives Hoeffding-type bounds when the random variables are not assumed to be jointly independent, but are instead required to form a martingale. Concentration of measure inequalities such as McDiarmid’s inequality handle the situation in which the sum is replaced by a more nonlinear function of the input variables . There are also a number of refinements of the Chernoff estimate from the previous notes, that are collectively referred to as “Chernoff bounds“. The Bernstein inequalities handle situations in which the underlying random variable is not bounded, but enjoys good moment bounds. See this previous blog post for these inequalities and some further discussion. Last, but certainly not least, there is an extensively developed theory of large deviations which is focused on the precise exponent in the exponential decay rate for tail probabilities such as (2) when is very large (of the order of ); there is also a complementary theory of *moderate deviations* that gives precise estimates in the regime where is much larger than one, but much less than , for which we generally expect gaussian behaviour rather than exponentially decaying bounds. These topics are beyond the scope of this course.

** — 2. Local limit theorems — **

Let be iid copies of a random variable of mean and variance , and write . On the one hand, the central limit theorem tells us that should behave like the normal distribution , which has probability density function . On the other hand, if is discrete, then must also be discrete. For instance, if takes values in the integers , then takes values in the integers as well. In this case, we would expect (much as we expect a Riemann sum to approximate an integral) the probability distribution of to behave like the probability density function predicted by the central limit theorem, thus we expect

for integer . This is not a direct consequence of the central limit theorem (which does not distinguish between continuous or discrete random variables ), and in any case is not true in some cases: if is restricted to an infinite subprogression of for some and integer , then is similarly restricted to the infinite subprogression , so that (7) totally fails when is outside of (and when does lie in , one would now expect the left-hand side of (7) to be about times larger than the right-hand side, to keep the total probability close to ). However, this turns out to be the only obstruction:

Theorem 7 (Discrete local limit theorem)Let be iid copies of an integer-valued random variable of mean and variance . Suppose furthermore that there is no infinite subprogression of with for which takes values almost surely in . Then one hasfor all and all integers , where the error term is uniform in . In other words, we have

as .

Note for comparison that the Berry-Esséen theorem (writing as, say, ) would give (assuming finite third moment) an error term of instead of , which would overwhelm the main term which is also of size .

*Proof:* Unlike previous arguments, we do not have the luxury here use an affine change of variables to normalise to mean zero and variance one, as this would disrupt the hypothesis that takes values in .

Fix and . Since and are integers, we have the Fourier identity

which upon taking expectations and using Fubini’s theorem gives

where is the characteristic function of . Expanding and noting that are iid copies of , we have

It will be convenient to make the change of variables , to obtain

As in the Fourier-analytic proof of the central limit theorem, we have

as , so by Taylor expansion we have

as for any fixed . This suggests (but does not yet prove) that

A standard Fourier-analytic calculation gives

so it will now suffice to establish that

uniformly in . From dominated convergence we have

so by the triangle inequality, it suffices to show that

This will follow from (9) and the dominated convergence theorem, as soon as we can dominate the integrands by an absolutely integrable function.

From (8), there is an such that

for all , and hence

for . This gives the required domination in the region , so it remains to handle the region .

From the triangle inequality we have for all . Actually we have the stronger bound for . Indeed, if for some such , this would imply that is a deterministic constant, which means that takes values in for some real , which implies that takes values in ; since also takes values in , this would place either in a singleton set or in an infinite subprogression of (depending on whether and are rational), a contradiction. As is continuous and the region is compact, there exists such that for all . This allows us to dominate by for , which is in turn bounded by for some independent of , giving the required domination.

Of course, if the random variable in Theorem 7 did take values almost surely in some subprogression , then either is almost surely constant, or there is a minimal progression with this property (since must divide the difference of any two integers that attains with positive probability). One can then make the affine change of variables (modifying and appropriately) and apply the above theorem to obtain a similar local limit theorem, which we will not write here. For instance, if is the uniform distribution on , then this argument gives

when is an integer of the same parity as (of course, will vanish otherwise). A further affine change of variables handles the case when is not integer valued, but takes values in some other lattice , where and are now real-valued.

We can complement these local limit theorems with the following result that handles the non-lattice case:

Theorem 8 (Continuous local limit theorem)Let be iid copies of an real-valued random variable of mean and variance . Suppose furthermore that there is no infinite progression with real for which takes values almost surely in . Then for any , one hasfor all and all , where the error term is uniform in (but may depend on ).

Equivalently, if we let be a normal random variable with mean and variance , then

uniformly in . Again, this can be compared with the Berry-Esséen theorem, which (assuming finite third moment) has an error term of which is uniform in both and .

*Proof:* Unlike the discrete case, we have the luxury here of normalising and , and we shall now do so.

We first observe that it will suffice to show that

whenever is a Schwartz function whose Fourier transform

is compactly supported, and where the error can depend on but is uniform in . Indeed, if this bound (10) holds, then (after replacing by to make it positive on some interval, and then rescaling) we obtain a bound of the form

for any and , where the term can depend on but not on . Then, by convolving by an approximation to the identity of some width much smaller than with compactly supported Fourier transform, applying (10) to the resulting function, and using (11) to control the error between that function and , we see that

uniformly in , where the term can depend on and but is uniform in . Letting tend slowly to zero, we obtain the claim.

It remains to establish (10). We adapt the argument from the discrete case. By the Fourier inversion formula we may write

By Fubini’s theorem as before, we thus have

and similarly

so it suffices by the triangle inequality and the boundedness and compact support of to show that

for any fixed (where the term can depend on ). We have and , so by making the change of variables , we now need to show that

as . But this follows from the argument used to handle the discrete case.

** — 3. The Poisson central limit theorem — **

The central limit theorem (after normalising the random variables to have mean zero) studies the fluctuations of sums where each individual term is quite small (typically of size ). Now we consider a variant situation, in which one considers a sum of random variables which are *usually* zero, but occasionally equal to a larger value such as . (This situation arises in many real-life situations when compiling aggregate statistics on rare events, e.g. the number of car crashes in a short period of time.) In these cases, one can get a different distribution than the gaussian distribution, namely a Poisson distribution with some intensity – that is to say, a random variable taking values in the non-negative integers with probability distribution

One can check that this distribution has mean and variance .

Theorem 9 (Poisson central limit theorem)Let be a triangular array of real random variables, where for each , the variables are jointly independent. Assume furthermore that

- (i) ( mostly ) One has as .
- (ii) ( rarely ) One has as .
- (iii) (Convergent expectation) One has as for some .
Then the random variables converge in distribution to a Poisson random variable of intensity .

*Proof:* From hypothesis (i) and the union bound we see that for each , we have that all of the lie in with probability as . Thus, if we replace each by the restriction , the random variable is only modified on an event of probability , which does not affect distributional limits (Slutsky’s theorem). Thus, we may assume without loss of generality that the take values in .

By Exercise 20 of Notes 4, a Poisson random variable of intensity has characteristic function . Applying the Lévy convergence theorem (Theorem 27 of Notes 4), we conclude that it suffices to show that

as for any fixed .

Fix . By the independence of the , we may write

Since only takes on the values and , we can write

where . By hypothesis (ii), we have , so by using a branch of the complex logarithm that is analytic near , we can write

By Taylor expansion we have

and hence by (ii), (iii)

as , and the claim follows.

Exercise 10Establish the conclusion of Theorem 9 directly from explicit computation of the probabilities in the case when each takes values in with for some fixed .

The Poisson central limit theorem can be viewed as a degenerate limit of the central limit theorem, as seen by the next two exercises.

Exercise 11Suppose we replace the hypothesis (iii) in Theorem 9 with the alternative hypothesis that the quantities go to infinity as , while leaving hypotheses (i) and (ii) unchanged. Show that converges in distribution to the normal distribution .

Exercise 12For each , let be a Poisson random variable with intensity . Show that as , the random variables converge in distribution to the normal distribution . Discuss how this is consistent with Theorem 9 and the previous exercise.

** — 4. Stable laws — **

Let be a real random variable. We say that has a *stable law* or a *stable distribution* if for any positive reals , there exists a positive real and a real such that whenever are iid copies of . In terms of the characteristic function of , we see that has a stable law if for any positive reals , there exist a positive real and a real for which we have the functional equation

for all real .

For instance, a normally distributed variable is stable thanks to Lemma 12 of Notes 4; one can also see this from the characteristic function . A Cauchy distribution , with probability density can also be seen to be stable, as is most easily seen from the characteristic function . As a more degenerate example, any deterministic random variable is stable. It is possible (though somewhat tedious) to completely classify all the stable distributions, see for instance the Wikipedia entry on these laws for the full classification.

If is stable, and are iid copies of , then by iterating the stable law hypothesis we see that the sums are all equal in distribution to some affine rescaling of . For instance, we have for some , and a routine induction then shows that

for all natural numbers (with the understanding that when ). In particular, the random variables all have the same distribution as .

More generally, given two real random variables and , we say that is in the *basin of attraction* for if, whenever are iid copies of and , there exist constants and such that converges in distribution to . Thus, any stable law is in its own basin of attraction, while the central limit theorem asserts that any random variable of finite variance is in the basin of attraction of a normal distribution. One can check that every random variable lies in the basin of attraction of a deterministic random variable such as , simply by letting go to infinity rapidly enough. To avoid this degenerate case, we now restrict to laws that are *non-degenerate*, in the sense that they are not almost surely constant. Then we have the following useful technical lemma:

Proposition 13 (Convergence of types)Let be a sequence of real random variables converging in distribution to a non-degenerate limit . Let and be real numbers such that converges in distribution to a non-degenerate limit . Then and converge to some finite limits respectively, and .

*Proof:* Suppose first that goes to zero. The sequence converges in distribution, hence is tight, hence converges in probability to zero. In particular, if is an independent copy of , then converges in probability to zero; but also converges in distribution to where is an independent copy of , and is not almost surely zero since is non-degenerate. This is a contradiction. Similarly if has any subsequence that goes to zero. We conclude that is bounded away from zero. Rewriting as and reversing the roles of and , we conclude also that is bounded away from zero, thus is bounded.

Since is tight and is bounded, is tight; since is also tight, this implies that is tight, that is to say is bounded.

Let be a limit point of the . By Slutsky’s theorem, a subsequence of the then converges in distribution to , thus . If the limit point is unique then we are done, so suppose there are two limit points , . Thus , which on rearranging gives for some and real with .

If then on iteration we have for any natural number , which clearly leads to a contradiction as since . If then iteration gives for any natural number , which on passing to the limit in distribution as gives , again a contradiction. If then we rewrite as and again obtain a contradiction, and the claim follows.

One can use this proposition to verify that basins of attraction of genuinely distinct laws are disjoint:

Exercise 14Let and be non-degenerate real random variables. Suppose that a random variable lies in the basin of attraction of both and . Then there exist and real such that .

If lies in the basin of attraction for a non-degenerate law , then converges in distribution to ; since is equal in distribution to the sum of two iid copies of , we see that converges in distribution to the sum of two iid copies of . On the other hand, converges in distribution to . Using Proposition 13 we conclude that for some and real . One can go further and conclude that in fact has a stable law; see the following exercise. Thus stable laws are the only laws that have a non-empty basin of attraction.

Exercise 15Let lie in the basin of attraction for a non-degenerate law .

- (i) Show that for any iid copies of , there exists a unique and such that . Also show that for all natural numbers .
- (ii) Show that the are strictly increasing, with for all natural numbers . (
Hint:study the absolute value of the characteristic function, using the non-degeneracy of to ensure that this absolute value is usually strictly less than one.) Also show that for all natural numbers .- (iii) Show that there exists such that for all . (
Hint:first show that is a Cauchy sequence in .)- (iv) If , and are iid copies of , show that for all natural numbers and some bounded real . Then show that has a stable law in this case.
- (v) If , show that for some real and all . Then show that has a stable law in this case.

Exercise 16 (Classification of stable laws)Let be a non-degenerate stable law, then lies in its own basin of attraction and one can then define as in the preceding exercise.

- (i) If , and is as in part (v) of the preceding exercise, show that for all and . Then show that
for some real . (One can use the identity to restrict attention to the case of positive .)

- (ii) Now suppose . Show that for all (where the implied constant in the notation is allowed to depend on ). Conclude that for all .
- (iii) We continue to assume . Show that for some real number . (
Hint:first show this when is a power of a fixed natural number (with possibly depending on ). Then use the estimates from part (ii) to show that does not actually depend on . (One may need to invoke the Dirichlet approximation theorem to show that for any given , one can find a power of that is somewhat close to a power of .)- (iv) We continue to assume . Show that for all and . Then show that
for all and some real .

It is also possible to determine which choices of parameters are actually achievable by some random variable , but we will not do so here.

It is possible to associate a central limit theorem to each stable law, which precisely determines their basin of attraction. We will not do this in full generality, but just illustrate the situation for the Cauchy distribution.

Exercise 17Let be a real random variable which is symmetric (that is, has the same distribution as ) and obeys the distribution identityfor all , where is a function which is

slowly varyingin the sense that as for all .

- (i) Show that
as , where denotes a quantity such that as . (You may need to establish the identity , which can be done by contour integration.)

- (ii) Let be iid copies of . Show that converges in distribution to a copy of the standard Cauchy distribution (i.e., to a random variable with probability density function ).

Filed under: 275A - probability theory, math.PR, Uncategorized Tagged: central limit theorem, large deviation inequality, local limit theorems, stable laws ]]>

between consecutive primes up to , in which we improved the Rankin bound of

to

for large (where we use the abbreviations , , and ). Here, we obtain an analogous result for the quantity

which measures how far apart the gaps between chains of consecutive primes can be. Our main result is

whenever is sufficiently large depending on , with the implied constant here absolute (and effective). The factor of is inherent to the method, and related to the basic probabilistic fact that if one selects numbers at random from the unit interval , then one expects the minimum gap between adjacent numbers to be about (i.e. smaller than the mean spacing of by an additional factor of ).

Our arguments combine those from the previous paper with the matrix method of Maier, who (in our notation) showed that

for an infinite sequence of going to infinity. (Maier needed to restrict to an infinite sequence to avoid Siegel zeroes, but we are able to resolve this issue by the now standard technique of simply eliminating a prime factor of an exceptional conductor from the sieve-theoretic portion of the argument. As a byproduct, this also makes all of the estimates in our paper effective.)

As its name suggests, the Maier matrix method is usually presented by imagining a matrix of numbers, and using information about the distribution of primes in the columns of this matrix to deduce information about the primes in at least one of the rows of the matrix. We found it convenient to interpret this method in an equivalent probabilistic form as follows. Suppose one wants to find an interval which contained a block of at least primes, each separated from each other by at least (ultimately, will be something like and something like ). One can do this by the probabilistic method: pick to be a random large natural number (with the precise distribution to be chosen later), and try to lower bound the probability that the interval contains at least primes, no two of which are within of each other.

By carefully choosing the residue class of with respect to small primes, one can eliminate several of the from consideration of being prime immediately. For instance, if is chosen to be large and even, then the with even have no chance of being prime and can thus be eliminated; similarly if is large and odd, then cannot be prime for any odd . Using the methods of our previous paper, we can find a residue class (where is a product of a large number of primes) such that, if one chooses to be a large random element of (that is, for some large random integer ), then the set of shifts for which still has a chance of being prime has size comparable to something like ; furthermore this set is fairly well distributed in in the sense that it does not concentrate too strongly in any short subinterval of . The main new difficulty, not present in the previous paper, is to get *lower* bounds on the size of in addition to upper bounds, but this turns out to be achievable by a suitable modification of the arguments.

Using a version of the prime number theorem in arithmetic progressions due to Gallagher, one can show that for each remaining shift , is going to be prime with probability comparable to , so one expects about primes in the set . An upper bound sieve (e.g. the Selberg sieve) also shows that for any distinct , the probability that and are both prime is . Using this and some routine second moment calculations, one can then show that with large probability, the set will indeed contain about primes, no two of which are closer than to each other; with no other numbers in this interval being prime, this gives a lower bound on .

Filed under: math.NT, paper Tagged: James Maynard, Kevin Ford, Maier matrix method, prime gaps ]]>

I never met or communicated with Roth personally, but was certainly influenced by his work; he wrote relatively few papers, but they tended to have outsized impact. For instance, he was one of the key people (together with Bombieri) to work on simplifying and generalising the large sieve, taking it from the technically formidable original formulation of Linnik and Rényi to the clean and general almost orthogonality principle that we have today (discussed for instance in these lecture notes of mine). The paper of Roth that had the most impact on my own personal work was his three-page paper proving what is now known as Roth’s theorem on arithmetic progressions:

Theorem 1 (Roth’s theorem on arithmetic progressions)Let be a set of natural numbers of positive upper density (thus ). Then contains infinitely many arithmetic progressions of length three (with non-zero of course).

At the heart of Roth’s elegant argument was the following (surprising at the time) dichotomy: if had some moderately large density within some arithmetic progression , either one could use Fourier-analytic methods to detect the presence of an arithmetic progression of length three inside , or else one could locate a long subprogression of on which had increased density. Iterating this dichotomy by an argument now known as the *density increment argument*, one eventually obtains Roth’s theorem, no matter which side of the dichotomy actually holds. This argument (and the many descendants of it), based on various “dichotomies between structure and randomness”, became essential in many other results of this type, most famously perhaps in Szemerédi’s proof of his celebrated theorem on arithmetic progressions that generalised Roth’s theorem to progressions of arbitrary length. More recently, my recent work on the Chowla and Elliott conjectures that was a crucial component of the solution of the Erdös discrepancy problem, relies on an *entropy decrement argument* which was directly inspired by the density increment argument of Roth.

The Erdös discrepancy problem also is connected with another well known theorem of Roth:

Theorem 2 (Roth’s discrepancy theorem for arithmetic progressions)Let be a sequence in . Then there exists an arithmetic progression in with positive such thatfor an absolute constant .

In fact, Roth proved a stronger estimate regarding mean square discrepancy, which I am not writing down here; as with the Roth theorem in arithmetic progressions, his proof was short and Fourier-analytic in nature (although non-Fourier-analytic proofs have since been found, for instance the semidefinite programming proof of Lovasz). The exponent is known to be sharp (a result of Matousek and Spencer).

As a particular corollary of the above theorem, for an infinite sequence of signs, the sums are unbounded in . The Erdös discrepancy problem asks whether the same statement holds when is restricted to be zero. (Roth also established discrepancy theorems for other sets, such as rectangles, which will not be discussed here.)

Finally, one has to mention Roth’s most famous result, cited for instance in his Fields medal citation:

Theorem 3 (Roth’s theorem on Diophantine approximation)Let be an irrational algebraic number. Then for any there is a quantity such that

From the Dirichlet approximation theorem (or from the theory of continued fractions) we know that the exponent in the denominator cannot be reduced to or below. A classical and easy theorem of Liouville gives the claim with the exponent replaced by the degree of the algebraic number ; work of Thue and Siegel reduced this exponent, but Roth was the one who obtained the near-optimal result. An important point is that the constant is *ineffective* – it is a major open problem in Diophantine approximation to produce any bound significantly stronger than Liouville’s theorem with effective constants. This is because the proof of Roth’s theorem does not exclude any *single* rational from being close to , but instead very ingeniously shows that one cannot have *two* different rationals , that are unusually close to , even when the denominators are very different in size. (I refer to this sort of argument as a “dueling conspiracies” argument; they are strangely prevalent throughout analytic number theory.)

Filed under: math.NT, obituary Tagged: Diophantine approximation, Klaus Roth, large sieve, randomness, structure ]]>

Applications for Postdoctoral FellowshipsÂ andÂ Research Memberships for this program (and for other MSRI programs in this time period, namely theÂ companion program in Harmonic AnalysisÂ and the Fall program in Geometric Group Theory, as well as the complementary program in all other areas of mathematics) remain open until Dec 1. Â Applications are open to everyone, but require supporting documentation, such as a CV, statement of purpose, and letters of recommendation from other mathematicians; see the application page for more details.

Filed under: admin, advertising, math.NT Tagged: MSRI ]]>

and

Then, as computed in previous notes, the normalised fluctuation also has mean zero and variance one:

This and Chebyshev’s inequality already indicates that the “typical” size of is , thus for instance goes to zero in probability for any that goes to infinity as . If we also have a finite fourth moment , then the calculations of the previous notes also give a fourth moment estimate

From this and the Paley-Zygmund inequality (Exercise 42 of Notes 1) we also get some lower bound for of the form

for some absolute constant and for sufficiently large; this indicates in particular that does not converge in any reasonable sense to something finite for any that goes to infinity.

The question remains as to what happens to the ratio itself, without multiplying or dividing by any factor . A first guess would be that these ratios converge in probability or almost surely, but this is unfortunately not the case:

Proposition 1Let be iid copies of an absolutely integrable real scalar random variable with mean zero, variance one, and finite fourth moment, and write . Then the random variables do not converge in probability or almost surely to any limit, and neither does any subsequence of these random variables.

*Proof:* Suppose for contradiction that some sequence converged in probability or almost surely to a limit . By passing to a further subsequence we may assume that the convergence is in the almost sure sense. Since all of the have mean zero, variance one, and bounded fourth moment, Theorem 24 of Notes 1 implies that the limit also has mean zero and variance one. On the other hand, is a tail random variable and is thus almost surely constant by the Kolmogorov zero-one law from Notes 3. Since constants have variance zero, we obtain the required contradiction.

Nevertheless there is an important limit for the ratio , which requires one to replace the notions of convergence in probability or almost sure convergence by the weaker concept of convergence in distribution.

Definition 2 (Vague convergence and convergence in distribution)Let be a locally compact Hausdorff topological space with the Borel -algebra. A sequence of finite measures on is said to converge vaguely to another finite measure if one hasas for all continuous compactly supported functions . (Vague convergence is also known as

weak convergence, although strictly speaking the terminology weak-* convergence would be more accurate.) A sequence of random variables taking values in is said toconverge in distribution(orconverge weaklyorconverge in law) to another random variable if the distributions converge vaguely to the distribution , or equivalently ifas for all continuous compactly supported functions .

One could in principle try to extend this definition beyond the locally compact Hausdorff setting, but certain pathologies can occur when doing so (e.g. failure of the Riesz representation theorem), and we will never need to consider vague convergence in spaces that are not locally compact Hausdorff, so we restrict to this setting for simplicity.

Note that the notion of convergence in distribution depends only on the distribution of the random variables involved. One consequence of this is that convergence in distribution does not produce unique limits: if converges in distribution to , and has the same distribution as , then also converges in distribution to . However, limits are unique up to equivalence in distribution (this is a consequence of the Riesz representation theorem, discussed for instance in this blog post). As a consequence of the insensitivity of convergence in distribution to equivalence in distribution, we may also legitimately talk about convergence of distribution of a sequence of random variables to another random variable even when all the random variables and involved are being modeled by different probability spaces (e.g. each is modeled by , and is modeled by , with no coupling presumed between these spaces). This is in contrast to the stronger notions of convergence in probability or almost sure convergence, which require all the random variables to be modeled by a common probability space. Also, by an abuse of notation, we can say that a sequence of random variables converges in distribution to a probability measure , when converges vaguely to . Thus we can talk about a sequence of random variables converging in distribution to a uniform distribution, a gaussian distribution, etc..

From the dominated convergence theorem (available for both convergence in probability and almost sure convergence) we see that convergence in probability or almost sure convergence implies convergence in distribution. The converse is not true, due to the insensitivity of convergence in distribution to equivalence in distribution; for instance, if are iid copies of a non-deterministic scalar random variable , then the trivially converge in distribution to , but will not converge in probability or almost surely (as one can see from the zero-one law). However, there are some partial converses that relate convergence in distribution to convergence in probability; see Exercise 10 below.

Remark 3The notion of convergence in distribution is somewhat similar to the notion of convergence in the sense of distributions that arises in distribution theory (discussed for instance in this previous blog post), however strictly speaking the two notions of convergence are distinct and should not be confused with each other, despite the very similar names.

The notion of convergence in distribution simplifies in the case of real scalar random variables:

Proposition 4Let be a sequence of scalar random variables, and let be another scalar random variable. Then the following are equivalent:

- (i) converges in distribution to .
- (ii) converges to for each continuity point of (i.e. for all real numbers at which is continuous). Here is the cumulative distribution function of .

*Proof:* First suppose that converges in distribution to , and is continuous at . For any , one can find a such that

for every . One can also find an larger than such that and . Thus

and

Let be a continuous function supported on that equals on . Then by the above discussion we have

and hence

for large enough . In particular

A similar argument, replacing with a continuous function supported on that equals on gives

for large enough. Putting the two estimates together gives

for large enough; sending , we obtain the claim.

Conversely, suppose that converges to at every continuity point of . Let be a continuous compactly supported function, then it is uniformly continuous. As is monotone increasing, it can only have countably many points of discontinuity. From these two facts one can find, for any , a simple function for some that are points of continuity of , and real numbers , such that for all . Thus

Similarly for replaced by . Subtracting and taking limit superior, we conclude that

and on sending , we obtain that converges in distribution to as claimed.

The restriction to continuity points of is necessary. Consider for instance the deterministic random variables , then converges almost surely (and hence in distribution) to , but does not converge to .

Example 5For any natural number , let be a discrete random variable drawn uniformly from the finite set , and let be the continuous random variable drawn uniformly from . Then converges in distribution to . Thus we see that a continuous random variable can emerge as the limit of discrete random variables.

Example 6For any natural number , let be a continuous random variable drawn uniformly from , then converges in distribution to the deterministic real number . Thus we see that discrete (or even deterministic) random variables can emerge as the limit of continuous random variables.

Exercise 7 (Portmanteau theorem)Show that the properties (i) and (ii) in Proposition 4 are also equivalent to the following three statements:

- (iii) One has for all closed sets .
- (iv) One has for all open sets .
- (v) For any Borel set whose topological boundary is such that , one has .
(Note: to prove this theorem, you may wish to invoke Urysohn’s lemma. To deduce (iii) from (i), you may wish to start with the case of compact .)

We can now state the famous central limit theorem:

Theorem 8 (Central limit theorem)Let be iid copies of a scalar random variable of finite mean and finite non-zero variance . Let . Then the random variables converges in distribution to a random variable with the standard normal distribution (that is to say, a random variable with probability density function ). Thus, by abuse of notationIn the normalised case when has mean zero and unit variance, this simplifies to

Using Proposition 4 (and the fact that the cumulative distribution function associated to is continuous, the central limit theorem is equivalent to asserting that

as for any , or equivalently that

Informally, one can think of the central limit theorem as asserting that approximately behaves like it has distribution for large , where is the normal distribution with mean and variance , that is to say the distribution with probability density function . The integrals can be written in terms of the error function as .

The central limit theorem is a basic example of the *universality phenomenon* in probability – many statistics involving a large system of many independent (or weakly dependent) variables (such as the normalised sums ) end up having a universal asymptotic limit (in this case, the normal distribution), regardless of the precise makeup of the underlying random variable that comprised that system. Indeed, the universality of the normal distribution is such that it arises in many other contexts than the fluctuation of iid random variables; the central limit theorem is merely the first place in probability theory where it makes a prominent appearance.

We will give several proofs of the central limit theorem in these notes; each of these proofs has their advantages and disadvantages, and can each extend to prove many further results beyond the central limit theorem. We first give Lindeberg’s proof of the central limit theorem, based on exchanging (or swapping) each component of the sum in turn. This proof gives an accessible explanation as to why there should be a universal limit for the central limit theorem; one then computes directly with gaussians to verify that it is the normal distribution which is the universal limit. Our second proof is the most popular one taught in probability texts, namely the Fourier-analytic proof based around the concept of the characteristic function of a real random variable . Thanks to the powerful identities and other results of Fourier analysis, this gives a quite short and direct proof of the central limit theorem, although the arguments may seem rather magical to readers who are not already familiar with Fourier methods. Finally, we give a proof based on the moment method, in the spirit of the arguments in the previous notes; this argument is more combinatorial, but is straightforward and is particularly robust, in particular being well equipped to handle some dependencies between components; we will illustrate this by proving the Erdos-Kac law in number theory by this method. Some further discussion of the central limit theorem (including some further proofs, such as one based on Stein’s method) can be found in this blog post. Some further variants of the central limit theorem, such as local limit theorems, stable laws, and large deviation inequalities, will be discussed in the next (and final) set of notes.

The following exercise illustrates the power of the central limit theorem, by establishing combinatorial estimates which would otherwise require the use of Stirling’s formula to establish.

Exercise 9 (De Moivre-Laplace theorem)Let be a Bernoulli random variable, taking values in with , thus has mean and variance . Let be iid copies of , and write .

- (i) Show that takes values in with . (This is an example of a binomial distribution.)
- (ii) Assume Stirling’s formula
where is a function of that goes to zero as . (A proof of this formula may be found in this previous blog post.) Using this formula, and without using the central limit theorem, show that

as for any fixed real numbers .

The above special case of the central limit theorem was first established by de Moivre and Laplace.

We close this section with some basic facts about convergence of distribution that will be useful in the sequel.

Exercise 10Let , be sequences of real random variables, and let be further real random variables.

- (i) If is deterministic, show that converges in distribution to if and only if converges in probability to .
- (ii) Suppose that is independent of for each , and independent of . Show that converges in distribution to if and only if converges in distribution to and converges in distribution to . (The shortest way to prove this is by invoking the Stone-Weierstrass theorem, but one can also proceed by proving some version of Proposition 4.) What happens if the independence hypothesis is dropped?
- (iii) If converges in distribution to , show that for every there exists such that for all sufficiently large . (That is to say, is a tight sequence of random variables.)
- (iv) Show that converges in distribution to if and only if, after extending the probability space model if necessary, one can find copies and of and respectively such that converges almost surely to . (
Hint:use the Skorohod representation, Exercise 29 of Notes 0.)- (v) If converges in distribution to , and is continuous, show that converges in distribution to . Generalise this claim to the case when takes values in an arbitrary locally compact Hausdorff space.
- (vi) (Slutsky’s theorem) If converges in distribution to , and converges in probability to a
deterministiclimit , show that converges in distribution to , and converges in distribution to . (Hint: either use (iv), or else use (iii) to control some error terms.) This statement combines particularly well with (i). What happens if is not assumed to be deterministic?- (vii) (Fatou lemma) If is continuous, and converges in distribution to , show that .
- (viii) (Bounded convergence) If is continuous and bounded, and converges in distribution to , show that .
- (ix) (Dominated convergence) If converges in distribution to , and there is an absolutely integrable such that almost surely for all , show that .

For future reference we also mention (but will not prove) Prokhorov’s theorem that gives a partial converse to part (iii) of the above exercise:

Theorem 11 (Prokhorov’s theorem)Let be a sequence of real random variables which is tight (that is, for every there exists such that for all sufficiently large ). Then there exists a subsequence which converges in distribution to some random variable (which may possibly be modeled by a different probability space model than the .)

The proof of this theorem relies on the Riesz representation theorem, and is beyond the scope of this course; but see for instance Exercise 29 of this previous blog post. (See also the closely related Helly selection theorem, covered in Exercise 30 of the same post.)

** — 1. The Lindeberg approach to the central limit theorem — **

We now give the Lindeberg argument establishing the central limit theorem. The proof splits into two unrelated components. The first component is to establish the central limit theorem for a *single* choice of underlying random variable . The second component is to show that the limiting distribution of is *universal* in the sense that it does not depend the choice of underlying random variable. Putting the two components together gives Theorem 8.

We begin with the first component of the argument. One could use the Bernoulli distribution from Exercise 9 as the choice of underlying random variable, but a simpler choice of distribution (in the sense that no appeal to Stirling’s formula is required) is the normal distribution itself. The key computation is:

Lemma 12 (Sum of independent Gaussians)Let be independent real random variables with normal distributions , respectively for some and . Then has the normal distribution .

This is of course consistent with the additivity of mean and variance for independent random variables, given that random variables with the distribution have mean and variance .

*Proof:* By subtracting and from respectively, we may normalise , by dividing through by we may also normalise . Thus

and

for any Borel sets . As are independent, this implies that

for any Borel set (this follows from the uniqueness of product measure, or equivalently one can use the monotone class lemma starting from the case when is a finite boolean combination of product sets ). In particular, we have

for any . Making the change of variables (and using the Fubini-Tonelli theorem as necessary) we can write the right-hand side as

We can complete the square using to write (after some routine algebra)

so on using the identity for any and , we can write (2) as

and so has the cumulative distribution function of , giving the claim.

In the next section we give an alternate proof of the above lemma using the machinery of characteristic functions. A more geometric argument can be given as follows. With the same normalisations as in the above proof, we can write and for some . Then we can write and where are iid copies of . But the joint probability density function of is rotation invariant, so has the same distribution as , and the claim follows.

From the above lemma we see that if are iid copies of a normal distribution of mean and variance , then has distribution , and hence has distribution exactly equal to . Thus the central limit theorem is clearly true in this case.

Exercise 13 (Probabilistic interpretation of convolution)Let be measurable functions with . Define the convolution of and to beShow that if are independent real random variables with probability density functions respectively, then has probability density function .

Now we turn to the general case of the central limit theorem. By subtracting from (and from each of the ) we may normalise ; by dividing (and each of the ) by we may also normalise . Thus , and our task is to show that

as , for any continuous compactly supported functions , where is a random variable distributed according to (possibly modeled by a different probability space than the original ). Since any continuous compactly supported function can be approximated uniformly by smooth compactly supported functions (as can be seen from the Weierstrass or Stone-Weierstrass theorems), it suffices to show this for smooth compactly supported .

Let be iid copies of ; by extending the probability space used to model (using Proposition 26 from Notes 2), we can model the and by a common model, in such a way that the combined collection of random variables are jointly independent. As we have already proved the central limit theorem in the normally distributed case, we already have

as (indeed we even have equality here). So it suffices to show that

We first establish this claim under the additional simplifying assumption of a finite third moment: . Rather than swap all of the with all of the , let us just swap the final to a , that is to say let us consider the expression

Writing , we can write this as

To compute this expression we use Taylor expansion. As is smooth and compactly supported, the first three derivatives of are bounded, leading to the Taylor approximation

where the implied constant depends on . Taking expectations, we conclude that

Now for a key point: as the random variable only depends on , it is independent of , and so we can decouple the expectations to obtain

The same considerations apply after swapping with (which also has a bounded third moment):

But by hypothesis, and have matching moments to second order: and . Thus on subtraction we have

A similar argument (permuting the indices, and replacing some of the with ) gives

for all . Summing the telescoping series, we conclude that

which gives (4). Note how it was important to Taylor expand to at least third order to obtain a total error bound that went to zero, which explains why it is the first two moments of (or equivalently, the mean and variance) that play such a decisive role in the central limit theorem.

Now we remove the hypothesis of finite third moment. As in the previous set of notes, we use the truncation method, taking advantage of the “room” inherent in the factor in the error term of the above analysis. For technical reasons we have to modify the usual truncation slightly to preserve the mean zero condition. Let . We split , where and with ; we split respectively (we are suppressing the dependence of on and to simplify the notation). The random variables (and ) are chosen to have mean zero, but their variance is not quite one. However, from dominated convergence, the quantity converges to , and the variance converges to , as .

Let be iid copies of . The previous arguments then give

for large enough. We can bound

and thus

for large enough . By Lemma 12, is distributed as , and hence

as . We conclude that

Next, we consider the error term

The variable has mean zero, and by dominated convergence, the variance of goes to zero as . The above error term is also mean zero and has the same variance as . In particular, from Cauchy-Schwarz we have

As is smooth and compactly supported, it is Lipschitz continuous, and hence

Taking expectations, and then combining all these estimates, we conclude that

for sufficiently large; letting go to infinity we obtain (3) as required. This concludes the proof of Theorem 8.

The above argument can be generalised to a central limit theorem for certain triangular arrays, known as the Lindeberg central limit theorem:

Exercise 14 (Lindeberg central limit theorem)Let be a sequence of natural numbers going to infinity in . For each natural number , let be jointly independent real random variables of mean zero and finite variance. (We do not require the random variables to be jointly independent in , or even to be modeled by a common probability space.) Let be defined byand assume that for all .

- (i) If one assumes the Lindeberg condition that
as for any , then show that the random variables converge in distribution to a random variable with the normal distribution .

- (ii) Show that the Lindeberg condition implies the
Feller conditionas .

Note that Theorem 8 (after normalising to the mean zero case ) corresponds to the special case and (or, if one wishes, ) of the Lindeberg central limit theorem. It was shown by Feller that if the situation as in the above exercise and the Feller condition holds, then the Lindeberg condition is necessary as well as sufficient for to converge in distribution to a random variable with normal distribution ; the combined result is sometimes known as the *Lindeberg-Feller theorem*.

Exercise 15 (Weak Berry-Esséen theorem)Let be iid copies of a real random variable of mean zero, unit variance, and finite third moment.

- (i) Show that
whenever is three times continuously differentiable and compactly supported, with distributed as and the implied constant in the notation absolute.

- (ii) Show that
for any , with the implied constant absolute.

We will strengthen the conclusion of this theorem in Theorem 37 below.

Remark 16The Lindeberg exchange method explains why the limiting distribution of statistics such as depend primarily on the first two moments of the component random variables , if there is a suitable amount of independence between the . It turns out that there is an analogous application of the Lindeberg method in random matrix theory, which (very roughly speaking) asserts that appropriate statistics of random matrices such as depend primarily on the first four moments of the matrix components , if there is a suitable amount of independence between the . See for instance this survey article of Van Vu and myself for more discussion of this. The Lindeberg method also suggests that the more moments of one assumes to match with the Gaussian variable , the faster the rate of convergence (because one can use higher order Taylor expansions).

We now use the Lindeberg central limit theorem to obtain the converse direction of the Kolmogorov three-series theorem (Exercise 29 of Notes 3).

Exercise 17 (Kolmogorov three-series theorem, converse direction)Let be a sequence of jointly independent real random variables, with the property that the series is almost surely convergent (i.e., the partial sums are almost surely convergent), and let .

- (i) Show that is finite. (
Hint:argue by contradiction and use the second Borel-Cantelli lemma.)- (ii) Show that is finite. (
Hint:first use (i) and the Borel-Cantelli lemma to reduce to the case where almost surely. If is infinite, use Exercise 14 to show that converges in distribution to a standard normal distribution, and use this to contradict the almost sure convergence of .- (iii) Show that the series is convergent. (
Hint:reduce as before to the case where almost surely, and apply the forward direction of the three-series theorem to .)

** — 2. The Fourier-analytic approach to the central limit theorem — **

Let us now give the standard Fourier-analytic proof of the central limit theorem. Given any real random variable , we introduce the characteristic function , defined by the formula

Equivalently, is the Fourier transform of the probability measure . One should caution that the term “characteristic function” has several other unrelated meanings in mathematics; particularly confusingly, in real analysis “characteristic function” is used to denote what in probability one would call an “indicator function”. Note that no moment hypotheses are required to define the characteristic function, because the random variable is bounded even when is not absolutely integrable.

Example 18The signed Bernoulli distribution, which takes the values and with probabilities of each, has characteristic function .

Most of the standard random variables in probability have characteristic functions that are quite simple and explicit. For the purposes of proving the central limit theorem, the most important such explicit form of the characteristic function is of the normal distribution:

Exercise 19Show that the normal distribution has characteristic function .

We record the explicit characteristic functions of some other standard distributions:

Exercise 20Let , and let be a Poisson random variable with intensity , thus takes values in the non-negative integers with . Show that for all .

Exercise 21Let be uniformly distributed in some interval . Show that for all non-zero .

Exercise 22Let and , and let be a Cauchy random variable with parameters , which means that is a real random variable with probability density function . Show that for all .

The characteristic function is clearly bounded in magnitude by , and equals at the origin. By the dominated convergence theorem, is continuous in .

Exercise 23 (Riemann-Lebesgue lemma)Show that if is a real random variable that has an absolutely integrable probability density function , then as . (Hint:first show the claim when is a finite linear combination of intervals, then for the general case show that can be approximated in norm by such finite linear combinations.) Note from Example 18 that the claim can fail if does not have a probability density function. (In Fourier analysis, this fact is known as the Riemann-Lebesgue lemma.)

Exercise 24Show that the characteristic function of a real random variable is in fact uniformly continuous on its domain.

Let be a real random variable. If we Taylor expand and formally interchange the series and expectation, we arrive at the heuristic identity

which thus interprets the characteristic function of a real random variable as a kind of generating function for the moments. One rigorous version of this identity is as follows.

Exercise 25 (Taylor expansion of characteristic function)Let be a real random variable with finite moment for some . Show that is times continuously differentiable, withfor all . Conclude in particular the partial Taylor expansion

where is a quantity that goes to zero as , times .

Exercise 26Let be a real random variable, and assume that it issubgaussianin the sense that there exist constants such thatfor all . (Thus for instance a bounded random variable is subgaussian, as is any gaussian random variable.) Rigorously establish (6) in this case, and show that the series converges locally uniformly in .

Note that the characteristic function depends only on the distribution of : if and are equal in distribution, then . The converse statement is true also: if , then and are equal in distribution. This follows from a more general (and useful) fact, known as Lévy’s continuity theorem.

Theorem 27 (Lévy continuity theorem, special case)Let be a sequence of real random variables, and let be an additional real random variable. Then the following statements are equivalent:

- (i) converges pointwise to .
- (ii) converges in distribution to .

*Proof:* The implication of (i) from (ii) is immediate from (5) and Exercise 10(viii).

Now suppose that (i) holds, and we wish to show that (ii) holds. We need to show that

whenever is a continuous, compactly supported function. As in the Lindeberg argument, it suffices to prove this when is smooth and compactly supported, in particular is a Schwartz function (infinitely differentiable, with all derivatives rapidly decreasing). But then we have the Fourier inversion formula

where

is Schwartz, and is in particular absolutely integrable (see e.g. these lecture notes of mine). From the Fubini-Tonelli theorem, we thus have

and similarly for . The claim now follows from the Lebesgue dominated convergence theorem.

Remark 28Setting for all , we see in particular the previous claim that if and only if , have the same distribution. It is instructive to use the above proof as a guide to prove this claim directly.

There is one subtlety with the Lévy continuity theorem: it is possible for a sequence of characteristic functions to converge pointwise, but for the limit to not be the characteristic function of any random variable, in which case will not converge in distribution. For instance, if , then converges pointwise to for any , but this is clearly not the characteristic function of any random variable (as characteristic functions are continuous). However, this lack of continuity is the only obstruction:

Exercise 29 (Lévy’s continuity theorem, full version)Let be a sequence of real valued random variables. Suppose that converges pointwise to a limit . Show that the following are equivalent:

- (i) is continuous at .
- (ii) is a tight sequence (as in Exercise 10(iii)).
- (iii) is the characteristic function of a real valued random variable (possibly after extending the sample space).
- (iv) converges in distribution to some real valued random variable (possibly after extending the sample space).

Hint: To get from (ii) to the other conclusions, use Theorem 11 and Theorem 27. To get back to (ii) from (i), use (7) for a suitable Schwartz function . The other implications are easy once Theorem 27 is in hand.

Remark 30Lévy’s continuity theorem is very similar in spirit to Weyl’s criterion in equidistribution theory.

Exercise 31 (Esséen concentration inequality)Let be a random variable taking values in . Then for any , , show thatfor some constant depending only on . (

Hint:Use (7) for a suitable Schwartz function .) The left-hand side of (8) (as well as higher dimensional analogues of this quantity) is known as thesmall ball probabilityof at radius .

In Fourier analysis, we learn that the Fourier transform is a particularly well-suited tool for studying convolutions. The probability theory analogue of this fact is that characteristic functions are a particularly well-suited tool for studying sums of independent random variables. More precisely, we have

Exercise 32 (Fourier identities)Let be independent real random variables. Thenfor all . Also, for any scalar , one has

and more generally, for any linear transformation , one has

Remark 33Note that this identity (9), combined with Exercise 19 and Remark 28, gives a quick alternate proof of Lemma 12.

In particular, we have the simple relationship

that describes the characteristic function of in terms of that of .

We now have enough machinery to give a quick proof of the central limit theorem:

*Proof:* (Fourier proof of Theorem 8) We may normalise to have mean zero and variance . By Exercise 25, we thus have

for sufficiently small , or equivalently

for sufficiently small . Applying (10), we conclude that

as for any fixed . But by Exercise 19, is the characteristic function of the normal distribution . The claim now follows from the Lévy continuity theorem.

The above machinery extends without difficulty to vector-valued random variables taking values in . The analogue of the characteristic function is then the function defined by

We leave the routine extension of the above results and proofs to the higher dimensional case to the reader. Most interesting is what happens to the central limit theorem:

Exercise 34 (Vector-valued central limit theorem)Let be a random variable taking values in with finite second moment. Define the covariance matrix to be the matrix whose entry is the covariance .

- Show that the covariance matrix is positive semi-definite real symmetric.
- Conversely, given any positive definite real symmetric matrix and , show that the multivariate normal distribution , given by the absolutely continuous measure
has mean and covariance matrix , and has a characteristic function given by

How would one define the normal distribution if degenerated to be merely positive semi-definite instead of positive definite?

- If is the sum of iid copies of , show that converges in distribution to . (For this exercise, you may assume without proof that the Lévy continuity theorem extends to .)

Exercise 35 (Complex central limit theorem)Let be a complex random variable of mean , whose real and imaginary parts have variance and covariance . Let be iid copies of . Show that as , the normalised sums converge in distribution to the standard complex gaussian , defined as the measure on withfor Borel , where is Lebesgue measure on (identified with in the usual fashion).

Exercise 36Use characteristic functions and the truncation argument to give an alternate proof of the Lindeberg central limit theorem (Theorem 14).

A more sophisticated version of the Fourier-analytic method gives a more quantitative form of the central limit theorem, namely the Berry-Esséen theorem.

Theorem 37 (Berry-Esséen theorem)Let have mean zero and unit variance. Let , where are iid copies of . Then we haveuniformly for all , where has the distribution of , and the implied constant is absolute.

*Proof:* (Optional) Write ; our task is to show that

for all . We may of course assume that , as the claim is trivial otherwise. (In particular, has finite third moment.)

Let be a small absolute constant to be chosen later. Let be an non-negative Schwartz function with total mass whose Fourier transform is supported in ; such an can be constructed by taking the inverse Fourier transform of a smooth function supported on and equal to one at the origin, and then multiplying that transform by its complex conjugate to make it non-negative.

Let be the smoothed out version of , defined as

Observe that is decreasing from to . As is rapidly decreasing and has mean one, we also have the bound

(say) for any , where subscript indicates that the implied constant depends on .

We claim that it suffices to show that

for every , where the subscript means that the implied constant depends on . Indeed, suppose that (14) held. Replacing by and for some large absolute constant and subtracting, we have

for any . From (13) we see that the function is bounded by , and hence by the bounded probability density of

Also, is non-negative, and for large enough, it is bounded from below by (say) on . We conclude (after choosing appropriately) that

for all real . This implies that

as can be seen by covering the real line by intervals and applying (15) to each such interval. From (13) we conclude that

A similar argument gives

and (12) then follows from (14).

It remains to establish (14). We write

(with the expression in the limit being uniformly bounded) and

to conclude (after applying Fubini’s theorem) that

using the compact support of . Hence

By the fundamental theorem of calculus we have . This factor of cancels a similar factor on the denominator to make the expression inside the limit dominated by an absolutely integrable random variable. Thus by the dominated convergence theorem

uniformly in . Applying the triangle inequality and the compact support of , it suffices to show that

for any ; taking expectations and using the definition of we have

and in particular

if and is small enough. Applying (10), we conclude that

if . Meanwhile, from Exercise 19 we have . Elementary calculus then gives us

(say) if is small enough. Inserting this bound into (16) we obtain the claim.

Exercise 38Show that the error terms in Theorem 37 are sharp (up to constants) when is a signed Bernoulli random variable, or more precisely when takes values in with probability for each.

Exercise 39Let be a sequence of real random variables which converge in distribution to a real random variable , and let be a sequence of real random variables which converge in distribution to a real random variable . Suppose that, for each , and are independent, and suppose also that and are independent. Show that converges in distribution to . (Hint:use the Lévy continuity theorem.)

The following exercise shows that the central limit theorem fails when the variance is infinite.

Exercise 40Let be iid copies of an absolutely integrable random variable of mean zero.

- (i) In this part we assume that is
symmetric, which means that and have the same distribution. Show that for any and ,(

Hint:relate both sides of this inequality to the probability of the event that and , using the symmetry of the situation.)- (ii) If is symmetric and converges in distribution to a real random variable , show that has finite variance. (
Hint:if this is not the case, then will have arbitrarily large variance as increases. On the other hand, can be made arbitrarily small by taking large enough. For a large threshold , apply (i) (with ) to obtain a contradiction.- (iii) Generalise (ii) by removing the hypothesis that is symmetric. (
Hint:apply thesymmetrisationtrick of replacing by , where is an independent copy of , and use the previous exercise. One may need to utilise some truncation arguments to show that has infinite variance whenever has infinite variance.)

Exercise 41

- (i) If is a real random variable of mean zero and variance , and is a real number, show that
and that

(

Hint:first establish the Taylor bounds and .)- (ii) Establish the pointwise inequality
whenever are complex numbers in the disk .

- (iii) Suppose that for each , are jointly independent real random variables of mean zero and finite variance, obeying the uniform bound
for all and some going to zero as , and also obeying the variance bound

as for some . If , use (i) and (ii) to show that

as for any given real .

- (iv) Use (iii) and a truncation argument to give an alternate proof of the Lindeberg central limit theorem (Theorem 14). (Note: one has to address the issue that truncating a random variable may alter its mean slightly.)

** — 3. The moment method — **

The above Fourier-analytic proof of the central limit theorem is one of the quickest (and slickest) proofs available for this theorem, and is accordingly the “standard” proof given in probability textbooks. However, it relies quite heavily on the Fourier-analytic identities in Exercise 32, which in turn are extremely dependent on both the commutative nature of the situation (as it uses the identity ) and on the independence of the situation (as it uses identities of the form ). As one or both of these factors can be absent when trying to generalise this theorem, it is also important to look for non-Fourier based methods to prove results such as the central limit theorem. These methods often lead to proofs that are lengthier and more technical than the Fourier proofs, but also tend to be more robust.

The most elementary (but still remarkably effective) method available in this regard is the *moment method*, which we have already used in previous notes. In principle, this method is equivalent to the Fourier method, through the identity (6); but in practice, the moment method proofs tend to look somewhat different than the Fourier-analytic ones, and it is often more apparent how to modify them to non-independent or non-commutative settings.

We first need an analogue of the Lévy continuity theorem. Here we encounter a technical issue: whereas the Fourier phases were bounded, the moment functions become unbounded at infinity. However, one can deal with this issue as long as one has sufficient decay:

Exercise 42 (Subgaussian random variables)Let be a real random variable. Show that the following statements are equivalent:

- (i) There exist such that for all .
- (ii) There exist such that for all .
- (iii) There exist such that for all .
Furthermore, show that if (i) holds for some , then (ii) holds for depending only on , and similarly for any of the other implications. Variables obeying (i), (ii), or (iii) are called

subgaussian. The function is known as the moment generating function of ; it is of course closely related to the characteristic function of .

Exercise 43Use the truncation method to show that in order to prove the central limit theorem (Theorem 8), it suffices to do so in the case when the underlying random variable is bounded (and in particular subgaussian).

Once we restrict to the subgaussian case, we have an analogue of the Lévy continuity theorem:

Theorem 44 (Moment continuity theorem)Let be a sequence of real random variables, and let be a subgaussian random variable. Suppose that for every , converges pointwise to . Then converges in distribution to .

*Proof:* Let , then by the preceding exercise we have

for some independent of . In particular,

when is sufficiently large depending on . From Taylor’s theorem with remainder (and Stirling’s formula (1)) we conclude

uniformly in , for sufficiently large. Similarly for . Taking limits using (i) we see that

Then letting , keeping fixed, we see that converges pointwise to for each , and the claim now follows from the Lévy continuity theorem.

Remark 45One corollary of Theorem 44 is that the distribution of a subgaussian random variable is uniquely determined by its moments (actually, this could already be deduced from Exercise 26 and Remark 28). The situation can fail for distributions with slower tails, for much the same reason that a smooth function is not determined by its derivatives at one point if that function is not analytic.The Fourier inversion formula provides an easy way to recover the distribution from the characteristic function. Recovering a distribution from its moments is more difficult, and sometimes requires tools such as analytic continuation; this problem is known as the

inverse moment problemand will not be discussed here.

Exercise 46 (Converse direction of moment continuity theorem)Let be a sequence of uniformly subgaussian random variables (thus there exist such that for all and all ), and suppose converges in distribution to a limit . Show that for any , converges pointwise to .

We now give the moment method proof of the central limit theorem. As discussed above we may assume without loss of generality that is bounded (and in particular subgaussian); we may also normalise to have mean zero and unit variance. By Theorem 44, it suffices to show that

for all , where is a standard gaussian variable.

The moments were already computed in Exercise 36 of Notes 1. So now we need to compute . Using linearity of expectation, we can expand this as

To understand this expression, let us first look at some small values of .

- For , this expression is trivially .
- For , this expression is trivially , thanks to the mean zero hypothesis on .
- For , we can split this expression into the diagonal and off-diagonal components:
Each summand in the first sum is , as has unit variance. Each summand in the second sum is , as the have mean zero and are independent. So the second moment is .

- For , we have a similar expansion
The summands in the latter two sums vanish because of the (joint) independence and mean zero hypotheses. The summands in the first sum need not vanish, but are , so the first term is , which is asymptotically negligible, so the third moment goes to .

- For , the expansion becomes quite complicated:
Again, most terms vanish, except for the first sum, which is and is asymptotically negligible, and the sum , which by the independence and unit variance assumptions works out to . Thus the fourth moment goes to (as it should).

Now we tackle the general case. Ordering the indices as for some , with each occuring with multiplicity and using elementary enumerative combinatorics, we see that is the sum of all terms of the form

where , are positive integers adding up to , and is the multinomial coefficient

The total number of such terms depends only on (in fact, it is (exercise!), though we will not need this fact).

As we already saw from the small examples, most of the terms vanish, and many of the other terms are negligible in the limit . Indeed, if any of the are equal to , then every summand in (17) vanishes, by joint independence and the mean zero hypothesis. Thus, we may restrict attention to those expressions (17) for which all the are at least . Since the sum up to , we conclude that is at most .

On the other hand, the total number of summands in (17) is clearly at most (in fact it is ), and the summands are bounded (for fixed ) since is bounded. Thus, if is *strictly* less than , then the expression in (17) is and goes to zero as . So, asymptotically, the only terms (17) which are still relevant are those for which is *equal* to . This already shows that goes to zero when is odd. When is even, the only surviving term in the limit is now when and . But then by independence and unit variance, the expectation in (17) is , and so this term is equal to

and the main term is happily equal to the moment as computed in Exercise 36 of Notes 1. (One could also appeal to Lemma 12 here, specialising to the case when is normally distributed, to explain this coincidence.) This concludes the proof of the central limit theorem.

Exercise 47 (Chernoff bound)Let be iid copies of a real random variable of mean zero and unit variance, which is subgaussian in the sense of Exercise 42. Write .

- (i) Show that there exist such that for all . Conclude that for all . (
Hint:the first claim follows directly from Exercise 42 when ; for , use the Taylor approximation .)- (ii) Conclude the Chernoff bound
for some , all , and all .

Exercise 48 (Erdös-Kac theorem)For any natural number , let be a natural number drawn uniformly at random from the natural numbers , and let denote the number of distinct prime factors of .

- (i) Show that for any , one has
as if is odd, and

as if is even. (

Hint:adapt the arguments in Exercise 16 of Notes 3, estimating by , using Mertens’ theorem and induction on to deal with lower order errors, and treating the random variables as being approximately independent and approximately of mean zero.)- (ii) Establish the Erdös-Kac theorem
as for any fixed .

Informally, the Erdös-Kac theorem asserts that behaves like for “random” . Note that this refines the Hardy-Ramanujan theorem (Exercise 16 of Notes 3).

Filed under: 275A - probability theory, math.CA, math.PR Tagged: central limit theorem, Lindeberg exchange method, moment method ]]>

as for any distinct integers , where is the Liouville function. (The usual formulation of the conjecture also allows one to consider more general linear forms than the shifts , but for sake of discussion let us focus on the shift case.) This conjecture remains open for , though there are now some partial results when one averages either in or in the , as discussed in this recent post.

A natural generalisation of the Chowla conjecture is the Elliott conjecture. Its original formulation was basically as follows: one had

whenever were bounded completely multiplicative functions and were distinct integers, and one of the was “non-pretentious” in the sense that

for all Dirichlet characters and real numbers . It is easy to see that some condition like (2) is necessary; for instance if and has period then can be verified to be bounded away from zero as .

In a previous paper with Matomaki and Radziwill, we provided a counterexample to the original formulation of the Elliott conjecture, and proposed that (2) be replaced with the stronger condition

as for any Dirichlet character . To support this conjecture, we proved an averaged and non-asymptotic version of this conjecture which roughly speaking showed a bound of the form

whenever was an arbitrarily slowly growing function of , was sufficiently large (depending on and the rate at which grows), and one of the obeyed the condition

for some that was sufficiently large depending on , and all Dirichlet characters of period at most . As further support of this conjecture, I recently established the bound

under the same hypotheses, where is an arbitrarily slowly growing function of .

In view of these results, it is tempting to conjecture that the condition (4) for one of the should be sufficient to obtain the bound

when is large enough depending on . This may well be the case for . However, the purpose of this blog post is to record a simple counterexample for . Let’s take for simplicity. Let be a quantity much larger than but much smaller than (e.g. ), and set

For , Taylor expansion gives

and

and hence

and hence

On the other hand one can easily verify that all of the obey (4) (the restriction there prevents from getting anywhere close to ). So it seems the correct non-asymptotic version of the Elliott conjecture is the following:

Conjecture 1 (Non-asymptotic Elliott conjecture)Let be a natural number, and let be integers. Let , let be sufficiently large depending on , and let be sufficiently large depending on . Let be bounded multiplicative functions such that for some , one hasfor all Dirichlet characters of conductor at most . Then

The case of this conjecture follows from the work of Halasz; in my recent paper a logarithmically averaged version of the case of this conjecture is established. The requirement to take to be as large as does not emerge in the averaged Elliott conjecture in my previous paper with Matomaki and Radziwill; it thus seems that this averaging has concealed some of the subtler features of the Elliott conjecture. (However, this subtlety does not seem to affect the asymptotic version of the conjecture formulated in that paper, in which the hypothesis is of the form (3), and the conclusion is of the form (1).)

A similar subtlety arises when trying to control the maximal integral

In my previous paper with Matomaki and Radziwill, we could show that easier expression

was small (for a slowly growing function of ) if was bounded and completely multiplicative, and one had a condition of the form

for some large . However, to obtain an analogous bound for (5) it now appears that one needs to strengthen the above condition to

in order to address the counterexample in which for some between and . This seems to suggest that proving (5) (which is closely related to the case of the Chowla conjecture) could in fact be rather difficult; the estimation of (6) relied primarily of prior work of Matomaki and Radziwill which used the hypothesis (7), but as this hypothesis is not sufficient to conclude (5), some additional input must also be used.

Filed under: expository, math.NT Tagged: Chowla conjecture, Elliott conjecture, Liouville function ]]>

The first fundamental result about these sums is the law of large numbers (or LLN for short), which comes in two formulations, weak (WLLN) and strong (SLLN). To state these laws, we first must define the notion of convergence in probability.

Definition 1Let be a sequence of random variables taking values in a separable metric space (e.g. the could be scalar random variables, taking values in or ), and let be another random variable taking values in . We say that converges in probability to if, for every radius , one has as . Thus, if are scalar, we have converging to in probability if as for any given .

The measure-theoretic analogue of convergence in probability is convergence in measure.

It is instructive to compare the notion of convergence in probability with almost sure convergence. it is easy to see that converges almost surely to if and only if, for every radius , one has as ; thus, roughly speaking, convergence in probability is good for controlling how a single random variable is close to its putative limiting value , while almost sure convergence is good for controlling how the entire *tail* of a sequence of random variables is close to its putative limit .

We have the following easy relationships between convergence in probability and almost sure convergence:

Exercise 2Let be a sequence of scalar random variables, and let be another scalar random variable.

- (i) If almost surely, show that in probability. Give a counterexample to show that the converse does not necessarily hold.
- (ii) Suppose that for all . Show that almost surely. Give a counterexample to show that the converse does not necessarily hold.
- (iii) If in probability, show that there is a subsequence of the such that almost surely.
- (iv) If are absolutely integrable and as , show that in probability. Give a counterexample to show that the converse does not necessarily hold.
- (v) (Urysohn subsequence principle) Suppose that every subsequence of has a further subsequence that converges to in probability. Show that also converges to in probability.
- (vi) Does the Urysohn subsequence principle still hold if “in probability” is replaced with “almost surely” throughout?
- (vii) If converges in probability to , and or is continuous, show that converges in probability to . More generally, if for each , is a sequence of scalar random variables that converge in probability to , and or is continuous, show that converges in probability to . (Thus, for instance, if and converge in probability to and respectively, then and converge in probability to and respectively.
- (viii) (Fatou’s lemma for convergence in probability) If are non-negative and converge in probability to , show that .
- (ix) (Dominated convergence in probability) If converge in probability to , and one almost surely has for all and some absolutely integrable , show that converges to .

Exercise 3Let be a sequence of scalar random variables converging in probability to another random variable .

- (i) Suppose that there is a random variable which is independent of for each individual . Show that is also independent of .
- (ii) Suppose that the are jointly independent. Show that is almost surely constant (i.e. there is a deterministic scalar such that almost surely).

We can now state the weak and strong law of large numbers, in the model case of iid random variables.

Theorem 4 (Law of large numbers, model case)Let be an iid sequence of copies of an absolutely integrable random variable (thus the are independent and all have the same distribution as ). Write , and for each natural number , let denote the random variable .

- (i) (Weak law of large numbers) The random variables converge in probability to .
- (ii) (Strong law of large numbers) The random variables converge almost surely to .

Informally: if are iid with mean , then for large. Clearly the strong law of large numbers implies the weak law, but the weak law is easier to prove (and has somewhat better quantitative estimates). There are several variants of the law of large numbers, for instance when one drops the hypothesis of identical distribution, or when the random variable is not absolutely integrable, or if one seeks more quantitative bounds on the rate of convergence; we will discuss some of these variants below the fold.

It is instructive to compare the law of large numbers with what one can obtain from the Kolmogorov zero-one law, discussed in Notes 2. Observe that if the are real-valued, then the limit superior and are tail random variables in the sense that they are not affected if one changes finitely many of the ; in particular, events such as are tail events for any . From this and the zero-one law we see that there must exist deterministic quantities such that and almost surely. The strong law of large numbers can then be viewed as the assertion that when is absolutely integrable. On the other hand, the zero-one law argument does not require absolute integrability (and one can replace the denominator by other functions of that go to infinity as ).

The law of large numbers asserts, roughly speaking, that the theoretical expectation of a random variable can be approximated by taking a large number of independent samples of and then forming the empirical mean . This ability to approximate the theoretical statistics of a probability distribution through empirical data is one of the basic starting points for mathematical statistics, though this is not the focus of the course here. The tendency of statistics such as to cluster closely around their mean value is the simplest instance of the concentration of measure phenomenon, which is of tremendous significance not only within probability, but also in applications of probability to disciplines such as statistics, theoretical computer science, combinatorics, random matrix theory and high dimensional geometry. We will not discuss these topics much in this course, but see this previous blog post for some further discussion.

There are several ways to prove the law of large numbers (in both forms). One basic strategy is to use the *moment method* – controlling statistics such as by computing moments such as the mean , variance , or higher moments such as for . The joint independence of the make such moments fairly easy to compute, requiring only some elementary combinatorics. A direct application of the moment method typically requires one to make a finite moment assumption such as , but as we shall see, one can reduce fairly easily to this case by a truncation argument.

For the strong law of large numbers, one can also use methods relating to the theory of martingales, such as stopping time arguments and maximal inequalities; we present some classical arguments of Kolmogorov in this regard.

** — 1. The moment method — **

We begin by using the moment method to establish both the strong and weak law of large numbers for sums of iid random variables, under additional moment hypotheses.

We first make a very simple observation: in order to prove the weak or strong law of large numbers for complex variables, it suffices to do so for real variables, as the complex case follows from the real case after taking real and imaginary parts. Thus we shall restrict attention henceforth to real random variables, in order to avoid some unnecessarily complications involving complex conjugation.

Let be a sequence of iid copies of a scalar random variable , and define the partial sums . Suppose that is absolutely integrable, with expectation (or *mean*) . Then we can use linearity of expectation to compute the expectation (or *first moment*) of :

In particular, the expectation of is . This looks consistent with the strong and weak law of large numbers, but does not immediately imply these laws. However, thanks to Markov’s inequality, we do at least get the following very weak bound

for any , in the case that is unsigned and absolutely integrable. Thus, in the unsigned case at least, we see that usually doesn’t get much larger than the mean . We will refer to (1) as a *first moment* bound on , as it was obtained primarily through a computation of the first moment of .

Now we turn to second moment bounds on , obtained through computations of second moments such as or . It will be convenient to normalise the mean to equal zero, by replacing each with (and by ), so that gets replaced by (and by ). With this normalisation, we see that to prove the strong or weak law of large numbers, it suffices to do so in the mean zero case . (On the other hand, if is unsigned, then normalising in this fashion will almost certainly destroy the unsigned property, so it is not always desirable to perform this normalisation.)

Suppose that has finite second moment (i.e. , that is to say is square-integrable) and has been normalised to have mean zero. We write the variance as . The first moment calculation then shows that has mean zero. Now we compute the variance of , which in the mean zero case is simply ; note from the triangle inequality that this quantity is finite. By linearity of expectation, we have

(All expressions here are absolutely integrable thanks to the Cauchy-Schwarz inequality.) If , then the term is equal to . If , then by hypothesis and are independent and mean zero, and thus

Putting all this together, we obtain

or equivalently

This bound was established in the mean zero case, but it is clear that it also holds in general, since subtracting a constant from a random variable does not affect its variance. Thus we see that while has the same mean as , it has a much smaller variance: in place of . This is the first demonstration of the concentration of measure effect that comes from combining many independent random variables . (At the opposite extreme to the independent case, suppose we took to all be exactly the same random variable: . Then has exactly the same mean and variance as . Decorrelating the does not affect the mean of , but produces significant cancellations that reduce the variance.)

If we insert this variance bound into Chebyshev’s inequality, we obtain the bound

for any natural number and , whenever has mean and a finite variance . The right-hand side goes to zero as for fixed , so we have in fact established the weak law of large numbers in the case that has finite variance.

Note that (2) implies that

converges to zero in probability whenever is a function of that goes to infinity as . Thus for instance

in probability. Informally, this means that tends to stray by not much more than from typically. This intuition will be reinforced in the next set of notes when we study the central limit theorem and related results such as the Chernoff inequality. (It is also supported by the law of the iterated logarithm, which we will probably not be able to get to in this set of notes.)

One can hope to use (2) and the Borel-Cantelli lemma (Exercise 2(ii)) to also obtain the strong law of large numbers in the second moment case, but unfortunately the quantities are not summable in . To resolve this issue, we will go to higher moments than the second moment. One could calculate third moments such as , but this turns out to not convey too much information (unless is unsigned) because of the signed nature of ; the expression would in principle convey more usable information, but is difficult to compute as is not a polynomial combination of the . Instead, we move on to the fourth moment. Again, we normalise to have mean , and now assume a finite fourth moment (which, by the Hölder or Jensen inequalities, implies that all lower moments such as are finite). We again use to denote the variance of . We can expand

Note that all expectations here are absolutely integrable by Hölder’s inequality and the hypothesis of finite fourth moment. The correlation looks complicated, but fortunately it simplifies greatly in most cases. Suppose for instance that is distinct from , then is independent of (even if some of the are equal to each other) and so

since . Similarly for permutations. This leaves only a few quadruples for which could be non-zero: the three cases , , where each of the indices is paired up with exactly one other index; and the diagonal case . If for instance , then

Similarly for the cases and , which gives a total contribution of to . Finally, when , then , and there are contributions of this form to . We conclude that

and hence by Markov’s inequality

for any . If we remove the normalisation , we conclude that

The right-hand side decays like , which is now summable in (in contrast to (2)). Thus we may now apply the Borel-Cantelli lemma and conclude the *strong* law of large numbers in the case when one has bounded fourth moment .

One can of course continue to compute higher and higher moments of (assuming suitable finite moment hypotheses on ), though as one can already see from the fourth moment calculation, the computations become increasingly combinatorial in nature. We will pursue this analysis more in the next set of notes, when we discuss the central limit theorem. For now, we turn to some applications and variants of the moment method (many of which are taken from Durrett’s book).

We begin with two quick applications of the weak law of large numbers to topics outside of probability. We first give an explicit version of the Weierstrass approximation theorem, which asserts that continuous functions on (say) the unit interval can be approximated by polynomials.

Proposition 5 (Approximation by Bernstein polynomials)Let be a continuous function. Then the Bernstein polynomialsconverges uniformly to as .

*Proof:* We first establish the pointwise bound for each . Fix , and let be iid Bernoulli variables (i.e. variables taking values in ) with each equal to with probability . The mean of the is clearly , and the variance is bounded crudely by (in fact it is ), so if we set then by the weak law of large numbers for random variables of finite second moment, we see that converges in probability to (note that is clearly dominated by ). By the dominated convergence theorem in probability, we conclude that converges to . But from direct computation we see that takes values in , with each being attained with probability (i.e. as a binomial distribution), and so from the definition of the Bernstein polynomials we see that . This concludes the pointwise convergence claim.

To establish the uniform convergence, we use the *proof* of the weak law of large numbers, rather than the *statement* of that law, to get the desired uniformity in the parameter . For a given , we see from (2) that

for any . On the other hand, as is continuous on , it is uniformly continuous, and so for any there exists a such that whenever with . For such an and , we conclude that

On the other hand, being continuous on , must be bounded in magnitude by some bound , so that . This leads to the upper bound

and thus by the triangle inequality and the identity

Since is bounded by (say) , and can be made arbitrarily small, we conclude that converges uniformly to as required.

The other application of the weak law of large numbers is to the geometry of high-dimensional cubes, giving the rather unintuitive conclusion that most of the volume of the high-dimensional cube is contained in a thin annulus.

Proposition 6Let . Then, for sufficiently large , a proportion of at least of the cube (by -dimensional Lebesgue measure) is contained in the annulus .

This proposition already indicates that high-dimensional geometry can behave in a manner quite differently from what one might naively expect from low-dimensional geometric intuition; one needs to develop a rather distinct high-dimensional geometric intuition before one can accurately make predictions in large dimensions.

*Proof:* Let be iid random variables drawn uniformly from . Then the random vector is uniformly distributed on the cube . The variables are also iid, and (by the change of variables formula) have mean

Hence, by the weak law of large numbers, the quantity converges in probability to , so in particular the probability

goes to zero as goes to infinity. But this quantity is precisely the proportion of that lies outside the annulus , and the claim follows.

The first and second moment method are very general, and apply to sums of random variables that do not need to be identically distributed, or even independent (although the bounds can get weaker and more complicated if one strays too far from these hypotheses). For instance, it is clear from linearity of expectation that has mean

(assuming of course that are absolutely integrable) and variance

(assuming now that are square-integrable). (For the latter claim, it is convenient, as before, to first normalise each of the to have mean zero.) If the are pairwise independent in addition to being square-integrable, then all the covariances vanish, and we obtain additivity of the variance:

Remark 7Viewing the variance as the square of the standard deviation, the identity (4) can be interpreted as a rigorous instantiation of the following informalprinciple of square root cancellation: if one has a sum of random (or pseudorandom) variables that “oscillate” in the sense that their mean is either zero or close to zero, and each has an expected magnitude of about (in the sense that a statistic such as the standard deviation of is comparable to ), and the “behave independently”, then the sum is expected to have a magnitude of about . Thus for instance a sum of unbiased signs would be expected to have magnitude about if the do not exhibit strong correlations with each other. This principle turns out to be remarkably broadly applicable (at least as a heuristic, if not as a rigorous argument), even in situations for which no randomness is evident (e.g. in considering the type of exponential sums that occur in analytic number theory). We will see some further instantiations of this principle in later notes.

These identities, together with Chebyshev’s inequality, already gives some useful control on many statistics, including some which are not obviously of the form of a sum of independent random variables. A classic example of this is the coupon collector problem, which we formulate as follows. Let be a natural number, and let be an infinite sequence of “coupons”, which are iid and uniformly distributed from the finite set . Let denote the first time at which one has collected all different types of coupons, thus is the first natural number for which the set attains the maximal cardinality of (or if no such natural number exists, though it is easy to see that this is a null event, indeed note from the strong law of large numbers that almost surely one will collect infinitely many of each coupon over time). The question is then to describe the behaviour of as gets large.

At first glance, does not seem to be easily describable as the sum of many independent random variables. However, if one looks at it the right way, one can see such a structure emerge (and much of the art of probability is in finding useful and different ways of thinking of the same random variable). Namely, for each , let denote the first time one has collected coupons out of , thus is the first non-negative integer such that has cardinality , with

If we then write for , then the take values in the natural numbers, we have the telescoping sum

Remarkably, the random variables have a simple structure:

Proposition 8The random variables are jointly independent, and each has a geometric distribution with parameter , in the sense thatfor and .

The joint independence of the reflects the “Markovian” or “memoryless” nature of a certain process relating to the coupon collector problem, and can be easily established once one has understood the concept of conditional expectation, but the further exploration of these concepts will have to be deferred to the course after this one (which I will not be teaching or writing notes for). But as the coupon collecting problem is so simple, we shall proceed instead by direct computation.

*Proof:* It suffices to show that

for any choice of natural numbers . In order for the event to hold, the first coupon can be arbitrary, but the coupons have to be equal to ; then must be from one the remaining elements of not equal to , and must be from the two-element set ; and so on and so forth up to . The claim then follows from a routine application of elementary combinatorics to count all the possible values for the tuple of the above form and dividing by the total number of such tuples.

Exercise 9Show that if is a geometric distribution with parameter for some (thus for all ) then has mean and variance .

From the above proposition and exercise as well as (3), (4) we see that

and

From the integral test (and crudely bounding ) one can thus obtain the bounds

and

where we use the usual asymptotic notation of denoting by any quantity bounded in magnitude by a constant multiple of . From Chebyshev’s inequality we thus see that

for any (note the bound is trivial unless is large). This implies in particular that converges to in probability as (assuming that our underlying probability space can model a separate coupon collector problem for each choice of ). Thus, roughly speaking, we see that one expects to take about units of time to collect all coupons.

Another application of the weak and strong law of large numbers (even with the moment hypotheses currently imposed on these laws) is a converse to the Borel-Cantelli lemma in the jointly independent case:

Exercise 10 (Second Borel-Cantelli lemma)Let be a sequence of jointly independent events. If , show that almost surely an infinite number of the hold simultaneously. (Hint:compute the mean and variance of . One can also compute the fourth moment if desired, but it is not necessary to do so for this result.)

One application of the second Borel-Cantelli lemma has the colourful name of the “infinite monkey theorem“:

Exercise 11 (Infinite monkey theorem)Let be iid random variables drawn uniformly from a finite alphabet . Show that almost surely, every finite word of letters in the alphabet appears infinitely often in the string .

In the usual formulation of the weak or strong law of large numbers, we draw the sums from a single infinite sequence of iid random variables. One can generalise the situation slightly by working instead with sums from rows of a triangular array, which are jointly independent within rows but not necessarily across rows:

Exercise 12 (Triangular arrays)Let be a triangular array of scalar random variables , such that for each , the row is a collection of independent random variables. For each , we form the partial sums .

- (i) (Weak law) If all the have mean and , show that converges in probability to .
- (ii) (Strong law) If all the have mean and , show that converges almost surely to .

Note that the weak and strong law of large numbers established previously corresponds to the case when the triangular array collapses to a single sequence of iid random variables.

We now illustrate the use of moment method and law of large number methods to two important examples of random structures, namely random graphs and random matrices.

Exercise 13For a natural number and a parameter , define anErdös-Renyi graphon vertices with parameter to be a random graph on a (deterministic) vertex set of vertices (thus is a random variable taking values in the discrete space of all possible graphs one can place on ) such that the events for unordered pairs in are jointly independent and each occur with probability .For each , let be an Erdös-Renyi graph on vertices with parameter (we do not require the graphs to be independent of each other).

- (i) If is the number of edges in , show that converges almost surely to . (
Hint:use Exercise 12.)- (ii) If is the number of triangles in (i.e. the set of unordered triples in such that ), show that converges in probability to . (Note: there is not quite enough joint independence here to directly apply the law of large numbers, however the second moment method still works nicely.)
- (iii) Show in fact that converges almost surely to . (Note: in contrast with the situation with the strong law of large numbers, the fourth moment does not need to be computed here.)

Exercise 14For each , let be a random matrix (i.e. a random variable taking values in the space or of matrices) such that the entries of are jointly independent in and take values in with a probability of each. (Such matrices are known asrandom sign matrices.) We do not assume any independence for the sequence .

- (i) Show that the random variables are deterministically equal to , where denotes the adjoint (which, in this case, is also the transpose) of and denotes the trace (sum of the diagonal entries) of a matrix.
- (ii) Show that for any natural number , the quantities are bounded uniformly in (i.e. they are bounded by a quantity that can depend on but not on ). (You may wish to first work with simple cases like or to gain intuition.)
- (iii) If denotes the operator norm of , and , show that converges almost surely to zero, and that diverges almost surely to infinity. (
Hint:use the spectral theorem to relate with the quantities .)

One can obtain much sharper information on quantities such as the operator norm of a random matrix; see this previous blog post for further discussion.

Exercise 15The Cramér random model for the primes is a random subset of the natural numbers with , , and the events for being jointly independent with (the restriction to is to ensure that is less than ). It is a simple, yet reasonably convincing, probabilistic model for the primes , which can be used to provide heuristic confirmations for many conjectures in analytic number theory. (It can be refined to give what are believed to be more accurate predictions; see this previous blog post for further discussion.)

- (i) (Probabilistic prime number theorem) Prove that almost surely, the quantity converges to one as .
- (ii) (Probabilistic Riemann hypothesis) Show that if , then the quantity
converges almost surely to zero as .

- (iii) (Probabilistic twin prime conjecture) Show that almost surely, there are an infinite number of elements of such that also lies in .
- (iv) (Probabilistic Goldbach conjecture) Show that almost surely, all but finitely many natural numbers are expressible as the sum of two elements of .

Probabilistic methods are not only useful for getting *heuristic* predictions about the primes; they can also give *rigorous* results about the primes. We give one basic example, namely a probabilistic proof (due to Turán) of a theorem of Hardy and Ramanujan, which roughly speaking asserts that a typical large number has about distinct prime factors.

Exercise 16 (Hardy-Ramanujan theorem)Let be a natural number (so in particular ), and let be a natural number drawn uniformly at random from to . Assume Mertens’ theoremfor all , where the sum is over primes up to .

- (i) Show that the random variable (where is when divides and otherwise, and the sum is over primes up to ) has mean and variance . (
Hint:compute (up to reasonable error) the means, variances and covariances of the random variables .)- (ii) If denotes the number of distinct prime factors of , show that converges to in probability as . (
Hint:first show that .) More precisely, show thatconverges in probability to zero, whenever is any function such that goes to infinity as .

Exercise 17 (Shannon entropy)Let be a finite non-empty set of some cardinality , and let be a random variable taking values in . Define the Shannon entropy to be the quantitywith the convention that . (In some texts, the logarithm to base is used instead of the natural logarithm .)

- (i) Show that . (
Hint:use Jensen’s inequality.) Determine when the equality holds.- (ii) Let and be a natural number. Let be iid copies of , thus is a random variable taking values in , and the distribution is a probability measure on . Let denote the set
Show that if is sufficiently large, then

and

(

Hint:use the weak law of large numbers to understand the number of times each element of occurs in .)Thus, roughly speaking, while in principle takes values in all of , in practice it is concentrated in a set of size about , and is roughly uniformly distributed on that set. This is the beginning of the microstate interpretation of Shannon entropy, but we will not develop the theory of Shannon entropy further in this course.

** — 2. Truncation — **

The weak and strong laws of large numbers have been proven under additional moment assumptions (of finite second moment and finite fourth moment respectively). To remove these assumptions, we use the simple but effective *truncation method*, decomposing a general scalar random variable into a truncated component such as for some suitable threshold , and a tail . The main term is bounded and thus has all moments finite. The tail will not have much better moment properties than the original variable , but one can still hope to make it “small” in various ways. There is a tradeoff regarding the selection of the truncation parameter : if is too large, then the truncated component has poor estimates, but if is too small then the tail causes trouble.

Let’s first see how this works with the weak law of large numbers. As before, we assume we have an iid sequence copies of a real absolutely integrable with no additional finite moment hypotheses. We write . At present, we cannot take variances of the or and so the second moment method is not directly available. But we now perform a truncation; it turns out that a good choice of threshold here is , thus we write where and , and then similarly decompose where

and

as for any given (not depending on ). By the triangle inequality, we can split

We begin by studying .

The random variables are iid with mean and variance at most

Thus, has mean and variance at most . By dominated convergence, as , so for sufficiently large we can bound

and hence by the Chebyshev inequality

Observe that is bounded surely by the absolutely integrable , and goes to zero as , so by dominated convergence we conclude that

as (keeping fixed).

To handle , we observe that each is only non-zero with probability , and hence by subadditivity

By dominated convergence again, as , and thus

Putting all this together, we conclude (5) as required. This concludes the proof of the weak law of large numbers (in the iid case) for arbitrary absolutely integrable . For future reference, we observe that the above arguments give the bound

whenever is sufficiently large depending on .

Due to the reliance on the dominated convergence theorem, the above argument does not provide any uniform rate of decay in (5). Indeed there is no such uniform rate. Consider for instance the sum where are iid random variables that equal with probability and with probability . Then the are unsigned and all have mean , but vanishes with probability , which converges to as . Thus we see that the probability that stays a distance from the mean value of is bounded away from zero. This is not inconsistent with the weak law of large numbers, because the underlying random variable depends on in this example. However, it rules out an estimate of the form

that holds uniformly whenever obeys the bound , and is a quantity that goes to zero as for a fixed choice of . (Contrast this with (2), which does provide such a uniform bound if one also assumes a bound on the second moment .)

One can ask what happens to the when the underlying random variable is not absolutely integrable. In the unsigned case, we have

Exercise 18Let be iid copies of an unsigned random variable with infinite mean, and write . Show that diverges to infinity in probability, in the sense that as for any fixed .

The above exercise shows that grows faster than (in probability, at least) when is unsigned with infinite mean, but this does not completely settle the question of the precise rate at which does grow. We will not answer this question in full generality here, but content ourselves with analysing a classic example of the unsigned infinite mean setting, namely the Saint Petersburg paradox. The paradox can be formulated as follows. Suppose we have a lottery whose payout takes taking values in the powers of two with

for . The question is to work out what is the “fair” or “breakeven” price to pay for a lottery ticket. If one plays this lottery times, the total payout is where are independent copies of , so the question boils down to asking what one expects the value of to be. If were absolutely integrable, the strong or weak law of large numbers would indicate that is the fair price to pay, but in this case we have

This suggests, paradoxically, that any finite price for this lottery, no matter how high, would be a bargain!

To clarify this paradox, we need to get a better understanding of the random variable . For a given , we let be a truncation parameter to be chosen later, and split where and as before (we no longer need the absolute value signs here as all random variables are unsigned). Since the take values in powers of two, we may as well also set to be a power of two. We split where

and

The random variable can be computed to have mean

and we can upper bound the variance by

and hence has mean and variance at most . By Chebyshev’s inequality, we thus have

for any .

Now we turn to . We cannot use the first or second moment methods here because the are not absolutely integrable. However, we can instead use the following “zeroth moment method” argument. Observe that the random variable is only nonzero with probability (that is to say, the “zeroth moment” is , using the convention ). Thus is nonzero with probability at most . We conclude that

This bound is valid for any natural number and any . Of course for this bound to be useful, we want to select parameters so that the right-hand side is somewhat small. If we pick for instance to be the integer part of , and to be , we see that

which (for large ) implies that

In particular, we see that converges in probability to one. This suggests that the fair price to pay for the Saint Petersburg lottery is a function of the number of tickets one wishes to play, and should be approximately when is large. In particular, the lottery is indeed worth paying out any finite cost , but one needs to buy about tickets before one breaks even!

In contrast to the absolutely integrable case, in which the weak law can be upgraded to the strong law, there is no strong law for the Saint Petersburg paradox:

Exercise 19With the notation as in the above analysis of the Saint Petersburg paradox, show that is almost surely unbounded. (Hint:it suffices to show that is almost surely unbounded. For this, use the second Borel-Cantelli lemma.)

The following exercise can be viewed as a continuous analogue of the Saint Petersburg paradox.

Exercise 20A real random variable is said to have a standard Cauchy distribution if it has the probability density function .

- (i) Verify that standard Cauchy distributions exist (this boils down to checking that the integral of the probability density function is ).
- (ii) Show that a real random variable with the standard Cauchy distribution is not absolutely integrable.
- (iii) If are iid copies of a random variable with the standard Cauchy distribution, show that converges in probability to but is almost surely unbounded.

Exercise 21 (Weak law of large numbers for triangular arrays)Let be a triangular array of random variables, with the variables jointly independent for each . Let be a sequence going to infinity, and write and . Assume thatand

as . Show that

in probability.

Now we turn to establishing the strong law of large numbers in full generality. A first attempt would be to apply the Borel-Cantelli lemma to the bound (6). However, the decay rates for quantities such as are far too weak to be absolutely summable, in large part due to the reliance on the dominated convergence theorem. To get around this we follow some arguments of Etemadi. We first need to make a few preliminary reductions, aimed at “sparsifying” the set of times that one needs to control. It is here that we will genuinely use the fact that the averages are being drawn from a single sequence of random variables, rather than from a triangular array.

We turn to the details. In previous arguments it was convenient to normalise the underlying random variable to have mean zero. Here we will use a different reduction, namely to the case when is unsigned; the strong law for real absolutely integrable clearly follows from the unsigned case by expressing as the difference of two unsigned absolutely integrable variables and (and splitting similarly).

Henceforth we assume (and hence the ) to be unsigned. Crucially, this now implies that the partial sums are monotone: . While this does not quite imply any monotonicity on the sequence , it does make it significantly easier to show that it converges. The key point is as follows.

Lemma 22Let be an increasing sequence, and let be a real number. Suppose that for any , the sequence converges to as , where . Then converges to .

*Proof:* Let . For any , let be the index such that . Then we have

and (for sufficiently large )

and thus

Taking limit inferior and superior, we conclude that

and then sending we obtain the claim.

An inspection of the above argument shows that we only need to verify the hypothesis for a countable sequence of (e.g. for natural number ). Thus, to show that converges to almost surely, it suffices to show that for any , one has almost surely as .

Fix . The point is that the “lacunary” sequence is much sparser than the sequence of natural numbers , and one now will lose a lot less from the Borel-Cantelli argument. Indeed, for any , we can apply (6) to conclude that

whenever is sufficiently large depending on . Thus, by the Borel-Cantelli lemma, it will suffice to show that the sums

and

are finite. Using the monotone convergence theorem to interchange the sum and expectation, it thus suffices to show the pointwise estimates

and

for some . But this follows from the geometric series formula (the first sum is over the elements of the sequence that are greater than or equal to , while the latter is over those that are less than ). This proves the strong law of large numbers for arbitrary absolutely integrable iid .

We remark that by carefully inspecting the above proof of the strong law of large numbers, we see that the hypothesis of joint independence of the can be relaxed to pairwise independence.

The next exercise shows how one can use the strong law of large numbers to approximate the cumulative distribution function of a random variable by an empirical cumulative distribution function.

Exercise 23Let be iid copies of a real random variable .

- (i) Show that for every real number , one has almost surely that
and

as .

- (ii) Establish the Glivenko-Cantelli theorem: almost surely, one has

uniformly inas . (Hint:For any natural number , let denote the largest integer multiple of less than . Show first that is within of for all when is sufficiently large.)

Exercise 24 (Lack of strong law for triangular arrays)Let be a random variable taking values in the natural numbers with , where (this is an example of a zeta distribution).

- (i) Show that is absolutely integrable.
- (ii) Let be jointly independent copies of . Show that the random variables are almost surely unbounded. (
Hint:for any constant , show that occurs with probability at least for some depending on . Then use the second Borel-Cantelli lemma.)

** — 3. The Kolmogorov maximal inequality — **

Let be a sequence of jointly independent square-integrable real random variables of mean zero; we do not assume the to be identically distributed. As usual, we form the sums , then has mean zero and variance

From Chebyshev’s inequality, we thus have

for any and natural number . Perhaps surprisingly, we have the following improvement to this bound, known as the Kolmogorov maximal inequality:

Theorem 25 (Kolmogorov maximal inequality)With the notation and hypotheses as above, we have

*Proof:* For each , let be the event that , but that for all . It is clear that the event is the disjunction of the disjoint events , thus

On the event , we have , and hence

We will shortly prove the inequality

for all . Assuming this inequality for the moment, we can put together all the above estimates, using the disjointness of the , to conclude that

and the claim follows from (7).

It remains to prove (8). Since

we have

But note that the random variable is completely determined by , while is completely determined by the . Thus and are independent. Since also has mean zero, we have

and the claim (8) follows.

An inspection of the above proof reveals that the key ingredient is the *lack of correlation between past and future* – a variable such as , which is determined by the portion of the sequence to the “past” (and present) of , is uncorrelated with a variable such as that depends only on the “future” of . One can formalise such a lack of correlation through the concept of a martingale, which will be covered in later courses in this sequence but which is beyond the scope of these notes. The use of the first time at which exceeds or attains the threshold is a simple example of a stopping time, which will be a heavily used concept in the theory of martingales (and is also used extensively in harmonic analysis, which is also greatly interested in establishing maximal inequalities).

The Kolmogorov maximal inequality gives the following variant of the strong law of large numbers.

Theorem 26 (Convergence of random series)Let be jointly independent square-integrable real random variables of mean zero withThen the series is almost surely convergent (i.e., the partial sums converge almost surely).

*Proof:* From the Kolmogorov maximal inequality and continuity from below we have

This is already enough to show that the partial sums are almost surely bounded in , but this isn’t quite enough to establish conditional convergence. To finish the job, we apply (9) with replaced by the shifted sequence for a natural number to conclude that

Sending to infinity using continuity from above, we conclude that

for all ; applying this for all rational , we conclude that is almost surely a Cauchy sequence, and the claim follows.

We can use this result together with some elementary manipulation of sums to give the following alternative proof of the strong law of large numbers.

Theorem 27 (Strong law of large numbers)Let be iid copies of an absolutely integrable variable of mean , and let . Then converges almost surely to .

*Proof:* We may normalise to be real valued with . Note that that

and hence by the Borel-Cantelli lemma we almost surely have for all but finitely many . Thus if we write and , then the difference between and almost surely goes to zero as . Thus it will suffice to show that goes to zero almost surely.

The random variables are still jointly independent, but are not quite mean zero. However, the normalised random variables are of mean zero and

so by Theorem 26 we see that the sum is almost surely convergent.

Write , thus the sequence is almost surely a Cauchy sequence. From the identity

we conclude that the sequence almost surely converges to zero, that is to say

almost surely. On the other hand, we have

and hence by dominated convergence

as , which implies that

as . By the triangle inequality, we conclude that almost surely, as required.

Exercise 28 (Kronecker lemma)Let be a convergent series of real numbers , and let be a sequence tending to infinity. Show that converges to zero as . (This is known as Kronecker’s lemma; the special case was implicitly used in the above argument.)

Exercise 29 (Kolmogorov three-series theorem, one direction)Let be a sequence of jointly independent real random variables, and let . Suppose that the two series and are absolutely convergent, and the third series is convergent. Show that the series is almost surely convergent. (The converse claim is also true, and will be discussed in later notes; the two claims are known collectively as the Kolmogorov three-series theorem.)

One advantage that the maximal inequality approach to the strong law of large numbers has over the moment method approach is that it tends to offer superior bounds on the (almost sure) *rate* of convergence. We content ourselves with just one example of this:

Exercise 30 (Cheap law of iterated logarithm)Let be a sequence of jointly independent real random variables of mean zero and bounded variance (thus ). Write . Show that converges almost surely to zero as for any given . (Hint:use Theorem 26 and the Kronecker lemma for a suitable weighted sum of the .) There is a more precise version of this fact known as the law of the iterated logarithm, which is beyond the scope of these notes.

The exercises below will be moved to a more appropriate location later, but are currently placed here in order to not disrupt existing numbering.

Exercise 31Let be iid copies of an absolutely integrable random variable with mean . Show that the averages converge in to , that is to say thatas .

Exercise 32A scalar random variable is said to be inweakif one hasThus Markov’s inequality tells us that every absolutely integrable random variable is in weak , but the converse is not true (e.g. random variables with the Cauchy distribution are weak but not absolutely integrable). Show that if are iid copies of an unsigned weak random variable, then there exists quantities such that converges in probability to , where . (Thus: there is a weak law of large numbers for weak random variables, and a strong law for strong (i.e. absolutely integrable) random variables.)

Filed under: 275A - probability theory, math.PR Tagged: law of large numbers, moment method, truncation ]]>

** — 1. Product measures — **

It is intuitively obvious that Lebesgue measure on ought to be related to Lebesgue measure on by the relationship

for any Borel sets . This is in fact true (see Exercise 4 below), and is part of a more general phenomenon, which we phrase here in the case of probability measures:

Theorem 1 (Product of two probability spaces)Let and be probability spaces. Then there is a unique probability measure on with the property thatfor all . Furthermore, we have the following two facts:

- (Tonelli theorem) If is measurable, then for each , the function is measurable on , and the function is measurable on . Similarly, for each , the function is measurable on and is measurable on . Finally, we have
- (Fubini theorem) If is absolutely integrable, then for -almost every , the function is absolutely integrable on , and the function is absolutely integrable on . Similarly, for -almost every , the function is absolutely integrable on and is absolutely integrable on . Finally, we have

The Fubini and Tonelli theorems are often used together (so much so that one may refer to them as a single theorem, the Fubini-Tonelli theorem, often also just referred to as *Fubini’s theorem* in the literature). For instance, given an absolutely integrable function and an absolutely integrable function , the Tonelli theorem tells us that the tensor product defined by

for , is absolutely integrable and one has the factorisation

Our proof of Theorem 1 will be based on the monotone class lemma that allows one to conveniently generate a -algebra from a Boolean algebra. (In Durrett, the closely related theorem is used in place of the monotone class lemma.) Define a *monotone class* in a set to be a collection of subsets of with the following two closure properties:

- If are a countable increasing sequence of sets in , then .
- If are a countable decreasing sequence of sets in , then .

Thus for instance any -algebra is a monotone class, but not conversely. Nevertheless, there is a key way in which monotone classes “behave like” -algebras:

Lemma 2 (Monotone class lemma)Let be a Boolean algebra on . Then is the smallest monotone class that contains .

*Proof:* Let be the intersection of all the monotone classes that contain . Since is clearly one such class, is a subset of . Our task is then to show that contains .

It is also clear that is a monotone class that contains . By replacing all the elements of with their complements, we see that is necessarily closed under complements.

For any , consider the set of all sets such that , , , and all lie in . It is clear that contains ; since is a monotone class, we see that is also. By definition of , we conclude that for all .

Next, let be the set of all such that , , , and all lie in for all . By the previous discussion, we see that contains . One also easily verifies that is a monotone class. By definition of , we conclude that . Since is also closed under complements, this implies that is closed with respect to finite unions. Since this class also contains , which contains , we conclude that is a Boolean algebra. Since is also closed under increasing countable unions, we conclude that it is closed under arbitrary countable unions, and is thus a -algebra. As it contains , it must also contain .

We now begin the proof of Theorem 1. We begin with the uniqueness claim. Suppose that we have two measures on that are product measures of and in the sense that

for all and . If we then set to be the collection of all such that , then contains all sets of the form with and . In fact contains the collection of all sets that are “elementary” in the sense that they are of the form for finite and for , since such sets can be easily decomposed into a finite union of *disjoint* products , at which point the claim follows from (4) and finite additivity. But is a Boolean algebra that generates as a -algebra, and from continuity from above and below we see that is a monotone class. By the monotone class lemma, we conclude that is all of , and hence . This gives uniqueness. Now we prove existence. We first claim that for any measurable set , the sets are measurable in . Indeed, the claim is obvious for sets that are “elementary” in the sense that they belong to the Boolean algebra defined previously, and the collection of all such sets is a monotone class, so the claim follows from the monotone class lemma. A similar argument (relying on monotone or dominated convergence) shows that the function

is measurable in for all . Thus, for any , we can define the quantity by

A routine application of the monotone convergence theorem verifies that is a countably additive measure; one easily checks that (2) holds for all , and in particular is a probability measure.

By construction, we see that the identity

holds (with all functions integrated being measurable) whenever is an indicator function with . By linearity of integration, the same identity holds (again with all functions measurable) when is an unsigned simple function. Since any unsigned measurable function can be expressed as the monotone non-decreasing limit of unsigned simple functions (for instance, one can round down to the largest multiple of that is less than and ), the above identity also holds for unsigned measurable by the monotone convergence theorem. Applying this fact to the absolute value of an absolutely integrable function , we conclude for such functions that

which by Markov’s inequality implies that

for -almost every . In other words, the function is absolutely integrable on for -almost every . By monotonicity we conclude that

and hence the function is absolutely integrable. Hence it makes sense to ask whether the identity

holds for absolutely integrable , as both sides are well-defined. We have already established this claim when is unsigned and absolutely integrable; by subtraction this implies the claim for real-valued absolutely integrable , and by taking real and imaginary parts we obtain the claim for complex-valued absolutely integrable .

We may reverse the roles of and , and define instead by the formula

By the previously proved uniqueness of product measure, we see that this defines the same product measure as previously. Repeating the previous arguments we obtain all the above claims with the roles of and reversed. This gives all the claims required for Theorem 1.

One can extend the product construction easily to finite products:

Exercise 3 (Finite products)Show that for any finite collection of probability spaces, there exists a unique probability measure on such thatwhenever for . Furthermore, show that

for any partition (after making the obvious identification between and ). Thus for instance one has the associativity property

for any probability spaces for .

By writing as products of pairs of probability spaces in many different ways, one can obtain a higher-dimensional analogue of the Fubini and Tonelli theorems; we leave the precise statement of such a theorem to the interested reader.

It is important to be aware that the Fubini theorem identity

for measurable functions that are not unsigned, are usually only justified when is absolutely integrable on , or equivalently (by the Tonelli theorem) the function is absolutely integrable on (or that is absolutely integrable on . Without this joint absolute integrability (and without any unsigned property on ), the identity (5) can fail even if both sides are well-defined. For instance, let be the unit interval , and let be the uniform probability measure on this interval, and set

One can check that both sides of (5) are well-defined, but that the left-hand side is and the right-hand side is . Of course, this function is neither unsigned nor jointly absolutely integrable, so this counterexample does not violate either of the Fubini or Tonelli theorems. Thus one should take care to only interchange integrals when the integrands are known to be either unsigned or jointly absolutely integrable, or if one has another way to rigorously justify the exchange of integrals.

The above theory extends from probability spaces to finite measure spaces, and more generally to measure spaces that are -finite, that is to say they are expressable as the countable union of sets of finite measure. (With a bit of care, some portions of product measure theory are even extendible to non-sigma-finite settings, though I urge caution in applying these results blindly in that case.) We will not give the details of these generalisations here, but content ourselves with one example:

Exercise 4Establish (4) for all Borel sets . (Hint:can be viewed as the disjoint union of a countable sequence of sets of measure .)

Remark 5When doing real analysis (as opposed to probability), it is convenient to complete the Borel -algebra on spaces such as , to form the largerLebesgue -algebra, defined as the collection of all subsets in that differ from a Borel set in by a sub-null set, in the sense that for some Borel subset of of zero Lebesgue measure. There are analogues of the Fubini and Tonelli theorems for such complete -algebras; see this previous lecture notes for details. However one should be cautioned that the product of Lebesgue -algebras isnotthe Lebesgue -algebra , but is instead an intermediate -algebra between and , which causes some additional small complications. For instance, if is Lebesgue measurable, then the functions can only be found to be Lebesgue measurable on foralmost every, rather than forall. We will not dwell on these subtleties further here, as we will rarely have any need to complete the -algebras used in probability theory.

It is also important in probability theory applications to form the product of an *infinite* number of probability spaces for , where can be infinite or even uncountable. Recall from Notes 0 that the product -algebra on is defined to be the -algebra generated by the sets for and , where is the usual coordinate projection. Equivalently, if we define an *elementary set* to be a subset of of the form , where is a finite subset of , is the obvious projection map to , and is a measurable set in , then can be defined as the -algebra generated by the collection of elementary sets. (Elementary sets are the measure-theoretic analogue of cylinder sets in point set topology.) For future reference we note the useful fact that is a Boolean algebra.

We define a *product measure* to be a probability measure on the measurable space which extends all of the finite products in the sense that

for all finite subsets of and all in , where . If this product measure exists, it is unique:

Exercise 6Show that for any collection of probability spaces for , there is at most one product measure . (Hint:adapt the uniqueness argument in Theorem 1 that used the monotone class lemma.)

Exercise 7Let be probability measures on , and let be their Stieltjes measure functions. Show that is the unique probability measure on whose Stietljes transform is the tensor product of .

In the case of finite , the finite product constructed in Exercise 3 is clearly the unique product. But for infinite , the construction of product measure is a more nontrivial issue. We can generalise the problem as follows:

Problem 8 (Extension problem)Let be a collection of measurable spaces. For each finite , let be a probability measure on obeying the compatibility conditionfor all finite and , where is the obvious restriction. Can one then define a probability measure on such that

Note that the compatibility condition (6) is clearly necessary if one is to find a measure obeying (7).

Again, one has uniqueness:

Exercise 9Show that for any and for finite as in the above extension problem, there is at most one probability measure with the stated properties.

The extension problem is trivial for finite , but for infinite there are unfortunately examples where the probability measure fails to exist. However, there is one key case in which we can build the extension, thanks to the Kolmogorov extension theorem. Call a measurable space standard Borel if it is isomorphic as a measurable space to a Borel subset of the unit interval with Borel measure, that is to say there is a bijection from to a Borel subset of such that and are both measurable. (In Durrett, such spaces are called *nice spaces*.) Note that one can easily replace by other standard spaces such as if desired, since these spaces are isomorphic as measurable spaces (why?).

Theorem 10 (Kolmogorov extension theorem)Let the situation be as in Problem 8. If all the measurable spaces are standard Borel, then there exists probability measure solving the extension problem (which is then unique, thanks to Exercise 9).

The proof of this theorem is lengthy and is deferred to the next (optional) section. Specialising to the product case, we conclude

Corollary 11Let be a collection of probability spaces with standard Borel. Then there exists a product measure (which is then unique, thanks to Exercise 6).

Of course, to use this theorem we would like to have a large supply of standard Borel spaces. Here is one tool that often suffices:

Lemma 12Let be a complete separable metric space, and let be a Borel subset of . Then (with the Borel -algebra) is standard Borel.

*Proof:* Let us call two topological spaces Borel isomorphic if their corresponding Borel structures are isomorphic as measurable spaces. Using the binary expansion, we see that is Borel isomorphic to (the countable number of points that have two binary expansions can be easily permuted to obtain a genuine isomorphism). Similarly is Borel isomorphic . Since is in bijection with , we conclude hat is Borel isomorphic . Thus it will suffice to to show that every complete separable metric space is Borel isomorphic to a Borel subset of . But if we let be a countable dense subset in , the map

can easily be seen to be a homeomorphism between and a subset of , which is completely metrisable and hence Borel (in fact it is a set – the countable intersection of open sets – why?). The claim follows.

Exercise 13 (Kolmogorov extension theorem, alternate form)For each natural number , let be a probability measure on with the property thatfor and any box in , where we identify with in the usual manner. Show that there exists a unique probability measure on (with the product -algebra, or equivalently the Borel -algebra on the product topology) such that

for all and Borel sets .

** — 2. Proof of the Kolmogorov extension theorem (optional) — **

We now prove Theorem 10. By the definition of a standard Borel space, we may assume without loss of generality that each is a Borel subset of with the Borel -algebra, and then by extending each to we may in fact assume without loss of generality that each is simply with the Borel -algebra. Thus each for finite is a probability measure on the cube .

We will exploit the regularity properties of such measures:

Exercise 14Let be a finite set, and let be a probability measure on (with the Borel -algebra). For any Borel set in , establish theinner regularitypropertyand the

outer regularityproperty

Hint:use the monotone class lemma.

Another way of stating the above exercise is that finite Borel measures on the cube are automatically Radon measures. In fact there is nothing particularly special about the unit cube here; the claim holds for any compact separable metric spaces. Radon measures are often used in real analysis (see e.g. these lecture notes) but we will not develop their theory further here.

Observe that one can define the *elementary measure* of any elementary set in by defining

for any finite and any Borel . This definition is well-defined thanks to the compatibility hypothesis (6). From the finite additivity of the it is easy to see that is a *finitely* additive probability measure on the Boolean algebra of elementary sets.

We would like to extend to a *countably* additive probability measure on . The standard approach to do this is via the Carathéodory extension theorem in measure theory (or the closely related Hahn-Kolmogorov theorem); this approach is presented in these previous lecture notes, and a similar approach is taken in Durrett. Here, we will try to avoid developing the Carathéodory extension theorem, and instead take a more direct approach similar to the direct construction of Lebesgue measure, given for instance in these previous lecture notes.

Given any subset (not necessarily Borel), we define its outer measure to be the quantity

where we say that is an *open elementary cover* of if each is an open elementary set, and . Some properties of this outer measure are easily established:

Exercise 15

- (i) Show that .
- (ii) (Monotonicity) Show that if then .
- (iii) (Countable subadditivity) For any countable sequence of subsets of , show that . In particular (from part (i)) we have the finite subadditivity for all .
- (iv) (Elementary sets) If is an elementary set, show that . (
Hint:first establish the claim when is compact, relying heavily on the regularity properties of the provided by Exercise 14, then extend to the general case by further heavy reliance on regularity.) In particular, we have .- (v) (Approximation) Show that if , then for any there exists an elementary set such that . (
Hint:use the monotone class lemma. When dealing with an increasing sequence of measurable sets obeying the required property, approximate these sets by an increasing sequence of elementary sets , and use the finite additivity of elementary measure and the fact that bounded monotone sequences converge.)

From part (v) of the above exercise, we see that every can be viewed as a “limit” of a sequence of elementary sets such that . From parts (iii), (iv) we see that the sequence is a Cauchy sequence and thus converges to a limit, which we denote ; one can check from further application of (iii), (iv) that this quantity does not depend on the specific choice of . (Indeed, from subadditivity we see that .) From definition we see that extends (thus for any elementary set ), and from the above exercise one checks that is countably additive. Thus is a probability measure with the desired properties, and the proof of the Kolmogorov extension theorem is complete.

** — 3. Independence — **

Using the notion of product measure, we can now quickly define the notion of independence:

Definition 16A collection of random variables (each of which take values in some measurable space ) is said to bejointly independent, if the distribution of is the product of the distributions of the . Or equivalently (after expanding all the definitions), we havefor all finite and all measurable subsets of . We say that two random variables are

independent(or that is independent of ) if the pair is jointly independent.

It is worth reiterating that unless otherwise specified,