This is however not the end of the matter; there are many variants, refinements, and generalisations of the central limit theorem, and the purpose of this set of notes is to present a small sample of these variants.
First of all, the above theorem does not quantify the rate of convergence in (1). We have already addressed this issue to some extent with the Berry-Esséen theorem, which roughly speaking gives a convergence rate of uniformly in if we assume that has finite third moment. However there are still some quantitative versions of (1) which are not addressed by the Berry-Esséen theorem. For instance one may be interested in bounding the large deviation probabilities
in the setting where grows with . Chebyshev’s inequality gives an upper bound of for this quantity, but one can often do much better than this in practice. For instance, the central limit theorem (1) suggests that this probability should be bounded by something like ; however, this theorem only kicks in when is very large compared with . For instance, if one uses the Berry-Esséen theorem, one would need as large as or so to reach the desired bound of , even under the assumption of finite third moment. Basically, the issue is that convergence-in-distribution results, such as the central limit theorem, only really control the typical behaviour of statistics in ; they are much less effective at controlling the very rare outlier events in which the statistic strays far from its typical behaviour. Fortunately, there are large deviation inequalities (or concentration of measure inequalities) that do provide exponential type bounds for quantities such as (2), which are valid for both small and large values of . A basic example of this is the Chernoff bound that made an appearance in Exercise 47 of Notes 4; here we give some further basic inequalities of this type, including versions of the Bennett and Hoeffding inequalities.
where is now bounded (but can grow with ). The central limit theorem predicts that this quantity should be roughly , but even if one is able to invoke the Berry-Esséen theorem, one cannot quite see this main term because it is dominated by the error term in Berry-Esséen. There is good reason for this: if for instance takes integer values, then also takes integer values, and can vanish when is less than and is slightly larger than an integer. However, this turns out to essentially be the only obstruction; if does not lie in a lattice such as , then we can establish a local limit theorem controlling (3), and when does take values in a lattice like , there is a discrete local limit theorem that controls probabilities such as . Both of these limit theorems will be proven by the Fourier-analytic method used in the previous set of notes.
We also discuss other limit theorems in which the limiting distribution is something other than the normal distribution. Perhaps the most common example of these theorems is the Poisson limit theorems, in which one sums a large number of indicator variables (or approximate indicator variables), each of which is rarely non-zero, but which collectively add up to a random variable of medium-sized mean. In this case, it turns out that the limiting distribution should be a Poisson random variable; this again is an easy application of the Fourier method. Finally, we briefly discuss limit theorems for other stable laws than the normal distribution, which are suitable for summing random variables of infinite variance, such as the Cauchy distribution.
Finally, we mention a very important class of generalisations to the CLT (and to the variants of the CLT discussed in this post), in which the hypothesis of joint independence between the variables is relaxed, for instance one could assume only that the form a martingale. Many (though not all) of the proofs of the CLT extend to these more general settings, and this turns out to be important for many applications in which one does not expect joint independence. However, we will not discuss these generalisations in this course, as they are better suited for subsequent courses in this series when the theory of martingales, conditional expectation, and related tools are developed.
— 1. Large deviation inequalities —
We now look at some upper bounds for the large deviation probability (2). To get some intuition as to what kinds of bounds one can expect, we first consider some examples. First suppose that has the standard normal distribution, then , , and has the distribution of , so that has the distribution of . We thus have
We recall Stirling’s formula, which we write crudely as
as , where denotes a quantity with as (and similarly for other uses of the notation in the sequel). If and is bounded away from zero and one, we then have the asymptotic
where is the entropy function
(compare with Exercise 17 of Notes 3). One can check that is decreasing for , and so one can compute that
as for any fixed . To compare this with (4), observe from Taylor expansion that
Finally, consider the example where takes values in with and for some small , thus and . We have , and hence
Here, we see that the large deviation probability is somewhat larger than the gaussian prediction of . Instead, the exponent is approximately related to and by the formula
We now give a general large deviations inequality that is consistent with the above examples.
Proposition 2 (Cheap Bennett inequality) Let , and let be independent random variables, each of which takes values in an interval of length at most . Write , and write for the mean of . Let be such that has variance at most . Then for any , we have
There is more precise form of this inequality known as Bennett’s inequality, but we will not prove it here.
The first term in the minimum dominates when , and the second term dominates when . Sometimes it is convenient to weaken the estimate by discarding the logarithmic factor, leading to
(possibly with a slightly different choice of ); thus we have Gaussian type large deviation estimates for as large as , and (slightly better than) exponential decay after that.
In the case when are iid copies of a random variable of mean and variance taking values in an interval of length , we have and , and the above inequality simplifies slightly to
Proof: We first begin with some quick reductions. Firstly, by dividing the (and , , and ) by , we may normalise ; by subtracting the mean from each of the , we may assume that the have mean zero, so that as well. We also write for the variance of each , so that . Our task is to show that
for all . We will just prove the upper tail bound
the lower tail bound then follows by replacing all with their negations , and the claim then follows by summing the two estimates.
We use the “exponential moment method”, previously seen in proving the Chernoff bound (Exercise 47 of Notes 4), in which one uses the exponential moment generating function of . On the one hand, from Markov’s inequality one has
for any real parameter . On the other hand, from the joint independence of the one has
Since the take values in an interval of length at most and have mean zero, we have and so . This leads to the Taylor expansion
so on taking expectations we have
Putting all this together, we conclude that
If (say), one can then set to be a small multiple of to obtain a bound of the form
If instead , one can set to be a small multiple of to obtain a bound of the form
In either case, the claim follows.
The following variant of the above proposition is also useful, in which we get a simpler bound at the expense of worsening the quantity slightly:
Proposition 3 (Cheap Hoeffding inequality) Let be independent random variables, with each taking values in an interval with . Write , and write for the mean of , and write
Proof: We again normalise the to have mean zero, so that . We then have for each , so by Taylor expansion we have for any real that
Multiplying in , we then have
and one can now repeat the previous arguments (but without the factor to deal with).
Remark 4 In the above examples, the underlying random variable was assumed to either be restricted to an interval, or to be subgaussian. This type of hypothesis is necessary if one wishes to have estimates on (2) that are similarly subgaussian. For instance, suppose has a zeta distribution
for some and all natural numbers , where . One can check that this distribution has finite mean and variance for . On the other hand, since we trivially have , we have the crude lower bound
which shows that in this case the expression (2) only decays at a polynomial rate in rather than an exponential or subgaussian rate.
Exercise 5 (Khintchine inequality) Let be iid copies of a Bernoulli random variable drawn uniformly from .
- (i) For any non-negative reals and any , show that
for some constant depending only on . When , show that one can take and equality holds.
- (ii) With the hypotheses in (i), obtain the matching lower bound
for some depending only on . (Hint: use (i) and Hölder’s inequality.)
- (iii) For any and any functions on a measure space , show that
with the same constants as in (i), (ii). When , show that one can take and equality holds.
- (iv) (Marcinkiewicz-Zygmund theorem) The Khintchine inequality is very useful in real analysis; we give one example here. Let be measure spaces, let , and suppose is a linear operator obeying the bound
for all and some finite . Show that for any finite sequence , one has the bound
for some constant depending only on . (Hint: test against a random sum .)
- (v) By using gaussian sums in place of random signs, show that one can take the constant in (iv) to be one. (For simplicity, let us take the functions in to be real valued.)
In this set of notes we have not focused on getting explicit constants in the large deviation inequalities, but it is not too difficult to do so with a little extra work. We give just one example here:
- (i) Show that for any real . (Hint: expand both sides as an infinite Taylor series in .)
- (ii) Show that for any real numbers and any , we have
(Note that this is consistent with (6) with .)
There are many further large deviation inequalities than the ones presented here. For instance, the Azuma-Hoeffding inequality gives Hoeffding-type bounds when the random variables are not assumed to be jointly independent, but are instead required to form a martingale. Concentration of measure inequalities such as McDiarmid’s inequality handle the situation in which the sum is replaced by a more nonlinear function of the input variables . There are also a number of refinements of the Chernoff estimate from the previous notes, that are collectively referred to as “Chernoff bounds“. The Bernstein inequalities handle situations in which the underlying random variable is not bounded, but enjoys good moment bounds. See this previous blog post for these inequalities and some further discussion. Last, but certainly not least, there is an extensively developed theory of large deviations which is focused on the precise exponent in the exponential decay rate for tail probabilities such as (2) when is very large (of the order of ); there is also a complementary theory of moderate deviations that gives precise estimates in the regime where is much larger than one, but much less than , for which we generally expect gaussian behaviour rather than exponentially decaying bounds. These topics are beyond the scope of this course.
— 2. Local limit theorems —
Let be iid copies of a random variable of mean and variance , and write . On the one hand, the central limit theorem tells us that should behave like the normal distribution , which has probability density function . On the other hand, if is discrete, then must also be discrete. For instance, if takes values in the integers , then takes values in the integers as well. In this case, we would expect (much as we expect a Riemann sum to approximate an integral) the probability distribution of to behave like the probability density function predicted by the central limit theorem, thus we expect
for integer . This is not a direct consequence of the central limit theorem (which does not distinguish between continuous or discrete random variables ), and in any case is not true in some cases: if is restricted to an infinite subprogression of for some and integer , then is similarly restricted to the infinite subprogression , so that (7) totally fails when is outside of (and when does lie in , one would now expect the left-hand side of (7) to be about times larger than the right-hand side, to keep the total probability close to ). However, this turns out to be the only obstruction:
Theorem 7 (Discrete local limit theorem) Let be iid copies of an integer-valued random variable of mean and variance . Suppose furthermore that there is no infinite subprogression of with for which takes values almost surely in . Then one has
for all and all integers , where the error term is uniform in . In other words, we have
Note for comparison that the Berry-Esséen theorem (writing as, say, ) would give (assuming finite third moment) an error term of instead of , which would overwhelm the main term which is also of size .
Proof: Unlike previous arguments, we do not have the luxury here use an affine change of variables to normalise to mean zero and variance one, as this would disrupt the hypothesis that takes values in .
Fix and . Since and are integers, we have the Fourier identity
which upon taking expectations and using Fubini’s theorem gives
where is the characteristic function of . Expanding and noting that are iid copies of , we have
It will be convenient to make the change of variables , to obtain
A standard Fourier-analytic calculation gives
so it will now suffice to establish that
uniformly in . From dominated convergence we have
so by the triangle inequality, it suffices to show that
This will follow from (9) and the dominated convergence theorem, as soon as we can dominate the integrands by an absolutely integrable function.
From (8), there is an such that
for all , and hence
for . This gives the required domination in the region , so it remains to handle the region .
From the triangle inequality we have for all . Actually we have the stronger bound for . Indeed, if for some such , this would imply that is a deterministic constant, which means that takes values in for some real , which implies that takes values in ; since also takes values in , this would place either in a singleton set or in an infinite subprogression of (depending on whether and are rational), a contradiction. As is continuous and the region is compact, there exists such that for all . This allows us to dominate by for , which is in turn bounded by for some independent of , giving the required domination.
Of course, if the random variable in Theorem 7 did take values almost surely in some subprogression , then either is almost surely constant, or there is a minimal progression with this property (since must divide the difference of any two integers that attains with positive probability). One can then make the affine change of variables (modifying and appropriately) and apply the above theorem to obtain a similar local limit theorem, which we will not write here. For instance, if is the uniform distribution on , then this argument gives
when is an integer of the same parity as (of course, will vanish otherwise). A further affine change of variables handles the case when is not integer valued, but takes values in some other lattice , where and are now real-valued.
We can complement these local limit theorems with the following result that handles the non-lattice case:
Theorem 8 (Continuous local limit theorem) Let be iid copies of an real-valued random variable of mean and variance . Suppose furthermore that there is no infinite progression with real for which takes values almost surely in . Then for any , one has
for all and all , where the error term is uniform in (but may depend on ).
Equivalently, if we let be a normal random variable with mean and variance , then
uniformly in . Again, this can be compared with the Berry-Esséen theorem, which (assuming finite third moment) has an error term of which is uniform in both and .
Proof: Unlike the discrete case, we have the luxury here of normalising and , and we shall now do so.
is compactly supported, and where the error can depend on but is uniform in . Indeed, if this bound (10) holds, then (after replacing by to make it positive on some interval, and then rescaling) we obtain a bound of the form
for any and , where the term can depend on but not on . Then, by convolving by an approximation to the identity of some width much smaller than with compactly supported Fourier transform, applying (10) to the resulting function, and using (11) to control the error between that function and , we see that
uniformly in , where the term can depend on and but is uniform in . Letting tend slowly to zero, we obtain the claim.
It remains to establish (10). We adapt the argument from the discrete case. By the Fourier inversion formula we may write
By Fubini’s theorem as before, we thus have
so it suffices by the triangle inequality and the boundedness and compact support of to show that
for any fixed (where the term can depend on ). We have and , so by making the change of variables , we now need to show that
as . But this follows from the argument used to handle the discrete case.
— 3. The Poisson central limit theorem —
The central limit theorem (after normalising the random variables to have mean zero) studies the fluctuations of sums where each individual term is quite small (typically of size ). Now we consider a variant situation, in which one considers a sum of random variables which are usually zero, but occasionally equal to a larger value such as . (This situation arises in many real-life situations when compiling aggregate statistics on rare events, e.g. the number of car crashes in a short period of time.) In these cases, one can get a different distribution than the gaussian distribution, namely a Poisson distribution with some intensity – that is to say, a random variable taking values in the non-negative integers with probability distribution
One can check that this distribution has mean and variance .
- (i) ( mostly ) One has as .
- (ii) ( rarely ) One has as .
- (iii) (Convergent expectation) One has as for some .
Then the random variables converge in distribution to a Poisson random variable of intensity .
Proof: From hypothesis (i) and the union bound we see that for each , we have that all of the lie in with probability as . Thus, if we replace each by the restriction , the random variable is only modified on an event of probability , which does not affect distributional limits (Slutsky’s theorem). Thus, we may assume without loss of generality that the take values in .
By Exercise 20 of Notes 4, a Poisson random variable of intensity has characteristic function . Applying the Lévy convergence theorem (Theorem 27 of Notes 4), we conclude that it suffices to show that
as for any fixed .
Fix . By the independence of the , we may write
Since only takes on the values and , we can write
where . By hypothesis (ii), we have , so by using a branch of the complex logarithm that is analytic near , we can write
By Taylor expansion we have
and hence by (ii), (iii)
as , and the claim follows.
Exercise 10 Establish the conclusion of Theorem 9 directly from explicit computation of the probabilities in the case when each takes values in with for some fixed .
The Poisson central limit theorem can be viewed as a degenerate limit of the central limit theorem, as seen by the next two exercises.
Exercise 11 Suppose we replace the hypothesis (iii) in Theorem 9 with the alternative hypothesis that the quantities go to infinity as , while leaving hypotheses (i) and (ii) unchanged. Show that converges in distribution to the normal distribution .
Exercise 12 For each , let be a Poisson random variable with intensity . Show that as , the random variables converge in distribution to the normal distribution . Discuss how this is consistent with Theorem 9 and the previous exercise.
— 4. Stable laws —
Let be a real random variable. We say that has a stable law or a stable distribution if for any positive reals , there exists a positive real and a real such that whenever are iid copies of . In terms of the characteristic function of , we see that has a stable law if for any positive reals , there exist a positive real and a real for which we have the functional equation
for all real .
For instance, a normally distributed variable is stable thanks to Lemma 12 of Notes 4; one can also see this from the characteristic function . A Cauchy distribution , with probability density can also be seen to be stable, as is most easily seen from the characteristic function . As a more degenerate example, any deterministic random variable is stable. It is possible (though somewhat tedious) to completely classify all the stable distributions, see for instance the Wikipedia entry on these laws for the full classification.
If is stable, and are iid copies of , then by iterating the stable law hypothesis we see that the sums are all equal in distribution to some affine rescaling of . For instance, we have for some , and a routine induction then shows that
for all natural numbers (with the understanding that when ). In particular, the random variables all have the same distribution as .
More generally, given two real random variables and , we say that is in the basin of attraction for if, whenever are iid copies of and , there exist constants and such that converges in distribution to . Thus, any stable law is in its own basin of attraction, while the central limit theorem asserts that any random variable of finite variance is in the basin of attraction of a normal distribution. One can check that every random variable lies in the basin of attraction of a deterministic random variable such as , simply by letting go to infinity rapidly enough. To avoid this degenerate case, we now restrict to laws that are non-degenerate, in the sense that they are not almost surely constant. Then we have the following useful technical lemma:
Proposition 13 (Convergence of types) Let be a sequence of real random variables converging in distribution to a non-degenerate limit . Let and be real numbers such that converges in distribution to a non-degenerate limit . Then and converge to some finite limits respectively, and .
Proof: Suppose first that goes to zero. The sequence converges in distribution, hence is tight, hence converges in probability to zero. In particular, if is an independent copy of , then converges in probability to zero; but also converges in distribution to where is an independent copy of , and is not almost surely zero since is non-degenerate. This is a contradiction. Similarly if has any subsequence that goes to zero. We conclude that is bounded away from zero. Rewriting as and reversing the roles of and , we conclude also that is bounded away from zero, thus is bounded.
Since is tight and is bounded, is tight; since is also tight, this implies that is tight, that is to say is bounded.
Let be a limit point of the . By Slutsky’s theorem, a subsequence of the then converges in distribution to , thus . If the limit point is unique then we are done, so suppose there are two limit points , . Thus , which on rearranging gives for some and real with .
If then on iteration we have for any natural number , which clearly leads to a contradiction as since . If then iteration gives for any natural number , which on passing to the limit in distribution as gives , again a contradiction. If then we rewrite as and again obtain a contradiction, and the claim follows.
One can use this proposition to verify that basins of attraction of genuinely distinct laws are disjoint:
Exercise 14 Let and be non-degenerate real random variables. Suppose that a random variable lies in the basin of attraction of both and . Then there exist and real such that .
If lies in the basin of attraction for a non-degenerate law , then converges in distribution to ; since is equal in distribution to the sum of two iid copies of , we see that converges in distribution to the sum of two iid copies of . On the other hand, converges in distribution to . Using Proposition 13 we conclude that for some and real . One can go further and conclude that in fact has a stable law; see the following exercise. Thus stable laws are the only laws that have a non-empty basin of attraction.
Exercise 15 Let lie in the basin of attraction for a non-degenerate law .
- (i) Show that for any iid copies of , there exists a unique and such that . Also show that for all natural numbers .
- (ii) Show that the are strictly increasing, with for all natural numbers . (Hint: study the absolute value of the characteristic function, using the non-degeneracy of to ensure that this absolute value is usually strictly less than one.) Also show that for all natural numbers .
- (iii) Show that there exists such that for all . (Hint: first show that is a Cauchy sequence in .)
- (iv) If , and are iid copies of , show that for all natural numbers and some bounded real . Then show that has a stable law in this case.
- (v) If , show that for some real and all . Then show that has a stable law in this case.
Exercise 16 (Classification of stable laws) Let be a non-degenerate stable law, then lies in its own basin of attraction and one can then define as in the preceding exercise.
- (i) If , and is as in part (v) of the preceding exercise, show that for all and . Then show that
for some real . (One can use the identity to restrict attention to the case of positive .)
- (ii) Now suppose . Show that for all (where the implied constant in the notation is allowed to depend on ). Conclude that for all .
- (iii) We continue to assume . Show that for some real number . (Hint: first show this when is a power of a fixed natural number (with possibly depending on ). Then use the estimates from part (ii) to show that does not actually depend on . (One may need to invoke the Dirichlet approximation theorem to show that for any given , one can find a power of that is somewhat close to a power of .)
- (iv) We continue to assume . Show that for all and . Then show that
for all and some real .
It is also possible to determine which choices of parameters are actually achievable by some random variable , but we will not do so here.
It is possible to associate a central limit theorem to each stable law, which precisely determines their basin of attraction. We will not do this in full generality, but just illustrate the situation for the Cauchy distribution.
Exercise 17 Let be a real random variable which is symmetric (that is, has the same distribution as ) and obeys the distribution identity
for all , where is a function which is slowly varying in the sense that as for all .
- (i) Show that
as , where denotes a quantity such that as . (You may need to establish the identity , which can be done by contour integration.)
- (ii) Let be iid copies of . Show that converges in distribution to a copy of the standard Cauchy distribution (i.e., to a random variable with probability density function ).