Let be iid copies of an absolutely integrable real scalar random variable , and form the partial sums . As we saw in the last set of notes, the law of large numbers ensures that the empirical averages converge (both in probability and almost surely) to a deterministic limit, namely the mean of the reference variable . Furthermore, under some additional moment hypotheses on the underlying variable , we can obtain *square root cancellation* for the fluctuation of the empirical average from the mean. To simplify the calculations, let us first restrict to the case of mean zero and variance one, thus

and

Then, as computed in previous notes, the normalised fluctuation also has mean zero and variance one:

This and Chebyshev’s inequality already indicates that the “typical” size of is , thus for instance goes to zero in probability for any that goes to infinity as . If we also have a finite fourth moment , then the calculations of the previous notes also give a fourth moment estimate

From this and the Paley-Zygmund inequality (Exercise 42 of Notes 1) we also get some lower bound for of the form

for some absolute constant and for sufficiently large; this indicates in particular that does not converge in any reasonable sense to something finite for any that goes to infinity.

The question remains as to what happens to the ratio itself, without multiplying or dividing by any factor . A first guess would be that these ratios converge in probability or almost surely, but this is unfortunately not the case:

Proposition 1Let be iid copies of an absolutely integrable real scalar random variable with mean zero, variance one, and finite fourth moment, and write . Then the random variables do not converge in probability or almost surely to any limit, and neither does any subsequence of these random variables.

*Proof:* Suppose for contradiction that some sequence converged in probability or almost surely to a limit . By passing to a further subsequence we may assume that the convergence is in the almost sure sense. Since all of the have mean zero, variance one, and bounded fourth moment, Theorem 24 of Notes 1 implies that the limit also has mean zero and variance one. On the other hand, is a tail random variable and is thus almost surely constant by the Kolmogorov zero-one law from Notes 3. Since constants have variance zero, we obtain the required contradiction.

Nevertheless there is an important limit for the ratio , which requires one to replace the notions of convergence in probability or almost sure convergence by the weaker concept of convergence in distribution.

Definition 2 (Vague convergence and convergence in distribution)Let be a locally compact Hausdorff topological space with the Borel -algebra. A sequence of finite measures on is said to converge vaguely to another finite measure if one hasas for all continuous compactly supported functions . (Vague convergence is also known as

weak convergence, although strictly speaking the terminology weak-* convergence would be more accurate.) A sequence of random variables taking values in is said toconverge in distribution(orconverge weaklyorconverge in law) to another random variable if the distributions converge vaguely to the distribution , or equivalently ifas for all continuous compactly supported functions .

One could in principle try to extend this definition beyond the locally compact Hausdorff setting, but certain pathologies can occur when doing so (e.g. failure of the Riesz representation theorem), and we will never need to consider vague convergence in spaces that are not locally compact Hausdorff, so we restrict to this setting for simplicity.

Note that the notion of convergence in distribution depends only on the distribution of the random variables involved. One consequence of this is that convergence in distribution does not produce unique limits: if converges in distribution to , and has the same distribution as , then also converges in distribution to . However, limits are unique up to equivalence in distribution (this is a consequence of the Riesz representation theorem, discussed for instance in this blog post). As a consequence of the insensitivity of convergence in distribution to equivalence in distribution, we may also legitimately talk about convergence of distribution of a sequence of random variables to another random variable even when all the random variables and involved are being modeled by different probability spaces (e.g. each is modeled by , and is modeled by , with no coupling presumed between these spaces). This is in contrast to the stronger notions of convergence in probability or almost sure convergence, which require all the random variables to be modeled by a common probability space. Also, by an abuse of notation, we can say that a sequence of random variables converges in distribution to a probability measure , when converges vaguely to . Thus we can talk about a sequence of random variables converging in distribution to a uniform distribution, a gaussian distribution, etc..

From the dominated convergence theorem (available for both convergence in probability and almost sure convergence) we see that convergence in probability or almost sure convergence implies convergence in distribution. The converse is not true, due to the insensitivity of convergence in distribution to equivalence in distribution; for instance, if are iid copies of a non-deterministic scalar random variable , then the trivially converge in distribution to , but will not converge in probability or almost surely (as one can see from the zero-one law). However, there are some partial converses that relate convergence in distribution to convergence in probability; see Exercise 10 below.

Remark 3The notion of convergence in distribution is somewhat similar to the notion of convergence in the sense of distributions that arises in distribution theory (discussed for instance in this previous blog post), however strictly speaking the two notions of convergence are distinct and should not be confused with each other, despite the very similar names.

The notion of convergence in distribution simplifies in the case of real scalar random variables:

Proposition 4Let be a sequence of scalar random variables, and let be another scalar random variable. Then the following are equivalent:

- (i) converges in distribution to .
- (ii) converges to for each continuity point of (i.e. for all real numbers at which is continuous). Here is the cumulative distribution function of .

*Proof:* First suppose that converges in distribution to , and is continuous at . For any , one can find a such that

for every . One can also find an larger than such that and . Thus

and

Let be a continuous function supported on that equals on . Then by the above discussion we have

and hence

for large enough . In particular

A similar argument, replacing with a continuous function supported on that equals on gives

for large enough. Putting the two estimates together gives

for large enough; sending , we obtain the claim.

Conversely, suppose that converges to at every continuity point of . Let be a continuous compactly supported function, then it is uniformly continuous. As is monotone increasing, it can only have countably many points of discontinuity. From these two facts one can find, for any , a simple function for some that are points of continuity of , and real numbers , such that for all . Thus

Similarly for replaced by . Subtracting and taking limit superior, we conclude that

and on sending , we obtain that converges in distribution to as claimed.

The restriction to continuity points of is necessary. Consider for instance the deterministic random variables , then converges almost surely (and hence in distribution) to , but does not converge to .

Example 5For any natural number , let be a discrete random variable drawn uniformly from the finite set , and let be the continuous random variable drawn uniformly from . Then converges in distribution to . Thus we see that a continuous random variable can emerge as the limit of discrete random variables.

Example 6For any natural number , let be a continuous random variable drawn uniformly from , then converges in distribution to the deterministic real number . Thus we see that discrete (or even deterministic) random variables can emerge as the limit of continuous random variables.

Exercise 7 (Portmanteau theorem)Show that the properties (i) and (ii) in Proposition 4 are also equivalent to the following three statements:

- (iii) One has for all closed sets .
- (iv) One has for all open sets .
- (v) For any Borel set whose topological boundary is such that , one has .
(Note: to prove this theorem, you may wish to invoke Urysohn’s lemma. To deduce (iii) from (i), you may wish to start with the case of compact .)

We can now state the famous central limit theorem:

Theorem 8 (Central limit theorem)Let be iid copies of a scalar random variable of finite mean and finite non-zero variance . Let . Then the random variables converges in distribution to a random variable with the standard normal distribution (that is to say, a random variable with probability density function ). Thus, by abuse of notationIn the normalised case when has mean zero and unit variance, this simplifies to

Using Proposition 4 (and the fact that the cumulative distribution function associated to is continuous, the central limit theorem is equivalent to asserting that

as for any , or equivalently that

Informally, one can think of the central limit theorem as asserting that approximately behaves like it has distribution for large , where is the normal distribution with mean and variance , that is to say the distribution with probability density function . The integrals can be written in terms of the error function as .

The central limit theorem is a basic example of the *universality phenomenon* in probability – many statistics involving a large system of many independent (or weakly dependent) variables (such as the normalised sums ) end up having a universal asymptotic limit (in this case, the normal distribution), regardless of the precise makeup of the underlying random variable that comprised that system. Indeed, the universality of the normal distribution is such that it arises in many other contexts than the fluctuation of iid random variables; the central limit theorem is merely the first place in probability theory where it makes a prominent appearance.

We will give several proofs of the central limit theorem in these notes; each of these proofs has their advantages and disadvantages, and can each extend to prove many further results beyond the central limit theorem. We first give Lindeberg’s proof of the central limit theorem, based on exchanging (or swapping) each component of the sum in turn. This proof gives an accessible explanation as to why there should be a universal limit for the central limit theorem; one then computes directly with gaussians to verify that it is the normal distribution which is the universal limit. Our second proof is the most popular one taught in probability texts, namely the Fourier-analytic proof based around the concept of the characteristic function of a real random variable . Thanks to the powerful identities and other results of Fourier analysis, this gives a quite short and direct proof of the central limit theorem, although the arguments may seem rather magical to readers who are not already familiar with Fourier methods. Finally, we give a proof based on the moment method, in the spirit of the arguments in the previous notes; this argument is more combinatorial, but is straightforward and is particularly robust, in particular being well equipped to handle some dependencies between components; we will illustrate this by proving the Erdos-Kac law in number theory by this method. Some further discussion of the central limit theorem (including some further proofs, such as one based on Stein’s method) can be found in this blog post. Some further variants of the central limit theorem, such as local limit theorems, stable laws, and large deviation inequalities, will be discussed in the next (and final) set of notes.

The following exercise illustrates the power of the central limit theorem, by establishing combinatorial estimates which would otherwise require the use of Stirling’s formula to establish.

Exercise 9 (De Moivre-Laplace theorem)Let be a Bernoulli random variable, taking values in with , thus has mean and variance . Let be iid copies of , and write .

- (i) Show that takes values in with . (This is an example of a binomial distribution.)
- (ii) Assume Stirling’s formula
where is a function of that goes to zero as . (A proof of this formula may be found in this previous blog post.) Using this formula, and without using the central limit theorem, show that

as for any fixed real numbers .

The above special case of the central limit theorem was first established by de Moivre and Laplace.

We close this section with some basic facts about convergence of distribution that will be useful in the sequel.

Exercise 10Let , be sequences of real random variables, and let be further real random variables.

- (i) If is deterministic, show that converges in distribution to if and only if converges in probability to .
- (ii) Suppose that is independent of for each , and independent of . Show that converges in distribution to if and only if converges in distribution to and converges in distribution to . (The shortest way to prove this is by invoking the Stone-Weierstrass theorem, but one can also proceed by proving some version of Proposition 4.) What happens if the independence hypothesis is dropped?
- (iii) If converges in distribution to , show that for every there exists such that for all sufficiently large . (That is to say, is a tight sequence of random variables.)
- (iv) Show that converges in distribution to if and only if, after extending the probability space model if necessary, one can find copies and of and respectively such that converges almost surely to . (
Hint:use the Skorohod representation, Exercise 29 of Notes 0.)- (v) If converges in distribution to , and is continuous, show that converges in distribution to . Generalise this claim to the case when takes values in an arbitrary locally compact Hausdorff space.
- (vi) (Slutsky’s theorem) If converges in distribution to , and converges in probability to a
deterministiclimit , show that converges in distribution to , and converges in distribution to . (Hint: either use (iv), or else use (iii) to control some error terms.) This statement combines particularly well with (i). What happens if is not assumed to be deterministic?- (vii) (Fatou lemma) If is continuous, and converges in distribution to , show that .
- (viii) (Bounded convergence) If is continuous and bounded, and converges in distribution to , show that .
- (ix) (Dominated convergence) If converges in distribution to , and there is an absolutely integrable such that almost surely for all , show that .

For future reference we also mention (but will not prove) Prokhorov’s theorem that gives a partial converse to part (iii) of the above exercise:

Theorem 11 (Prokhorov’s theorem)Let be a sequence of real random variables which is tight (that is, for every there exists such that for all sufficiently large ). Then there exists a subsequence which converges in distribution to some random variable (which may possibly be modeled by a different probability space model than the .)

The proof of this theorem relies on the Riesz representation theorem, and is beyond the scope of this course; but see for instance Exercise 29 of this previous blog post. (See also the closely related Helly selection theorem, covered in Exercise 30 of the same post.)

** — 1. The Lindeberg approach to the central limit theorem — **

We now give the Lindeberg argument establishing the central limit theorem. The proof splits into two unrelated components. The first component is to establish the central limit theorem for a *single* choice of underlying random variable . The second component is to show that the limiting distribution of is *universal* in the sense that it does not depend the choice of underlying random variable. Putting the two components together gives Theorem 8.

We begin with the first component of the argument. One could use the Bernoulli distribution from Exercise 9 as the choice of underlying random variable, but a simpler choice of distribution (in the sense that no appeal to Stirling’s formula is required) is the normal distribution itself. The key computation is:

Lemma 12 (Sum of independent Gaussians)Let be independent real random variables with normal distributions , respectively for some and . Then has the normal distribution .

This is of course consistent with the additivity of mean and variance for independent random variables, given that random variables with the distribution have mean and variance .

*Proof:* By subtracting and from respectively, we may normalise , by dividing through by we may also normalise . Thus

and

for any Borel sets . As are independent, this implies that

for any Borel set (this follows from the uniqueness of product measure, or equivalently one can use the monotone class lemma starting from the case when is a finite boolean combination of product sets ). In particular, we have

for any . Making the change of variables (and using the Fubini-Tonelli theorem as necessary) we can write the right-hand side as

We can complete the square using to write (after some routine algebra)

so on using the identity for any and , we can write (2) as

and so has the cumulative distribution function of , giving the claim.

In the next section we give an alternate proof of the above lemma using the machinery of characteristic functions. A more geometric argument can be given as follows. With the same normalisations as in the above proof, we can write and for some . Then we can write and where are iid copies of . But the joint probability density function of is rotation invariant, so has the same distribution as , and the claim follows.

From the above lemma we see that if are iid copies of a normal distribution of mean and variance , then has distribution , and hence has distribution exactly equal to . Thus the central limit theorem is clearly true in this case.

Exercise 13 (Probabilistic interpretation of convolution)Let be measurable functions with . Define the convolution of and to beShow that if are independent real random variables with probability density functions respectively, then has probability density function .

Now we turn to the general case of the central limit theorem. By subtracting from (and from each of the ) we may normalise ; by dividing (and each of the ) by we may also normalise . Thus , and our task is to show that

as , for any continuous compactly supported functions , where is a random variable distributed according to (possibly modeled by a different probability space than the original ). Since any continuous compactly supported function can be approximated uniformly by smooth compactly supported functions (as can be seen from the Weierstrass or Stone-Weierstrass theorems), it suffices to show this for smooth compactly supported .

Let be iid copies of ; by extending the probability space used to model (using Proposition 26 from Notes 2), we can model the and by a common model, in such a way that the combined collection of random variables are jointly independent. As we have already proved the central limit theorem in the normally distributed case, we already have

as (indeed we even have equality here). So it suffices to show that

We first establish this claim under the additional simplifying assumption of a finite third moment: . Rather than swap all of the with all of the , let us just swap the final to a , that is to say let us consider the expression

Writing , we can write this as

To compute this expression we use Taylor expansion. As is smooth and compactly supported, the first three derivatives of are bounded, leading to the Taylor approximation

where the implied constant depends on . Taking expectations, we conclude that

Now for a key point: as the random variable only depends on , it is independent of , and so we can decouple the expectations to obtain

The same considerations apply after swapping with (which also has a bounded third moment):

But by hypothesis, and have matching moments to second order: and . Thus on subtraction we have

A similar argument (permuting the indices, and replacing some of the with ) gives

for all . Summing the telescoping series, we conclude that

which gives (4). Note how it was important to Taylor expand to at least third order to obtain a total error bound that went to zero, which explains why it is the first two moments of (or equivalently, the mean and variance) that play such a decisive role in the central limit theorem.

Now we remove the hypothesis of finite third moment. As in the previous set of notes, we use the truncation method, taking advantage of the “room” inherent in the factor in the error term of the above analysis. For technical reasons we have to modify the usual truncation slightly to preserve the mean zero condition. Let . We split , where and with ; we split respectively (we are suppressing the dependence of on and to simplify the notation). The random variables (and ) are chosen to have mean zero, but their variance is not quite one. However, from dominated convergence, the quantity converges to , and the variance converges to , as .

Let be iid copies of . The previous arguments then give

for large enough. We can bound

and thus

for large enough . By Lemma 12, is distributed as , and hence

as . We conclude that

Next, we consider the error term

The variable has mean zero, and by dominated convergence, the variance of goes to zero as . The above error term is also mean zero and has the same variance as . In particular, from Cauchy-Schwarz we have

As is smooth and compactly supported, it is Lipschitz continuous, and hence

Taking expectations, and then combining all these estimates, we conclude that

for sufficiently large; letting go to infinity we obtain (3) as required. This concludes the proof of Theorem 8.

The above argument can be generalised to a central limit theorem for certain triangular arrays, known as the Lindeberg central limit theorem:

Exercise 14 (Lindeberg central limit theorem)Let be a sequence of natural numbers going to infinity in . For each natural number , let be jointly independent real random variables of mean zero and finite variance. (We do not require the random variables to be jointly independent in , or even to be modeled by a common probability space.) Let be defined byand assume that for all .

- (i) If one assumes the Lindeberg condition that
as for any , then show that the random variables converge in distribution to a random variable with the normal distribution .

- (ii) Show that the Lindeberg condition implies the
Feller conditionas .

Note that Theorem 8 (after normalising to the mean zero case ) corresponds to the special case and (or, if one wishes, ) of the Lindeberg central limit theorem. It was shown by Feller that if the situation as in the above exercise and the Feller condition holds, then the Lindeberg condition is necessary as well as sufficient for to converge in distribution to a random variable with normal distribution ; the combined result is sometimes known as the *Lindeberg-Feller theorem*.

Exercise 15 (Weak Berry-Esséen theorem)Let be iid copies of a real random variable of mean zero, unit variance, and finite third moment.

- (i) Show that
whenever is three times continuously differentiable and compactly supported, with distributed as and the implied constant in the notation absolute.

- (ii) Show that
for any , with the implied constant absolute.

We will strengthen the conclusion of this theorem in Theorem 37 below.

Remark 16The Lindeberg exchange method explains why the limiting distribution of statistics such as depend primarily on the first two moments of the component random variables , if there is a suitable amount of independence between the . It turns out that there is an analogous application of the Lindeberg method in random matrix theory, which (very roughly speaking) asserts that appropriate statistics of random matrices such as depend primarily on the first four moments of the matrix components , if there is a suitable amount of independence between the . See for instance this survey article of Van Vu and myself for more discussion of this. The Lindeberg method also suggests that the more moments of one assumes to match with the Gaussian variable , the faster the rate of convergence (because one can use higher order Taylor expansions).

We now use the Lindeberg central limit theorem to obtain the converse direction of the Kolmogorov three-series theorem (Exercise 29 of Notes 3).

Exercise 17 (Kolmogorov three-series theorem, converse direction)Let be a sequence of jointly independent real random variables, with the property that the series is almost surely convergent (i.e., the partial sums are almost surely convergent), and let .

- (i) Show that is finite. (
Hint:argue by contradiction and use the second Borel-Cantelli lemma.)- (ii) Show that is finite. (
Hint:first use (i) and the Borel-Cantelli lemma to reduce to the case where almost surely. If is infinite, use Exercise 14 to show that converges in distribution to a standard normal distribution, and use this to contradict the almost sure convergence of .)- (iii) Show that the series is convergent. (
Hint:reduce as before to the case where almost surely, and apply the forward direction of the three-series theorem to .)

** — 2. The Fourier-analytic approach to the central limit theorem — **

Let us now give the standard Fourier-analytic proof of the central limit theorem. Given any real random variable , we introduce the characteristic function , defined by the formula

Equivalently, is the Fourier transform of the probability measure . One should caution that the term “characteristic function” has several other unrelated meanings in mathematics; particularly confusingly, in real analysis “characteristic function” is used to denote what in probability one would call an “indicator function”. Note that no moment hypotheses are required to define the characteristic function, because the random variable is bounded even when is not absolutely integrable.

Example 18The signed Bernoulli distribution, which takes the values and with probabilities of each, has characteristic function .

Most of the standard random variables in probability have characteristic functions that are quite simple and explicit. For the purposes of proving the central limit theorem, the most important such explicit form of the characteristic function is of the normal distribution:

Exercise 19Show that the normal distribution has characteristic function .

We record the explicit characteristic functions of some other standard distributions:

Exercise 20Let , and let be a Poisson random variable with intensity , thus takes values in the non-negative integers with . Show that for all .

Exercise 21Let be uniformly distributed in some interval . Show that for all non-zero .

Exercise 22Let and , and let be a Cauchy random variable with parameters , which means that is a real random variable with probability density function . Show that for all .

The characteristic function is clearly bounded in magnitude by , and equals at the origin. By the dominated convergence theorem, is continuous in .

Exercise 23 (Riemann-Lebesgue lemma)Show that if is a real random variable that has an absolutely integrable probability density function , then as . (Hint:first show the claim when is a finite linear combination of intervals, then for the general case show that can be approximated in norm by such finite linear combinations.) Note from Example 18 that the claim can fail if does not have a probability density function. (In Fourier analysis, this fact is known as the Riemann-Lebesgue lemma.)

Exercise 24Show that the characteristic function of a real random variable is in fact uniformly continuous on its domain.

Let be a real random variable. If we Taylor expand and formally interchange the series and expectation, we arrive at the heuristic identity

which thus interprets the characteristic function of a real random variable as a kind of generating function for the moments. One rigorous version of this identity is as follows.

Exercise 25 (Taylor expansion of characteristic function)Let be a real random variable with finite moment for some . Show that is times continuously differentiable, withfor all . Conclude in particular the partial Taylor expansion

where is a quantity that goes to zero as , times .

Exercise 26Let be a real random variable, and assume that it issubgaussianin the sense that there exist constants such thatfor all . (Thus for instance a bounded random variable is subgaussian, as is any gaussian random variable.) Rigorously establish (6) in this case, and show that the series converges locally uniformly in .

Note that the characteristic function depends only on the distribution of : if and are equal in distribution, then . The converse statement is true also: if , then and are equal in distribution. This follows from a more general (and useful) fact, known as Lévy’s continuity theorem.

Theorem 27 (Lévy continuity theorem, special case)Let be a sequence of real random variables, and let be an additional real random variable. Then the following statements are equivalent:

- (i) converges pointwise to .
- (ii) converges in distribution to .

*Proof:* The implication of (i) from (ii) is immediate from (5) and Exercise 10(viii).

Now suppose that (i) holds, and we wish to show that (ii) holds. We need to show that

whenever is a continuous, compactly supported function. As in the Lindeberg argument, it suffices to prove this when is smooth and compactly supported, in particular is a Schwartz function (infinitely differentiable, with all derivatives rapidly decreasing). But then we have the Fourier inversion formula

where

is Schwartz, and is in particular absolutely integrable (see e.g. these lecture notes of mine). From the Fubini-Tonelli theorem, we thus have

and similarly for . The claim now follows from the Lebesgue dominated convergence theorem.

Remark 28Setting for all , we see in particular the previous claim that if and only if , have the same distribution. It is instructive to use the above proof as a guide to prove this claim directly.

There is one subtlety with the Lévy continuity theorem: it is possible for a sequence of characteristic functions to converge pointwise, but for the limit to not be the characteristic function of any random variable, in which case will not converge in distribution. For instance, if , then converges pointwise to for any , but this is clearly not the characteristic function of any random variable (as characteristic functions are continuous). However, this lack of continuity is the only obstruction:

Exercise 29 (Lévy’s continuity theorem, full version)Let be a sequence of real valued random variables. Suppose that converges pointwise to a limit . Show that the following are equivalent:

- (i) is continuous at .
- (ii) is a tight sequence (as in Exercise 10(iii)).
- (iii) is the characteristic function of a real valued random variable (possibly after extending the sample space).
- (iv) converges in distribution to some real valued random variable (possibly after extending the sample space).

Hint: To get from (ii) to the other conclusions, use Theorem 11 and Theorem 27. To get back to (ii) from (i), use (7) for a suitable Schwartz function . The other implications are easy once Theorem 27 is in hand.

Remark 30Lévy’s continuity theorem is very similar in spirit to Weyl’s criterion in equidistribution theory.

Exercise 31 (Esséen concentration inequality)Let be a random variable taking values in . Then for any , , show thatfor some constant depending only on . (

Hint:Use (7) for a suitable Schwartz function .) The left-hand side of (8) (as well as higher dimensional analogues of this quantity) is known as thesmall ball probabilityof at radius .

In Fourier analysis, we learn that the Fourier transform is a particularly well-suited tool for studying convolutions. The probability theory analogue of this fact is that characteristic functions are a particularly well-suited tool for studying sums of independent random variables. More precisely, we have

Exercise 32 (Fourier identities)Let be independent real random variables. Thenfor all . Also, for any scalar , one has

and more generally, for any linear transformation , one has

Remark 33Note that this identity (9), combined with Exercise 19 and Remark 28, gives a quick alternate proof of Lemma 12.

In particular, we have the simple relationship

that describes the characteristic function of in terms of that of .

We now have enough machinery to give a quick proof of the central limit theorem:

*Proof:* (Fourier proof of Theorem 8) We may normalise to have mean zero and variance . By Exercise 25, we thus have

for sufficiently small , or equivalently

for sufficiently small . Applying (10), we conclude that

as for any fixed . But by Exercise 19, is the characteristic function of the normal distribution . The claim now follows from the Lévy continuity theorem.

The above machinery extends without difficulty to vector-valued random variables taking values in . The analogue of the characteristic function is then the function defined by

We leave the routine extension of the above results and proofs to the higher dimensional case to the reader. Most interesting is what happens to the central limit theorem:

Exercise 34 (Vector-valued central limit theorem)Let be a random variable taking values in with finite second moment. Define the covariance matrix to be the matrix whose entry is the covariance .

- Show that the covariance matrix is positive semi-definite real symmetric.
- Conversely, given any positive definite real symmetric matrix and , show that the multivariate normal distribution , given by the absolutely continuous measure
has mean and covariance matrix , and has a characteristic function given by

How would one define the normal distribution if degenerated to be merely positive semi-definite instead of positive definite?

- If is the sum of iid copies of , show that converges in distribution to . (For this exercise, you may assume without proof that the Lévy continuity theorem extends to .)

Exercise 35 (Complex central limit theorem)Let be a complex random variable of mean , whose real and imaginary parts have variance and covariance . Let be iid copies of . Show that as , the normalised sums converge in distribution to the standard complex gaussian , defined as the measure on withfor Borel , where is Lebesgue measure on (identified with in the usual fashion).

Exercise 36Use characteristic functions and the truncation argument to give an alternate proof of the Lindeberg central limit theorem (Theorem 14).

A more sophisticated version of the Fourier-analytic method gives a more quantitative form of the central limit theorem, namely the Berry-Esséen theorem.

Theorem 37 (Berry-Esséen theorem)Let have mean zero and unit variance. Let , where are iid copies of . Then we haveuniformly for all , where has the distribution of , and the implied constant is absolute.

*Proof:* (Optional) Write ; our task is to show that

for all . We may of course assume that , as the claim is trivial otherwise. (In particular, has finite third moment.)

Let be a small absolute constant to be chosen later. Let be an non-negative Schwartz function with total mass whose Fourier transform is supported in ; such an can be constructed by taking the inverse Fourier transform of a smooth function supported on and equal to one at the origin, and then multiplying that transform by its complex conjugate to make it non-negative.

Let be the smoothed out version of , defined as

Observe that is decreasing from to . As is rapidly decreasing and has mean one, we also have the bound

(say) for any , where subscript indicates that the implied constant depends on .

We claim that it suffices to show that

for every , where the subscript means that the implied constant depends on . Indeed, suppose that (14) held. Replacing by and for some large absolute constant and subtracting, we have

for any . From (13) we see that the function is bounded by , and hence by the bounded probability density of

Also, is non-negative, and for large enough, it is bounded from below by (say) on . We conclude (after choosing appropriately) that

for all real . This implies that

as can be seen by covering the real line by intervals and applying (15) to each such interval. From (13) we conclude that

A similar argument gives

and (12) then follows from (14).

It remains to establish (14). We write

(with the expression in the limit being uniformly bounded) and

to conclude (after applying Fubini’s theorem) that

using the compact support of . Hence

By the fundamental theorem of calculus we have . This factor of cancels a similar factor on the denominator to make the expression inside the limit dominated by an absolutely integrable random variable. Thus by the dominated convergence theorem

and so it suffices to show that

uniformly in . Applying the triangle inequality and the compact support of , it suffices to show that

for any ; taking expectations and using the definition of we have

and in particular

if and is small enough. Applying (10), we conclude that

if . Meanwhile, from Exercise 19 we have . Elementary calculus then gives us

(say) if is small enough. Inserting this bound into (16) we obtain the claim.

Exercise 38Show that the error terms in Theorem 37 are sharp (up to constants) when is a signed Bernoulli random variable, or more precisely when takes values in with probability for each.

Exercise 39Let be a sequence of real random variables which converge in distribution to a real random variable , and let be a sequence of real random variables which converge in distribution to a real random variable . Suppose that, for each , and are independent, and suppose also that and are independent. Show that converges in distribution to . (Hint:use the Lévy continuity theorem.)

The following exercise shows that the central limit theorem fails when the variance is infinite.

Exercise 40Let be iid copies of an absolutely integrable random variable of mean zero.

- (i) In this part we assume that is
symmetric, which means that and have the same distribution. Show that for any and ,(

Hint:relate both sides of this inequality to the probability of the event that and , using the symmetry of the situation.)- (ii) If is symmetric and converges in distribution to a real random variable , show that has finite variance. (
Hint:if this is not the case, then will have arbitrarily large variance as increases. On the other hand, can be made arbitrarily small by taking large enough. For a large threshold , apply (i) (with ) to obtain a contradiction.- (iii) Generalise (ii) by removing the hypothesis that is symmetric. (
Hint:apply thesymmetrisationtrick of replacing by , where is an independent copy of , and use the previous exercise. One may need to utilise some truncation arguments to show that has infinite variance whenever has infinite variance.)

Exercise 41

- (i) If is a real random variable of mean zero and variance , and is a real number, show that
and that

(

Hint:first establish the Taylor bounds and .)- (ii) Establish the pointwise inequality
whenever are complex numbers in the disk .

- (iii) Suppose that for each , are jointly independent real random variables of mean zero and finite variance, obeying the uniform bound
for all and some going to zero as , and also obeying the variance bound

as for some . If , use (i) and (ii) to show that

as for any given real .

- (iv) Use (iii) and a truncation argument to give an alternate proof of the Lindeberg central limit theorem (Theorem 14). (Note: one has to address the issue that truncating a random variable may alter its mean slightly.)

** — 3. The moment method — **

The above Fourier-analytic proof of the central limit theorem is one of the quickest (and slickest) proofs available for this theorem, and is accordingly the “standard” proof given in probability textbooks. However, it relies quite heavily on the Fourier-analytic identities in Exercise 32, which in turn are extremely dependent on both the commutative nature of the situation (as it uses the identity ) and on the independence of the situation (as it uses identities of the form ). As one or both of these factors can be absent when trying to generalise this theorem, it is also important to look for non-Fourier based methods to prove results such as the central limit theorem. These methods often lead to proofs that are lengthier and more technical than the Fourier proofs, but also tend to be more robust.

The most elementary (but still remarkably effective) method available in this regard is the *moment method*, which we have already used in previous notes. In principle, this method is equivalent to the Fourier method, through the identity (6); but in practice, the moment method proofs tend to look somewhat different than the Fourier-analytic ones, and it is often more apparent how to modify them to non-independent or non-commutative settings.

We first need an analogue of the Lévy continuity theorem. Here we encounter a technical issue: whereas the Fourier phases were bounded, the moment functions become unbounded at infinity. However, one can deal with this issue as long as one has sufficient decay:

Exercise 42 (Subgaussian random variables)Let be a real random variable. Show that the following statements are equivalent:

- (i) There exist such that for all .
- (ii) There exist such that for all .
- (iii) There exist such that for all .
Furthermore, show that if (i) holds for some , then (ii) holds for depending only on , and similarly for any of the other implications. Variables obeying (i), (ii), or (iii) are called

subgaussian. The function is known as the moment generating function of ; it is of course closely related to the characteristic function of .

Exercise 43Use the truncation method to show that in order to prove the central limit theorem (Theorem 8), it suffices to do so in the case when the underlying random variable is bounded (and in particular subgaussian).

Once we restrict to the subgaussian case, we have an analogue of the Lévy continuity theorem:

Theorem 44 (Moment continuity theorem)Let be a sequence of real random variables, and let be a subgaussian random variable. Suppose that for every , converges pointwise to . Then converges in distribution to .

*Proof:* Let , then by the preceding exercise we have

for some independent of . In particular,

when is sufficiently large depending on . From Taylor’s theorem with remainder (and Stirling’s formula (1)) we conclude

uniformly in , for sufficiently large. Similarly for . Taking limits using (i) we see that

Then letting , keeping fixed, we see that converges pointwise to for each , and the claim now follows from the Lévy continuity theorem.

We remark that in place of Stirling’s formula, one can use the cruder lower bound , which is immediate from the observation that occurs in the Taylor expansion of .

Remark 45One corollary of Theorem 44 is that the distribution of a subgaussian random variable is uniquely determined by its moments (actually, this could already be deduced from Exercise 26 and Remark 28). The situation can fail for distributions with slower tails, for much the same reason that a smooth function is not determined by its derivatives at one point if that function is not analytic.The Fourier inversion formula provides an easy way to recover the distribution from the characteristic function. Recovering a distribution from its moments is more difficult, and sometimes requires tools such as analytic continuation; this problem is known as the

inverse moment problemand will not be discussed here.

Exercise 46 (Converse direction of moment continuity theorem)Let be a sequence of uniformly subgaussian random variables (thus there exist such that for all and all ), and suppose converges in distribution to a limit . Show that for any , converges pointwise to .

We now give the moment method proof of the central limit theorem. As discussed above we may assume without loss of generality that is bounded (and in particular subgaussian); we may also normalise to have mean zero and unit variance. By Theorem 44, it suffices to show that

for all , where is a standard gaussian variable.

The moments were already computed in Exercise 36 of Notes 1. So now we need to compute . Using linearity of expectation, we can expand this as

To understand this expression, let us first look at some small values of .

- For , this expression is trivially .
- For , this expression is trivially , thanks to the mean zero hypothesis on .
- For , we can split this expression into the diagonal and off-diagonal components:
Each summand in the first sum is , as has unit variance. Each summand in the second sum is , as the have mean zero and are independent. So the second moment is .

- For , we have a similar expansion
The summands in the latter two sums vanish because of the (joint) independence and mean zero hypotheses. The summands in the first sum need not vanish, but are , so the first term is , which is asymptotically negligible, so the third moment goes to .

- For , the expansion becomes quite complicated:
Again, most terms vanish, except for the first sum, which is and is asymptotically negligible, and the sum , which by the independence and unit variance assumptions works out to . Thus the fourth moment goes to (as it should).

Now we tackle the general case. Ordering the indices as for some , with each occuring with multiplicity and using elementary enumerative combinatorics, we see that is the sum of all terms of the form

where , are positive integers adding up to , and is the multinomial coefficient

The total number of such terms depends only on (in fact, it is (exercise!), though we will not need this fact).

As we already saw from the small examples, most of the terms vanish, and many of the other terms are negligible in the limit . Indeed, if any of the are equal to , then every summand in (17) vanishes, by joint independence and the mean zero hypothesis. Thus, we may restrict attention to those expressions (17) for which all the are at least . Since the sum up to , we conclude that is at most .

On the other hand, the total number of summands in (17) is clearly at most (in fact it is ), and the summands are bounded (for fixed ) since is bounded. Thus, if is *strictly* less than , then the expression in (17) is and goes to zero as . So, asymptotically, the only terms (17) which are still relevant are those for which is *equal* to . This already shows that goes to zero when is odd. When is even, the only surviving term in the limit is now when and . But then by independence and unit variance, the expectation in (17) is , and so this term is equal to

and the main term is happily equal to the moment as computed in Exercise 36 of Notes 1. (One could also appeal to Lemma 12 here, specialising to the case when is normally distributed, to explain this coincidence.) This concludes the proof of the central limit theorem.

Exercise 47 (Chernoff bound)Let be iid copies of a real random variable of mean zero and unit variance, which is subgaussian in the sense of Exercise 42. Write .

- (i) Show that there exist such that for all . Conclude that for all . (
Hint:the first claim follows directly from Exercise 42 when ; for , use the Taylor approximation .)- (ii) Conclude the Chernoff bound
for some , all , and all .

Exercise 48 (Erdös-Kac theorem)For any natural number , let be a natural number drawn uniformly at random from the natural numbers , and let denote the number of distinct prime factors of .

- (i) Show that for any , one has
as if is odd, and

as if is even. (

Hint:adapt the arguments in Exercise 16 of Notes 3, estimating by , using Mertens’ theorem and induction on to deal with lower order errors, and treating the random variables as being approximately independent and approximately of mean zero.)- (ii) Establish the Erdös-Kac theorem
as for any fixed .

Informally, the Erdös-Kac theorem asserts that behaves like for “random” . Note that this refines the Hardy-Ramanujan theorem (Exercise 16 of Notes 3).

## 69 comments

Comments feed for this article

6 January, 2010 at 2:02 am

JeanVery nice notes! Thank you!

Just a remark: in http://arxiv.org/abs/0705.1224 and http://arxiv.org/abs/0908.0391, one can find some applications of Stein’s method in random matrix theory.

6 January, 2010 at 7:27 am

Mark MeckesMore applications of Stein’s method in random matrix theory appear here, here, and here.

6 January, 2010 at 2:53 am

karabasovA couple of very minor typos:

1) in the second-last paragraph of section 3, it should be “to those terms in (17)”

2) Beginning of section 4: “Lindeberg observed that” – the comma between observed and that is not correct.

6 January, 2010 at 3:25 am

karabasovI think there might be a parsing error around “where L is the rhox” in the last section… or maybe I just don’t get it.

6 January, 2010 at 9:12 am

Giovanni PeccatiHi Terry,

thank you for these inspiring lecture notes !

Small typo, in the brackets starting 5 lines after formula (30) (If we secretely…) there is a square root missing in the normalizing factor of the Gaussian density.

Best, G

6 January, 2010 at 12:04 pm

Terence TaoThanks for the corrections!

Somewhat embarrassingly, I was only dimly aware of the work on applying Stein’s method to random matrices; it seems that the fraction of literature on the subject that I am familiar with is still not fully representative. Thanks for the references!

7 January, 2010 at 5:45 am

Tim vBDear Terry,

thanks for this marvelous lecture notes!

I hope the following two remarks are not too pedantic, but both points made me stop for a short while:

1. The part “now we get the Fokker-Planck-Equation and that can be solved exactly” confused me a bit, because the Fokker-Planck-Equation can “usually” not be solved excatly – the one for the Ornstein-Uhlenbeck process can. It’s of course clear from the context that you mean exactly this, but I think it would help me if you redundantly write ” the above computations heuristically lead us eventually to the Fokker-Planck equation of the Ornstein-Uhlenbeck process”.

2. Wouldn’t most people say the Ornstein-Uhlenbeck process is the solution of a stochastic ordinary differential equation (stochastic ODE) rather than a stochastic partial differential equation?

[Fair enough; I’ve adjusted the notes accordingly.]15 January, 2010 at 12:31 pm

Steven HeilmanMore tedious [potential] corrections:

1. Two Eqs. after Eq. (14): P(…)\leq E(…)

2. Eq (16): square in denominator correct?

3. Theorem 4,5: F should be phi (or vice versa)

4(a). Proof of Theorem 5: several missing parentheses? (or E symbols), e.g.

\displaystyle {\bf E} (\varphi(Z_n) – \varphi(W_n)) = o(1).

This is done quite a lot, so I will assume it was done on purpose.

4(b). telescoping sum, multiply by -1

5. After Eq. (26): tex error for bound on f. I assume you want

{f(x) = O_\varphi( 1/|x|)}

[Corrected, thanks – T.]17 January, 2010 at 7:28 pm

salazarSecond 1 and 3. For 4, since W_n is a RV, there should be no ambiguity. For 2, I haven’t been able to see how to replace \hat{eta}.

17 January, 2010 at 8:55 pm

Terence Taocan be bounded by an absolute constant. (The denominator of was a relic from an earlier version of the argument, but is now redundant due to the restriction to the range .)

18 January, 2010 at 6:29 pm

254A, Notes 3b: Brownian motion and Dyson Brownian motion « What’s new[…] the central limit theorem from the fundamental solution of the heat equation (cf. Section 7 of Notes 2), although the derivation is only heuristic because one first needs to know that some limiting […]

29 January, 2010 at 7:16 am

Andreas NaiveExcellent notes!

Just a small fix:

In (30), should be , which slightly simplify the following argument.

[Corrected, thanks – T.]2 February, 2010 at 1:34 pm

254A, Notes 4: The semi-circular law « What’s new[…] vein, we may apply the truncation argument (much as was done for the central limit theorem in Notes 2) to reduce the semi-circular law to the bounded case: Exercise 5 Show that in order to prove the […]

10 February, 2010 at 10:56 pm

245A, Notes 5: Free probability « What’s new[…] an interesting analogue in the freely independent setting. For instance, the central limit theorem (Notes 2) for averages of classically independent random variables, which roughly speaking asserts that such […]

25 February, 2010 at 10:19 am

PDEbeginnerDear Prof. Tao,

I have some problems on Ex 12 (Esseen concentration inequaity) and Thm 3:

Ex 12. Let , the value on LHS of (9) should be 1, while that on RHS should be 0. Then the inequality breaks down.

Thm 3. In the paragraph immediately below (14), I do not understand how to apply an integration by parts argument, since one does not know the exact distribution of .

Thanks in advance!

25 February, 2010 at 11:09 am

Terence TaoAh, there was a factor of r^d missing in the concentration inequality; it is fixed now.

As for Theorem 3, the point is that the distribution of X is the Stieltjes derivative of the cumulative distribution function of X; . If we integrate the latter integral by parts, we see that small perturbations of F in the uniform norm lead to small perturbations of when f has bounded total variation.

25 February, 2010 at 10:20 am

PDEbeginnerSorry, it is Ex 11.

26 February, 2010 at 2:41 pm

YPThe derivation of Gaussian in Section 7 is amazing! (To me it seems very much in the spirit of physics). Isn’t it strange that to our brain it seems very natural to make this argument as it is, whereas for scientific community you actually need to add days of hard work (in Sobolev spaces etc)?

27 February, 2010 at 6:44 pm

vedadiDear Prof. Tao,

We know that for i.i.d mean zero, variance one random variables, converges in distribution to the point mass at zero, and standard normal r.v. for respectively. Do we have convergence in distribution for other values

Thanks

6 March, 2010 at 7:14 pm

254A, Notes 7: The least singular value « What’s new[…] understand this walk, we apply (a slight variant) of the Berry-Esséen theorem from Notes 2: Exercise 1 Show […]

10 March, 2010 at 10:35 am

PDEbeginnerDear Prof. Tao,

I finished reading this note. As usual, I have some problems :-)

1. For the moment method, we first apply Chernoff bound (3), and then we can assume that are uniformly guassian. But when proving Chernoff bound in Note 1, we have used the identity . When we are dealing random matrices, if we also use Chernoff bounds, then it seems we shall have some trouble.

2. I still do not know how to prove the conclusion in Ex 11. Suppose one is trying to bound the easy probability (with and ):

.

By Fourier transform, we obtain

Clearly, to show has the same bound as in exercise, we need to show that has that bound. I don’t know to show this.

3. A small typo: the in Theorem 5 seems to be .

10 March, 2010 at 6:38 pm

Terence Tao1. For random matrices, one does not apply the Chernoff bound to the matrix directly, but to other scalar expressions related to that matrix, e.g. linear combinations of the matrix entries.

2. One first needs to replace the sharp cutoff to the ball of radius r by a smoother cutoff. This is the method of smoothing sums:

http://www.tricki.org/article/Smoothing_sums

3. Thanks for the correction!

5 May, 2010 at 8:48 pm

AnonymousThese notes are very helpful, thank you very much!

I do not have a strong background in the Fourier analysis, could you point me to a refernce to the solutions to Exercise 11 and Exercise 15?

A few comments I have:

In Exercise 13, the characteristic function should be $e^{i\mu}t…$ instead of $e^{-i\mu}t$?

In Part 1 of this note, the “N” in the last two singled-out equations should be replaced with “N_n”?

6 May, 2010 at 9:06 am

Terence TaoThanks for the corrections,’

For Ex11, see Lemma 7.17 of my book “Additive combinatorics” with Van Vu. Ex15 can be found in most graduate texts in probability, including those listed at the link given.

27 July, 2010 at 3:50 pm

Ahmet ArivanGreat notes. I have probably a very stupid question. But in the Stein method section, it is claimed that if f(t) =O(1/1+|t|) and f'(t)=O(1) then tf(t) is Lipschitz with constant of O(1). Is it possible to give a hint on why this is true?

Thanks,

31 July, 2010 at 8:35 am

Terence TaoAh yes, that requires an additional argument (which needs the Lipschitz hypothesis on phi). I’ve modified the text accordingly (and in particular weakened the Berry-Esseen type claim, since the Lipschitz property is not present in that case.)

29 October, 2012 at 3:53 pm

Nick CookSmall thing: I think the last part of the proof of Theorem 6, showing that , with implicit constant linear in the bounded Lipschitz norm of , belongs just after this proof since we aren’t assuming is Lipschitz here. – or leave it in with modification of the phrase “using the Lipschitz nature of ” :)

[Corrected by assuming $\phi$ to be Lipschitz. -T.]4 November, 2010 at 8:33 am

Qiaochu YuanRegarding Section 7: I had always vaguely figured that if were going to converge to a particular distribution, it would have to converge to an eigenfunction of the Fourier transform. Do you know of any heuristic (or rigorous!) derivations of the CLT using this idea?

4 November, 2010 at 9:02 am

Terence TaoI think the more accurate statement would be that a stable law should have a Fourier transform whose logarithm is an eigenfunction of the scaling operation, since the convolution of with itself needs to be a rescaling of .

16 November, 2010 at 1:04 pm

karabasovAt the beginning of the proof of theorem 1, it seems that the first “for sufficiently small t” is not needed. Only the second is needed.

15 March, 2011 at 6:57 pm

SujitHi Terry,

I am a bit puzzled about Exercise 1. I am thinking of $Z_n$ as a map from $n$ copies of $\Omega$ to $R$. If this is the case, what does it mean to say $Z_n$ converges almost surely? Isn’t the underlying sample space varying as $n$ varies?

The CLT says something about the push forward of the measures from $\Omega^n$ to $R$ and I can understand it. So even if the sample space changes, it doesn’t matter as we are comparing the push forward measures. In the strong LLN case, it still sounds fine to me if I interpret the almost sure convergence as almost sure convergence to the constant map on $\Omega^n$ given by the expectation. But I am having trouble trying to understand what Exercise 1 means.

Thanks

15 March, 2011 at 7:08 pm

Terence TaoThe intent here is to work in a common extension of all these probability spaces, where we have an infinite sequence of iid random variables (and this is also the correct way to interpret the strong law of large numbers).

16 March, 2011 at 1:03 pm

AnonymousDear Prof. Tao,

Let be the moment generating function of a non degenerate r.v . Then is always strictly increasing and strictly convex on ?

Thanks

28 August, 2011 at 6:12 am

The law of large numbers and the central limit theorem | Controlled Complexity[…] theorem (CLT). There are excellent resources on the net for LLN and CLT. For example, this and this are highly recommended readings. This blog will play a complementary with figures and animations to […]

1 October, 2011 at 10:32 am

AnonymousCan anyone help me to compute the expected waiting time of the first occurrence of the patten, say, TTHT. here

is there a general approach to solve this sort of problems?

Thanks

18 November, 2011 at 2:38 am

Diffusion in Ehrenfest wind-tree model « Disquisitiones Mathematicae[…] of the word “abnormal” comes by comparison with Brownian motion and/or central limit theorem: once we know that the diffusion is “sublinear” (maybe after removing the […]

17 March, 2012 at 5:48 am

AnonymousDear Prof. Tao,

Let be the simple symmetric random walk in one dimension.

Assume that K is a large constant and I need a lower bound for the following probability which does not depend on

What is the idea behind finding a lower bound for this type of quantities?

Thank you

10 October, 2012 at 3:33 pm

Robert KohnHi, thanks for insightful presentation. My question is, suppose that the are independent with finite third moments, zero mean and unit variance. What can we say about the convergence of the third moment of the normalized sum ?

In addition, what results of the same kind can we get if the are correlated?

Thanks,

Robert Kohn

20 November, 2012 at 8:14 am

JackRegarding Exercise 1: I don’t see how the intuition help. What I thought is to give a lower bound of for some , which is from the definition of convergence in probability. Could you elaborate the intuition? Why would independence help here?

20 November, 2012 at 8:41 am

Terence TaoActually, one needs a lower bound on to contradict convergence in probability to an unspecified limit (lower bounding merely prevents convergence in probability to zero, or to a non-positive random variable). But if one can quantify some approximate independence between and for n,m widely separated, then this, together with the central limit theorem, can supply the desired lower bound.

20 November, 2012 at 2:07 pm

JackAn attempt is that both and $Z_{2n}$ converge to in distribution by CLT. Thus one has . I still don’t see how one can bound .

20 November, 2012 at 2:14 pm

Terence Taoand are still quite correlated, so it is a bit tricky to separate them by . Try and instead for some large M (larger than or so). The point is that these two random variables are almost independent (more precisely, and are independent) which makes it difficult for them to stay so close to each other so often.

20 November, 2012 at 2:46 pm

JackWhat puzzles me is the use of “independence” here. As a beginner, I only think that and are independent if and only if where and are Borel sets in the corresponding Borel sigma algebra. Why does this concept have anything to do with the “distance” between these random variables? It’s quite intuitive though: two things are “independent”, then they should not be too “close” to each other. But I don’t see the connection. (Eventually I come up with something like . Is this the point? )

20 November, 2012 at 6:24 pm

Terence TaoIf X and Y are independent, then (for instance) will be large if and are large, thus giving some separation between X and Y. (One can also in principle work with expectations, but this requires some hypotheses on absolute integrability that one then has to justify separately.)

26 November, 2012 at 3:13 pm

JackFinding a lower bound for is still not “obvious” to me. I tried . And the independence between and gives . Since is chosen such that , implies that this is very close to the desired estimate. But I don’t see the way to bound and . Since I haven’t use CLT so far, I am wondering there must be lack of something here.

26 November, 2012 at 3:42 pm

Terence TaoThe CLT describes the limiting value of in the limit as with fixed, and similarly for (which is basically a rescaled variant of ).

26 November, 2012 at 4:14 pm

JackI found in some book that random variables converging in distribution to $X$ is defined as (and the author also calls it “weak convergence”)

for all bounded, continuous function . Is it “equivalent” to the definition in [254A Note4 Exercise 18](https://terrytao.wordpress.com/2010/10/02/245a-notes-4-modes-of-convergence/)? But in [245B Note 11](https://terrytao.wordpress.com/2009/02/21/245b-notes-11-the-strong-and-weak-topologies/), weak convergence is convergence in weak topology. I am totally confused with the terminologies.

[I don’t know how to give links like people do in MathOverflow.]

26 November, 2012 at 4:39 pm

Terence TaoYes, these notions are all connected to each other; see Exercise 23 of https://terrytao.wordpress.com/2010/01/01/254a-notes-0-a-review-of-probability-theory/ . (Though, strictly speaking, vague convergence of measures – which is equivalent to convergence in distribution – is an example of weak* convergence rather than weak convergence.)

22 March, 2013 at 12:34 am

AnonymousHi professor Tao,

does the CLT hold if we substitute the hypothesis of finite mean and variance with the hypothesis of finite second order moment, ?

22 March, 2013 at 9:10 am

Terence TaoBy Holder’s inequality (or the Cauchy-Schwarz inequality), finite second order moment is equivalent to finite mean and variance.

3 April, 2013 at 3:13 am

AnonymousThanks a lot for the answer! By the way, does { \Z_n := \frac{S_n – n \mu}{\sqrt{n} \sigma} to N(0,1)} implies that {\frac{S_n}{n} to N(\mu,\frac{\sigma^2}{n}) }? This implication seems quite weird to me, since {n to \infty}, but on a book about generalized polynomial chaos I fount the following sentence: “the numerical average of a sequence of i.i.d. random variables will converge, as n is increased, to a Gaussian distribution { N(\mu,\frac{\sigma^2}{n}) } […]”.

3 April, 2013 at 10:30 am

Terence TaoIf one uses a rescaled notion of convergence, then this statement is true, although if one uses unrescaled versions of convergence (e.g. convergence in distribution, vague convergence, total variation convergence, etc.) then the statement is either false or trivially true, depending on exactly which mode of convergence is specified. It is somewhat of an abuse of notation to refer to convergence of random variables without specifying the exact nature of the convergence, although for informal mathematical discussion it is generally permissible to be a bit loose in this regard.

7 May, 2013 at 3:44 am

Anonymous(i) In Remark 1 the word theorem is missing from “central limit theorem.”

(ii) The equation following the sentence “From the bounded density of $G$ and the rapid decrease of $\eta$ we have…” I think the first term should be an expectation and not probability.

(iii) Do you use any convention for capitalizing theorems? For example, you write “central limit theorem” in lower case but “Taylor” expansion in upper case.

(iv) What is the diagonalisation argument in “By a diagonalisation argument, we conclude that there exists a sequence…”? Is there a Wikipedia page? Thanks.

[Corrected, thanks. Taylor is a proper noun and is therefore capitalised in English. The diagonalisation argument, originally due to Cantor, refers to any procedure in which one repeatedly extracts sequences with various properties and then passes to a diagonal subsequence with an even better property. Actually, the Arzela-Ascoli diagonalisation argument is closer in spirit to the one used here than the Cantor diagonalisation argument.]13 May, 2013 at 12:00 pm

חסם צ'רנוף, חסם אנטרופיה ומה שביניהם | One and One[…] נדבר עליו כאן, אתם מוזמנים לקרוא עליו בבלוג של טרי טאו. אבל גם אם פשוט נאמין שהאינטגרל הזה קרוב מספיק […]

2 November, 2015 at 7:06 pm

275A, Notes 4: The central limit theorem | What's new[…] theorem (including some further proofs, such as one based on Stein’s method) can be found in this blog post. Some further variants of the central limit theorem, such as local limit theorems, stable laws, and […]

28 December, 2015 at 5:32 am

VladimirDear Tao, I do not get it why in the proof of theorem 7 the t is uniformly distributed even if independence of Z_n:i is clear to me? Could you help please?

28 December, 2015 at 8:14 am

Terence Taois uniformly distributed by definition. In less probabilistic language, the identity being asserted here is that

which is of course a special case of the fundamental theorem of calculus.

29 December, 2015 at 2:49 am

VladimirMany thanks, now I get it.

27 January, 2016 at 8:50 am

AnonymousDear Prof. Tao, in exercise 8, shouldn’t the derivative be evaluated at 0?

[Exercise reworded – T.]3 May, 2016 at 2:41 pm

Vivek Kumar BagariaGreat notes! How does the ‘Vector-valued central limit’ behave if the dimension of the vector scales with the number of samples $n$ ?

4 May, 2016 at 8:30 am

Terence TaoThis basically comes down to the dependence of constants of the Berry-Esseen theorem on dimension. The best bound I know of in this direction comes from this 2003 paper of Bentkus, but there may be further improvements since then.

5 May, 2016 at 4:11 am

easywhy is the third moment important in that paper? CLT is about second moments right?

5 May, 2016 at 8:12 am

Terence TaoIf one just wants qualitative convergence (with no rate), then second moment conditions suffice, but in order to obtain quantitative bounds one needs some higher moment assumption (to deal with the errors caused by truncation arguments). Also one can get faster rates of convergence than Berry-Esseen if the original distribution matches third or higher moments with the gaussian, as can be seen from the Lindeberg argument (or from the Fourier argument).

7 November, 2016 at 10:26 pm

SahibaHey, can you help me in Exercise 1? I’m not able to solve it.

8 February, 2017 at 10:10 pm

John RawlsThese notes are immensely useful as always, but I have a small question: how exactly does the diagonalization argument in Section 1 work? We know that for each fixed and all bounded continuous , but it seems that the rate of convergence depends not only on but also on . How then can we choose so that we get for all , or at least for all in some nice subclass? Thanks!

9 February, 2017 at 10:27 pm

Terence TaoOne can work with some countable dense sequence of functions (e.g. indicator functions of with an enumeration of the rationals will work). For each , we then can find such that lies within of for all and ; we can also ensure that the are increasing in . Letting be the inverse of (or more precisely, is the first for which ) we see that for all , and by approximating a general using the we obtain convergence in distribution.

14 February, 2017 at 8:32 pm

John RawlsVery clever, thank you!

16 February, 2017 at 10:07 am

tpflyHi professor Tao,

I’m reading this series of notes and I have a question about the proof of Berry Esseen Theorem. You said that in order to establish (13) it suffices to establish (15) because of (8). But (8) requires that $\varphi$ is Schwartz, or at least L1 and the fourier transform should be also L1, because the Fourier inverse is used. However the function in the proof of berry esseen is decreasing from $1$ to $0$, is (8) still available with this kind of functions?

Thanks!

17 February, 2017 at 5:34 am

tpflyI found the answer of my question by taking a limit, that is, and now the expectation is taken on a Schwartz function.

By the way I think there are some mistakes in this proof, the Fourier transform

should be

as there must be a singular point at 0, otherwise would be finite. And

should be

.

[New version of proof given – T.]