254A, Notes 2: The central limit theorem

5 January, 2010 in 254A - random matrices, math.PR | Tags: Berry-Esseen theorem, central limit theorem, Lindeberg replacement trick, Ornstein-Uhlenbeck process, Stein's method | by Terence Tao

Consider the sum ${S_n := X_1+\ldots+X_n}$ of iid real random variables ${X_1,\ldots,X_n \equiv X}$ of finite mean ${\mu}$ and variance ${\sigma^2}$ for some ${\sigma > 0}$ . Then the sum ${S_n}$ has mean ${n\mu}$ and variance ${n\sigma^2}$ , and so (by Chebyshev’s inequality) we expect ${S_n}$ to usually have size ${n\mu + O(\sqrt{n} \sigma)}$ . To put it another way, if we consider the normalised sum

$\displaystyle Z_n := \frac{S_n - n \mu}{\sqrt{n} \sigma} \ \ \ \ \ (1)$

then ${Z_n}$ has been normalised to have mean zero and variance ${1}$ , and is thus usually of size ${O(1)}$ .

In the previous set of notes, we were able to establish various tail bounds on ${Z_n}$ . For instance, from Chebyshev’s inequality one has

$\displaystyle {\bf P}(|Z_n| > \lambda) \leq \lambda^{-2}, \ \ \ \ \ (2)$

and if the original distribution ${X}$ was bounded or subgaussian, we had the much stronger Chernoff bound

$\displaystyle {\bf P}(|Z_n| > \lambda) \leq C \exp( - c \lambda^2 ) \ \ \ \ \ (3)$

for some absolute constants ${C, c > 0}$ ; in other words, the ${Z_n}$ are uniformly subgaussian.

Now we look at the distribution of ${Z_n}$ . The fundamental central limit theorem tells us the asymptotic behaviour of this distribution:

Theorem 1 (Central limit theorem) Let ${X_1,\ldots,X_n \equiv X}$ be iid real random variables of finite mean ${\mu}$ and variance ${\sigma^2}$ for some ${\sigma > 0}$ , and let ${Z_n}$ be the normalised sum (1). Then as ${n \rightarrow \infty}$ , ${Z_n}$ converges in distribution to the standard normal distribution ${N(0,1)_{\bf R}}$ .

Exercise 2 Show that ${Z_n}$ does not converge in probability or in the almost sure sense. (Hint: the intuition here is that for two very different values ${n_1 \ll n_2}$ of ${n}$ , the quantities ${Z_{n_1}}$ and ${Z_{n_2}}$ are almost independent of each other, since the bulk of the sum ${S_{n_2}}$ is determined by those ${X_n}$ with ${n > n_1}$ . Now make this intuition precise.)

Exercise 3 Use Stirling’s formula from Notes 0a to verify the central limit theorem in the case when ${X}$ is a Bernoulli distribution, taking the values ${0}$ and ${1}$ only. (This is a variant of Exercise 2 from those notes, or Exercise 2 from Notes 1. It is easy to see that once one does this, one can rescale and handle any other two-valued distribution also.)

Exercise 4 Use Exercise 9 from Notes 1 to verify the central limit theorem in the case when ${X}$ is gaussian.

Note we are only discussing the case of real iid random variables. The case of complex random variables (or more generally, vector-valued random variables) is a little bit more complicated, and will be discussed later in this post.

The central limit theorem (and its variants, which we discuss below) are extremely useful tools in random matrix theory, in particular through the control they give on random walks (which arise naturally from linear functionals of random matrices). But the central limit theorem can also be viewed as a “commutative” analogue of various spectral results in random matrix theory (in particular, we shall see in later lectures that the Wigner semicircle law can be viewed in some sense as a “noncommutative” or “free” version of the central limit theorem). Because of this, the techniques used to prove the central limit theorem can often be adapted to be useful in random matrix theory. Because of this, we shall use these notes to dwell on several different proofs of the central limit theorem, as this provides a convenient way to showcase some of the basic methods that we will encounter again (in a more sophisticated form) when dealing with random matrices.

— 1. Reductions —

We first record some simple reductions one can make regarding the proof of the central limit theorem. Firstly, we observe scale invariance: if the central limit theorem holds for one random variable ${X}$ , then it is easy to see that it also holds for ${aX+b}$ for any real ${a, b}$ with ${a \neq 0}$ . Because of this, one can normalise to the case when ${X}$ has mean ${\mu=0}$ and variance ${\sigma^2 = 1}$ , in which case ${Z_n}$ simplifies to

$\displaystyle Z_n = \frac{X_1 + \ldots + X_n}{\sqrt{n}}. \ \ \ \ \ (4)$

The other reduction we can make is truncation: to prove the central limit theorem for arbitrary random variables ${X}$ of finite mean and variance, it suffices to verify the theorem for bounded random variables. To see this, we first need a basic linearity principle:

Exercise 5 (Linearity of convergence) Let ${V}$ be a finite-dimensional real or complex vector space, ${X_n, Y_n}$ be sequences of ${V}$ -valued random variables (not necessarily independent), and let ${X, Y}$ be another pair of ${V}$ -valued random variables. Let ${c_n, d_n}$ be scalars converging to ${c, d}$ respectively.

If ${X_n}$ converges in distribution to ${X}$ , and ${Y_n}$ converges in distribution to ${Y}$ , and at least one of ${X, Y}$ is deterministic, show that ${c_n X_n+d_n Y_n}$ converges in distribution to ${c X+d Y}$ .

If ${X_n}$ converges in probability to ${X}$ , and ${Y_n}$ converges in probability to ${Y}$ , show that ${c_n X_n+d_n Y_n}$ converges in probability to ${c X+ dY}$ .

If ${X_n}$ converges almost surely to ${X}$ , and ${Y_n}$ converges almost surely ${Y}$ , show that ${c_n X_n+d_n Y_n}$ converges almost surely to ${cX+dY}$ .

Show that the first part of the exercise can fail if ${X, Y}$ are not deterministic.

Now suppose that we have established the central limit theorem for bounded random variables, and want to extend to the unbounded case. Let ${X}$ be an unbounded random variable, which we can normalise to have mean zero and unit variance. Let ${N = N_n > 0}$ be a truncation parameter depending on ${n}$ which, as usual, we shall optimise later, and split ${X = X_{\leq N} + X_{>N}}$ in the usual fashion ( ${X_{\leq N} = X {\bf I}(|X| \leq N)}$ ; ${X_{>N} = X {\bf I}(|X| > N)}$ ). Thus we have ${S_n = S_{n,\leq N} + S_{n,>N}}$ as usual.

Let ${\mu_{\leq N}, \sigma_{\leq N}^2}$ be the mean and variance of the bounded random variable ${X_{\leq N}}$ . As we are assuming that the central limit theorem is already true in the bounded case, we know that if we fix ${N}$ to be independent of ${n}$ , then

$\displaystyle Z_{n,\leq N} := \frac{S_{n,\leq N} - n \mu_{\leq N}}{\sqrt{n} \sigma_{\leq N}}$

converges in distribution to ${N(0,1)_{\bf R}}$ . By a diagonalisation argument, we conclude that there exists a sequence ${N_n}$ going (slowly) to infinity with ${n}$ , such that ${Z_{n,\leq N_n}}$ still converges in distribution to ${N(0,1)_{\bf R}}$ .

For such a sequence, we see from dominated convergence that ${\sigma_{\leq N_n}}$ converges to ${\sigma = 1}$ . As a consequence of this and Exercise 5, we see that

$\displaystyle \frac{S_{n,\leq N_n} - n \mu_{\leq N_n}}{\sqrt{n}}$

converges in distribution to ${N(0,1)_{\bf R}}$ .

Meanwhile, from dominated convergence again, ${\sigma_{>N_n}}$ converges to ${0}$ . From this and (2) we see that

$\displaystyle \frac{S_{n,>N_n} - n \mu_{>N_n}}{\sqrt{n}}$

converges in distribution to ${0}$ . Finally, from linearity of expectation we have ${\mu_{\leq N_n} + \mu_{>N_n} = \mu = 0}$ . Summing (using Exercise 5), we obtain the claim.

Remark 6 The truncation reduction is not needed for some proofs of the central limit theorem (notably the Fourier-analytic proof), but is very convenient for some of the other proofs that we will give here, and will also be used at several places in later notes.

By applying the scaling reduction after the truncation reduction, we observe that to prove the central limit theorem, it suffices to do so for random variables ${X}$ which are bounded and which have mean zero and unit variance. (Why is it important to perform the reductions in this order?)

— 2. The Fourier method —

Let us now give the standard Fourier-analytic proof of the central limit theorem. Given any real random variable ${X}$ , we introduce the characteristic function ${F_X: {\bf R} \rightarrow {\bf C}}$ , defined by the formula

$\displaystyle F_X(t) := {\bf E} e^{itX}. \ \ \ \ \ (5)$

Equivalently, ${F_X}$ is the Fourier transform of the probability measure ${\mu_X}$ .

Example 7 The signed Bernoulli distribution has characteristic function ${F(t) = \cos(t)}$ .

Exercise 8 Show that the normal distribution ${N(\mu,\sigma^2)_{\bf R}}$ has characteristic function ${F(t) = e^{it\mu} e^{-\sigma^2 t^2/2}}$ .

More generally, for a random variable ${X}$ taking values in a real vector space ${{\bf R}^d}$ , we define the characteristic function ${F_X: {\bf R}^d \rightarrow {\bf C}}$ by

$\displaystyle F_X(t) := {\bf E} e^{it\cdot X} \ \ \ \ \ (6)$

where ${\cdot}$ denotes the Euclidean inner product on ${{\bf R}^d}$ . One can similarly define the characteristic function on complex vector spaces ${{\bf C}^d}$ by using the complex inner product

$\displaystyle (z_1,\ldots,z_d) \cdot (w_1,\ldots,w_d) := \hbox{Re}(z_1 \overline{w_1} + \ldots + z_d \overline{w_d})$

(or equivalently, by identifying ${{\bf C}^d}$ with ${{\bf R}^{2d}}$ in the usual manner.)

More generally, one can define the characteristic function on any finite dimensional real or complex vector space ${V}$ , by identifying ${V}$ with ${{\bf R}^d}$ or ${{\bf C}^d}$ . (Strictly speaking, one either has to select an inner product on ${V}$ to do this, or else make the characteristic function defined on the dual space ${V^*}$ instead of on ${V}$ itself; see for instance my notes on the Fourier transform in general locally compact abelian groups. But we will not need to care about this subtlety in our applications.)

The characteristic function is clearly bounded in magnitude by ${1}$ , and equals ${1}$ at the origin. By the Lebesgue dominated convergence theorem, ${F_X}$ is continuous in ${t}$ .

Exercise 9 (Riemann-Lebesgue lemma) Show that if ${X}$ is an absolutely continuous random variable taking values in ${{\bf R}^d}$ or ${{\bf C}^d}$ , then ${F_X(t) \rightarrow 0}$ as ${t \rightarrow \infty}$ . Show that the claim can fail when the absolute continuity hypothesis is dropped.

Exercise 10 Show that the characteristic function ${F_X}$ of a random variable ${X}$ taking values in ${{\bf R}^d}$ or ${{\bf C}^d}$ is in fact uniformly continuous on its domain.

Let ${X}$ be a real random variable. If we Taylor expand ${e^{itX}}$ and formally interchange the series and expectation, we arrive at the heuristic identity

$\displaystyle F_X(t) = \sum_{k=0}^\infty \frac{(it)^k}{k!} {\bf E} X^k \ \ \ \ \ (7)$

which thus interprets the characteristic function of a real random variable ${X}$ as a kind of generating function for the moments. One rigorous version of this identity is as follows.

Exercise 11 (Taylor expansion of characteristic function) Let ${X}$ be a real random variable with finite ${k^{th}}$ moment for some ${k \geq 1}$ . Show that ${F_X}$ is ${k}$ times continuously differentiable, and one has the partial Taylor expansion

$\displaystyle F_X(t) = \sum_{j=0}^k \frac{(it)^j}{j!} {\bf E} X^j + o( |t|^k )$

where ${o(|t|^k)}$ is a quantity that goes to zero as ${t \rightarrow 0}$ , times ${|t|^k}$ . In particular, we have

$\displaystyle \frac{d^j}{dt^j} F_X(t) = i^j {\bf E} X^j$

for all ${0 \leq j \leq k}$ .

Exercise 12 Establish (7) in the case that ${X}$ is subgaussian, and show that the series converges locally uniformly in ${t}$ .

Note that the characteristic function depends only on the distribution of ${X}$ : if ${X \equiv Y}$ , then ${F_X = F_Y}$ . The converse statement is true also: if ${F_X=F_Y}$ , then ${X \equiv Y}$ . This follows from a more general (and useful) fact, known as Lévy’s continuity theorem.

Theorem 13 (Lévy continuity theorem, special case) Let ${V}$ be a finite-dimensional real or complex vector space, and let ${X_n}$ be a sequence of ${V}$ -valued random variables, and let ${X}$ be an additional ${V}$ -valued random variable. Then the following statements are equivalent:

(i) ${F_{X_n}}$ converges pointwise to ${F_X}$ .

(ii) ${X_n}$ converges in distribution to ${X}$ .

Proof: Without loss of generality we may take ${V = {\bf R}^d}$ .

The implication of (i) from (ii) is immediate from (6) and the definition of convergence in distribution (see Definition 10 of Notes 0), since the function ${x \mapsto e^{it \cdot x}}$ is bounded continuous.

Now suppose that (i) holds, and we wish to show that (ii) holds. By Exercise 23(iv) of Notes 0, it suffices to show that

$\displaystyle {\bf E} \varphi(X_n) \rightarrow {\bf E} \varphi(X)$

whenever ${\varphi: V \rightarrow {\bf R}}$ is a continuous, compactly supported function. By approximating ${\varphi}$ uniformly by Schwartz functions (e.g. using the Stone-Weierstrass theorem), it suffices to show this for Schwartz functions ${\varphi}$ . But then we have the Fourier inversion formula

$\displaystyle \varphi(X_n) = \int_{{\bf R}^d} \hat \varphi(t) e^{it \cdot X_n}\ dt$

where

$\displaystyle \hat \varphi(t) := \frac{1}{(2\pi)^d} \int_{{\bf R}^d} \varphi(x) e^{-it \cdot x}\ dx$

is a Schwartz function, and is in particular absolutely integrable (see e.g. these lecture notes of mine). From the Fubini-Tonelli theorem, we thus have

$\displaystyle {\bf E} \varphi(X_n) = \int_{{\bf R}^d} \hat \varphi(t) F_{X_n}(t)\ dt \ \ \ \ \ (8)$

and similarly for ${X}$ . The claim now follows from the Lebesgue dominated convergence theorem. $\Box$

Remark 14 Setting ${X_n := Y}$ for all ${n}$ , we see in particular the previous claim that ${F_X = F_Y}$ if and only if ${X \equiv Y}$ . It is instructive to use the above proof as a guide to prove this claim directly.

Exercise 15 (Lévy’s continuity theorem, full version) Let ${V}$ be a finite-dimensional real or complex vector space, and let ${X_n}$ be a sequence of ${V}$ -valued random variables. Suppose that ${F_{X_n}}$ converges pointwise to a limit ${F}$ . Show that the following are equivalent:

(i) ${F}$ is continuous at ${0}$ .

(ii) ${X_n}$ is a tight sequence.

(iii) ${F}$ is the characteristic function of a ${V}$ -valued random variable ${X}$ (possibly after extending the sample space).

(iv) ${X_n}$ converges in distribution to some ${V}$ -valued random variable ${X}$ (possibly after extending the sample space).

Hint: To get from (ii) to the other conclusions, use Prokhorov’s theorem and Theorem 13. To get back to (ii) from (i), use (8) for a suitable Schwartz function ${\varphi}$ . The other implications are easy once Theorem 13 is in hand.

Remark 16 Lévy’s continuity theorem is very similar in spirit to Weyl’s criterion in equidistribution theory.

Exercise 17 (Esséen concentration inequality) Let ${X}$ be a random variable taking values in ${{\bf R}^d}$ . Then for any ${r>0}$ , ${\varepsilon > 0}$ , show that

$\displaystyle \sup_{x_0 \in {\bf R}^d} {\bf P}( |X - x_0| \leq r ) \leq C_{d,\varepsilon} r^d \int_{t \in {\bf R}^d: |t| \leq \varepsilon/r} |F_X(t)|\ dt \ \ \ \ \ (9)$

for some constant ${C_{d,\varepsilon}}$ depending only on ${d}$ and ${\varepsilon}$ . (Hint: Use (8) for a suitable Schwartz function ${\varphi}$ .) The left-hand side of (9) is known as the small ball probability of ${X}$ at radius ${r}$ .

In Fourier analysis, we learn that the Fourier transform is a particularly well-suited tool for studying convolutions. The probability theory analogue of this fact is that characteristic functions are a particularly well-suited tool for studying sums of independent random variables. More precisely, we have

Exercise 18 (Fourier identities) Let ${V}$ be a finite-dimensional real or complex vector space, and let ${X, Y}$ be independent random variables taking values in ${V}$ . Then

$\displaystyle F_{X+Y}(t) = F_X(t) F_Y(t) \ \ \ \ \ (10)$

for all ${t \in V}$ . Also, for any scalar ${c}$ , one has

$\displaystyle F_{cX}(t) = F_X(\overline{c}t)$

and more generally, for any linear transformation ${T: V \rightarrow V}$ , one has

$\displaystyle F_{TX}(t) = F_X(T^*t).$

Remark 19 Note that this identity (10), combined with Exercise 8 and Remark 14, gives a quick alternate proof of Exercise 9 from Notes 1.

In particular, in the normalised setting (4), we have the simple relationship

$\displaystyle F_{Z_n}(t) = F_X( t / \sqrt{n} )^n \ \ \ \ \ (11)$

that describes the characteristic function of ${Z_n}$ in terms of that of ${X}$ .

We now have enough machinery to give a quick proof of the central limit theorem:

Proof: (Proof of Theorem 1) We may normalise ${X}$ to have mean zero and variance ${1}$ . By Exercise 11, we thus have

$\displaystyle F_X(t) = 1 - t^2/2 + o(|t|^2)$

for sufficiently small ${t}$ , or equivalently

$\displaystyle F_X(t) = \exp( - t^2/2 + o(|t|^2) )$

for sufficiently small ${t}$ . Applying (11), we conclude that

$\displaystyle F_{Z_n}(t) \rightarrow \exp( - t^2/2 )$

as ${n \rightarrow \infty}$ for any fixed ${t}$ . But by Exercise 8, ${\exp(-t^2/2)}$ is the characteristic function of the normal distribution ${N(0,1)_{\bf R}}$ . The claim now follows from the Lévy continuity theorem. $\Box$

Exercise 20 (Vector-valued central limit theorem) Let ${\vec X=(X_1,\ldots,X_d)}$ be a random variable taking values in ${{\bf R}^d}$ with finite second moment. Define the covariance matrix ${\Sigma(\vec X)}$ to be the ${d \times d}$ matrix ${\Sigma}$ whose ${ij^{th}}$ entry is the covariance ${{\bf E} (X_i - {\bf E}(X_i)) (X_j - {\bf E}(X_j))}$ .

Show that the covariance matrix is positive semi-definite real symmetric.

Conversely, given any positive definite real symmetric ${d \times d}$ matrix ${\Sigma}$ and ${\mu \in {\bf R}^d}$ , show that the normal distribution ${N(\mu,\Sigma)_{{\bf R}^d}}$ , given by the absolutely continuous measure
$\displaystyle \frac{1}{((2\pi)^d \det \Sigma)^{1/2}} e^{- (x-\mu) \cdot \Sigma^{-1} (x-\mu) / 2}\ dx,$

has mean ${\mu}$ and covariance matrix ${\sigma}$ , and has a characteristic function given by

$\displaystyle F(t) = e^{i \mu \cdot t} e^{- t \cdot \Sigma t / 2}.$

How would one define the normal distribution ${N(\mu,\Sigma)_{{\bf R}^d}}$ if ${\Sigma}$ degenerated to be merely positive semi-definite instead of positive definite?

If ${\vec S_n := \vec X_1+\ldots + \vec X_n}$ is the sum of ${n}$ iid copies of ${\vec X}$ , show that ${\frac{\vec S_n - n \mu}{\sqrt{n}}}$ converges in distribution to ${N(0,\Sigma(X))_{{\bf R}^d}}$ .

Exercise 21 (Complex central limit theorem) Let ${X}$ be a complex random variable of mean ${\mu \in {\bf C}}$ , whose real and imaginary parts have variance ${\sigma^2/2}$ and covariance ${0}$ . Let ${X_1,\ldots,X_n \equiv X}$ be iid copies of ${X}$ . Show that as ${n \rightarrow \infty}$ , the normalised sums (1) converge in distribution to the standard complex gaussian ${N(0,1)_{\bf C}}$ .

Exercise 22 (Lindeberg central limit theorem) Let ${X_1,X_2,\ldots}$ be a sequence of independent (but not necessarily identically distributed) real random variables, normalised to have mean zero and variance one. Assume the (strong) Lindeberg condition

$\displaystyle \lim_{N \rightarrow \infty} \limsup_{n \rightarrow \infty} \frac{1}{n} \sum_{j=1}^n {\bf E} |X_{j,>N}|^2 = 0$

where ${X_{j,>N} := X_j {\bf I}(|X_j| > N)}$ is the truncation of ${X_j}$ to large values. Show that as ${n \rightarrow \infty}$ , ${\frac{X_1+\ldots+X_n}{\sqrt{n}}}$ converges in distribution to ${N(0,1)_{\bf R}}$ . (Hint: modify the truncation argument.)

A more sophisticated version of the Fourier-analytic method gives a more quantitative form of the central limit theorem, namely the Berry-Esséen theorem.

Theorem 23 (Berry-Esséen theorem) Let ${X}$ have mean zero, unit variance, and finite third moment. Let ${Z_n := (X_1+\ldots+X_n)/\sqrt{n}}$ , where ${X_1,\ldots,X_n}$ are iid copies of ${X}$ . Then we have

$\displaystyle {\bf P}( Z_n < a ) = {\bf P}( G < a ) + O( \frac{1}{\sqrt{n}} ({\bf E} |X|^3)) \ \ \ \ \ (12)$

uniformly for all ${a \in {\bf R}}$ , where ${G \equiv N(0,1)_{\bf R}}$ , and the implied constant is absolute.

Proof: (Optional) Write ${\varepsilon := {\bf E} |X|^3/\sqrt{n}}$ ; our task is to show that

$\displaystyle {\bf P}( Z_n < a ) = {\bf P}( G < a ) + O( \varepsilon )$

for all ${a}$ . We may of course assume that ${\varepsilon < 1}$ , as the claim is trivial otherwise.

Let ${c > 0}$ be a small absolute constant to be chosen later. Let ${\eta: {\bf R} \rightarrow {\bf R}}$ be an non-negative Schwartz function with total mass ${1}$ whose Fourier transform is supported in ${[-c,c]}$ , and let ${\varphi: {\bf R} \rightarrow {\bf R}}$ be the smoothed out version of ${1_{(-\infty,0]}}$ , defined as

$\displaystyle \varphi(x) := \int_{\bf R} 1_{(-\infty,0]}(x - \varepsilon y) \eta(y)\ dy.$

Observe that ${\varphi}$ is decreasing from ${1}$ to ${0}$ .

We claim that it suffices to show that

$\displaystyle {\bf E} \varphi(Z_n - a) = {\bf E} \varphi(G - a ) + O_\eta(\varepsilon) \ \ \ \ \ (13)$

for every ${a}$ , where the subscript means that the implied constant depends on ${\eta}$ . Indeed, suppose that (13) held. Define

$\displaystyle \sigma := \sup_a |{\bf P}( Z_n < a ) - {\bf P}( G < a )|\ \ \ \ \ (14)$

thus our task is to show that ${\sigma = O(\varepsilon)}$ .

Let ${a}$ be arbitrary, and let ${K > 0}$ be a large absolute constant to be chosen later. We write

$\displaystyle {\bf P}( Z_n < a ) \leq {\bf E} \varphi(Z_n - a - K \varepsilon ) +$

$\displaystyle + {\bf E} (1 - \varphi(Z_n - a - K \varepsilon )) {\bf I}(Z_n < a)$

and thus by (13)

$\displaystyle {\bf P}( Z_n < a ) \leq {\bf E} \varphi(G - a - K \varepsilon ) +$

$\displaystyle + {\bf E} (1 - \varphi(Z_n - a - K \varepsilon )) {\bf I}(Z_n < a) + O_\eta(\varepsilon).$

Meanwhile, from (14) and an integration by parts we see that

$\displaystyle {\bf E} (1 - \varphi(Z_n - a - K \varepsilon )) {\bf I}(Z_n < a) = {\bf E} (1 - \varphi(G - a - K \varepsilon )) {\bf I}(G < a) +$

$\displaystyle + O(( 1 - \varphi(-K\varepsilon) )\sigma).$

From the bounded density of ${G}$ and the rapid decrease of ${\eta}$ we have

$\displaystyle {\bf E} \varphi(G - a - K \varepsilon ) + {\bf E} (1 - \varphi(G - a - K \varepsilon )) {\bf I}(G < a)$

$\displaystyle = {\bf P}( G < a ) + O_{\eta,K}(\varepsilon).$

Putting all this together, we see that

$\displaystyle {\bf P}( Z_n < a ) \leq {\bf P}(G < a) + O_{\eta,K}(\varepsilon) + O(( 1 - \varphi(-K\varepsilon) )\sigma).$

A similar argument gives a lower bound

$\displaystyle {\bf P}( Z_n < a ) \geq {\bf P}(G < a) - O_{\eta,K}(\varepsilon) - O(\varphi(K\varepsilon)\sigma),$

and so

$\displaystyle |{\bf P}( Z_n < a ) -{\bf P}(G < a)| \leq O_{\eta,K}(\varepsilon) + O(( 1 - \varphi(-K\varepsilon) )\sigma) + O(\varphi(K\varepsilon)\sigma).$

Taking suprema over ${a}$ , we obtain

$\displaystyle \sigma \leq O_{\eta,K}(\varepsilon) + O(( 1 - \varphi(-K\varepsilon) )\sigma) + O(\varphi(K\varepsilon)\sigma).$

If ${K}$ is large enough (depending on ${c}$ ), we can make ${1 - \varphi(-K\varepsilon)}$ and ${\varphi(K\varepsilon)}$ small, and thus absorb the latter two terms on the right-hand side into the left-hand side. This gives the desired bound ${\sigma = O(\varepsilon)}$ .

It remains to establish (13). Applying (8), it suffices to show that

$\displaystyle |\int_{\bf R} \hat \varphi(t) (F_{Z_n}(t) - F_G(t))\ dt| \leq O(\varepsilon). \ \ \ \ \ (15)$

Now we estimate each of the various expressions. Standard Fourier-analytic computations show that

$\displaystyle \hat \varphi(t) = \hat 1_{(-\infty,a]}(t) \hat \eta(\varepsilon t)$

and that

$\displaystyle \hat 1_{(-\infty,a]}(t) = O( \frac{1}{1+|t|} ).$

Since ${\hat \eta}$ was supported in ${[-c,c]}$ , it suffices to show that

$\displaystyle \int_{|t| \leq c/\varepsilon} \frac{|F_{Z_n}(t) - F_G(t)|}{1+|t|}\ dt \leq O(\varepsilon). \ \ \ \ \ (16)$

From Taylor expansion we have

$\displaystyle e^{itX} = 1 + it X - \frac{t^2}{2} X^2 + O( |t|^3 |X|^3 )$

for any ${t}$ ; taking expectations and using the definition of ${\varepsilon}$ we have

$\displaystyle F_X(t) = 1 - t^2/2 + O( \varepsilon \sqrt{n} |t|^3 )$

and in particular

$\displaystyle F_X(t) = \exp( - t^2/2 + O( \varepsilon \sqrt{n} |t|^3 ) )$

if ${|t| \leq c / {\bf E} |X|^3}$ and ${c}$ is small enough. Applying (11), we conclude that

$\displaystyle F_{Z_n}(t) = \exp(-t^2/2 + O( \varepsilon |t|^3 ))$

if ${|t| \leq c/\varepsilon}$ . Meanwhile, from Exercise 8 we have ${F_G(t) = \exp(-t^2/2)}$ . Elementary calculus then gives us

$\displaystyle |F_{Z_n}(t) - F_G(t)| \leq O( \varepsilon |t|^3 \exp( - t^2/4 ) )$

(say) if ${c}$ is small enough. Inserting this bound into (16) we obtain the claim. $\Box$

Exercise 24 Show that the error terms here are sharp (up to constants) when ${X}$ is a signed Bernoulli random variable.

— 3. The moment method —

The above Fourier-analytic proof of the central limit theorem is one of the quickest (and slickest) proofs available for this theorem, and is accordingly the “standard” proof given in probability textbooks. However, it relies quite heavily on the Fourier-analytic identities in Exercise 18, which in turn are extremely dependent on both the commutative nature of the situation (as it uses the identity ${e^{A+B}=e^A e^B}$ ) and on the independence of the situation (as it uses identities of the form ${{\bf E}(e^A e^B) = ({\bf E} e^A)({\bf E} e^B)}$ ). When we turn to random matrix theory, we will often lose (or be forced to modify) one or both of these properties, which often causes the Fourier-analytic methods to fail spectacularly. Because of this, it is also important to look for non-Fourier based methods to prove results such as the central limit theorem. These methods often lead to proofs that are lengthier and more technical than the Fourier proofs, but also tend to be more robust, and in particular can often be extended to random matrix theory situations. Thus both the Fourier and non-Fourier proofs will be of importance in this course.

The most elementary (but still remarkably effective) method available in this regard is the moment method, which we have already used in the previous notes, which seeks to understand the distribution of a random variable ${X}$ via its moments ${X^k}$ . In principle, this method is equivalent to the Fourier method, through the identity (7); but in practice, the moment method proofs tend to look somewhat different than the Fourier-analytic ones, and it is often more apparent how to modify them to non-independent or non-commutative settings.

We first need an analogue of the Lévy continuity theorem. Here we encounter a technical issue: whereas the Fourier phases ${x \mapsto e^{itx}}$ were bounded, the moment functions ${x \mapsto x^k}$ become unbounded at infinity. However, one can deal with this issue as long as one has sufficient decay:

Theorem 25 (Moment continuity theorem) Let ${X_n}$ be a sequence of uniformly subgaussian real random variables, and let ${X}$ be another subgaussian random variable. Then the following statements are equivalent:

(i) For every ${k=0,1,2,\ldots}$ , ${{\bf E} X_n^k}$ converges pointwise to ${{\bf E} X^k}$ .

(ii) ${X_n}$ converges in distribution to ${X}$ .

Proof: We first show how (ii) implies (i). Let ${N > 0}$ be a truncation parameter, and let ${\varphi: {\bf R} \rightarrow {\bf R}}$ be a smooth function that equals ${1}$ on ${[-1,1]}$ and vanishes outside of ${[-2,2]}$ . Then for any ${k}$ , the convergence in distribution implies that ${{\bf E} X_n^k \varphi(X_n/N)}$ converges to ${{\bf E} X^k \varphi(X/N)}$ . On the other hand, from the uniform subgaussian hypothesis, one can make ${{\bf E} X_n^k (1 - \varphi(X_n/N))}$ and ${{\bf E} X^k (1-\varphi(X/N))}$ arbitrarily small for fixed ${k}$ by making ${N}$ large enough. Summing, and then letting ${N}$ go to infinity, we obtain (i).

Conversely, suppose (i) is true. From the uniform subgaussian hypothesis, the ${X_n}$ have ${(k+1)^{st}}$ moment bounded by ${(Ck)^{k/2}}$ for all ${k \geq 1}$ and some ${C}$ independent of ${k}$ (see Exercise 4 from Notes 0). From Taylor’s theorem with remainder (and Stirling’s formula, Notes 0a) we conclude

$\displaystyle F_{X_n}(t) = \sum_{j=0}^k \frac{(it)^j}{j!} {\bf E} X_n^j + O( (Ck)^{-k/2} |t|^{k+1} )$

uniformly in ${t}$ and ${n}$ . Similarly for ${X}$ . Taking limits using (i) we see that

$\displaystyle \limsup_{n \rightarrow \infty} |F_{X_n}(t) - F_X(t)| = O( (Ck)^{-k/2} |t|^{k+1} ).$

Then letting ${k \rightarrow \infty}$ , keeping ${t}$ fixed, we see that ${F_{X_n}(t)}$ converges pointwise to ${F_X(t)}$ for each ${t}$ , and the claim now follows from the Lévy continuity theorem. $\Box$

Remark 26 One corollary of Theorem 25 is that the distribution of a subgaussian random variable is uniquely determined by its moments (actually, this could already be deduced from Exercise 12 and Remark 14). The situation can fail for distributions with slower tails, for much the same reason that a smooth function is not determined by its derivatives at one point if that function is not analytic.

The Fourier inversion formula provides an easy way to recover the distribution from the characteristic function. Recovering a distribution from its moments is more difficult, and sometimes requires tools such as analytic continuation; this problem is known as the inverse moment problem and will not be discussed here.

To prove the central limit theorem, we know from the truncation method that we may assume without loss of generality that ${X}$ is bounded (and in particular subgaussian); we may also normalise ${X}$ to have mean zero and unit variance. From the Chernoff bound (3) we know that the ${Z_n}$ are uniformly subgaussian; so by Theorem 25, it suffices to show that

$\displaystyle {\bf E} Z_n^k \rightarrow {\bf E} G^k$

for all ${k=0,1,2,\ldots}$ , where ${G \equiv N(0,1)_{\bf R}}$ is a standard gaussian variable.

The moments ${{\bf E} G^k}$ are easy to compute:

Exercise 27 Let ${k}$ be a natural number, and let ${G \equiv N(0,1)_{\bf R}}$ . Show that ${{\bf E} G^k}$ vanishes when ${k}$ is odd, and equal to ${\frac{k!}{2^{k/2} (k/2)!}}$ when ${k}$ is even. (This can either be done directly by using the Gamma function, or by using Exercise 8 and Exercise 12.)

So now we need to compute ${{\bf E} Z_n^k}$ . Using (4) and linearity of expectation, we can expand this as

$\displaystyle {\bf E} Z_n^k = n^{-k/2} \sum_{1 \leq i_1,\ldots,i_k \leq n} {\bf E} X_{i_1} \ldots X_{i_k}.$

To understand this expression, let us first look at some small values of ${k}$ .

For ${k=0}$ , this expression is trivially ${1}$ .
For ${k=1}$ , this expression is trivially ${0}$ , thanks to the mean zero hypothesis on ${X}$ .
For ${k=2}$ , we can split this expression into the diagonal and off-diagonal components:
$\displaystyle n^{-1} \sum_{1 \leq i \leq n} {\bf E} X_i^2 + n^{-1} \sum_{1 \leq i < j \leq n} {\bf E} 2 X_i X_j.$

Each summand in the first sum is ${1}$ , as ${X}$ has unit variance. Each summand in the second sum is ${0}$ , as the ${X_i}$ have mean zero and are independent. So the second moment ${{\bf E} Z_n^2}$ is ${1}$ .
For ${k=3}$ , we have a similar expansion
$\displaystyle n^{-3/2} \sum_{1 \leq i \leq n} {\bf E} X_i^3 + n^{-3/2} \sum_{1 \leq i < j \leq n} {\bf E} 3 X_i^2 X_j + 3 X_i X_j^2$

$\displaystyle + n^{-3/2} \sum_{1 \leq i < j < k \leq n} {\bf E} 6 X_i X_j X_k.$

The summands in the latter two sums vanish because of the (joint) independence and mean zero hypotheses. The summands in the first sum need not vanish, but are ${O(1)}$ , so the first term is ${O(n^{-1/2})}$ , which is asymptotically negligible, so the third moment ${{\bf E} Z_n^3}$ goes to ${0}$ .
For ${k=4}$ , the expansion becomes quite complicated:
$\displaystyle n^{-2} \sum_{1 \leq i \leq n} {\bf E} X_i^4 + n^{-2} \sum_{1 \leq i < j \leq n} {\bf E} 4 X_i^3 X_j + 6 X_i^2 X_j^2 + 4 X_i X_j^3$

$\displaystyle + n^{-2} \sum_{1 \leq i < j < k \leq n} {\bf E} 12 X_i^2 X_j X_k + 12 X_i X_j^2 X_k + 12 X_i X_j X_k^2$

$\displaystyle + n^{-2} \sum_{1 \leq i < j < k < l \leq n} {\bf E} 24 X_i X_j X_k X_l.$

Again, most terms vanish, except for the first sum, which is ${O( n^{-1} )}$ and is asymptotically negligible, and the sum ${n^{-2} \sum_{1 \leq i < j \leq n} {\bf E} 6 X_i^2 X_j^2}$ , which by the independence and unit variance assumptions works out to ${n^{-2} 6 \binom{n}{2} = 3 + o(1)}$ . Thus the fourth moment ${{\bf E} Z_n^4}$ goes to ${3}$ (as it should).

Now we tackle the general case. Ordering the indices ${i_1,\ldots,i_k}$ as ${j_1 < \ldots < j_m}$ for some ${1 \leq m \leq k}$ , with each ${j_r}$ occuring with multiplicity ${a_r \geq 1}$ and using elementary enumerative combinatorics, we see that ${{\bf E} Z_n^k}$ is the sum of all terms of the form

$\displaystyle n^{-k/2} \sum_{1 \leq j_1 < \ldots < j_m \leq n} c_{k,a_1,\ldots,a_m} {\bf E} X_{j_1}^{a_1} \ldots X_{j_m}^{a_m} \ \ \ \ \ (17)$

where ${1 \leq m \leq k}$ , ${a_1,\ldots,a_m}$ are positive integers adding up to ${k}$ , and ${c_{k,a_1,\ldots,a_m}}$ is the multinomial coefficient

$\displaystyle c_{k,a_1,\ldots,a_m} := \frac{k!}{a_1! \ldots a_m!}.$

The total number of such terms depends only on ${k}$ (in fact, it is ${2^{k-1}}$ (exercise!), though we will not need this fact).

As we already saw from the small ${k}$ examples, most of the terms vanish, and many of the other terms are negligible in the limit ${n \rightarrow \infty}$ . Indeed, if any of the ${a_r}$ are equal to ${1}$ , then every summand in (17) vanishes, by joint independence and the mean zero hypothesis. Thus, we may restrict attention to those expressions (17) for which all the ${a_r}$ are at least ${2}$ . Since the ${a_r}$ sum up to ${k}$ , we conclude that ${m}$ is at most ${k/2}$ .

On the other hand, the total number of summands in (17) is clearly at most ${n^m}$ (in fact it is ${\binom{n}{m}}$ ), and the summands are bounded (for fixed ${k}$ ) since ${X}$ is bounded. Thus, if ${m}$ is strictly less than ${k/2}$ , then the expression in (17) is ${O( n^{m-k/2} )}$ and goes to zero as ${n \rightarrow \infty}$ . So, asymptotically, the only terms (17) which are still relevant are those for which ${m}$ is equal to ${k/2}$ . This already shows that ${{\bf E} Z_n^k}$ goes to zero when ${k}$ is odd. When ${k}$ is even, the only surviving term in the limit is now when ${m=k/2}$ and ${a_1=\ldots=a_m = 2}$ . But then by independence and unit variance, the expectation in (17) is ${1}$ , and so this term is equal to

$\displaystyle n^{-k/2} \binom{n}{m} c_{k,2,\ldots,2} = \frac{1}{(k/2)!} \frac{k!}{2^{k/2}} + o(1),$

and the main term is happily equal to the moment ${{\bf E} G^k}$ as computed in Exercise 27.

— 4. The Lindeberg swapping trick —

The moment method proof of the central limit theorem that we just gave consisted of four steps:

(Truncation and normalisation step) A reduction to the case when ${X}$ was bounded with zero mean and unit variance.
(Inverse moment step) A reduction to a computation of asymptotic moments ${\lim_{n \rightarrow \infty} {\bf E} Z_n^k}$ .
(Analysis step) Showing that most terms in the expansion of this asymptotic moment were zero, or went to zero as ${n \rightarrow \infty}$ .
(Algebra step) Using enumerative combinatorics to compute the remaining terms in the expansion.

In this particular case, the enumerative combinatorics was very classical and easy – it was basically asking for the number of ways one can place ${k}$ balls in ${m}$ boxes, so that the ${r^{th}}$ box contains ${a_r}$ balls, and the answer is well known to be given by the multinomial ${\frac{k!}{a_1! \ldots a_r!}}$ . By a small algebraic miracle, this result matched up nicely with the computation of the moments of the gaussian ${N(0,1)_{\bf R}}$ .

However, when we apply the moment method to more advanced problems, the enumerative combinatorics can become more non-trivial, requiring a fair amount of combinatorial and algebraic computation. The algebraic miracle that occurs at the end of the argument can then seem like a very fortunate but inexplicable coincidence, making the argument somehow unsatisfying despite being rigorous.

In a 1922 paper, Lindeberg observed that there was a very simple way to decouple the algebraic miracle from the analytic computations, so that all relevant algebraic identities only need to be verified in the special case of gaussian random variables, in which everything is much easier to compute. This Lindeberg swapping trick (or Lindeberg replacement trick) will be very useful in the later theory of random matrices, so we pause to give it here in the simple context of the central limit theorem.

The basic idea is follows. We repeat the truncation-and-normalisation and inverse moment steps in the preceding argument. Thus, ${X_1,\ldots,X_n}$ are iid copies of a boudned real random variable ${X}$ of mean zero and unit variance, and we wish to show that ${{\bf E} Z_n^k \rightarrow {\bf E} G^k}$ , where ${G \equiv N(0,1)_{\bf R}}$ , where ${k \geq 0}$ is fixed.

Now let ${Y_1,\ldots,Y_n}$ be iid copies of the gaussian itself: ${Y_1,\ldots,Y_n \equiv N(0,1)_{\bf R}}$ . Because the sum of independent gaussians is again a gaussian (Exercise 9 from Notes 1, we see that the random variable

$\displaystyle W_n := \frac{Y_1+\ldots+Y_n}{\sqrt{n}}$

already has the same distribution as ${G}$ : ${W_n \equiv G}$ . Thus, it suffices to show that

$\displaystyle {\bf E} Z_n^k = {\bf E} W_n^k + o(1).$

Now we perform the analysis part of the moment method argument again. We can expand ${{\bf E} Z_n^k}$ into terms (17) as before, and discard all terms except for the ${a_1=\ldots=a_m=2}$ term as being ${o(1)}$ . Similarly, we can expand ${{\bf E} W_n^k}$ into very similar terms (but with the ${X_i}$ replaced by ${Y_i}$ ) and again discard all but the ${a_1=\ldots=a_m}$ term.

But by hypothesis, the second moments of ${X}$ and ${Y}$ match: ${{\bf E} X^2 = {\bf E} Y^2 = 1}$ . Thus, by joint independence, the ${a_1=\ldots=a_m=2}$ term (17) for ${X}$ is exactly equal to that of ${Y}$ . And the claim follows.

This is almost exactly the same proof as in the previous section, but note that we did not need to compute the multinomial coefficient ${c_{k,a_1,\ldots,a_m}}$ , nor did we need to verify the miracle that this coefficient matched (up to normalising factors) to the moments of the gaussian. Instead, we used the much more mundane “miracle” that the sum of independent gaussians was again a gaussian.

To put it another way, the Lindeberg replacement trick factors a universal limit theorem, such as the central limit theorem, into two components:

A universality or invariance result, which shows that the distribution (or other statistics, such as moments) of some random variable ${F(X_1,\ldots,X_n)}$ is asymptotically unchanged in the limit ${n \rightarrow \infty}$ if each of the input variables ${X_i}$ are replaced by a gaussian substitute ${Y_i}$ ; and
The gaussian case, which computes the asymptotic distribution (or other statistic) of ${F(Y_1,\ldots,Y_n)}$ in the case when ${Y_1,\ldots,Y_n}$ are all gaussians.

The former type of result tends to be entirely analytic in nature (basically, one just needs to show that all error terms that show up when swapping ${X}$ with ${Y}$ are ${o(1)}$ ), while the latter type of result tends to be entirely algebraic in nature (basically, one just needs to exploit the many pleasant algebraic properties of gaussians). This decoupling of the analysis and algebra steps tends to simplify both, both at a technical level and at a conceptual level.

— 5. Individual swapping —

In the above argument, we swapped all the original input variables ${X_1,\ldots,X_n}$ with gaussians ${Y_1,\ldots,Y_n}$ en masse. There is also a variant of the Lindeberg trick in which the swapping is done individually. To illustrate the individual swapping method, let us use it to show the following weak version of the Berry-Esséen theorem:

Theorem 28 (Berry-Esséen theorem, weak form) Let ${X}$ have mean zero, unit variance, and finite third moment, and let ${\varphi}$ be smooth with uniformly bounded derivatives up to third order. Let ${Z_n := (X_1+\ldots+X_n)/\sqrt{n}}$ , where ${X_1,\ldots,X_n}$ are iid copies of ${X}$ . Then we have

$\displaystyle {\bf E} \varphi( Z_n ) = {\bf E} \varphi(G) + O( \frac{1}{\sqrt{n}} ({\bf E} |X|^3) \sup_{x \in{\bf R}} |\varphi'''(x)| ) \ \ \ \ \ (18)$

where ${G \equiv N(0,1)_{\bf R}}$ .

Proof: Let ${Y_1,\ldots,Y_n}$ and ${W_n}$ be in the previous section. As ${W_n \equiv G}$ , it suffices to show that

$\displaystyle {\bf E} \varphi(Z_n) - \varphi(W_n) = o(1).$

We telescope this (using linearity of expectation) as

$\displaystyle {\bf E} \varphi(Z_n) - \varphi(W_n) = -\sum_{i=0}^{n-1} {\bf E} \varphi( Z_{n,i} ) - \varphi( Z_{n,i+1} )$

where

$\displaystyle Z_{n,i} := \frac{X_1+\ldots+X_i+Y_{i+1}+\ldots+Y_n}{\sqrt{n}}$

is a partially swapped version of ${Z_n}$ . So it will suffice to show that

$\displaystyle {\bf E} \varphi( Z_{n,i} ) - \varphi( Z_{n,i+1} ) = O( ({\bf E} |X|^3) \sup_{x \in{\bf R}} |\varphi'''(x)| /n^{3/2} )$

uniformly for ${0 \leq i < n}$ .

We can write ${Z_{n,i} = S_{n,i} + Y_{i+1}/\sqrt{n}}$ and ${Z_{n,i+1} = S_{n,i} + X_{i+1}/\sqrt{n}}$ , where

$\displaystyle S_{n,i} := \frac{X_1+\ldots+X_i+Y_{i+2}+\ldots+Y_n}{\sqrt{n}}. \ \ \ \ \ (19)$

To exploit this, we use Taylor expansion with remainder to write

$\displaystyle \varphi(Z_{n,i}) = \varphi(S_{n,i}) + \varphi'(S_{n,i}) Y_{i+1}/\sqrt{n}$

$\displaystyle + \frac{1}{2} \varphi''(S_{n,i}) Y_{i+1}^2 / n + O( |Y_{i+1}|^3 / n^{3/2} \sup_{x \in{\bf R}} |\varphi'''(x)| )$

and

$\displaystyle \varphi(Z_{n,i+1}) = \varphi(S_{n,i}) + \varphi'(S_{n,i}) X_{i+1}/\sqrt{n}$

$\displaystyle + \frac{1}{2} \varphi''(S_{n,i}) X_{i+1}^2 / n + O( |X_{i+1}|^3 / n^{3/2} \sup_{x \in{\bf R}} |\varphi'''(x)|)$

where the implied constants depend on ${\varphi}$ but not on ${n}$ . Now, by construction, the moments of ${X_{i+1}}$ and ${Y_{i+1}}$ match to second order, thus

$\displaystyle {\bf E} \varphi( Z_{n,i} ) - \varphi( Z_{n,i+1} ) = O( {\bf E} |Y_{i+1}|^3 \sup_{x \in{\bf R}} |\varphi'''(x)| / n^{3/2} )$

$\displaystyle + O( {\bf E} |X_{i+1}|^3 \sup_{x \in{\bf R}} |\varphi'''(x)| / n^{3/2} ),$

and the claim follows. (Note from Hölder’s inequality that ${{\bf E} |X|^3 \geq 1}$ .) $\Box$

Remark 29 The above argument relied on Taylor expansion, and the hypothesis that the moments of ${X}$ and ${Y}$ matched to second order. It is not hard to see that if we assume more moments matching (e.g. ${{\bf E} X^3 = {\bf E} Y^3 = 3}$ ), and more smoothness on ${\varphi}$ , we see that we can improve the ${\frac{1}{\sqrt{n}}}$ factor on the right-hand side. Thus we see that we expect swapping methods to become more powerful when more moments are matching. We will see this when we discuss the four moment theorem of Van Vu and myself in later lectures, which (very) roughly speaking asserts that the spectral statistics of two random matrices are asymptotically indistinguishable if their coefficients have matching moments to fourth order.

Theorem 28 is easily implied by Theorem 23 and an integration by parts. In the reverse direction, let us see what Theorem 28 tells us about the cumulative distribution function

$\displaystyle {\bf P}( Z_n < a )$

of ${Z_n}$ . For any ${\varepsilon > 0}$ , one can upper bound this expression by

$\displaystyle {\bf E} \varphi(Z_n)$

where ${\varphi}$ is a smooth function equal to ${1}$ on ${(-\infty,a]}$ that vanishes outside of ${(-\infty,a+\varepsilon]}$ , and has third derivative ${O(\varepsilon^{-3})}$ . By Theorem 28, we thus have

$\displaystyle {\bf P}( Z_n < a ) \leq {\bf E} \varphi(G) + O( \frac{1}{\sqrt{n}} ({\bf E} |X|^3) \varepsilon^{-3} ).$

On the other hand, as ${G}$ has a bounded probability density function, we have

$\displaystyle {\bf E} \varphi(G) = {\bf P}(G < a ) + O(\varepsilon)$

and so

$\displaystyle {\bf P}( Z_n < a ) \leq {\bf P}( G < a ) + O(\varepsilon) + O( \frac{1}{\sqrt{n}} ({\bf E} |X|^3) \varepsilon^{-3} ).$

A very similar argument gives the matching lower bound, thus

$\displaystyle {\bf P}( Z_n < a ) = {\bf P}( G < a ) + O(\varepsilon) + O( \frac{1}{\sqrt{n}} ({\bf E} |X|^3) \varepsilon^{-3} ).$

Optimising in ${\varepsilon}$ we conclude that

$\displaystyle {\bf P}( Z_n < a ) = {\bf P}( G < a ) + O( \frac{1}{\sqrt{n}} ({\bf E} |X|^3))^{1/4}. \ \ \ \ \ (20)$

Comparing this with Theorem 23 we see that we have lost an exponent of ${1/4}$ . In our applications to random matrices, this type of loss is acceptable, and so the swapping argument is a reasonable substitute for the Fourier-analytic one in this case. Also, this method is quite robust, and in particular extends well to higher dimensions; we will return to this point in later lectures, but see for instance Appendix D of this paper of myself and Van Vu for an example of a multidimensional Berry-Esséen theorem proven by this method.

On the other hand there is another method that can recover this loss while still avoiding Fourier-analytic techniques; we turn to this topic next.

— 6. Stein’s method —

Stein’s method, introduced by Charles Stein in 1970 (who should not be confused with a number of other eminent mathematicians with this surname, including my advisor), is a powerful method to show convergence in distribution to a special distribution, such as the gaussian. In several recent papers, this method has been used to control several expressions of interest in random matrix theory (e.g. the distribution of moments, or of the Stieltjes transform.) We will not use it much in this course, but this method is of independent interest, so I will briefly discuss (a very special case of) it here.

The probability density function ${\rho(x) := \frac{1}{\sqrt{2\pi}} e^{-x^2/2}}$ of the standard normal distribution ${N(0,1)_{\bf R}}$ can be viewed as a solution to the ordinary differential equation

$\displaystyle \rho'(x) + x \rho(x) = 0. \ \ \ \ \ (21)$

One can take adjoints of this, and conclude (after an integration by parts) that ${\rho}$ obeys the integral identity

$\displaystyle \int_{\bf R} \rho(x) (f'(x) - x f(x))\ dx = 0$

for any continuously differentiable ${f}$ with both ${f}$ and ${f'}$ bounded (one can relax these assumptions somewhat). To put it another way, if ${G \equiv N(0,1)}$ , then we have

$\displaystyle {\bf E} f'(G) - G f(G) = 0 \ \ \ \ \ (22)$

whenever ${f}$ is continuously differentiable with ${f, f'}$ both bounded.

It turns out that the converse is true: if ${X}$ is a real random variable with the property that

$\displaystyle {\bf E} f'(X) - X f(X) = 0$

whenever ${f}$ is continuously differentiable with ${f, f'}$ both bounded, then ${X}$ is Gaussian. In fact, more is true, in the spirit of Theorem 13 and Theorem 25:

Theorem 30 (Stein continuity theorem) Let ${X_n}$ be a sequence of real random variables with uniformly bounded second moment, and let ${G \equiv N(0,1)}$ . Then the following are equivalent:

(i) ${{\bf E} f'(X_n) - X_n f(X_n)}$ converges to zero whenever ${f: {\bf R} \rightarrow {\bf R}}$ is continuously differentiable with ${f, f'}$ both bounded.

(ii) ${X_n}$ converges in distribution to ${G}$ .

Proof: To show that (ii) implies (i), it is not difficult to use the uniform bounded second moment hypothesis and a truncation argument to show that ${{\bf E} f'(X_n) - X_n f(X_n)}$ converges to ${{\bf E} f'(G) - G f(G)}$ when ${f}$ is continuously differentiable with ${f, f'}$ both bounded, and the claim then follows from (22).

Now we establish the converse. It suffices to show that

$\displaystyle {\bf E} \varphi(X_n) - {\bf E} \varphi(G) \rightarrow 0$

whenever ${\varphi: {\bf R} \rightarrow {\bf R}}$ is a bounded Lipschitz function. We may normalise ${\varphi}$ to be bounded in magnitude by ${1}$ .

Trivially, the function ${\varphi(\cdot)-{\bf E} \varphi(G)}$ has zero expectation when one substitutes ${G}$ for the argument ${\cdot}$ , thus

$\displaystyle \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} e^{-y^2/2} (\varphi(y) - {\bf E}\varphi(G))\ dy = 0. \ \ \ \ \ (23)$

Comparing this with (22), one may thus hope to find a representation of the form

$\displaystyle \varphi(x) - {\bf E} \varphi(G) = f'(x) - x f(x) \ \ \ \ \ (24)$

for some continuously differentiable ${f}$ with ${f, f'}$ both bounded. This is a simple ODE and can be easily solved (by the method of integrating factors) to give a solution ${f}$ , namely

$\displaystyle f(x) := e^{x^2/2} \int_{-\infty}^x e^{-y^2/2} (\varphi(y) - {\bf E}\varphi(G))\ dy. \ \ \ \ \ (25)$

(One could dub ${f}$ the Stein transform of ${\varphi}$ , although this term does not seem to be in widespread use.) By the fundamental theorem of calculus, ${f}$ is continuously differentiable and solves (24). Using (23), we may also write ${f}$ as

$\displaystyle f(x) := -e^{x^2/2} \int_x^{\infty} e^{-y^2/2} (\varphi(y) - {\bf E}\varphi(G))\ dy. \ \ \ \ \ (26)$

By completing the square, we see that ${e^{-y^2/2} \leq e^{-x^2/2} e^{-x(y-x)}}$ . Inserting this into (25) and using the bounded nature of ${\varphi}$ , we conclude that ${f(x) = O_\varphi(1/|x|)}$ for ${x < -1}$ ; inserting it instead into (26), we have ${f(x) = O_\varphi( 1/|x|)}$ for ${x > 1}$ . Finally, easy estimates give ${f(x) = O_\varphi(1)}$ for ${|x| \leq 1}$ . Thus for all ${x}$ we have

$\displaystyle f(x) = O_\varphi( \frac{1}{1+|x|} )$

which when inserted back into (24) gives the boundedness of ${f'}$ (and also of course gives the boundedness of ${f}$ ). In fact, if we rewrite (26) as

$\displaystyle f(x) := - \int_0^\infty e^{-s^2/2} e^{-sx} (\varphi(x+s) - {\bf E}\varphi(G))\ ds,$

we see on differentiation under the integral sign (and using the Lipschitz nature of ${\varphi}$ ) that ${f'(x) = O_\varphi(1/x)}$ for ${x>1}$ ; a similar manipulation (starting from (25)) applies for ${x<-1}$ , and we in fact conclude that ${f'(x) = O_\varphi(\frac{1}{1+|x|})}$ for all ${x}$ .

Applying (24) with ${x=X_n}$ and taking expectations, we have

$\displaystyle \varphi(X_n) - {\bf E} \varphi(G) = f'(X_n) - X_n f(X_n).$

By the hypothesis (i), the right-hand side goes to zero, hence the left-hand side does also, and the claim follows. $\Box$

The above theorem gave only a qualitative result (convergence in distribution), but the proof is quite quantitative, and can be used to in particular to give Berry-Esséen type results. To illustrate this, we begin with a strengthening of Theorem 28 that reduces the number of derivatives of ${\varphi}$ that need to be controlled:

Theorem 31 (Berry-Esséen theorem, less weak form) Let ${X}$ have mean zero, unit variance, and finite third moment, and let ${\varphi}$ be smooth, bounded in magnitude by ${1}$ , and Lipschitz. Let ${Z_n := (X_1+\ldots+X_n)/\sqrt{n}}$ , where ${X_1,\ldots,X_n}$ are iid copies of ${X}$ . Then we have

$\displaystyle {\bf E} \varphi( Z_n ) = {\bf E} \varphi(G) + O( \frac{1}{\sqrt{n}} ({\bf E} |X|^3) (1+\sup_{x \in{\bf R}} |\varphi'(x)|) ) \ \ \ \ \ (27)$

where ${G \equiv N(0,1)_{\bf R}}$ .

Proof: Set ${A := 1+\sup_{x \in{\bf R}} |\varphi'(x)|}$ .

Let ${f}$ be the Stein transform (25) of ${\varphi}$ , then by (24) we have

$\displaystyle {\bf E} \varphi( Z_n ) - {\bf E} \varphi(G) = {\bf E} f'( Z_n ) - Z_n f(Z_n).$

We expand ${Z_n f(Z_n) = \frac{1}{\sqrt{n}} \sum_{i=1}^n X_i f(Z_n)}$ . For each ${i}$ , we then split ${Z_n = Z_{n;i} + \frac{1}{\sqrt{n}} X_i}$ , where ${Z_{n;i} := (X_1 + \ldots + X_{i-1} + X_{i+1} + \ldots + X_n)/\sqrt{n}}$ (cf. (19)). By the fundamental theorem of calculus, we have

$\displaystyle {\bf E} X_i f(Z_n) = {\bf E} X_i f(Z_{n;i}) + \frac{1}{\sqrt{n}} X_i^2 f'(Z_{n;i} + \frac{t}{\sqrt{n}} X_i)$

where ${t}$ is uniformly distributed in ${[0,1]}$ and independent of all of the ${X_1,\ldots,X_n}$ . Now observe that ${X_i}$ and ${Z_{n;i}}$ are independent, and ${X_i}$ has mean zero, so the first term on the right-hand side vanishes. Thus

$\displaystyle {\bf E} \varphi( Z_n ) - {\bf E} \varphi(G) = \frac{1}{n} \sum_{i=1}^n {\bf E} f'(Z_n) - X_i^2 f'(Z_{n;i} + \frac{t}{\sqrt{n}} X_i). \ \ \ \ \ (28)$

Another application of independendence gives

$\displaystyle {\bf E} f'(Z_{n;i}) = {\bf E} X_i^2 f'(Z_{n;i})$

so we may rewrite (28) as

$\displaystyle \frac{1}{n} \sum_{i=1}^n {\bf E} (f'(Z_n) - f'(Z_{n;i})) - X_i^2 (f'(Z_{n;i} + \frac{t}{\sqrt{n}} X_i)-f'(Z_{n;i})).$

Recall from the proof of Theorem 30 that ${f(x) = O(1/(1+|x|))}$ and ${f'(x) = O(A/(1+|x|))}$ . By the product rule, this implies that ${xf(x)}$ has a Lipschitz constant of ${O(A)}$ . Applying (24) and the definition of ${A}$ , we conclude that ${f'}$ has a Lipschitz constant of ${O( A )}$ . Thus we can bound the previous expression as

$\displaystyle \frac{1}{n} \sum_{i=1}^n {\bf E} \frac{1}{\sqrt{n}} O( A |X_i| + A |X_i|^3 )$

and the claim follows from Hölder’s inequality. $\Box$

This improvement already reduces the ${1/4}$ loss in (20) to ${1/2}$ . But one can do better still by pushing the arguments further. Let us illustrate this in the model case when the ${X_i}$ not only have bounded third moment, but are in fact bounded:

Theorem 32 (Berry-Esséen theorem, bounded case) Let ${X}$ have mean zero, unit variance, and be bounded by ${O(1)}$ . Let ${Z_n := (X_1+\ldots+X_n)/\sqrt{n}}$ , where ${X_1,\ldots,X_n}$ are iid copies of ${X}$ . Then we have

$\displaystyle {\bf P}( Z_n < a ) = {\bf P}( G < a ) + O( \frac{1}{\sqrt{n}} ) \ \ \ \ \ (29)$

whenever ${a=O(1)}$ , where ${G \equiv N(0,1)_{\bf R}}$ .

Proof: Write ${\phi := 1_{(-\infty,a]}}$ , thus we seek to show that

$\displaystyle {\bf E} \phi(Z_n) - \phi(G) = O( \frac{1}{\sqrt{n}} ).$

Let ${f}$ be the Stein transform (25) of ${\phi}$ . ${\phi}$ is not continuous, but it is not difficult to see (e.g. by a limiting argument) that we still have the estimates ${f(x) = O(1/(1+|x|))}$ and ${f'(x) = O(1)}$ (in a weak sense), and that ${xf}$ has a Lipschitz norm of ${O(1)}$ (here we use the hypothesis ${a=O(1)}$ ). A similar limiting argument gives

$\displaystyle {\bf E} \phi(Z_n) - \phi(G) = {\bf E} f'(Z_n) - Z_n f(Z_n)$

and by arguing as in the proof of Theorem 31, we can write the right-hand side as

$\displaystyle \frac{1}{n} \sum_{i=1}^n {\bf E} (f'(Z_n) - f'(Z_{n;i})) - X_i^2 (f'(Z_{n;i} + \frac{t}{\sqrt{n}} X_i)-f'(Z_{n;i})).$

From (24), ${f'}$ is equal to ${\phi}$ , plus a function with Lipschitz norm ${O(1)}$ . Thus, we can write the above expression as

$\displaystyle \frac{1}{n} \sum_{i=1}^n {\bf E} (\phi(Z_n) - \phi(Z_{n;i})) - X_i^2 (\phi(Z_{n;i} + \frac{t}{\sqrt{n}} X_i)-\phi(Z_{n;i})) + O( 1/\sqrt{n} ).$

The ${\phi(Z_{n;i})}$ terms cancel (due to the independence of ${X_i}$ and ${Z_{n;i}}$ , and the normalised mean and variance of ${X_i}$ ), so we can simplify this as

$\displaystyle {\bf E} \phi(Z_n) - \frac{1}{n} \sum_{i=1}^n {\bf E} X_i^2 \phi(Z_{n;i} + \frac{t}{\sqrt{n}} X_i)$

and so we conclude that

$\displaystyle \frac{1}{n} \sum_{i=1}^n {\bf E} X_i^2 \phi(Z_{n;i} + \frac{t}{\sqrt{n}} X_i) = {\bf E} \phi(G) + O( 1/\sqrt{n} ).$

Since ${t}$ and ${X_i}$ are bounded, and ${\phi}$ is non-increasing, we have

$\displaystyle \phi(Z_{n;i} + O(1/\sqrt{n})) \leq \phi(Z_{n;i} + \frac{t}{\sqrt{n}} X_i) \leq \phi(Z_{n;i} - O(1/\sqrt{n}));$

applying the second inequality and using independence to once again eliminate the ${X_i^2}$ factor, we see that

$\displaystyle \frac{1}{n} \sum_{i=1}^n {\bf E} \phi(Z_{n;i} - O(1/\sqrt{n}) ) \geq {\bf E} \phi(G) + O( 1/\sqrt{n} )$

which implies (by another appeal to the non-increasing nature of ${\phi}$ and the bounded nature of ${X_i}$ ) that

$\displaystyle {\bf E} \phi(Z_n - O(1/\sqrt{n}) ) \geq {\bf E} \phi(G) + O( 1/\sqrt{n} )$

or in other words that

$\displaystyle {\bf P} (Z_n \leq a + O(1/\sqrt{n})) \geq {\bf P} (G \leq a) + O( 1/\sqrt{n} ).$

Similarly, using the lower bound inequalities, one has

$\displaystyle {\bf P} (Z_n \leq a - O(1/\sqrt{n})) \leq {\bf P} (G \leq a) + O( 1/\sqrt{n} ).$

Moving ${a}$ up and down by ${O(1/\sqrt{n})}$ , and using the bounded density of ${G}$ , we obtain the claim. $\Box$

Actually, one can use Stein’s method to obtain the full Berry-Esséen theorem, but the computations get somewhat technical, requiring an induction on ${n}$ to deal with the contribution of the exceptionally large values of ${X_i}$ : see this paper of Barbour and Hall.

— 7. Predecessor comparison —

Suppose one had never heard of the normal distribution, but one still suspected the existence of the central limit theorem – thus, one thought that the sequence ${Z_n}$ of normalised distributions was converging in distribution to something, but was unsure what the limit was. Could one still work out what that limit was?

Certainly in the case of Bernoulli distributions, one could work explicitly using Stirling’s formula (see Exercise 3), and the Fourier-analytic method would also eventually work. Let us now give a third way to (heuristically) derive the normal distribution as the limit of the central limit theorem. The idea is to compare ${Z_n}$ with its predecessor ${Z_{n-1}}$ , using the recursive formula

$\displaystyle Z_n = \frac{\sqrt{n-1}}{\sqrt{n}} Z_{n-1} + \frac{1}{\sqrt{n}} X_n \ \ \ \ \ (30)$

(normalising ${X_n}$ to have mean zero and unit variance as usual; let us also truncate ${X_n}$ to be bounded, for simplicity). Let us hypothesise that ${Z_n}$ and ${Z_{n-1}}$ are approximately the same distribution; let us also conjecture that this distribution is absolutely continuous, given as ${\rho(x)\ dx}$ for some smooth ${\rho(x)}$ . (If we secretly knew the central limit theorem, we would know that ${\rho(x)}$ is in fact ${\frac{1}{\sqrt{2\pi}} e^{-x^2/2}}$ , but let us pretend that we did not yet know this fact.) Thus, for any test function ${\varphi}$ , we expect

$\displaystyle {\bf E} \varphi( Z_n ) \approx {\bf E} \varphi( Z_{n-1} ) \approx \int_{\bf R} \varphi(x) \rho(x)\ dx. \ \ \ \ \ (31)$

Now let us try to combine this with (30). We assume ${\varphi}$ to be smooth, and Taylor expand to third order:

$\displaystyle \varphi( Z_n ) = \varphi(\frac{\sqrt{n-1}}{\sqrt{n}} Z_{n-1}) + \frac{1}{\sqrt{n}} X_n \varphi'(\frac{\sqrt{n-1}}{\sqrt{n}} Z_{n-1})$

$\displaystyle + \frac{1}{2n} X_n^2 \varphi''(\frac{\sqrt{n-1}}{\sqrt{n}} Z_{n-1}) + O( \frac{1}{n^{3/2}} )$

Taking expectations, and using the independence of ${X_n}$ and ${Z_{n-1}}$ , together with the normalisations on ${X_n}$ , we obtain

$\displaystyle {\bf E} \varphi( Z_n ) = {\bf E} \varphi(\frac{\sqrt{n-1}}{\sqrt{n}} Z_{n-1}) + \frac{1}{2n} \varphi''(\frac{\sqrt{n-1}}{\sqrt{n}} Z_{n-1}) + O( \frac{1}{n^{3/2}} ).$

Up to errors of ${O( \frac{1}{n^{3/2}} )}$ , one can approximate the second term here by ${\frac{1}{2n} \varphi''(Z_{n-1})}$ . We then insert (31) and are led to the heuristic equation

$\displaystyle \int_{\bf R} \varphi(x) \rho(x) \approx \int_{\bf R} \varphi(\frac{\sqrt{n-1}}{\sqrt{n}} x) \rho(x) + \frac{1}{2n} \varphi''(x) \rho(x)\ dx + O( \frac{1}{n^{3/2}} ).$

Changing variables for the first term on the right hand side, and integrating by parts for the second term, we have

$\displaystyle \int_{\bf R} \varphi(x) \rho(x) \approx \int_{\bf R} \varphi(x) \frac{\sqrt{n}}{\sqrt{n-1}} \rho(\frac{\sqrt{n}}{\sqrt{n-1}} x)$

$\displaystyle + \frac{1}{2n} \varphi(x) \rho''(x)\ dx + O( \frac{1}{n^{3/2}} ).$

Since ${\varphi}$ was an arbitrary test function, this suggests the heuristic equation

$\displaystyle \rho(x) \approx \frac{\sqrt{n}}{\sqrt{n-1}} \rho(\frac{\sqrt{n}}{\sqrt{n-1}} x) + \frac{1}{2n} \rho''(x) + O( \frac{1}{n^{3/2}} ).$

Taylor expansion gives

$\displaystyle \frac{\sqrt{n}}{\sqrt{n-1}} \rho(\frac{\sqrt{n}}{\sqrt{n-1}} x) = \rho(x) + \frac{1}{2n} \rho(x) + \frac{1}{2n} x \rho'(x) + O( \frac{1}{n^{3/2}} )$

which leads us to the heuristic ODE

$\displaystyle L\rho(x) = 0$

where ${L}$ is the Ornstein-Uhlenbeck operator

$\displaystyle L \rho(x) := \rho(x) + x \rho'(x) + \rho''(x).$

Observe that ${L \rho}$ is the total derivative of ${x \rho(x) + \rho'(x)}$ ; integrating from infinity, we thus get

$\displaystyle x \rho(x) + \rho'(x) = 0$

which is (21), and can be solved by standard ODE methods as ${\rho(x) = c e^{-x^2/2}}$ for some ${c}$ ; the requirement that probability density functions have total mass ${1}$ then gives the constant ${c}$ as ${\frac{1}{\sqrt{2\pi}}}$ , as we knew it must.

The above argument was not rigorous, but one can make it so with a significant amount of PDE machinery. If we view ${n}$ (or more precisely, ${\log n}$ ) as a time parameter, and view ${\phi}$ as depending on time, the above computations heuristically lead us eventually to the Fokker-Planck equation for the Ornstein-Uhlenbeck process,

$\displaystyle \partial_t \rho(t,x) = L \rho$

which is a linear parabolic equation that is fortunate enough that it can be solved exactly (indeed, it is not difficult to transform this equation to the linear heat equation by some straightforward changes of variable). Using the spectral theory of the Ornstein-Uhlenbeck operator ${L}$ , one can show that solutions to this equation starting from an arbitrary probability distribution, are attracted to the gaussian density function ${\frac{1}{\sqrt{2\pi}} e^{-x^2/2}}$ , which as we saw is the steady state for this equation. The stable nature of this attraction can eventually be used to make the above heuristic analysis rigorous. However, this requires a substantial amount of technical effort (e.g. developing the theory of Sobolev spaces associated to ${L}$ ) and will not be attempted here. One can also proceed by relating the Fokker-Planck equation to the associated stochastic process, namely the Ornstein-Uhlenbeck process, but this requires one to first set up stochastic calculus, which we will not do here. (The various Taylor expansion calculations we have performed in these notes, though, are closely related to stochastic calculus tools such as Ito’s lemma.) Stein’s method, discussed above, can also be interpreted as a way of making the above computations rigorous (by not working with the density function ${\rho}$ directly, but instead testing the random variable ${Z_n}$ against various test functions ${\varphi}$ ).

This argument does, though highlight two ideas which we will see again in later notes when studying random matrices. Firstly, that it is profitable to study the distribution of some random object ${Z_n}$ by comparing it with its predecessor ${Z_{n-1}}$ , which one presumes to have almost the same distribution. Secondly, we see that it may potentially be helpful to approximate (in some weak sense) a discrete process (such as the iteration of the scheme (30)) with a continuous evolution (in this case, a Fokker-Planck equation) which can then be controlled using PDE methods.

87 comments

Comments feed for this article

6 January, 2010 at 2:02 am

Jean

Very nice notes! Thank you!

Just a remark: in http://arxiv.org/abs/0705.1224 and http://arxiv.org/abs/0908.0391, one can find some applications of Stein’s method in random matrix theory.

6 January, 2010 at 7:27 am

Mark Meckes

More applications of Stein’s method in random matrix theory appear here, here, and here.

6 January, 2010 at 2:53 am

karabasov

A couple of very minor typos:
1) in the second-last paragraph of section 3, it should be “to those terms in (17)”
2) Beginning of section 4: “Lindeberg observed that” – the comma between observed and that is not correct.

6 January, 2010 at 3:25 am

karabasov

I think there might be a parsing error around “where L is the rhox” in the last section… or maybe I just don’t get it.

6 January, 2010 at 9:12 am

Giovanni Peccati

Hi Terry,

thank you for these inspiring lecture notes !

Small typo, in the brackets starting 5 lines after formula (30) (If we secretely…) there is a square root missing in the normalizing factor of the Gaussian density.

Best, G

6 January, 2010 at 12:04 pm

Terence Tao

Thanks for the corrections!

Somewhat embarrassingly, I was only dimly aware of the work on applying Stein’s method to random matrices; it seems that the fraction of literature on the subject that I am familiar with is still not fully representative. Thanks for the references!

7 January, 2010 at 5:45 am

Tim vB

Dear Terry,
thanks for this marvelous lecture notes!
I hope the following two remarks are not too pedantic, but both points made me stop for a short while:
1. The part “now we get the Fokker-Planck-Equation and that can be solved exactly” confused me a bit, because the Fokker-Planck-Equation can “usually” not be solved excatly – the one for the Ornstein-Uhlenbeck process can. It’s of course clear from the context that you mean exactly this, but I think it would help me if you redundantly write ” the above computations heuristically lead us eventually to the Fokker-Planck equation of the Ornstein-Uhlenbeck process”.
2. Wouldn’t most people say the Ornstein-Uhlenbeck process is the solution of a stochastic ordinary differential equation (stochastic ODE) rather than a stochastic partial differential equation?

[Fair enough; I’ve adjusted the notes accordingly.]

15 January, 2010 at 12:31 pm

Steven Heilman

More tedious [potential] corrections:

1. Two Eqs. after Eq. (14): P(…)\leq E(…)

2. Eq (16): square in denominator correct?

3. Theorem 4,5: F should be phi (or vice versa)

4(a). Proof of Theorem 5: several missing parentheses? (or E symbols), e.g.

\displaystyle {\bf E} (\varphi(Z_n) – \varphi(W_n)) = o(1).

This is done quite a lot, so I will assume it was done on purpose.

4(b). telescoping sum, multiply by -1

5. After Eq. (26): tex error for bound on f. I assume you want

{f(x) = O_\varphi( 1/|x|)}

[Corrected, thanks – T.]

17 January, 2010 at 7:28 pm

salazar

Second 1 and 3. For 4, since W_n is a RV, there should be no ambiguity. For 2, I haven’t been able to see how to replace \hat{eta}.

17 January, 2010 at 8:55 pm

Terence Tao

$\hat \eta$ can be bounded by an absolute constant. (The denominator of $1+\epsilon |t|$ was a relic from an earlier version of the argument, but is now redundant due to the restriction to the range $|t| \leq c/\epsilon$ .)

18 January, 2010 at 6:29 pm

254A, Notes 3b: Brownian motion and Dyson Brownian motion « What’s new

[…] the central limit theorem from the fundamental solution of the heat equation (cf. Section 7 of Notes 2), although the derivation is only heuristic because one first needs to know that some limiting […]

29 January, 2010 at 7:16 am

Andreas Naive

Excellent notes!

Just a small fix:

In (30), $\frac{1}{\sqrt{n-1}}$ should be $\frac{1}{\sqrt{n}}$ , which slightly simplify the following argument.

[Corrected, thanks – T.]

2 February, 2010 at 1:34 pm

254A, Notes 4: The semi-circular law « What’s new

[…] vein, we may apply the truncation argument (much as was done for the central limit theorem in Notes 2) to reduce the semi-circular law to the bounded case: Exercise 5 Show that in order to prove the […]

10 February, 2010 at 10:56 pm

245A, Notes 5: Free probability « What’s new

[…] an interesting analogue in the freely independent setting. For instance, the central limit theorem (Notes 2) for averages of classically independent random variables, which roughly speaking asserts that such […]

25 February, 2010 at 10:19 am

PDEbeginner

Dear Prof. Tao,

I have some problems on Ex 12 (Esseen concentration inequaity) and Thm 3:

Ex 12. Let $r \rightarrow \infty$ , the value on LHS of (9) should be 1, while that on RHS should be 0. Then the inequality breaks down.

Thm 3. In the paragraph immediately below (14), I do not understand how to apply an integration by parts argument, since one does not know the exact distribution of $X$ .

Thanks in advance!

25 February, 2010 at 11:09 am

Terence Tao

Ah, there was a factor of r^d missing in the concentration inequality; it is fixed now.

As for Theorem 3, the point is that the distribution of X is the Stieltjes derivative of the cumulative distribution function $F(a) := {\bf P}(X \leq a)$ of X; ${\bf E} f(X) = \int_{-\infty}^\infty f dF$ . If we integrate the latter integral by parts, we see that small perturbations of F in the uniform norm lead to small perturbations of ${\bf E} f(X)$ when f has bounded total variation.

25 February, 2010 at 10:20 am

PDEbeginner

Sorry, it is Ex 11.

26 February, 2010 at 2:41 pm

The derivation of Gaussian in Section 7 is amazing! (To me it seems very much in the spirit of physics). Isn’t it strange that to our brain it seems very natural to make this argument as it is, whereas for scientific community you actually need to add days of hard work (in Sobolev spaces etc)?

27 February, 2010 at 6:44 pm

vedadi

Dear Prof. Tao,

We know that for $X_1, X_2, ...$ i.i.d mean zero, variance one random variables, $\frac{S_n}{n^{a}}$ converges in distribution to the point mass at zero, $\delta_{0}$ and standard normal r.v. $Z$ for $a=1, 1/2$ respectively. Do we have convergence in distribution for other values $0<a<1 ?$

Thanks

6 March, 2010 at 7:14 pm

254A, Notes 7: The least singular value « What’s new

[…] understand this walk, we apply (a slight variant) of the Berry-Esséen theorem from Notes 2: Exercise 1 Show […]

10 March, 2010 at 10:35 am

PDEbeginner

Dear Prof. Tao,

I finished reading this note. As usual, I have some problems :-)

1. For the moment method, we first apply Chernoff bound (3), and then we can assume that $Z_n$ are uniformly guassian. But when proving Chernoff bound in Note 1, we have used the identity $e^{A+B}=e^A e^B$ . When we are dealing random matrices, if we also use Chernoff bounds, then it seems we shall have some trouble.

2. I still do not know how to prove the conclusion in Ex 11. Suppose one is trying to bound the easy probability (with $d=1$ and $x_0=0$ ):

$P\{|X| \leq r\}$ .

By Fourier transform, we obtain

$(P\{|X| \leq r\}=) C\int_{\bf R} \frac{sin rt}{t} F_X(t)=C\int_{|t|\leq \epsilon/r}+\int_{|t|>\epsilon/r} \frac{sin rt}{t} F_X(t).$

Clearly, to show $P\{|X| \leq r\}$ has the same bound as in exercise, we need to show that $\int_{|t|>\epsilon/r} \frac{sin rt}{t} F_X(t)$ has that bound. I don’t know to show this.

3. A small typo: the $F$ in Theorem 5 seems to be $\phi$ .

10 March, 2010 at 6:38 pm

Terence Tao

1. For random matrices, one does not apply the Chernoff bound to the matrix directly, but to other scalar expressions related to that matrix, e.g. linear combinations of the matrix entries.

2. One first needs to replace the sharp cutoff to the ball of radius r by a smoother cutoff. This is the method of smoothing sums:

http://www.tricki.org/article/Smoothing_sums

3. Thanks for the correction!

5 May, 2010 at 8:48 pm

Anonymous

These notes are very helpful, thank you very much!
I do not have a strong background in the Fourier analysis, could you point me to a refernce to the solutions to Exercise 11 and Exercise 15?

A few comments I have:

In Exercise 13, the characteristic function should be $e^{i\mu}t…$ instead of $e^{-i\mu}t$?

In Part 1 of this note, the “N” in the last two singled-out equations should be replaced with “N_n”?

6 May, 2010 at 9:06 am

Terence Tao

Thanks for the corrections,’

For Ex11, see Lemma 7.17 of my book “Additive combinatorics” with Van Vu. Ex15 can be found in most graduate texts in probability, including those listed at the link given.

27 July, 2010 at 3:50 pm

Ahmet Arivan

Great notes. I have probably a very stupid question. But in the Stein method section, it is claimed that if f(t) =O(1/1+|t|) and f'(t)=O(1) then tf(t) is Lipschitz with constant of O(1). Is it possible to give a hint on why this is true?

Thanks,

31 July, 2010 at 8:35 am

Terence Tao

Ah yes, that requires an additional argument (which needs the Lipschitz hypothesis on phi). I’ve modified the text accordingly (and in particular weakened the Berry-Esseen type claim, since the Lipschitz property is not present in that case.)

29 October, 2012 at 3:53 pm

Nick Cook

Small thing: I think the last part of the proof of Theorem 6, showing that $f'(x)=O_\phi(1/(1+|x|))$ , with implicit constant linear in the bounded Lipschitz norm of $\phi$ , belongs just after this proof since we aren’t assuming $\phi$ is Lipschitz here. – or leave it in with modification of the phrase “using the Lipschitz nature of $\phi$ ” :)

[Corrected by assuming $\phi$ to be Lipschitz. -T.]

4 November, 2010 at 8:33 am

Qiaochu Yuan

Regarding Section 7: I had always vaguely figured that if $Z_n$ were going to converge to a particular distribution, it would have to converge to an eigenfunction of the Fourier transform. Do you know of any heuristic (or rigorous!) derivations of the CLT using this idea?

4 November, 2010 at 9:02 am

Terence Tao

I think the more accurate statement would be that a stable law $Z$ should have a Fourier transform whose logarithm is an eigenfunction of the scaling operation, since the convolution $Z_1+Z_2$ of $Z$ with itself needs to be a rescaling of $Z$ .

16 November, 2010 at 1:04 pm

karabasov

At the beginning of the proof of theorem 1, it seems that the first “for sufficiently small t” is not needed. Only the second is needed.

15 March, 2011 at 6:57 pm

Sujit

Hi Terry,

I am a bit puzzled about Exercise 1. I am thinking of $Z_n$ as a map from $n$ copies of $\Omega$ to $R$. If this is the case, what does it mean to say $Z_n$ converges almost surely? Isn’t the underlying sample space varying as $n$ varies?

The CLT says something about the push forward of the measures from $\Omega^n$ to $R$ and I can understand it. So even if the sample space changes, it doesn’t matter as we are comparing the push forward measures. In the strong LLN case, it still sounds fine to me if I interpret the almost sure convergence as almost sure convergence to the constant map on $\Omega^n$ given by the expectation. But I am having trouble trying to understand what Exercise 1 means.

Thanks

15 March, 2011 at 7:08 pm

Terence Tao

The intent here is to work in a common extension of all these probability spaces, where we have an infinite sequence $X_1, X_2, X_3, \ldots$ of iid random variables (and this is also the correct way to interpret the strong law of large numbers).

16 March, 2011 at 1:03 pm

Anonymous

Dear Prof. Tao,

Let $M(t)$ be the moment generating function of a non degenerate r.v $X$ . Then is $c(t)=\log M(t)$ always strictly increasing and strictly convex on $[0, \infty)$ ?

Thanks

28 August, 2011 at 6:12 am

The law of large numbers and the central limit theorem | Controlled Complexity

[…] theorem (CLT). There are excellent resources on the net for LLN and CLT. For example, this and this are highly recommended readings. This blog will play a complementary with figures and animations to […]

1 October, 2011 at 10:32 am

Anonymous

Can anyone help me to compute the expected waiting time of the first occurrence of the patten, say, TTHT. here $P(H)=p, P(T)=q$
is there a general approach to solve this sort of problems?

Thanks

18 November, 2011 at 2:38 am

Diffusion in Ehrenfest wind-tree model « Disquisitiones Mathematicae

[…] of the word “abnormal” comes by comparison with Brownian motion and/or central limit theorem: once we know that the diffusion is “sublinear” (maybe after removing the […]

17 March, 2012 at 5:48 am

Anonymous

Dear Prof. Tao,

Let $S_n$ be the simple symmetric random walk in one dimension.
Assume that K is a large constant and $0<c<1.$ I need a lower bound for the following probability which does not depend on $N.$

$P(\max_{n \leq N}|S_n|\leq Kc\sqrt{N}, |S_N|\leq c\sqrt{N})$

What is the idea behind finding a lower bound for this type of quantities?

Thank you

10 October, 2012 at 3:33 pm

Robert Kohn

Hi, thanks for insightful presentation. My question is, suppose that the $x_i$ are independent with finite third moments, zero mean and unit variance. What can we say about the convergence of the third moment of the normalized sum ?
In addition, what results of the same kind can we get if the $x_i$ are correlated?

Thanks,
Robert Kohn

20 November, 2012 at 8:14 am

Jack

Regarding Exercise 1: I don’t see how the intuition help. What I thought is to give a lower bound of $P(Z_n>\epsilon)$ for some $\epsilon>0$ , which is from the definition of convergence in probability. Could you elaborate the intuition? Why would independence help here?

20 November, 2012 at 8:41 am

Terence Tao

Actually, one needs a lower bound on ${\bf P}(|Z_n-Z_m| \geq \varepsilon)$ to contradict convergence in probability to an unspecified limit (lower bounding ${\bf P}(Z_n \geq \varepsilon)$ merely prevents convergence in probability to zero, or to a non-positive random variable). But if one can quantify some approximate independence between $Z_n$ and $Z_m$ for n,m widely separated, then this, together with the central limit theorem, can supply the desired lower bound.

20 November, 2012 at 2:07 pm

Jack

An attempt is that both $Z_n$ and $Z_{2n}$ converge to $Z\sim N(0,1)$ in distribution by CLT. Thus one has $\frac{\sqrt{2}Z_{2n}-Z_n}{\sqrt{2}-1}\to Z$ . I still don’t see how one can bound ${\Bbb P}(|Z_{2n}-Z_n|\geq\epsilon)$ .

20 November, 2012 at 2:14 pm

Terence Tao

$Z_{2n}$ and $Z_n$ are still quite correlated, so it is a bit tricky to separate them by $\varepsilon$ . Try $Z_{Mn}$ and $Z_n$ instead for some large M (larger than $1/\varepsilon^2$ or so). The point is that these two random variables are almost independent (more precisely, $Z_n$ and $Z_{Mn} - M^{-1/2} Z_n$ are independent) which makes it difficult for them to stay so close to each other so often.

20 November, 2012 at 2:46 pm

Jack

What puzzles me is the use of “independence” here. As a beginner, I only think that $X$ and $Y$ are independent if and only if $P(X\in A,Y\in B)=P(X\in A)P(Y\in B)$ where $A$ and $B$ are Borel sets in the corresponding Borel sigma algebra. Why does this concept have anything to do with the “distance” between these random variables? It’s quite intuitive though: two things are “independent”, then they should not be too “close” to each other. But I don’t see the connection. (Eventually I come up with something like $E(XY)=E(X)E(Y)$ . Is this the point? )

20 November, 2012 at 6:24 pm

Terence Tao

If X and Y are independent, then ${\bf P}( X < -\varepsilon \wedge Y > \varepsilon)$ (for instance) will be large if ${\bf P}(X <-\varepsilon)$ and ${\bf P}(Y > \varepsilon)$ are large, thus giving some separation between X and Y. (One can also in principle work with expectations, but this requires some hypotheses on absolute integrability that one then has to justify separately.)

26 November, 2012 at 3:13 pm

Jack

Finding a lower bound for ${\bf P}(|Z_{n}-Z_{Mn}|\geq\varepsilon)$ is still not “obvious” to me. I tried ${\bf P}(|Z_{n}-Z_{Mn}|\geq\varepsilon)\geq{\bf P}(Z_{n} < -\varepsilon\wedge Z_{Mn} > \varepsilon)$ . And the independence between $Z_n$ and $Z_{Mn}-M^{-1/2}Z_n$ gives ${\bf P}(Z_{n}<-\varepsilon\wedge Z_{Mn}-M^{-1/2}Z_n>\varepsilon)={\bf P}(Z_{n}<-\varepsilon){\bf P}(Z_{Mn}-M^{-1/2}Z_n>\varepsilon)$ . Since $M$ is chosen such that $M> 1/\varepsilon^2$ , $M^{-1/2}<\varepsilon$ implies that this is very close to the desired estimate. But I don’t see the way to bound ${\bf P}(Z_{n}<-\varepsilon)$ and ${\bf P}(Z_{Mn}-M^{-1/2}Z_n>\varepsilon)$ . Since I haven’t use CLT so far, I am wondering there must be lack of something here.

26 November, 2012 at 3:42 pm

Terence Tao

The CLT describes the limiting value of $P(Z_n < -\varepsilon)$ in the limit as $n \to \infty$ with $\varepsilon$ fixed, and similarly for $Z_{Mn} - M^{-1/2} Z_n$ (which is basically a rescaled variant of $Z_{(M-1)n}$ ).

26 November, 2012 at 4:14 pm

Jack

I found in some book that random variables $X_n$ converging in distribution to $X$ is defined as (and the author also calls it “weak convergence”)
$\lim_{n\to\infty}\int f(x)dP^{X_n}=\int f(x)dP^X$ for all bounded, continuous function $f$ . Is it “equivalent” to the definition in [254A Note4 Exercise 18](https://terrytao.wordpress.com/2010/10/02/245a-notes-4-modes-of-convergence/)? But in [245B Note 11](https://terrytao.wordpress.com/2009/02/21/245b-notes-11-the-strong-and-weak-topologies/), weak convergence is convergence in weak topology. I am totally confused with the terminologies.

[I don’t know how to give links like people do in MathOverflow.]

26 November, 2012 at 4:39 pm

Terence Tao

Yes, these notions are all connected to each other; see Exercise 23 of https://terrytao.wordpress.com/2010/01/01/254a-notes-0-a-review-of-probability-theory/ . (Though, strictly speaking, vague convergence of measures – which is equivalent to convergence in distribution – is an example of weak* convergence rather than weak convergence.)

22 March, 2013 at 12:34 am

Anonymous

Hi professor Tao,

does the CLT hold if we substitute the hypothesis of finite mean and variance with the hypothesis of finite second order moment, ${\bf E} X^2$ ?

22 March, 2013 at 9:10 am

Terence Tao

By Holder’s inequality (or the Cauchy-Schwarz inequality), finite second order moment is equivalent to finite mean and variance.

3 April, 2013 at 3:13 am

Anonymous

Thanks a lot for the answer! By the way, does { \Z_n := \frac{S_n – n \mu}{\sqrt{n} \sigma} to N(0,1)} implies that {\frac{S_n}{n} to N(\mu,\frac{\sigma^2}{n}) }? This implication seems quite weird to me, since {n to \infty}, but on a book about generalized polynomial chaos I fount the following sentence: “the numerical average of a sequence of i.i.d. random variables will converge, as n is increased, to a Gaussian distribution { N(\mu,\frac{\sigma^2}{n}) } […]”.

3 April, 2013 at 10:30 am

Terence Tao

If one uses a rescaled notion of convergence, then this statement is true, although if one uses unrescaled versions of convergence (e.g. convergence in distribution, vague convergence, total variation convergence, etc.) then the statement is either false or trivially true, depending on exactly which mode of convergence is specified. It is somewhat of an abuse of notation to refer to convergence of random variables without specifying the exact nature of the convergence, although for informal mathematical discussion it is generally permissible to be a bit loose in this regard.

7 May, 2013 at 3:44 am

Anonymous

(i) In Remark 1 the word theorem is missing from “central limit theorem.”
(ii) The equation following the sentence “From the bounded density of $G$ and the rapid decrease of $\eta$ we have…” I think the first term should be an expectation and not probability.
(iii) Do you use any convention for capitalizing theorems? For example, you write “central limit theorem” in lower case but “Taylor” expansion in upper case.
(iv) What is the diagonalisation argument in “By a diagonalisation argument, we conclude that there exists a sequence…”? Is there a Wikipedia page? Thanks.

[Corrected, thanks. Taylor is a proper noun and is therefore capitalised in English. The diagonalisation argument, originally due to Cantor, refers to any procedure in which one repeatedly extracts sequences with various properties and then passes to a diagonal subsequence with an even better property. Actually, the Arzela-Ascoli diagonalisation argument is closer in spirit to the one used here than the Cantor diagonalisation argument.]

13 May, 2013 at 12:00 pm

חסם צ'רנוף, חסם אנטרופיה ומה שביניהם | One and One

[…] נדבר עליו כאן, אתם מוזמנים לקרוא עליו בבלוג של טרי טאו. אבל גם אם פשוט נאמין שהאינטגרל הזה קרוב מספיק […]

2 November, 2015 at 7:06 pm

275A, Notes 4: The central limit theorem | What's new

[…] theorem (including some further proofs, such as one based on Stein’s method) can be found in this blog post. Some further variants of the central limit theorem, such as local limit theorems, stable laws, and […]

28 December, 2015 at 5:32 am

Vladimir

Dear Tao, I do not get it why in the proof of theorem 7 the t is uniformly distributed even if independence of Z_n:i is clear to me? Could you help please?

28 December, 2015 at 8:14 am

Terence Tao

$t$ is uniformly distributed by definition. In less probabilistic language, the identity being asserted here is that

$\displaystyle X_n f(Z_{n,i} + \frac{1}{\sqrt{n}} X_n) = X_n f(Z_{n,i}) + X_n^2 \frac{1}{\sqrt{n}} \int_0^1 f'(Z_{n,i} + \frac{s}{\sqrt{n}} X_n)\ ds$

which is of course a special case of the fundamental theorem of calculus.

29 December, 2015 at 2:49 am

Vladimir

Many thanks, now I get it.

27 January, 2016 at 8:50 am

Anonymous

Dear Prof. Tao, in exercise 8, shouldn’t the derivative be evaluated at 0?

[Exercise reworded – T.]

3 May, 2016 at 2:41 pm

Vivek Kumar Bagaria

Great notes! How does the ‘Vector-valued central limit’ behave if the dimension of the vector scales with the number of samples $n$ ?

4 May, 2016 at 8:30 am

Terence Tao

This basically comes down to the dependence of constants of the Berry-Esseen theorem on dimension. The best bound I know of in this direction comes from this 2003 paper of Bentkus, but there may be further improvements since then.

5 May, 2016 at 4:11 am

easy

why is the third moment important in that paper? CLT is about second moments right?

5 May, 2016 at 8:12 am

Terence Tao

If one just wants qualitative convergence (with no rate), then second moment conditions suffice, but in order to obtain quantitative bounds one needs some higher moment assumption (to deal with the errors caused by truncation arguments). Also one can get faster rates of convergence than Berry-Esseen if the original distribution matches third or higher moments with the gaussian, as can be seen from the Lindeberg argument (or from the Fourier argument).

7 November, 2016 at 10:26 pm

Sahiba

Hey, can you help me in Exercise 1? I’m not able to solve it.

8 February, 2017 at 10:10 pm

John Rawls

These notes are immensely useful as always, but I have a small question: how exactly does the diagonalization argument in Section 1 work? We know that $\lim_{n\to\infty}\mathbb{E}[f(Z_{n,\leq N})]=\mathbb{E}[f(G)]$ for each fixed $N$ and all bounded continuous $f$ , but it seems that the rate of convergence depends not only on $N$ but also on $f$ . How then can we choose $N_n$ so that we get $\lim_{n\to\infty}\mathbb{E}[f(Z_{n,\leq N_n})]=\mathbb{E}[f(G)]$ for all $f$ , or at least for all $f$ in some nice subclass? Thanks!

9 February, 2017 at 10:27 pm

Terence Tao

One can work with some countable dense sequence of functions $f_m$ (e.g. indicator functions of $(-\infty,q_m)$ with $q_1,q_2,\dots$ an enumeration of the rationals will work). For each $N$ , we then can find $n_N$ such that ${\bf E} f_m(Z_{n, \leq N})$ lies within $1/N$ of ${\bf E}(f_m(G))$ for all $m \leq N$ and $n \geq n_N$ ; we can also ensure that the $n_N$ are increasing in $N$ . Letting $n \mapsto N_n$ be the inverse of $N \mapsto n_N$ (or more precisely, $N_n$ is the first $N$ for which $n_N \geq n$ ) we see that $\lim_{n \to \infty} {\bf E}(f_m(Z_{n,\leq N_n})) = {\bf E}(f_m(G))$ for all $m$ , and by approximating a general $f$ using the $f_m$ we obtain convergence in distribution.

14 February, 2017 at 8:32 pm

John Rawls

Very clever, thank you!

16 February, 2017 at 10:07 am

tpfly

Hi professor Tao,
I’m reading this series of notes and I have a question about the proof of Berry Esseen Theorem. You said that in order to establish (13) it suffices to establish (15) because of (8). But (8) requires that $\varphi$ is Schwartz, or at least L1 and the fourier transform should be also L1, because the Fourier inverse is used. However the function in the proof of berry esseen is decreasing from $1$ to $0$, is (8) still available with this kind of functions?
Thanks!

17 February, 2017 at 5:34 am

tpfly

I found the answer of my question by taking a limit, that is, $\mathbb{E}\varphi(X-a)=\lim_{T\to +\infty}\mathbb{E}[\varphi(X-a)-\varphi(X+T)]$ and now the expectation is taken on a Schwartz function.

By the way I think there are some mistakes in this proof, the Fourier transform
$\displaystyle \hat 1_{-\infty,a]}=O(\frac{1}{1+|t|})$
should be
$\displaystyle \hat 1_{-\infty,a]}=O(\frac{1}{|t|})$
as there must be a singular point at 0, otherwise $\int1_{-\infty,a]}$ would be finite. And
$\displaystyle \hat \varphi(t) = \hat 1_{(-\infty,a]}(t) \hat \eta(t/\epsilon)$
should be
$\displaystyle \hat \varphi(t) = \hat 1_{(-\infty,a]}(t) \hat \eta(t\epsilon)$ .

[New version of proof given – T.]

6 July, 2017 at 2:45 pm

Syntactic Translations of Berry-Esséen and the CLT | Polynomially Bounded

[…] noted in Remark 29 here, one can improve the constant term by changing the replacement variables. In particular, it […]

21 December, 2017 at 2:13 am

Nick Cook

A couple of typos in the proof of Theorem 23: In the first display after (15) it should say $\hat{\eta}(\varepsilon t)$ (with no change to (16)), and above the last display it should say “if $t\le c/\varepsilon$ “.

[Corrected, thanks – T.]

15 April, 2018 at 12:36 am

Ankit

Hi Prof. Tao,

Thanks for the wonderful notes. These are great for self-studying.

In Eq. (17), shouldn’t there be a summation over different choices of {a_1,\ldots,a_m} as well? cf. with the case of k = 4 where terms are explicitly written.
It doesn’t affect the result in the end though because all of them have to be 2 for the non-vanishing term.

15 April, 2018 at 9:14 am

Terence Tao

One is summing over all choices of $m$ and of $a_1,\dots,a_m$ , yes (see the text immediately before and after (17)).

15 April, 2018 at 11:03 am

Ankit

I see. Thank you for replying.

20 April, 2018 at 8:49 pm

Damodar

Prof. Tao, I would be very glad if you could look at my blog post on ” A full proof of Berry-Esseen inequality in the Central Limit Theorem”.

Description: This theorem gives us the maximum convergence limit of the basin of attraction in the Central limit theorem.

Link: http://www.physicslog.com/proof-berry-essen-theorem/

Thanks,
Damodar

3 June, 2018 at 11:51 am

Anonymous

Dear Prof Tao, I have a question about the proof using the Fourier method. The proof (around equation (11)) only uses $t$ in a neighborhood of the origin. My question is if this is equivalent to independence (i.e., does $F_{X,Y}(t,s)=F_X(t)F_Y(t)$ for $t$ in a neighborhood of the origin imply $F_{X,Y}(t,s)=F_X(t)F_Y(t)$ for all $t \in V$)? Thanks.

4 June, 2018 at 10:14 pm

Terence Tao

If $X,Y$ are bounded (or subgaussian) random variables, then $F_X(t), F_Y(t)$ are analytic functions of $t$ , and so by analytic continuation any identity that holds for small $t$ automatically holds for large $t$ . I don’t know what happens though if $X,Y$ are more heavy-tailed, there may well be (somewhat artificial) counterexamples in this setting.

15 February, 2020 at 1:47 pm

Anonymous

Dear Prof. Tao, Do you know if there are quantitative versions of local central limit theorems, akin to the Berry-Esseen inequality? I am in particular interested in bounded integer i.i.d. random variables, assuming the number-theoretic issues are not a problem, i.e. the gcd is 1. E.g. think uniform distribution on the set of squares of integers in an interval.

9 March, 2022 at 9:49 am

Aditya Guha Roy

Prof. Tao in Exercise 2 is Kolmogorov’s zero-one law the main thing that works behind the scenes: passing on to a subsequence it suffices to consider only almost sure convergence of a subsequence, but then we see that the normalized sums have mean zero and variance one so the limiting random variable would also have variance one, however since these $(Z_{n_j})$ is a sequence of tail random variables so if it converges in the almost sure sense then the limit must be a constant with probability one and will hence have zero variance, a contradiction which refutes our assumption and hence proves the claim.

Is this correct in your opinion?

[This is indeed one way to solve the exercise – T.]

16 May, 2022 at 2:23 pm

Michael Neely

In the “Random Matrix” book Exercise 1.1.10, what answer did you have in mind? The question may need some fixes, a counter-example seems to be when $X$ is uniformly distributed over $[0,1]$ , $X(Omega) = [0,1]$ , and $X' = X 1\{X \neq 1/2\}$ , so $X$ and $X'$ have the same distribution but $X'$ cannot be 1/2 so we cannot have $X=X'(\phi(\omega))$ .

16 May, 2022 at 5:29 pm

Michael Neely

By the way this is a stackexchange question here (I gave some preliminary comments on thoughts on the posting of HangY) https://math.stackexchange.com/questions/4450636/distribution-of-a-random-variable-exercise/4450746#4450746

17 May, 2022 at 9:27 am

Terence Tao

Oops, you are right, the exercise is not correct as stated. (On standard probability spaces one can salvage it up to almost sure agreement by using a relative product construction, but this is not the point I wanted to make in this exercise.) I have added an erratum replacing the exercise by one which makes the desired point.

17 May, 2022 at 3:28 pm

Michael Neely

Thanks for the quick reply! However, I don’t think this works. I believe I can construct two independent random variables $X_i:Omega\rightarrow\mathbb[0,1]$ for $i\in\{1,2\}$ , both uniformly distributed over $[0,1]$ , such that the intersection of their images $X_1(\Omega)\cap X_2(\Omega)$ is the empty set. If I understand your updated question properly, this is a (counterintuitive) counter example to your claim.

17 May, 2022 at 5:15 pm

Michael Neely

Nevermind, I think my counterexample might be circumvented by your “almost surely” insertion (taken on the probability spaces, rather than on the R space) if we give different parts of the two different spaces measure 0 with respect to the different measures.

	Anonymous on Almost all Collatz orbits atta…
	Anonymous on About
	Anonymous on Almost all Collatz orbits atta…
	Anonymous on On product representations of…
	Anonymous on On product representations of…
	Anonymous on Why global regularity for Navi…
	Anonymous on 245B, notes 3: L^p spaces
	Mohammed Mannan on 254A, Notes 0: Hilbert’s…
	Mohammed Mannan on 254A, Notes 0: Hilbert’s…
	Anonymous on On product representations of…
	Anonymous on Additive combinatorics
	Yingyuan Li on Analysis II
	Associativity of the… on Pappus’s theorem and ell…
	Anonymous on The Black-Scholes equation
	Anonymous on 275A, Notes 3: The weak and st…

254A, Notes 2: The central limit theorem

Recent Comments

Articles by others

Diversions

Mathematics

Selected articles

Software

The sciences

Top Posts

Archives

Categories

The Polymath Blog

87 comments

Leave a comment Cancel reply

For commenters

254A, Notes 2: The central limit theorem

Share this:

Recent Comments

Articles by others

Diversions

Mathematics

Selected articles

Software

The sciences

Top Posts

Archives

Categories

The Polymath Blog

87 comments

Leave a comment Cancel reply

For commenters