275A, Notes 4: The central limit theorem

2 November, 2015 in 275A - probability theory, math.CA, math.PR | Tags: central limit theorem, Lindeberg exchange method, moment method | by Terence Tao

Let ${X_1,X_2,\dots}$ be iid copies of an absolutely integrable real scalar random variable ${X}$ , and form the partial sums ${S_n := X_1 + \dots + X_n}$ . As we saw in the last set of notes, the law of large numbers ensures that the empirical averages ${S_n/n}$ converge (both in probability and almost surely) to a deterministic limit, namely the mean ${\mu= {\bf E} X}$ of the reference variable ${X}$ . Furthermore, under some additional moment hypotheses on the underlying variable ${X}$ , we can obtain square root cancellation for the fluctuation ${\frac{S_n}{n} - \mu}$ of the empirical average from the mean. To simplify the calculations, let us first restrict to the case ${\mu=0, \sigma^2=1}$ of mean zero and variance one, thus

$\displaystyle {\bf E} X = 0$

and

$\displaystyle {\bf Var}(X) = {\bf E} X^2 = 1.$

Then, as computed in previous notes, the normalised fluctuation ${S_n/\sqrt{n}}$ also has mean zero and variance one:

$\displaystyle {\bf E} \frac{S_n}{\sqrt{n}} = 0$

$\displaystyle {\bf Var}(\frac{S_n}{\sqrt{n}}) = {\bf E} (\frac{S_n}{\sqrt{n}})^2 = 1.$

This and Chebyshev’s inequality already indicates that the “typical” size of ${S_n}$ is ${O(\sqrt{n})}$ , thus for instance ${\frac{S_n}{\sqrt{n} \omega(n)}}$ goes to zero in probability for any ${\omega(n)}$ that goes to infinity as ${n \rightarrow \infty}$ . If we also have a finite fourth moment ${{\bf E} |X|^4 < \infty}$ , then the calculations of the previous notes also give a fourth moment estimate

$\displaystyle {\bf E} (\frac{S_n}{\sqrt{n}})^4 = 3 + O( \frac{{\bf E} |X|^4}{n} ).$

From this and the Paley-Zygmund inequality (Exercise 44 of Notes 1) we also get some lower bound for ${\frac{S_n}{\sqrt{n}}}$ of the form

$\displaystyle {\bf P}( |\frac{S_n}{\sqrt{n}}| \geq \varepsilon ) \geq \varepsilon$

for some absolute constant ${\varepsilon>0}$ and for ${n}$ sufficiently large; this indicates in particular that ${\frac{S_n \omega(n)}{\sqrt{n}}}$ does not converge in any reasonable sense to something finite for any ${\omega(n)}$ that goes to infinity.
The question remains as to what happens to the ratio ${S_n/\sqrt{n}}$ itself, without multiplying or dividing by any factor ${\omega(n)}$ . A first guess would be that these ratios converge in probability or almost surely, but this is unfortunately not the case:

Proposition 1 Let ${X_1,X_2,\dots}$ be iid copies of an absolutely integrable real scalar random variable ${X}$ with mean zero, variance one, and finite fourth moment, and write ${S_n := X_1 + \dots + X_n}$ . Then the random variables ${S_n/\sqrt{n}}$ do not converge in probability or almost surely to any limit, and neither does any subsequence of these random variables.

Proof: Suppose for contradiction that some sequence ${S_{n_j}/\sqrt{n_j}}$ converged in probability or almost surely to a limit ${Y}$ . By passing to a further subsequence we may assume that the convergence is in the almost sure sense. Since all of the ${S_{n_j}/\sqrt{n_j}}$ have mean zero, variance one, and bounded fourth moment, Theorem 25 of Notes 1 implies that the limit ${Y}$ also has mean zero and variance one. On the other hand, ${Y}$ is a tail random variable and is thus almost surely constant by the Kolmogorov zero-one law from Notes 3. Since constants have variance zero, we obtain the required contradiction. $\Box$

Nevertheless there is an important limit for the ratio ${S_n/\sqrt{n}}$ , which requires one to replace the notions of convergence in probability or almost sure convergence by the weaker concept of convergence in distribution.

Definition 2 (Vague convergence and convergence in distribution) Let ${R}$ be a locally compact Hausdorff topological space with the Borel ${\sigma}$ -algebra. A sequence of finite measures ${\mu_n}$ on ${R}$ is said to converge vaguely to another finite measure ${\mu}$ if one has

$\displaystyle \int_R G(x)\ d\mu_n(x) \rightarrow \int_R G(x)\ d\mu(x)$

as ${n \rightarrow \infty}$ for all continuous compactly supported functions ${G: R \rightarrow {\bf R}}$ . (Vague convergence is also known as weak convergence, although strictly speaking the terminology weak-* convergence would be more accurate.) A sequence of random variables ${X_n}$ taking values in ${R}$ is said to converge in distribution (or converge weakly or converge in law) to another random variable ${X}$ if the distributions ${\mu_{X_n}}$ converge vaguely to the distribution ${\mu_X}$ , or equivalently if

$\displaystyle {\bf E}G(X_n) \rightarrow {\bf E} G(X)$

as ${n \rightarrow \infty}$ for all continuous compactly supported functions ${G: R \rightarrow {\bf R}}$ .

One could in principle try to extend this definition beyond the locally compact Hausdorff setting, but certain pathologies can occur when doing so (e.g. failure of the Riesz representation theorem), and we will never need to consider vague convergence in spaces that are not locally compact Hausdorff, so we restrict to this setting for simplicity.

Note that the notion of convergence in distribution depends only on the distribution of the random variables involved. One consequence of this is that convergence in distribution does not produce unique limits: if ${X_n}$ converges in distribution to ${X}$ , and ${Y}$ has the same distribution as ${X}$ , then ${X_n}$ also converges in distribution to ${Y}$ . However, limits are unique up to equivalence in distribution (this is a consequence of the Riesz representation theorem, discussed for instance in this blog post). As a consequence of the insensitivity of convergence in distribution to equivalence in distribution, we may also legitimately talk about convergence of distribution of a sequence of random variables ${X_n}$ to another random variable ${X}$ even when all the random variables ${X_1,X_2,\dots}$ and ${X}$ involved are being modeled by different probability spaces (e.g. each ${X_n}$ is modeled by ${\Omega_n}$ , and ${X}$ is modeled by ${\Omega}$ , with no coupling presumed between these spaces). This is in contrast to the stronger notions of convergence in probability or almost sure convergence, which require all the random variables to be modeled by a common probability space. Also, by an abuse of notation, we can say that a sequence ${X_n}$ of random variables converges in distribution to a probability measure ${\mu}$ , when ${\mu_{X_n}}$ converges vaguely to ${\mu}$ . Thus we can talk about a sequence of random variables converging in distribution to a uniform distribution, a gaussian distribution, etc..

From the dominated convergence theorem (available for both convergence in probability and almost sure convergence) we see that convergence in probability or almost sure convergence implies convergence in distribution. The converse is not true, due to the insensitivity of convergence in distribution to equivalence in distribution; for instance, if ${X_1,X_2,\dots}$ are iid copies of a non-deterministic scalar random variable ${X}$ , then the ${X_n}$ trivially converge in distribution to ${X}$ , but will not converge in probability or almost surely (as one can see from the zero-one law). However, there are some partial converses that relate convergence in distribution to convergence in probability; see Exercise 10 below.

Remark 3 The notion of convergence in distribution is somewhat similar to the notion of convergence in the sense of distributions that arises in distribution theory (discussed for instance in this previous blog post), however strictly speaking the two notions of convergence are distinct and should not be confused with each other, despite the very similar names.

The notion of convergence in distribution simplifies in the case of real scalar random variables:

Proposition 4 Let ${X_1,X_2,\dots}$ be a sequence of scalar random variables, and let ${X}$ be another scalar random variable. Then the following are equivalent:

(i) ${X_n}$ converges in distribution to ${X}$ .

(ii) ${F_{X_n}(t)}$ converges to ${F_X(t)}$ for each continuity point ${t}$ of ${F_X}$ (i.e. for all real numbers ${t \in {\bf R}}$ at which ${F_X}$ is continuous). Here ${F_X(t) := {\bf P}(X \leq t)}$ is the cumulative distribution function of ${X}$ .

Proof: First suppose that ${X_n}$ converges in distribution to ${X}$ , and ${F_X}$ is continuous at ${t}$ . For any ${\varepsilon > 0}$ , one can find a ${\delta}$ such that

$\displaystyle F_X(t) - \varepsilon \leq F_X(t') \leq F_X(t) + \varepsilon$

for every ${t' \in [t-\delta,t+\delta]}$ . One can also find an ${N}$ larger than ${|t|+\delta}$ such that ${F_X(-N) \leq \varepsilon}$ and ${F_X(N) \geq 1-\varepsilon}$ . Thus

$\displaystyle {\bf P} (|X| \geq N ) = O(\varepsilon)$

and

$\displaystyle {\bf P} (|X - t| \leq \delta ) = O(\varepsilon).$

Let ${G: {\bf R} \rightarrow [0,1]}$ be a continuous function supported on ${[-2N, t]}$ that equals ${1}$ on ${[-N, t-\delta]}$ . Then by the above discussion we have

$\displaystyle {\bf E} G(X) = F_X(t) + O(\varepsilon)$

and hence

$\displaystyle {\bf E} G(X_n) = F_X(t) + O(\varepsilon)$

for large enough ${n}$ . In particular

$\displaystyle {\bf P}( X_n \leq t ) \geq F_X(t) - O(\varepsilon).$

A similar argument, replacing ${G}$ with a continuous function supported on ${[t,2N]}$ that equals ${1}$ on ${[t+\delta,N]}$ gives

$\displaystyle {\bf P}( X_n > t ) \geq 1 - F_X(t) - O(\varepsilon)$

for ${n}$ large enough. Putting the two estimates together gives

$\displaystyle F_{X_n}(t) = F_X(t) + O(\varepsilon)$

for ${n}$ large enough; sending ${\varepsilon \rightarrow 0}$ , we obtain the claim.
Conversely, suppose that ${F_{X_n}(t)}$ converges to ${F_X(t)}$ at every continuity point ${t}$ of ${F_X}$ . Let ${G: {\bf R} \rightarrow {\bf R}}$ be a continuous compactly supported function, then it is uniformly continuous. As ${F_X}$ is monotone increasing, it can only have countably many points of discontinuity. From these two facts one can find, for any ${\varepsilon>0}$ , a simple function ${G_\varepsilon(t) = \sum_{i=1}^n c_i 1_{(t_i,t_{i+1}]}}$ for some ${t_1 < \dots < t_n}$ that are points of continuity of ${F_X}$ , and real numbers ${c_i}$ , such that ${|G(t) - G_\varepsilon(t)| \leq \varepsilon}$ for all ${t}$ . Thus

$\displaystyle {\bf E} G(X_n) = {\bf E} G_\varepsilon(X_n) + O(\varepsilon)$

$\displaystyle = \sum_{i=1}^n c_i(F_{X_n}(t_{i+1}) - F_{X_n}(t)) + O(\varepsilon).$

Similarly for ${X_n}$ replaced by ${X}$ . Subtracting and taking limit superior, we conclude that

$\displaystyle \limsup_{n \rightarrow \infty} |{\bf E} G(X_n) - {\bf E} G(X)| = O(\varepsilon),$

and on sending ${\varepsilon \rightarrow 0}$ , we obtain that ${X_n}$ converges in distribution to ${X}$ as claimed. $\Box$
The restriction to continuity points of ${t}$ is necessary. Consider for instance the deterministic random variables ${X_n = 1/n}$ , then ${X_n}$ converges almost surely (and hence in distribution) to ${0}$ , but ${F_{X_n}(0) = 0}$ does not converge to ${F_X(0)=1}$ .

Example 5 For any natural number ${n}$ , let ${X_n}$ be a discrete random variable drawn uniformly from the finite set ${\{0/n, 1/n, \dots, (n-1)/n\}}$ , and let ${X}$ be the continuous random variable drawn uniformly from ${[0,1]}$ . Then ${X_n}$ converges in distribution to ${X}$ . Thus we see that a continuous random variable can emerge as the limit of discrete random variables.

Example 6 For any natural number ${n}$ , let ${X_n}$ be a continuous random variable drawn uniformly from ${[0,1/n]}$ , then ${X_n}$ converges in distribution to the deterministic real number ${0}$ . Thus we see that discrete (or even deterministic) random variables can emerge as the limit of continuous random variables.

Exercise 7 (Portmanteau theorem) Show that the properties (i) and (ii) in Proposition 4 are also equivalent to the following three statements:

(iii) One has ${\limsup_{n \rightarrow \infty} {\bf P}( X_n \in K ) \leq {\bf P}(X \in K)}$ for all closed sets ${K \subset {\bf R}}$ .

(iv) One has ${\liminf_{n \rightarrow \infty} {\bf P}( X_n \in U ) \geq {\bf P}(X \in U)}$ for all open sets ${U \subset {\bf R}}$ .

(v) For any Borel set ${E \subset {\bf R}}$ whose topological boundary ${\partial E}$ is such that ${{\bf P}(X \in \partial E) = 0}$ , one has ${\lim_{n \rightarrow \infty} {\bf P}(X_n \in E) = {\bf P}(X \in E)}$ .

(Note: to prove this theorem, you may wish to invoke Urysohn’s lemma. To deduce (iii) from (i), you may wish to start with the case of compact ${K}$ .)

We can now state the famous central limit theorem:

Theorem 8 (Central limit theorem) Let ${X_1,X_2,\dots}$ be iid copies of a scalar random variable ${X}$ of finite mean ${\mu := {\bf E} X}$ and finite non-zero variance ${\sigma^2 := {\bf Var}(X)}$ . Let ${S_n := X_1 + \dots + X_n}$ . Then the random variables ${\frac{\sqrt{n}}{\sigma} (\frac{S_n}{n} - \mu)}$ converges in distribution to a random variable with the standard normal distribution ${N(0,1)}$ (that is to say, a random variable with probability density function ${x \mapsto \frac{1}{\sqrt{2\pi}} e^{-x^2/2}}$ ). Thus, by abuse of notation

$\displaystyle \frac{\sqrt{n}}{\sigma} (\frac{S_n}{n} - \mu) \rightarrow N(0,1).$

In the normalised case ${\mu=0, \sigma^2=1}$ when ${X}$ has mean zero and unit variance, this simplifies to

$\displaystyle \frac{S_n}{\sqrt{n}} \rightarrow N(0,1).$

Using Proposition 4 (and the fact that the cumulative distribution function associated to ${N(0,1)}$ is continuous, the central limit theorem is equivalent to asserting that

$\displaystyle {\bf P}( \frac{\sqrt{n}}{\sigma} (\frac{S_n}{n} - \mu) \leq t ) \rightarrow \frac{1}{\sqrt{2\pi}} \int_{-\infty}^t e^{-x^2/2}\ dx$

as ${n \rightarrow \infty}$ for any ${t \in {\bf R}}$ , or equivalently that

$\displaystyle {\bf P}( a \leq \frac{\sqrt{n}}{\sigma} (\frac{S_n}{n} - \mu) \leq b ) \rightarrow \frac{1}{\sqrt{2\pi}} \int_{a}^b e^{-x^2/2}\ dx.$

Informally, one can think of the central limit theorem as asserting that ${S_n}$ approximately behaves like it has distribution ${N( n \mu, n \sigma^2 )}$ for large ${n}$ , where ${N(\mu,\sigma^2)}$ is the normal distribution with mean ${\mu}$ and variance ${\sigma^2}$ , that is to say the distribution with probability density function ${x \mapsto \frac{1}{\sqrt{2\pi} \sigma} e^{-(x-\mu)^2/2\sigma^2}}$ . The integrals ${\frac{1}{\sqrt{2\pi}} \int_{-\infty}^t e^{-x^2/2}\ dx}$ can be written in terms of the error function ${\hbox{erf}}$ as ${\frac{1}{2} + \frac{1}{2} \hbox{erf}(t/\sqrt{2})}$ .
The central limit theorem is a basic example of the universality phenomenon in probability – many statistics involving a large system of many independent (or weakly dependent) variables (such as the normalised sums ${\frac{\sqrt{n}}{\sigma}(\frac{S_n}{n}-\mu)}$ ) end up having a universal asymptotic limit (in this case, the normal distribution), regardless of the precise makeup of the underlying random variable ${X}$ that comprised that system. Indeed, the universality of the normal distribution is such that it arises in many other contexts than the fluctuation of iid random variables; the central limit theorem is merely the first place in probability theory where it makes a prominent appearance.

We will give several proofs of the central limit theorem in these notes; each of these proofs has their advantages and disadvantages, and can each extend to prove many further results beyond the central limit theorem. We first give Lindeberg’s proof of the central limit theorem, based on exchanging (or swapping) each component ${X_1,\dots,X_n}$ of the sum ${S_n}$ in turn. This proof gives an accessible explanation as to why there should be a universal limit for the central limit theorem; one then computes directly with gaussians to verify that it is the normal distribution which is the universal limit. Our second proof is the most popular one taught in probability texts, namely the Fourier-analytic proof based around the concept of the characteristic function ${t \mapsto {\bf E} e^{itX}}$ of a real random variable ${X}$ . Thanks to the powerful identities and other results of Fourier analysis, this gives a quite short and direct proof of the central limit theorem, although the arguments may seem rather magical to readers who are not already familiar with Fourier methods. Finally, we give a proof based on the moment method, in the spirit of the arguments in the previous notes; this argument is more combinatorial, but is straightforward and is particularly robust, in particular being well equipped to handle some dependencies between components; we will illustrate this by proving the Erdos-Kac law in number theory by this method. Some further discussion of the central limit theorem (including some further proofs, such as one based on Stein’s method) can be found in this blog post. Some further variants of the central limit theorem, such as local limit theorems, stable laws, and large deviation inequalities, will be discussed in the next (and final) set of notes.

The following exercise illustrates the power of the central limit theorem, by establishing combinatorial estimates which would otherwise require the use of Stirling’s formula to establish.

Exercise 9 (De Moivre-Laplace theorem) Let ${X}$ be a Bernoulli random variable, taking values in ${\{0,1\}}$ with ${{\bf P}(X=0)={\bf P}(X=1)=1/2}$ , thus ${X}$ has mean ${1/2}$ and variance ${1/4}$ . Let ${X_1,X_2,\dots}$ be iid copies of ${X}$ , and write ${S_n := X_1+\dots+X_n}$ .

(i) Show that ${S_n}$ takes values in ${\{0,\dots,n\}}$ with ${{\bf P}(S_n=i) = \frac{1}{2^n} \binom{n}{i}}$ . (This is an example of a binomial distribution.)

(ii) Assume Stirling’s formula
$\displaystyle n! = (1+o(1)) \sqrt{2\pi n} n^n e^{-n} \ \ \ \ \ (1)$

where ${o(1)}$ is a function of ${n}$ that goes to zero as ${n \rightarrow \infty}$ . (A proof of this formula may be found in this previous blog post.) Using this formula, and without using the central limit theorem, show that

$\displaystyle {\bf P}( a \leq 2\sqrt{n} (\frac{S_n}{n} - \frac{1}{2}) \leq b ) \rightarrow \frac{1}{\sqrt{2\pi}} \int_{a}^b e^{-x^2/2}\ dx$

as ${n \rightarrow \infty}$ for any fixed real numbers ${a<b}$ .

The above special case of the central limit theorem was first established by de Moivre and Laplace.

We close this section with some basic facts about convergence of distribution that will be useful in the sequel.

Exercise 10 Let ${X_1,X_2,\dots}$ , ${Y_1,Y_2,\dots}$ be sequences of real random variables, and let ${X,Y}$ be further real random variables.

(i) If ${X}$ is deterministic, show that ${X_n}$ converges in distribution to ${X}$ if and only if ${X_n}$ converges in probability to ${X}$ .

(ii) Suppose that ${X_n}$ is independent of ${Y_n}$ for each ${n}$ , and ${X}$ independent of ${Y}$ . Show that ${X_n+iY_n}$ converges in distribution to ${X+iY}$ if and only if ${X_n}$ converges in distribution to ${X}$ and ${Y_n}$ converges in distribution to ${Y}$ . (The shortest way to prove this is by invoking the Stone-Weierstrass theorem, but one can also proceed by proving some version of Proposition 4.) What happens if the independence hypothesis is dropped?

(iii) If ${X_n}$ converges in distribution to ${X}$ , show that for every ${\varepsilon>0}$ there exists ${K>0}$ such that ${{\bf P}( |X_n| \geq K ) < \varepsilon}$ for all sufficiently large ${n}$ . (That is to say, ${X_n}$ is a tight sequence of random variables.)

(iv) Show that ${X_n}$ converges in distribution to ${X}$ if and only if, after extending the probability space model if necessary, one can find copies ${Z_1,Z_2,\dots}$ and ${Z}$ of ${X_1,X_2,\dots}$ and ${X}$ respectively such that ${Z_n}$ converges almost surely to ${Z}$ . (Hint: use the Skorohod representation, Exercise 29 of Notes 0.)

(v) If ${X_1,X_2,\dots}$ converges in distribution to ${X}$ , and ${F: {\bf R} \rightarrow {\bf R}}$ is continuous, show that ${F(X_1),F(X_2),\dots}$ converges in distribution to ${F(X)}$ . Generalise this claim to the case when ${X}$ takes values in an arbitrary locally compact Hausdorff space.

(vi) (Slutsky’s theorem) If ${X_n}$ converges in distribution to ${X}$ , and ${Y_n}$ converges in probability to a deterministic limit ${Y}$ , show that ${X_n+Y_n}$ converges in distribution to ${X+Y}$ , and ${X_n Y_n}$ converges in distribution to ${XY}$ . (Hint: either use (iv), or else use (iii) to control some error terms.) This statement combines particularly well with (i). What happens if ${Y}$ is not assumed to be deterministic?

(vii) (Fatou lemma) If ${G: {\bf R} \rightarrow [0,+\infty)}$ is continuous, and ${X_n}$ converges in distribution to ${X}$ , show that ${\liminf_{n \rightarrow \infty} {\bf E} G(X_n) \geq {\bf E} G(X)}$ .

(viii) (Bounded convergence) If ${G: {\bf R} \rightarrow {\bf R}}$ is continuous and bounded, and ${X_n}$ converges in distribution to ${X}$ , show that ${\lim_{n \rightarrow \infty} {\bf E} G(X_n) = {\bf E} G(X)}$ .

(ix) (Dominated convergence) If ${X_n}$ converges in distribution to ${X}$ , and there is an absolutely integrable ${Y}$ such that ${|X_n| \leq Y}$ almost surely for all ${n}$ , show that ${\lim_{n \rightarrow \infty} {\bf E} X_n = {\bf E} X}$ .

For future reference we also mention (but will not prove) Prokhorov’s theorem that gives a partial converse to part (iii) of the above exercise:

Theorem 11 (Prokhorov’s theorem) Let ${X_1,X_2,\dots}$ be a sequence of real random variables which is tight (that is, for every ${\varepsilon>0}$ there exists ${K>0}$ such that ${{\bf P}(|X_n| \geq K) < \varepsilon}$ for all sufficiently large ${n}$ ). Then there exists a subsequence ${X_{n_j}}$ which converges in distribution to some random variable ${X}$ (which may possibly be modeled by a different probability space model than the ${X_1,X_2,\dots}$ .)

The proof of this theorem relies on the Riesz representation theorem, and is beyond the scope of this course; but see for instance Exercise 29 of this previous blog post. (See also the closely related Helly selection theorem, covered in Exercise 30 of the same post.)

— 1. The Lindeberg approach to the central limit theorem —

We now give the Lindeberg argument establishing the central limit theorem. The proof splits into two unrelated components. The first component is to establish the central limit theorem for a single choice of underlying random variable ${X}$ . The second component is to show that the limiting distribution of ${\frac{\sqrt{n}}{\sigma} (\frac{S_n}{n} - \mu)}$ is universal in the sense that it does not depend the choice of underlying random variable. Putting the two components together gives Theorem 8.

We begin with the first component of the argument. One could use the Bernoulli distribution from Exercise 9 as the choice of underlying random variable, but a simpler choice of distribution (in the sense that no appeal to Stirling’s formula is required) is the normal distribution ${N(\mu,\sigma^2)}$ itself. The key computation is:

Lemma 12 (Sum of independent Gaussians) Let ${X_1, X_2}$ be independent real random variables with normal distributions ${N(\mu_1,\sigma_1^2)}$ , ${N(\mu_2,\sigma_2^2)}$ respectively for some ${\mu_1,\mu_2 \in {\bf R}}$ and ${\sigma_1,\sigma_2 > 0}$ . Then ${X_1+X_2}$ has the normal distribution ${N(\mu_1+\mu_2, \sigma_1^2+\sigma_2^2)}$ .

This is of course consistent with the additivity of mean and variance for independent random variables, given that random variables with the distribution ${N(\mu,\sigma^2)}$ have mean ${\mu}$ and variance ${\sigma^2}$ .

Proof: By subtracting ${\mu_1}$ and ${\mu_2}$ from ${X_1,X_2}$ respectively, we may normalise ${\mu_1=\mu_2=0}$ , by dividing through by ${\sqrt{\sigma_1^2+\sigma_2^2}}$ we may also normalise ${\sigma_1^2+\sigma_2^2=1}$ . Thus

$\displaystyle {\bf P}( X_1 \in E_1 ) = \frac{1}{\sqrt{2\pi} \sigma_1} \int_{E_1} e^{-x_1^2 / 2\sigma_1^2}\ dx_1$

and

$\displaystyle {\bf P}( X_2 \in E_2 ) = \frac{1}{\sqrt{2\pi} \sigma_2} \int_{E_2} e^{-x_2^2 / 2\sigma_2^2}\ dx_2$

for any Borel sets ${E_1,E_2 \subset {\bf R}}$ . As ${X_1,X_2}$ are independent, this implies that

$\displaystyle {\bf P}( (X_1,X_2) \in E ) = \frac{1}{2\pi \sigma_1 \sigma_2} \int_E e^{-x_1^2/2\sigma_1^2 - x_2^2/2\sigma_2^2}\ dx_1 dx_2$

for any Borel set ${E \subset {\bf R}^2}$ (this follows from the uniqueness of product measure, or equivalently one can use the monotone class lemma starting from the case when ${E}$ is a finite boolean combination of product sets ${E_1 \times E_2}$ ). In particular, we have

$\displaystyle {\bf P}( X_1 + X_2 \leq t ) = \frac{1}{2\pi \sigma_1 \sigma_2} \int_{{\bf R}^2} 1_{x_1+x_2 \leq t} e^{-x_1^2/2\sigma_1^2 - x_2^2/2\sigma_2^2}\ dx_1 dx_2$

for any ${t \in {\bf R}}$ . Making the change of variables ${x = x_1 + x_2}$ (and using the Fubini-Tonelli theorem as necessary) we can write the right-hand side as

$\displaystyle \frac{1}{2\pi \sigma_1 \sigma_2} \int_{-\infty}^t \int_{\bf R} e^{-x_1^2/2\sigma_1^2 - (x-x_1)^2/2\sigma_2^2}\ dx_1 dx. \ \ \ \ \ (2)$

We can complete the square using ${\sigma_1^2+\sigma_2^2=1}$ to write (after some routine algebra)

$\displaystyle e^{-x_1^2/2\sigma_1^2 - (x-x_1)/2\sigma_2^2} = e^{-x^2 / 2} e^{-(x_1 - x \sigma_1^2)^2 / 2 \sigma_1^2 \sigma_2^2}$

so on using the identity ${\frac{1}{\sqrt{2\pi} \sigma} \int_{\bf R} e^{-(x_1-\mu)^2/2\sigma^2}\ dx_1 = 1}$ for any ${\mu \in {\bf R}}$ and ${\sigma>0}$ , we can write (2) as

$\displaystyle \frac{1}{\sqrt{2\pi}} \int_{-\infty}^t e^{-x^2 / 2}\ dx$

and so ${X_1+X_2}$ has the cumulative distribution function of ${N(0, 1)}$ , giving the claim. $\Box$
In the next section we give an alternate proof of the above lemma using the machinery of characteristic functions. A more geometric argument can be given as follows. With the same normalisations as in the above proof, we can write ${\sigma_1 = \cos \theta}$ and ${\sigma_2 = \sin \theta}$ for some ${0 < \theta < \pi/2}$ . Then we can write ${X_1 = \cos \theta Y_1}$ and ${X_2 = \sin \theta Y_2}$ where ${Y_1,Y_2}$ are iid copies of ${N(0,1)}$ . But the joint probability density function ${(x_1,x_2) \mapsto \frac{1}{2\pi} e^{-(x_1^2+x_2^2)/2}}$ of ${(Y_1,Y_2)}$ is rotation invariant, so ${X_1+X_2 = \cos \theta Y_1 + \sin \theta Y_2}$ has the same distribution as ${Y_1}$ , and the claim follows.

From the above lemma we see that if ${X_1,X_2,\dots}$ are iid copies of a normal distribution ${N(\mu,\sigma^2)}$ of mean ${\mu}$ and variance ${\sigma^2}$ , then ${S_n = X_1 + \dots X_n}$ has distribution ${N(n \mu, n \sigma^2)}$ , and hence ${\frac{\sqrt{n}}{\sigma} (\frac{S_n}{n} - \mu)}$ has distribution exactly equal to ${N(0,1)}$ . Thus the central limit theorem is clearly true in this case.

Exercise 13 (Probabilistic interpretation of convolution) Let ${f, g: {\bf R} \rightarrow [0,+\infty]}$ be measurable functions with ${\int_{\bf R} f(x)\ dx = \int_{\bf R} g(x)\ dx = 1}$ . Define the convolution ${f*g}$ of ${f}$ and ${g}$ to be

$\displaystyle f*g(x) := \int_{\bf R} f(y) g(x-y)\ dy.$

Show that if ${X, Y}$ are independent real random variables with probability density functions ${f, g}$ respectively, then ${X+Y}$ has probability density function ${f*g}$ .

Now we turn to the general case of the central limit theorem. By subtracting ${\mu}$ from ${X}$ (and from each of the ${X_i}$ ) we may normalise ${\mu = 0}$ ; by dividing ${X}$ (and each of the ${X_i}$ ) by ${\sigma}$ we may also normalise ${\sigma=1}$ . Thus ${{\bf E} X = 0, {\bf E} X^2 = 1}$ , and our task is to show that

$\displaystyle {\bf E} G( \frac{X_1+\dots+X_n}{\sqrt{n}} ) \rightarrow {\bf E} G(N) \ \ \ \ \ (3)$

as ${n \rightarrow \infty}$ , for any continuous compactly supported functions ${G: {\bf R} \rightarrow {\bf R}}$ , where ${N}$ is a random variable distributed according to ${N(0,1)}$ (possibly modeled by a different probability space than the original ${X_1,X_2,\dots}$ ). Since any continuous compactly supported function can be approximated uniformly by smooth compactly supported functions (as can be seen from the Weierstrass or Stone-Weierstrass theorems), it suffices to show this for smooth compactly supported ${G}$ .
Let ${Y_1,Y_2,\dots}$ be iid copies of ${N}$ ; by extending the probability space used to model ${X_1,X_2,\dots}$ (using Proposition 26 from Notes 2), we can model the ${Y_1,Y_2,\dots}$ and ${X_1,X_2,\dots}$ by a common model, in such a way that the combined collection ${(X_n,Y_n)_{n=1}^\infty}$ of random variables are jointly independent. As we have already proved the central limit theorem in the normally distributed case, we already have

$\displaystyle {\bf E} G( \frac{Y_1+\dots+Y_n}{\sqrt{n}} ) \rightarrow {\bf E} G(N) \ \ \ \ \ (4)$

as ${n \rightarrow \infty}$ (indeed we even have equality here). So it suffices to show that

$\displaystyle {\bf E} G( \frac{X_1+\dots+X_n}{\sqrt{n}} ) - {\bf E} G( \frac{Y_1+\dots+Y_n}{\sqrt{n}} ) \rightarrow 0 \ \ \ \ \ (5)$

as ${n \rightarrow \infty}$ .
We first establish this claim under the additional simplifying assumption of a finite third moment: ${{\bf E} |X|^3 < \infty}$ . Rather than swap all of the ${X_1,\dots,X_n}$ with all of the ${Y_1,\dots,Y_n}$ , let us just swap the final ${X_n}$ to a ${Y_n}$ , that is to say let us consider the expression

$\displaystyle {\bf E} G( \frac{X_1+\dots+X_n}{\sqrt{n}} ) - {\bf E} G( \frac{X_1+\dots+X_{n-1}+Y_n}{\sqrt{n}} ).$

Writing ${Z := \frac{X_1 + \dots + X_{n-1}}{\sqrt{n}}}$ , we can write this as

$\displaystyle {\bf E} G( Z + \frac{1}{\sqrt{n}} X_n ) - {\bf E} G( Z + \frac{1}{\sqrt{n}} Y_n ).$

To compute this expression we use Taylor expansion. As ${G}$ is smooth and compactly supported, the first three derivatives of ${G}$ are bounded, leading to the Taylor approximation

$\displaystyle G( Z + \frac{1}{\sqrt{n}} X_n ) = G(Z) + \frac{1}{\sqrt{n}} G'(Z) X_n + \frac{1}{2n} G''(Z) X_n^2 + O( \frac{1}{n^{3/2}} |X_n|^3 )$

where the implied constant depends on ${G}$ . Taking expectations, we conclude that

$\displaystyle {\bf E} G( Z + \frac{1}{\sqrt{n}} X_n ) = {\bf E} G(Z) + \frac{1}{\sqrt{n}} {\bf E} G'(Z) X_n + \frac{1}{2n} {\bf E} G''(Z) X_n^2$

$\displaystyle + O( \frac{1}{n^{3/2}} {\bf E} |X|^3 ).$

Now for a key point: as the random variable ${Z}$ only depends on ${X_1,\dots,X_{n-1}}$ , it is independent of ${X_n}$ , and so we can decouple the expectations to obtain

$\displaystyle {\bf E} G( Z + \frac{1}{\sqrt{n}} X_n ) = {\bf E} G(Z) + \frac{1}{\sqrt{n}} {\bf E} G'(Z) {\bf E} X_n + \frac{1}{2n} {\bf E} G''(Z) {\bf E} X_n^2$

$\displaystyle + O( \frac{1}{n^{3/2}} {\bf E} |X|^3 ).$

The same considerations apply after swapping ${X_n}$ with ${Y_n}$ (which also has a bounded third moment):

$\displaystyle {\bf E} G( Z + \frac{1}{\sqrt{n}} Y_n ) = {\bf E} G(Z) + \frac{1}{\sqrt{n}} {\bf E} G'(Z) {\bf E} Y_n + \frac{1}{2n} {\bf E} G''(Z) {\bf E} Y_n^2$

$\displaystyle + O( \frac{1}{n^{3/2}} ).$

But by hypothesis, ${X_n}$ and ${Y_n}$ have matching moments to second order: ${{\bf E} X_n = {\bf E} Y_n = 0}$ and ${{\bf E} X_n^2 = {\bf E} Y_n^2 = 1}$ . Thus on subtraction we have

$\displaystyle {\bf E} G( \frac{X_1+\dots+X_n}{\sqrt{n}} ) - {\bf E} G( \frac{X_1+\dots+X_{n-1}+Y_n}{\sqrt{n}} )$

$\displaystyle = O( \frac{1}{n^{3/2}} (1 + {\bf E} |X|^3) ).$

A similar argument (permuting the indices, and replacing some of the ${X_i}$ with ${Y_i}$ ) gives

$\displaystyle {\bf E} G( \frac{X_1+\dots+X_i+Y_{i+1}+\dots+Y_n}{\sqrt{n}} )$

$\displaystyle - {\bf E} G( \frac{X_1+\dots+X_{i-1}+Y_i+\dots+Y_n}{\sqrt{n}} )$

$\displaystyle = O( \frac{1}{n^{3/2}} (1 + {\bf E} |X|^3) )$

for all ${1 \leq i \leq n}$ . Summing the telescoping series, we conclude that

$\displaystyle {\bf E} G( \frac{X_1+\dots+X_n}{\sqrt{n}} ) - {\bf E} G( \frac{Y_1+\dots+Y_n}{\sqrt{n}} ) = O( \frac{1}{n^{1/2}} (1 + {\bf E} |X|^3))$

which gives (5). Note how it was important to Taylor expand to at least third order to obtain a total error bound that went to zero, which explains why it is the first two moments ${{\bf E} X, {\bf E} X^2}$ of ${X}$ (or equivalently, the mean and variance) that play such a decisive role in the central limit theorem.
Now we remove the hypothesis of finite third moment. As in the previous set of notes, we use the truncation method, taking advantage of the “room” inherent in the ${\frac{1}{n^{1/2}}}$ factor in the error term of the above analysis. For technical reasons we have to modify the usual truncation slightly to preserve the mean zero condition. Let ${\varepsilon > 0}$ . We split ${X_i = X_{i,\leq} + X_{i,>}}$ , where ${X_{i,\leq} := X_i 1_{|X_i| \leq \varepsilon \sqrt{n}} - \mu_n}$ and ${X_{i,>} := X_i 1_{|X_i| > \varepsilon \sqrt{n}} + \mu_n}$ with ${\mu_n := {\bf E} X 1_{|X| \leq \varepsilon \sqrt{n}}}$ ; we split ${X = X_{\leq} + X_{>}}$ respectively (we are suppressing the dependence of ${X_{i,\leq}, X_{i,>}, X_{\leq}, X_{>}}$ on ${\varepsilon}$ and ${n}$ to simplify the notation). The random variables ${X_\leq}$ (and ${X_{i,\leq}}$ ) are chosen to have mean zero, but their variance is not quite one. However, from dominated convergence, the quantity ${\mu_n := {\bf E} X 1_{|X| \leq \varepsilon \sqrt{n}}}$ converges to ${0}$ , and the variance ${\sigma_n^2 := {\bf Var}(X_{\leq}) = {\bf E} |X|^2 1_{|X| \leq \varepsilon \sqrt{n}} - \mu_n^2}$ converges to ${1}$ , as ${n \rightarrow \infty}$ .

Let ${Y'_1,\dots,Y'_n}$ be iid copies of ${N(0,\sigma_n^2)}$ . The previous arguments then give

$\displaystyle {\bf E} G( \frac{X_{1,\leq}+\dots+X_{n,\leq}}{\sqrt{n}} ) - {\bf E} G( \frac{Y'_1+\dots+Y'_n}{\sqrt{n}} ) = O( \frac{1}{n^{1/2}} (1 + {\bf E} |X_\leq|^3))$

for ${n}$ large enough. We can bound

$\displaystyle {\bf E} |X_\leq|^3 \leq {\bf E} \varepsilon n^{1/2} |X|^2 = \varepsilon n^{1/2}$

and thus

$\displaystyle {\bf E} G( \frac{X_{1,\leq}+\dots+X_{n,\leq}}{\sqrt{n}} ) - {\bf E} G( \frac{Y'_1+\dots+Y'_n}{\sqrt{n}} ) = O( \varepsilon )$

for large enough ${n}$ . By Lemma 12, ${\frac{Y'_1+\dots+Y'_n}{\sqrt{n}}}$ is distributed as ${N(0,\sigma_n)}$ , and hence

$\displaystyle {\bf E} G( \frac{Y'_1+\dots+Y'_n}{\sqrt{n}} ) \rightarrow {\bf E} G(N)$

as ${n \rightarrow \infty}$ . We conclude that

$\displaystyle {\bf E} G( \frac{X_{1,\leq}+\dots+X_{n,\leq}}{\sqrt{n}} ) = {\bf E} G(N) + O( \varepsilon ).$

Next, we consider the error term

$\displaystyle \frac{X_{1,>}+\dots+X_{n,>}}{\sqrt{n}}.$

The variable ${X_>}$ has mean zero, and by dominated convergence, the variance of ${X_>}$ goes to zero as ${n \rightarrow \infty}$ . The above error term is also mean zero and has the same variance as ${X_>}$ . In particular, from Cauchy-Schwarz we have

$\displaystyle {\bf E} |\frac{X_{1,>}+\dots+X_{n,>}}{\sqrt{n}}| \rightarrow 0.$

As ${G}$ is smooth and compactly supported, it is Lipschitz continuous, and hence

$\displaystyle G( \frac{X_{1}+\dots+X_{n}}{\sqrt{n}} ) = G( \frac{X_{1,\leq}+\dots+X_{n,\leq}}{\sqrt{n}} ) + O( |\frac{X_{1,>}+\dots+X_{n,>}}{\sqrt{n}}| ).$

Taking expectations, and then combining all these estimates, we conclude that

$\displaystyle {\bf E} G( \frac{X_{1}+\dots+X_{n}}{\sqrt{n}} ) = {\bf E} G(N) + O( \varepsilon )$

for ${n}$ sufficiently large; letting ${n}$ go to infinity we obtain (3) as required. This concludes the proof of Theorem 8.
The above argument can be generalised to a central limit theorem for certain triangular arrays, known as the Lindeberg central limit theorem:

Exercise 14 (Lindeberg central limit theorem) Let ${k_n}$ be a sequence of natural numbers going to infinity in ${n}$ . For each natural number ${n}$ , let ${X_{1,n},\dots,X_{k_n,n}}$ be jointly independent real random variables of mean zero and finite variance. (We do not require the random variables ${(X_{1,n},\dots,X_{k_n,n})}$ to be jointly independent in ${n}$ , or even to be modeled by a common probability space.) Let ${\sigma_n}$ be defined by

$\displaystyle \sigma_n^2 := \sum_{i=1}^{k_n} {\bf Var}(X_{i,n})$

and assume that ${\sigma_n>0}$ for all ${n}$ .

(i) If one assumes the Lindeberg condition that
$\displaystyle \frac{1}{\sigma_n^2} \sum_{i=1}^{k_n} {\bf E}( |X_{i,n}|^2 1_{|X_{i,n}| > \varepsilon \sigma_n} ) \rightarrow 0$

as ${n \rightarrow \infty}$ for any ${\varepsilon > 0}$ , then show that the random variables ${\frac{X_{1,n}+\dots+X_{k_n,n}}{\sigma_n}}$ converge in distribution to a random variable with the normal distribution ${N(0,1)}$ .

(ii) Show that the Lindeberg condition implies the Feller condition
$\displaystyle \frac{1}{\sigma_n^2} \max_{1 \leq i \leq k_n} {\bf E} |X_{i,n}|^2 \rightarrow 0$

as ${n \rightarrow \infty}$ .

Note that Theorem 8 (after normalising to the mean zero case ${\mu=0}$ ) corresponds to the special case ${k_n = n}$ and ${X_{i,n} := X_i}$ (or, if one wishes, ${X_{i,n} := X_i/\sqrt{n}}$ ) of the Lindeberg central limit theorem. It was shown by Feller that if the situation as in the above exercise and the Feller condition holds, then the Lindeberg condition is necessary as well as sufficient for ${X_{1,n}+\dots+X_{k_n,n}}$ to converge in distribution to a random variable with normal distribution ${N(0,\sigma^2)}$ ; the combined result is sometimes known as the Lindeberg-Feller theorem.

Exercise 15 (Weak Berry-Esséen theorem) Let ${X_1,\dots,X_n}$ be iid copies of a real random variable ${X}$ of mean zero, unit variance, and finite third moment.

(i) Show that
$\displaystyle {\bf E} G( \frac{X_1+\dots+X_n}{\sqrt{n}}) = {\bf E} G(N) + O( n^{-1/2} (\sup_{x \in {\bf R}} |G'''(x)|) {\bf E} |X|^3 )$

whenever ${G}$ is three times continuously differentiable and compactly supported, with ${N}$ distributed as ${N(0,1)}$ and the implied constant in the ${O()}$ notation absolute.

(ii) Show that
$\displaystyle {\bf P} ( \frac{X_1+\dots+X_n}{\sqrt{n}} \leq t ) = {\bf P}( N \leq t) + O( n^{-1/2} {\bf E} |X|^3 )^{1/4}$

for any ${t \in {\bf R}}$ , with the implied constant absolute.

We will strengthen the conclusion of this theorem in Theorem 37 below.

Remark 16 The Lindeberg exchange method explains why the limiting distribution of statistics such as ${\frac{X_1+\dots+X_n}{\sqrt{n}}}$ depend primarily on the first two moments of the component random variables ${X_i}$ , if there is a suitable amount of independence between the ${X_i}$ . It turns out that there is an analogous application of the Lindeberg method in random matrix theory, which (very roughly speaking) asserts that appropriate statistics of random matrices such as ${(X_{ij})_{1 \leq i, j \leq n}}$ depend primarily on the first four moments of the matrix components ${X_{ij}}$ , if there is a suitable amount of independence between the ${X_{ij}}$ . See for instance this survey article of Van Vu and myself for more discussion of this. The Lindeberg method also suggests that the more moments of ${X}$ one assumes to match with the Gaussian variable ${N}$ , the faster the rate of convergence (because one can use higher order Taylor expansions).

We now use the Lindeberg central limit theorem to obtain the converse direction of the Kolmogorov three-series theorem (Exercise 29 of Notes 3).

Exercise 17 (Kolmogorov three-series theorem, converse direction) Let ${X_1,X_2,\dots}$ be a sequence of jointly independent real random variables, with the property that the series ${\sum_{n=1}^\infty X_n}$ is almost surely convergent (i.e., the partial sums are almost surely convergent), and let ${A>0}$ .

(i) Show that ${\sum_{n=1}^\infty {\bf P}( |X_n| > A )}$ is finite. (Hint: argue by contradiction and use the second Borel-Cantelli lemma.)

(ii) Show that ${\sum_{n=1}^\infty {\bf Var}( X_n 1_{|X_n| \leq A} )}$ is finite. (Hint: first use (i) and the Borel-Cantelli lemma to reduce to the case where ${|X_n| \leq A}$ almost surely. If ${\sum_{n=1}^\infty \hbox{Var}(X_n)}$ is infinite, use Exercise 14 to show that ${\sum_{n=1}^N (X_n-{\bf E}X_n) / \sqrt{\sum_{n=1}^\infty {\bf Var}(X_n)}}$ converges in distribution to a standard normal distribution, and use this to contradict the almost sure convergence of ${\sum_{n=1}^\infty X_n}$ .)

(iii) Show that the series ${\sum_{n=1}^\infty {\bf E} X_n 1_{|X_n| \leq A}}$ is convergent. (Hint: reduce as before to the case where ${|X_n| \leq A}$ almost surely, and apply the forward direction of the three-series theorem to ${\sum_n X_n - {\bf E} X_n}$ .)

— 2. The Fourier-analytic approach to the central limit theorem —

Let us now give the standard Fourier-analytic proof of the central limit theorem. Given any real random variable ${X}$ , we introduce the characteristic function ${\varphi_X: {\bf R} \rightarrow {\bf C}}$ , defined by the formula

$\displaystyle \varphi_X(t) := {\bf E} e^{itX}. \ \ \ \ \ (6)$

Equivalently, ${\varphi_X}$ is the Fourier transform of the probability measure ${\mu_X}$ . One should caution that the term “characteristic function” has several other unrelated meanings in mathematics; particularly confusingly, in real analysis “characteristic function” is used to denote what in probability one would call an “indicator function”. Note that no moment hypotheses are required to define the characteristic function, because the random variable ${e^{itX}}$ is bounded even when ${X}$ is not absolutely integrable.

Example 18 The signed Bernoulli distribution, which takes the values ${+1}$ and ${-1}$ with probabilities of ${1/2}$ each, has characteristic function ${\phi(t) = \cos(t)}$ .

Most of the standard random variables in probability have characteristic functions that are quite simple and explicit. For the purposes of proving the central limit theorem, the most important such explicit form of the characteristic function is of the normal distribution:

Exercise 19 Show that the normal distribution ${N(\mu,\sigma^2)}$ has characteristic function ${\phi(t) = e^{it\mu} e^{-\sigma^2 t^2/2}}$ .

We record the explicit characteristic functions of some other standard distributions:

Exercise 20 Let ${\lambda > 0}$ , and let ${X}$ be a Poisson random variable with intensity ${\lambda}$ , thus ${X}$ takes values in the non-negative integers with ${{\bf P}(X=k) = e^{-\lambda} \frac{\lambda^k}{k!}}$ . Show that ${\varphi_X(t) = \exp( \lambda(e^{it}-1) )}$ for all ${t \in {\bf R}}$ .

Exercise 21 Let ${X}$ be uniformly distributed in some interval ${[a,b]}$ . Show that ${\varphi_X(t) = \frac{e^{itb}-e^{ita}}{it(b-a)}}$ for all non-zero ${t}$ .

Exercise 22 Let ${x_0 \in {\bf R}}$ and ${\gamma > 0}$ , and let ${X}$ be a Cauchy random variable with parameters ${x_0,\gamma}$ , which means that ${X}$ is a real random variable with probability density function ${\frac{\gamma}{\pi ((x-x_0)^2+\gamma^2)}}$ . Show that ${\varphi_X(t) = e^{ix_0 t} e^{-\gamma |t|}}$ for all ${t \in {\bf R}}$ .

The characteristic function is clearly bounded in magnitude by ${1}$ , and equals ${1}$ at the origin. By the dominated convergence theorem, ${\varphi_X}$ is continuous in ${t}$ .

Exercise 23 (Riemann-Lebesgue lemma) Show that if ${X}$ is a real random variable that has an absolutely integrable probability density function ${f}$ , then ${\varphi_X(t) \rightarrow 0}$ as ${t \rightarrow \infty}$ . (Hint: first show the claim when ${f}$ is a finite linear combination of intervals, then for the general case show that ${f}$ can be approximated in ${L^1}$ norm by such finite linear combinations.) Note from Example 18 that the claim can fail if ${X}$ does not have a probability density function. (In Fourier analysis, this fact is known as the Riemann-Lebesgue lemma.)

Exercise 24 Show that the characteristic function ${\varphi_X}$ of a real random variable ${X}$ is in fact uniformly continuous on its domain.

Let ${X}$ be a real random variable. If we Taylor expand ${e^{itX}}$ and formally interchange the series and expectation, we arrive at the heuristic identity

$\displaystyle \varphi_X(t) = \sum_{k=0}^\infty \frac{(it)^k}{k!} {\bf E} X^k \ \ \ \ \ (7)$

which thus interprets the characteristic function of a real random variable ${X}$ as a kind of generating function for the moments. One rigorous version of this identity is as follows.

Exercise 25 (Taylor expansion of characteristic function) Let ${X}$ be a real random variable with finite ${k^{th}}$ moment for some ${k \geq 1}$ . Show that ${\varphi_X}$ is ${k}$ times continuously differentiable, with

$\displaystyle \frac{d^j}{dt^j} \varphi_X(t) = i^j {\bf E} X^j e^{itX}$

for all ${0 \leq j \leq k}$ . Conclude in particular the partial Taylor expansion

$\displaystyle \varphi_X(t) = \sum_{j=0}^k \frac{(it)^j}{j!} {\bf E} X^j + o( |t|^k )$

where ${o(|t|^k)}$ is a quantity that goes to zero as ${t \rightarrow 0}$ , times ${|t|^k}$ .

Exercise 26 Let ${X}$ be a real random variable, and assume that it is subgaussian in the sense that there exist constants ${C, c > 0}$ such that

$\displaystyle {\bf P}( |X| \geq t ) \leq C e^{-ct^2}$

for all ${t \in {\bf R}}$ . (Thus for instance a bounded random variable is subgaussian, as is any gaussian random variable.) Rigorously establish (7) in this case, and show that the series converges locally uniformly in ${t}$ .

Note that the characteristic function depends only on the distribution of ${X}$ : if ${X}$ and ${Y}$ are equal in distribution, then ${\varphi_X = \varphi_Y}$ . The converse statement is true also: if ${\varphi_X=\varphi_Y}$ , then ${X}$ and ${Y}$ are equal in distribution. This follows from a more general (and useful) fact, known as Lévy’s continuity theorem.

Theorem 27 (Lévy continuity theorem, special case) Let ${X_n}$ be a sequence of real random variables, and let ${X}$ be an additional real random variable. Then the following statements are equivalent:

(i) ${\varphi_{X_n}}$ converges pointwise to ${\varphi_X}$ .

(ii) ${X_n}$ converges in distribution to ${X}$ .

Proof: The implication of (i) from (ii) is immediate from (6) and Exercise 10(viii).

Now suppose that (i) holds, and we wish to show that (ii) holds. We need to show that

$\displaystyle {\bf E} G(X_n) \rightarrow {\bf E} G(X)$

whenever ${G: {\bf R} \rightarrow {\bf R}}$ is a continuous, compactly supported function. As in the Lindeberg argument, it suffices to prove this when ${G}$ is smooth and compactly supported, in particular ${G}$ is a Schwartz function (infinitely differentiable, with all derivatives rapidly decreasing). But then we have the Fourier inversion formula

$\displaystyle G(X_n) = \int_{{\bf R}} \hat G(t) e^{it \cdot X_n}\ dt$

where

$\displaystyle \hat G(t) := \frac{1}{2\pi} \int_{{\bf R}} G(x) e^{-it \cdot x}\ dx$

is Schwartz, and is in particular absolutely integrable (see e.g. these lecture notes of mine). From the Fubini-Tonelli theorem, we thus have

$\displaystyle {\bf E} G(X_n) = \int_{{\bf R}} \hat G(t) \varphi_{X_n}(t)\ dt \ \ \ \ \ (8)$

and similarly for ${X}$ . The claim now follows from the Lebesgue dominated convergence theorem. $\Box$

Remark 28 Setting ${X_n := Y}$ for all ${n}$ , we see in particular the previous claim that ${\varphi_X = \varphi_Y}$ if and only if ${X}$ , ${Y}$ have the same distribution. It is instructive to use the above proof as a guide to prove this claim directly.

There is one subtlety with the Lévy continuity theorem: it is possible for a sequence ${\varphi_{X_n}}$ of characteristic functions to converge pointwise, but for the limit to not be the characteristic function of any random variable, in which case ${X_n}$ will not converge in distribution. For instance, if ${X_n \equiv N(0,n)}$ , then ${\varphi_{X_n}(t) = e^{-t^2/2n}}$ converges pointwise to ${1_{t=0}}$ for any ${t}$ , but this is clearly not the characteristic function of any random variable (as characteristic functions are continuous). However, this lack of continuity is the only obstruction:

Exercise 29 (Lévy’s continuity theorem, full version) Let ${X_n}$ be a sequence of real valued random variables. Suppose that ${\varphi_{X_n}}$ converges pointwise to a limit ${\varphi}$ . Show that the following are equivalent:

(i) ${\varphi}$ is continuous at ${0}$ .

(ii) ${X_n}$ is a tight sequence (as in Exercise 10(iii)).

(iii) ${\varphi}$ is the characteristic function of a real valued random variable ${X}$ (possibly after extending the sample space).

(iv) ${X_n}$ converges in distribution to some real valued random variable ${X}$ (possibly after extending the sample space).

Hint: To get from (ii) to the other conclusions, use Theorem 11 and Theorem 27. To get back to (ii) from (i), use (8) for a suitable Schwartz function ${G}$ . The other implications are easy once Theorem 27 is in hand.

Remark 30 Lévy’s continuity theorem is very similar in spirit to Weyl’s criterion in equidistribution theory.

Exercise 31 (Esséen concentration inequality) Let ${X}$ be a random variable taking values in ${{\bf R}}$ . Then for any ${r>0}$ , ${\varepsilon > 0}$ , show that

$\displaystyle \sup_{x_0 \in {\bf R}} {\bf P}( |X - x_0| \leq r ) \leq C_{\varepsilon} r \int_{t \in {\bf R}: |t| \leq \varepsilon/r} |\varphi_X(t)|\ dt \ \ \ \ \ (9)$

for some constant ${C_{\varepsilon}}$ depending only on ${\varepsilon}$ . (Hint: Use (8) for a suitable Schwartz function ${G}$ .) The left-hand side of (9) (as well as higher dimensional analogues of this quantity) is known as the small ball probability of ${X}$ at radius ${r}$ .

In Fourier analysis, we learn that the Fourier transform is a particularly well-suited tool for studying convolutions. The probability theory analogue of this fact is that characteristic functions are a particularly well-suited tool for studying sums of independent random variables. More precisely, we have

Exercise 32 (Fourier identities) Let ${X, Y}$ be independent real random variables. Then

$\displaystyle \varphi_{X+Y}(t) = \varphi_X(t) \varphi_Y(t) \ \ \ \ \ (10)$

for all ${t \in V}$ . Also, for any scalar ${c}$ , one has

$\displaystyle \varphi_{cX}(t) = \varphi_X(\overline{c}t)$

and more generally, for any linear transformation ${T: V \rightarrow V}$ , one has

$\displaystyle \varphi_{TX}(t) = \varphi_X(T^*t).$

Remark 33 Note that this identity (10), combined with Exercise 19 and Remark 28, gives a quick alternate proof of Lemma 12.

In particular, we have the simple relationship

$\displaystyle \varphi_{S_n/\sqrt{n}}(t) = \varphi_X( t / \sqrt{n} )^n \ \ \ \ \ (11)$

that describes the characteristic function of ${S_n/\sqrt{n}}$ in terms of that of ${X}$ .
We now have enough machinery to give a quick proof of the central limit theorem:

Proof: (Fourier proof of Theorem 8) We may normalise ${X}$ to have mean zero and variance ${1}$ . By Exercise 25, we thus have

$\displaystyle \varphi_X(t) = 1 - t^2/2 + o(|t|^2)$

for sufficiently small ${t}$ , or equivalently

$\displaystyle \varphi_X(t) = \exp( - t^2/2 + o(|t|^2) )$

for sufficiently small ${t}$ . Applying (11), we conclude that

$\displaystyle \varphi_{S_n/\sqrt{n}}(t) \rightarrow \exp( - t^2/2 )$

as ${n \rightarrow \infty}$ for any fixed ${t}$ . But by Exercise 19, ${\exp(-t^2/2)}$ is the characteristic function of the normal distribution ${N(0,1)}$ . The claim now follows from the Lévy continuity theorem. $\Box$
The above machinery extends without difficulty to vector-valued random variables ${\vec X = (X_1,\dots,X_d)}$ taking values in ${{\bf R}^d}$ . The analogue of the characteristic function is then the function ${\varphi_{\vec X}: {\bf R}^d \rightarrow {\bf C}}$ defined by

$\displaystyle \varphi_{\vec X}(t_1,\dots,t_d) := {\bf E} e^{i (t_1 X_1 + \dots + t_d X_d)}.$

We leave the routine extension of the above results and proofs to the higher dimensional case to the reader. Most interesting is what happens to the central limit theorem:

Exercise 34 (Vector-valued central limit theorem) Let ${\vec X=(X_1,\ldots,X_d)}$ be a random variable taking values in ${{\bf R}^d}$ with finite second moment. Define the covariance matrix ${\Sigma(\vec X)}$ to be the ${d \times d}$ matrix ${\Sigma}$ whose ${ij^{th}}$ entry is the covariance ${{\bf E} (X_i - {\bf E}(X_i)) (X_j - {\bf E}(X_j))}$ .

Show that the covariance matrix is positive semi-definite real symmetric.

Conversely, given any positive definite real symmetric ${d \times d}$ matrix ${\Sigma}$ and ${\mu \in {\bf R}^d}$ , show that the multivariate normal distribution ${N(\mu,\Sigma)_{{\bf R}^d}}$ , given by the absolutely continuous measure
$\displaystyle \frac{1}{((2\pi)^d \det \Sigma)^{1/2}} e^{- (x-\mu) \cdot \Sigma^{-1} (x-\mu) / 2}\ dx,$

has mean ${\mu}$ and covariance matrix ${\Sigma}$ , and has a characteristic function given by

$\displaystyle \varphi_{N(\mu,\Sigma)_{{\bf R}^d}}(t) = e^{i \mu \cdot t} e^{- t \cdot \Sigma t / 2}.$

How would one define the normal distribution ${N(\mu,\Sigma)_{{\bf R}^d}}$ if ${\Sigma}$ degenerated to be merely positive semi-definite instead of positive definite?

If ${\vec S_n := \vec X_1+\ldots + \vec X_n}$ is the sum of ${n}$ iid copies of ${\vec X}$ , show that ${\frac{\vec S_n - n \mu}{\sqrt{n}}}$ converges in distribution to ${N(0,\Sigma(X))_{{\bf R}^d}}$ . (For this exercise, you may assume without proof that the Lévy continuity theorem extends to ${{\bf R}^d}$ .)

Exercise 35 (Complex central limit theorem) Let ${X}$ be a complex random variable of mean ${\mu \in {\bf C}}$ , whose real and imaginary parts have variance ${\sigma^2/2}$ and covariance ${0}$ . Let ${X_1,\ldots,X_n \equiv X}$ be iid copies of ${X}$ . Show that as ${n \rightarrow \infty}$ , the normalised sums ${\frac{\sqrt{n}}{\sigma} (\frac{X_1+\dots+X_n}{n} - \mu)}$ converge in distribution to the standard complex gaussian ${N(0,1)_{\bf C}}$ , defined as the measure ${\mu}$ on ${{\bf C}}$ with

$\displaystyle \mu(S) := \frac{1}{\pi} \int_{S} e^{-|z|^2}\ dz$

for Borel ${S \in {\bf C}}$ , where ${dz}$ is Lebesgue measure on ${{\bf C}}$ (identified with ${{\bf R}^2}$ in the usual fashion).

Exercise 36 Use characteristic functions and the truncation argument to give an alternate proof of the Lindeberg central limit theorem (Theorem 14).

A more sophisticated version of the Fourier-analytic method gives a more quantitative form of the central limit theorem, namely the Berry-Esséen theorem.

Theorem 37 (Berry-Esséen theorem) Let ${X}$ have mean zero and unit variance. Let ${S_n := X_1+\ldots+X_n}$ , where ${X_1,\ldots,X_n}$ are iid copies of ${X}$ . Then we have

$\displaystyle {\bf P}( \frac{S_n}{\sqrt{n}} < a ) = {\bf P}( N < a ) + O( \frac{1}{\sqrt{n}} ({\bf E} |X|^3)) \ \ \ \ \ (12)$

uniformly for all ${a \in {\bf R}}$ , where ${N}$ has the distribution of ${N(0,1)}$ , and the implied constant is absolute.

Proof: (Optional) Write ${\varepsilon := {\bf E} |X|^3/\sqrt{n}}$ ; our task is to show that

$\displaystyle {\bf P}( \frac{S_n}{\sqrt{n}} < a ) = {\bf P}( N < a ) + O( \varepsilon ) \ \ \ \ \ (13)$

for all ${a}$ . We may of course assume that ${\varepsilon < 1}$ , as the claim is trivial otherwise. (In particular, ${X}$ has finite third moment.)
Let ${c > 0}$ be a small absolute constant to be chosen later. Let ${\eta: {\bf R} \rightarrow {\bf R}}$ be an non-negative Schwartz function with total mass ${1}$ whose Fourier transform is supported in ${[-c,c]}$ ; such an ${\eta}$ can be constructed by taking the inverse Fourier transform of a smooth function supported on ${[-c/2,c/2]}$ and equal to one at the origin, and then multiplying that transform by its complex conjugate to make it non-negative.

Let ${\psi: {\bf R} \rightarrow {\bf R}}$ be the smoothed out version of ${1_{(-\infty,0]}}$ , defined as

$\displaystyle \psi(x) := \int_{\bf R} 1_{(-\infty,0]}(x - \varepsilon y) \eta(y)\ dy$

$\displaystyle = \int_{x/\varepsilon}^\infty \eta(y)\ dy.$

Observe that ${\psi}$ is decreasing from ${1}$ to ${0}$ . As ${\eta}$ is rapidly decreasing and has mean one, we also have the bound

$\displaystyle \psi(x) = 1_{(-\infty,0]}(x) + O_\eta( (1+|x|/\varepsilon)^{-100} ) \ \ \ \ \ (14)$

(say) for any ${x}$ , where subscript indicates that the implied constant depends on ${\eta}$ .
We claim that it suffices to show that

$\displaystyle {\bf E} \psi(\frac{S_n}{\sqrt{n}} - a) = {\bf E} \psi(N - a ) + O_\eta(\varepsilon) \ \ \ \ \ (15)$

for every ${a}$ , where the subscript means that the implied constant depends on ${\eta}$ . Indeed, suppose that (15) held. Replacing ${a}$ by ${a + C \varepsilon}$ and ${a - C\varepsilon}$ for some large absolute constant ${C}$ and subtracting, we have

$\displaystyle {\bf E} \psi(\frac{S_n}{\sqrt{n}} - a - C \varepsilon) - \psi(\frac{S_n}{\sqrt{n}} - a + C \varepsilon) = {\bf E} \psi(N - a - C \varepsilon) - {\bf E} \psi(N - a + C \varepsilon)$

$\displaystyle + O_\eta(\varepsilon)$

for any ${a}$ . From (14) we see that the function ${\psi(t - a - C \varepsilon) - \psi(t - a + C \varepsilon )}$ is bounded by ${O_{\eta,C} ( 1 + |x-a|/\varepsilon)^{-100} )}$ , and hence by the bounded probability density of ${N}$

$\displaystyle {\bf E} \psi(N - a - C \varepsilon) - {\bf E} \psi(N - a + C \varepsilon) = O_{\eta,C}(\varepsilon).$

Also, ${\psi(t - a - C \varepsilon) - \psi(t - a + C \varepsilon )}$ is non-negative, and for ${C}$ large enough, it is bounded from below by (say) ${1/2}$ on ${[a,a+\varepsilon]}$ . We conclude (after choosing ${C}$ appropriately) that

$\displaystyle {\bf E} ( \frac{S_n}{\sqrt{n}} \in [a,a+\varepsilon] ) = O_\eta(\varepsilon) \ \ \ \ \ (16)$

for all real ${a}$ . This implies that

$\displaystyle {\bf E} (1 + |\frac{S_n}{\sqrt{n}} - a|/\varepsilon)^{-100} = O_\eta(\varepsilon),$

as can be seen by covering the real line by intervals ${[a+j\varepsilon, a+(j+1)\varepsilon)]}$ and applying (16) to each such interval. From (14) we conclude that

$\displaystyle {\bf E} \psi(\frac{S_n}{\sqrt{n}} - a) = {\bf E} 1{(-\infty,0]}(\frac{S_n}{\sqrt{n}} - a) + O_\eta(\varepsilon).$

A similar argument gives

$\displaystyle {\bf E} \psi(N - a) = {\bf E} 1{(-\infty,0]}(N - a) + O_\eta(\varepsilon),$

and (13) then follows from (15).
It remains to establish (15). We write

$\displaystyle \psi(x) = \lim_{M \rightarrow \infty} \int_{x/\varepsilon}^{x/\varepsilon + M} \eta(y)\ dy$

(with the expression in the limit being uniformly bounded) and

$\displaystyle \eta(y) = \int_{\bf R} \hat \eta(t) e^{ity}\ dt$

to conclude (after applying Fubini’s theorem) that

$\displaystyle \psi(x) = \lim_{M \rightarrow \infty} \int_{-c}^c \hat \eta(t) e^{itx/\varepsilon} \frac{e^{itM}-1}{it}\ dt$

using the compact support of ${\hat \eta}$ . Hence

$\displaystyle \psi(\frac{S_n}{\sqrt{n}}-a) - \psi(N-a) = \lim_{M \rightarrow \infty} \int_{-c}^c \hat \eta(t) (e^{it\frac{S_n}{\sqrt{n}}/\varepsilon} - e^{itN/\varepsilon}) \frac{e^{itM}-1}{it} e^{-ita/\varepsilon}\ dt.$

By the fundamental theorem of calculus we have ${e^{it\frac{S_n}{\sqrt{n}}/\varepsilon} - e^{itN/\varepsilon} = O( |t| (|\frac{S_n}{\sqrt{n}}| + |N|) /\varepsilon)}$ . This factor of ${|t|}$ cancels a similar factor on the denominator to make the expression inside the limit dominated by an absolutely integrable random variable. Thus by the dominated convergence theorem

$\displaystyle {\bf E} \psi(\frac{S_n}{\sqrt{n}}) - {\bf E} \psi(N) = \lim_{M \rightarrow \infty} \int_{-c}^c \hat \eta(t) (\varphi_{S_n/\sqrt{n}}(t/\varepsilon)-\varphi_N(t/\varepsilon)) \frac{e^{itM}-1}{it} e^{-ita/\varepsilon}\ dt$

and so it suffices to show that

$\displaystyle \int_{-c}^c \hat \eta(t) (\varphi_{S_n/\sqrt{n}}(t/\varepsilon) - \varphi_N(t/\varepsilon)) \frac{e^{itM}-1}{it} e^{-ita/\varepsilon}\ dt = O_\eta(\varepsilon)$

uniformly in ${M}$ . Applying the triangle inequality and the compact support of ${\hat \eta}$ , it suffices to show that

$\displaystyle \int_{|t| \leq c} |\varphi_{S_n/\sqrt{n}}(t/\varepsilon) - \varphi_{N}(t/\varepsilon)| \frac{dt}{|t|} = O(\varepsilon). \ \ \ \ \ (17)$

From Taylor expansion we have

$\displaystyle e^{itX} = 1 + it X - \frac{t^2}{2} X^2 + O( |t|^3 |X|^3 )$

for any ${t}$ ; taking expectations and using the definition of ${\varepsilon}$ we have

$\displaystyle \varphi_X(t) = 1 - t^2/2 + O( \varepsilon \sqrt{n} |t|^3 )$

and in particular

$\displaystyle \varphi_X(t) = \exp( - t^2/2 + O( \varepsilon \sqrt{n} |t|^3 ) )$

if ${|t| \leq c / \varepsilon \sqrt{n}}$ and ${c}$ is small enough. Applying (11), we conclude that

$\displaystyle \varphi_{S_n/\sqrt{n}}(t) = \exp(-t^2/2 + O( \varepsilon |t|^3 ))$

if ${|t| \leq c/\varepsilon}$ . Meanwhile, from Exercise 19 we have ${\varphi_N(t) = \exp(-t^2/2)}$ . Elementary calculus then gives us

$\displaystyle |\varphi_{S_n/\sqrt{n}}(t) - \varphi_N(t)| \leq O( \varepsilon |t|^3 \exp( - t^2/4 ) )$

(say) if ${c}$ is small enough. Inserting this bound into (17) we obtain the claim. $\Box$

Exercise 38 Show that the error terms in Theorem 37 are sharp (up to constants) when ${X}$ is a signed Bernoulli random variable, or more precisely when ${X}$ takes values in ${\{-1,+1\}}$ with probability ${1/2}$ for each.

Exercise 39 Let ${X_n}$ be a sequence of real random variables which converge in distribution to a real random variable ${X}$ , and let ${Y_n}$ be a sequence of real random variables which converge in distribution to a real random variable ${Y}$ . Suppose that, for each ${n}$ , ${X_n}$ and ${Y_n}$ are independent, and suppose also that ${X}$ and ${Y}$ are independent. Show that ${X_n+Y_n}$ converges in distribution to ${X+Y}$ . (Hint: use the Lévy continuity theorem.)

The following exercise shows that the central limit theorem fails when the variance is infinite.

Exercise 40 Let ${X_1,X_2,\dots}$ be iid copies of an absolutely integrable random variable ${X}$ of mean zero.

(i) In this part we assume that ${X}$ is symmetric, which means that ${X}$ and ${-X}$ have the same distribution. Show that for any ${t > 0}$ and ${M > 0}$ ,
$\displaystyle {\bf P}( X_1 + \dots + X_n \geq t )$

$\displaystyle \geq \frac{1}{2} {\bf P}( X_1 1_{|X_1| \leq M} + \dots + X_n 1_{|X_n| \leq M} \geq t ).$

(Hint: relate both sides of this inequality to the probability of the event that ${X_1 1_{|X_1| \leq M} + \dots + X_n 1_{|X_n| \leq M} \geq t}$ and ${X_1 1_{|X_1| > M} + \dots + X_n 1_{|X_n| > M} \geq 0}$ , using the symmetry of the situation.)

(ii) If ${X}$ is symmetric and ${\frac{X_1+\dots+X_n}{\sqrt{n}}}$ converges in distribution to a real random variable ${Y}$ , show that ${X}$ has finite variance. (Hint: if this is not the case, then ${X 1_{|X| \leq M}}$ will have arbitrarily large variance as ${M}$ increases. On the other hand, ${{\bf P}( Y \geq u )}$ can be made arbitrarily small by taking ${u}$ large enough. For a large threshold ${M}$ , apply (i) (with ${t = u \sqrt{n}}$ ) to obtain a contradiction.

(iii) Generalise (ii) by removing the hypothesis that ${X}$ is symmetric. (Hint: apply the symmetrisation trick of replacing ${X}$ by ${X-X'}$ , where ${X'}$ is an independent copy of ${X}$ , and use the previous exercise. One may need to utilise some truncation arguments to show that ${X-X'}$ has infinite variance whenever ${X}$ has infinite variance.)

Exercise 41

(i) If ${X}$ is a real random variable of mean zero and variance ${\sigma^2}$ , and ${t}$ is a real number, show that
$\displaystyle \varphi_X(t) = 1 + O( \sigma^2 t^2 )$

and that

$\displaystyle \varphi_X(t) = 1 - \frac{1}{2} \sigma^2 t^2 + O( t^2 {\bf E} \min( |X|^2, t |X|^3 ) ).$

(Hint: first establish the Taylor bounds ${e^{itX} = 1 + itX + O(t^2 |X|^2)}$ and ${e^{itX} = 1 + itX + \frac{1}{2} t^2 |X|^2 + O( t^3 |X|^3 )}$ .)

(ii) Establish the pointwise inequality
$\displaystyle |z_1 \dots z_n - z'_1 \dots z'_n| \leq \sum_{i=1}^n |z_i - z'_i|$

whenever ${z_1,\dots,z_n,z'_1,\dots,z'_n}$ are complex numbers in the disk ${\{ z \in {\bf C}: |z| \leq 1 \}}$ .

(iii) Suppose that for each ${n}$ , ${X_{1,n},\dots,X_{n,n}}$ are jointly independent real random variables of mean zero and finite variance, obeying the uniform bound
$\displaystyle |X_{i,n}| \leq \varepsilon_n \sqrt{n}$

for all ${i=1,\dots,n}$ and some ${\varepsilon_n}$ going to zero as ${n \rightarrow \infty}$ , and also obeying the variance bound

$\displaystyle \sum_{i=1}^n {\bf Var}( X_{i,n} ) \rightarrow \sigma^2$

as ${n \rightarrow \infty}$ for some ${0 < \sigma < \infty}$ . If ${S_n := X_{1,n} + \dots + X_{n,n}}$ , use (i) and (ii) to show that

$\displaystyle \varphi_{S_n/\sqrt{n}}(t) \rightarrow e^{-\sigma^2 t^2/2}$

as ${n \rightarrow \infty}$ for any given real ${t}$ .

(iv) Use (iii) and a truncation argument to give an alternate proof of the Lindeberg central limit theorem (Theorem 14). (Note: one has to address the issue that truncating a random variable may alter its mean slightly.)

— 3. The moment method —

The above Fourier-analytic proof of the central limit theorem is one of the quickest (and slickest) proofs available for this theorem, and is accordingly the “standard” proof given in probability textbooks. However, it relies quite heavily on the Fourier-analytic identities in Exercise 32, which in turn are extremely dependent on both the commutative nature of the situation (as it uses the identity ${e^{A+B}=e^A e^B}$ ) and on the independence of the situation (as it uses identities of the form ${{\bf E}(e^A e^B) = ({\bf E} e^A)({\bf E} e^B)}$ ). As one or both of these factors can be absent when trying to generalise this theorem, it is also important to look for non-Fourier based methods to prove results such as the central limit theorem. These methods often lead to proofs that are lengthier and more technical than the Fourier proofs, but also tend to be more robust.

The most elementary (but still remarkably effective) method available in this regard is the moment method, which we have already used in previous notes. In principle, this method is equivalent to the Fourier method, through the identity (7); but in practice, the moment method proofs tend to look somewhat different than the Fourier-analytic ones, and it is often more apparent how to modify them to non-independent or non-commutative settings.

We first need an analogue of the Lévy continuity theorem. Here we encounter a technical issue: whereas the Fourier phases ${x \mapsto e^{itx}}$ were bounded, the moment functions ${x \mapsto x^k}$ become unbounded at infinity. However, one can deal with this issue as long as one has sufficient decay:

Exercise 42 (Subgaussian random variables) Let ${X}$ be a real random variable. Show that the following statements are equivalent:

(i) There exist ${C,c>0}$ such that ${{\bf P}(|X| \geq t) \leq C e^{-ct^2}}$ for all ${t>0}$ .

(ii) There exist ${C'>0}$ such that ${{\bf E} |X|^k \leq (C'k)^{k/2}}$ for all ${k \geq 1}$ .

(iii) There exist ${C'',c''>0}$ such that ${{\bf E} e^{tX} \leq C'' e^{c'' t^2}}$ for all ${t \in {\bf R}}$ .

Furthermore, show that if (i) holds for some ${C,c}$ , then (ii) holds for ${C'}$ depending only on ${C,c}$ , and similarly for any of the other implications. Variables obeying (i), (ii), or (iii) are called subgaussian. The function ${t \mapsto {\bf E} e^{tX}}$ is known as the moment generating function of ${X}$ ; it is of course closely related to the characteristic function of ${X}$ .

Exercise 43 Use the truncation method to show that in order to prove the central limit theorem (Theorem 8), it suffices to do so in the case when the underlying random variable ${X}$ is bounded (and in particular subgaussian).

Once we restrict to the subgaussian case, we have an analogue of the Lévy continuity theorem:

Theorem 44 (Moment continuity theorem) Let ${X_n}$ be a sequence of real random variables, and let ${X}$ be a subgaussian random variable. Suppose that for every ${k=0,1,2,\dots}$ , ${{\bf E} X_n^k}$ converges pointwise to ${{\bf E} X^k}$ . Then ${X_n}$ converges in distribution to ${X}$ .

Proof: Let ${k = 0,1,2,\dots}$ , then by the preceding exercise we have

$\displaystyle {\bf E} |X|^{k+1} \leq (C (k+1))^{(k+1)/2}$

for some ${C}$ independent of ${k}$ . In particular,

$\displaystyle {\bf E} |X_n|^{k+1} \leq 2(C (k+1))^{(k+1)/2}$

when ${n}$ is sufficiently large depending on ${k}$ . From Taylor’s theorem with remainder (and Stirling’s formula (1)) we conclude

$\displaystyle \varphi_{X_n}(t) = \sum_{j=0}^k \frac{(it)^j}{j!} {\bf E} X_n^j + O( (C(k+1))^{-(k+1)/2} |t|^{k+1} )$

uniformly in ${t}$ , for ${n}$ sufficiently large. Similarly for ${X}$ . Taking limits using (i) we see that

$\displaystyle \limsup_{n \rightarrow \infty} |\varphi_{X_n}(t) - \varphi_X(t)| = O( (C(k+1))^{-(k+1)/2} |t|^{k+1} ).$

Then letting ${k \rightarrow \infty}$ , keeping ${t}$ fixed, we see that ${\varphi_{X_n}(t)}$ converges pointwise to ${\varphi_X(t)}$ for each ${t}$ , and the claim now follows from the Lévy continuity theorem. $\Box$
We remark that in place of Stirling’s formula, one can use the cruder lower bound ${n! \geq n^n/e^n}$ , which is immediate from the observation that ${n^n/n!}$ occurs in the Taylor expansion of ${e^n}$ .

Remark 45 One corollary of Theorem 44 is that the distribution of a subgaussian random variable is uniquely determined by its moments (actually, this could already be deduced from Exercise 26 and Remark 28). The situation can fail for distributions with slower tails, for much the same reason that a smooth function is not determined by its derivatives at one point if that function is not analytic.
The Fourier inversion formula provides an easy way to recover the distribution from the characteristic function. Recovering a distribution from its moments is more difficult, and sometimes requires tools such as analytic continuation; this problem is known as the inverse moment problem and will not be discussed here.

Exercise 46 (Converse direction of moment continuity theorem) Let ${X_n}$ be a sequence of uniformly subgaussian random variables (thus there exist ${C,c>0}$ such that ${{\bf P}(|X_n| \geq t) \leq C e^{-ct^2}}$ for all ${t>0}$ and all ${n}$ ), and suppose ${X_n}$ converges in distribution to a limit ${X}$ . Show that for any ${k=0,1,2,\dots}$ , ${{\bf E} X_n^k}$ converges pointwise to ${{\bf E} X^k}$ .

We now give the moment method proof of the central limit theorem. As discussed above we may assume without loss of generality that ${X}$ is bounded (and in particular subgaussian); we may also normalise ${X}$ to have mean zero and unit variance. By Theorem 44, it suffices to show that

$\displaystyle {\bf E} S_n^k / n^{k/2} \rightarrow {\bf E} N^k$

for all ${k=0,1,2,\ldots}$ , where ${N \equiv N(0,1)_{\bf R}}$ is a standard gaussian variable.
The moments ${{\bf E} N^k}$ were already computed in Exercise 37 of Notes 1. So now we need to compute ${{\bf E} S_n^k / n^{k/2}}$ . Using linearity of expectation, we can expand this as

$\displaystyle {\bf E} S_n^k / n^{k/2} = n^{-k/2} \sum_{1 \leq i_1,\ldots,i_k \leq n} {\bf E} X_{i_1} \ldots X_{i_k}.$

To understand this expression, let us first look at some small values of ${k}$ .

For ${k=0}$ , this expression is trivially ${1}$ .
For ${k=1}$ , this expression is trivially ${0}$ , thanks to the mean zero hypothesis on ${X}$ .
For ${k=2}$ , we can split this expression into the diagonal and off-diagonal components:
$\displaystyle n^{-1} \sum_{1 \leq i \leq n} {\bf E} X_i^2 + n^{-1} \sum_{1 \leq i < j \leq n} {\bf E} 2 X_i X_j.$

Each summand in the first sum is ${1}$ , as ${X}$ has unit variance. Each summand in the second sum is ${0}$ , as the ${X_i}$ have mean zero and are independent. So the second moment ${{\bf E} Z_n^2}$ is ${1}$ .
For ${k=3}$ , we have a similar expansion
$\displaystyle n^{-3/2} \sum_{1 \leq i \leq n} {\bf E} X_i^3 + n^{-3/2} \sum_{1 \leq i < j \leq n} {\bf E} 3 X_i^2 X_j + 3 X_i X_j^2$

$\displaystyle + n^{-3/2} \sum_{1 \leq i < j < k \leq n} {\bf E} 6 X_i X_j X_k.$

The summands in the latter two sums vanish because of the (joint) independence and mean zero hypotheses. The summands in the first sum need not vanish, but are ${O(1)}$ , so the first term is ${O(n^{-1/2})}$ , which is asymptotically negligible, so the third moment ${{\bf E} Z_n^3}$ goes to ${0}$ .
For ${k=4}$ , the expansion becomes quite complicated:
$\displaystyle n^{-2} \sum_{1 \leq i \leq n} {\bf E} X_i^4 + n^{-2} \sum_{1 \leq i < j \leq n} {\bf E} 4 X_i^3 X_j + 6 X_i^2 X_j^2 + 4 X_i X_j^3$

$\displaystyle + n^{-2} \sum_{1 \leq i < j < k \leq n} {\bf E} 12 X_i^2 X_j X_k + 12 X_i X_j^2 X_k + 12 X_i X_j X_k^2$

$\displaystyle + n^{-2} \sum_{1 \leq i < j < k < l \leq n} {\bf E} 24 X_i X_j X_k X_l.$

Again, most terms vanish, except for the first sum, which is ${O( n^{-1} )}$ and is asymptotically negligible, and the sum ${n^{-2} \sum_{1 \leq i < j \leq n} {\bf E} 6 X_i^2 X_j^2}$ , which by the independence and unit variance assumptions works out to ${n^{-2} 6 \binom{n}{2} = 3 + o(1)}$ . Thus the fourth moment ${{\bf E} Z_n^4}$ goes to ${3}$ (as it should).

Now we tackle the general case. Ordering the indices ${i_1,\ldots,i_k}$ as ${j_1 < \ldots < j_m}$ for some ${1 \leq m \leq k}$ , with each ${j_r}$ occuring with multiplicity ${a_r \geq 1}$ and using elementary enumerative combinatorics, we see that ${{\bf E} S_n^k / n^{k/2}}$ is the sum of all terms of the form

$\displaystyle n^{-k/2} \sum_{1 \leq j_1 < \ldots < j_m \leq n} c_{k,a_1,\ldots,a_m} {\bf E} X_{j_1}^{a_1} \ldots X_{j_m}^{a_m} \ \ \ \ \ (18)$

where ${1 \leq m \leq k}$ , ${a_1,\ldots,a_m}$ are positive integers adding up to ${k}$ , and ${c_{k,a_1,\ldots,a_m}}$ is the multinomial coefficient

$\displaystyle c_{k,a_1,\ldots,a_m} := \frac{k!}{a_1! \ldots a_m!}.$

The total number of such terms depends only on ${k}$ (in fact, it is ${2^{k-1}}$ (exercise!), though we will not need this fact).

As we already saw from the small ${k}$ examples, most of the terms vanish, and many of the other terms are negligible in the limit ${n \rightarrow \infty}$ . Indeed, if any of the ${a_r}$ are equal to ${1}$ , then every summand in (18) vanishes, by joint independence and the mean zero hypothesis. Thus, we may restrict attention to those expressions (18) for which all the ${a_r}$ are at least ${2}$ . Since the ${a_r}$ sum up to ${k}$ , we conclude that ${m}$ is at most ${k/2}$ .

On the other hand, the total number of summands in (18) is clearly at most ${n^m}$ (in fact it is ${\binom{n}{m}}$ ), and the summands are bounded (for fixed ${k}$ ) since ${X}$ is bounded. Thus, if ${m}$ is strictly less than ${k/2}$ , then the expression in (18) is ${O( n^{m-k/2} )}$ and goes to zero as ${n \rightarrow \infty}$ . So, asymptotically, the only terms (18) which are still relevant are those for which ${m}$ is equal to ${k/2}$ . This already shows that ${{\bf E} Z_n^k}$ goes to zero when ${k}$ is odd. When ${k}$ is even, the only surviving term in the limit is now when ${m=k/2}$ and ${a_1=\ldots=a_m = 2}$ . But then by independence and unit variance, the expectation in (18) is ${1}$ , and so this term is equal to

$\displaystyle n^{-k/2} \binom{n}{m} c_{k,2,\ldots,2} = \frac{1}{(k/2)!} \frac{k!}{2^{k/2}} + o(1),$

and the main term is happily equal to the moment ${{\bf E} N^k}$ as computed in Exercise 37 of Notes 1. (One could also appeal to Lemma 12 here, specialising to the case when ${X}$ is normally distributed, to explain this coincidence.) This concludes the proof of the central limit theorem.

Exercise 47 (Chernoff bound) Let ${X_1,\dots,X_n}$ be iid copies of a real random variable ${X}$ of mean zero and unit variance, which is subgaussian in the sense of Exercise 42. Write ${S_n := X_1+\dots+X_n}$ .

(i) Show that there exist ${c''>0}$ such that ${{\bf E} e^{tX} \leq e^{c'' t^2}}$ for all ${t \in {\bf R}}$ . Conclude that ${{\bf E} e^{tS_n/\sqrt{n}} \leq e^{c'' t^2}}$ for all ${t \in {\bf R}}$ . (Hint: the first claim follows directly from Exercise 42 when ${|t| \geq 1}$ ; for ${|t| < 1}$ , use the Taylor approximation ${e^{tX} = 1 + tX + O( t^2 (e^{2X} + e^{-2X}) )}$ .)

(ii) Conclude the Chernoff bound
$\displaystyle {\bf P}( |\frac{S_n}{\sqrt{n}}| \geq \lambda ) \leq C e^{-c \lambda^2}$

for some ${C,c>0}$ , all ${\lambda>0}$ , and all ${n \geq 1}$ .

Exercise 48 (Erdös-Kac theorem) For any natural number ${x \geq 100}$ , let ${n}$ be a natural number drawn uniformly at random from the natural numbers ${\{1,\dots,x\}}$ , and let ${\omega(n)}$ denote the number of distinct prime factors of ${n}$ .

(i) Show that for any ${k=0,1,2,\dots}$ , one has
$\displaystyle {\bf E} (\frac{\omega(n)-\log\log x}{\sqrt{\log\log x}})^k \rightarrow 0$

as ${x \rightarrow \infty}$ if ${k}$ is odd, and

$\displaystyle {\bf E} (\frac{\omega(n)-\log\log x}{\sqrt{\log\log x}})^k \rightarrow \frac{k!}{2^{k/2} (k/2)!}$

as ${x \rightarrow \infty}$ if ${k}$ is even. (Hint: adapt the arguments in Exercise 16 of Notes 3, estimating ${\omega(n)-\log\log x}$ by ${\sum_{p \leq x^{1/10k}} 1_{p|n} - \frac{1}{p}}$ , using Mertens’ theorem and induction on ${k}$ to deal with lower order errors, and treating the random variables ${1_{p|n} - \frac{1}{p}}$ as being approximately independent and approximately of mean zero.)

(ii) Establish the Erdös-Kac theorem
$\displaystyle \frac{1}{x} | \{ n \leq x: a \leq \frac{\omega(n)-\log\log x}{\sqrt{\log\log x}} \leq b \}| \rightarrow \frac{1}{\sqrt{2\pi}} \int_a^b e^{-x^2/2}\ dx$

as ${x \rightarrow \infty}$ for any fixed ${a < b}$ .

Informally, the Erdös-Kac theorem asserts that ${\omega(n)}$ behaves like ${N( \log\log n, \log\log n)}$ for “random” ${n}$ . Note that this refines the Hardy-Ramanujan theorem (Exercise 16 of Notes 3).

72 comments

Comments feed for this article

2 November, 2015 at 7:56 pm

Anonymous

may want to add “275A – probability theory” in the tags

[Added, thanks – T.]

3 November, 2015 at 5:46 am

Anonymous

In theorem 36, the assumption $E(|X|^3) < \infty$ is not really needed (since otherwise (11) holds trivially!)

[Assumption removed, thanks – T.]

3 November, 2015 at 11:07 am

Anonymous

You should use $\dots$ instead of $\ldots$ and $\cdots$, resp.

3 November, 2015 at 7:21 pm

Anonymous

In exercise 13(ii), “ $\sup$ ” is in fact “ $\max$ “, and the expectation should be dependent on the index $i$ .

[Corrected, thanks – T.]

4 November, 2015 at 9:09 am

Anonymous

According to the Wikipedia articles on the CLT and the Berry-Esseen theorem:

1. The term “central limit theorem” (in German) was first used by Polya (in 1920) where the word “central” is due to its central importance in probability theory. The later interpretation of “central” in the sense that the CLT describes the behaviour of the distribution near its center is (according to Le Cam) due to the French school on probability.

2. The best implied constant in the error term in 11, is lower-bounded by $0.4097$ (Esseen, 1956) and upper-bounded by $0.4748$ (Shevtsova, 2011).

4 November, 2015 at 9:46 am

Anonymous

In the proof of Theorem 26 you briefly switch to $R^d$

[Corrected, thanks – T.]

4 November, 2015 at 9:57 am

Anonymous

A useful lemma is that if $X_n$ converge in distribution they can be coupled to converge almost surely. This is also an instructive alternate approach to the second direction in Proposition 4.

[This was the intention of Exercise 10(iv), though I had previously only asked for convergence in probability, and have now strengthened the conclusion. -T.]

4 November, 2015 at 11:18 am

Anonymous

It seems that exercise 10(vi) generalizes Slutsky’s theorem (which deals with a degenerate(!) limit $Y$ ), but in the Wikipedia article on Slutsky’s theorem (according to the second note) it is claimed that the requirement that $Y$ is degenerate(!) is important (the theorem is not (!) valid in general for a non-degenerate limit $Y$ ) – is this claim true?

[Oops, this was an error – the exercise has now been corrected. -T.]

5 November, 2015 at 6:52 am

Anonymous

Exercise 9, definition of distribution at the beginning has typo.

[Corrected, thanks – T.]

5 November, 2015 at 9:45 am

John Mangual

The Lindberg approach is related to the “renormalization group approach”, as I learned from Yakov Sinai.

However, I couldn’t not reconcile the discussions of RG in hep-th and the discussion in the probability textbook.

5 November, 2015 at 12:21 pm

Joop de Maat

A typo (“equvialence” before Remark 3) and a question: How exactly are the two notions of distributions from Remark 3 similar ? You only indicated in what the where different, not in what they are alike.

5 November, 2015 at 12:54 pm

Terence Tao

Thanks for the correction! The similarity is both notions of convergence require testing against “nice” functions (continuous compactly supported functions in the case of convergence in distribution, and smooth compactly supported functions in the case of convergence in the sense of distributions). In particular, it is true that a sequence of random variables $X_n$ converge in distribution to a random variable $X$ if and only if the probability measures $\mu_{X_n}$ converge in the sense of distributions to $\mu_X$ . (But once one leaves the setting of probability measures, one can have more exotic distributions as the distributional limit, e.g. derivatives of delta functions.)

14 November, 2015 at 11:59 am

Venky

In the Lindberg approach, I was trying to see where the Gaussian nature of the pdf of Y_i is important other than (3). Can one say that an implication of CLT is that the Gaussian pdf is the only pdf that preserves its functional form over countable addition of independent variables? Great notes, thanks.

14 November, 2015 at 12:20 pm

Terence Tao

Yes, the gaussian is the only finite-variance stable law, and the Lindeberg argument explains why there can be at most one such law. (But there are other stable laws in the infinite variance case, which lie outside of the reach of the CLT: for instance, the Cauchy distribution is also stable.)

14 November, 2015 at 4:35 pm

Anonymous

I have a question on the Example 5, we have known that Xn converges to X in distribution, where Xn is uniform distributed on {1/n, 2/n, …, n/n} and X is uniform[0,1]. How can we check that if Xn converges to X in prob?

14 November, 2015 at 6:49 pm

Terence Tao

This is not possible from the information given, because we do not know the joint distribution of $(X,X_n)$ , only the individual distributions of $X$ and $X_n$ ; the joint distribution is irrelevant for the question of whether there is convergence in distribution, but is very relevant for convergence in probability For instance, $X_n$ could be $X$ rounded up to the next multiple of $1/n$ , in which case $X_n$ converges to $X$ in probability (and almost surely). Or, $X_n$ could be independent of $X$ , in which case it will not converge in probability or almost surely to $X$ .

18 November, 2015 at 3:42 pm

Terence Tao

There appears to be a strange issue with one of the wordpress LaTeX renderers right now, as some simple expressions such as $\varphi = 0$ are not rendering properly while others such as $\varphi$ are fine. For those in the class following the notes online, a temporary workaround is to hover one’s mouse over the improperly rendered images to see the uncompiled LaTeX. I do not seem to be able to fix the issue on my end, but I’m hoping that the problem will resolve itself shortly (at which point I will remove this comment).

19 November, 2015 at 8:25 am

Terence Tao

OK, the problem seems to be that as of about 24 hours ago, wordpress has become unable to render any latex fragment that it has not previously rendered, presumably due to the machine that does all the new renderings being somehow offline. I’d be interested to see if other wordpress.com sites are having the same issue, in which case it would make sense to report it. In the meantime, I’ve rolled back these notes to an earlier version for now to avoid the rendering errors. [Update: it now appears that the issue is affecting all wordpress.com blogs, and has been reported. -T.]

19 November, 2015 at 11:38 am

ianmarqz

I have the same problem of well rendering : ( sample: https://juanmarqz.wordpress.com/about/sum-of-reciprocals/

19 November, 2015 at 2:51 pm

275A, Notes 5: Variants of the central limit theorem | What's new

[…] the previous set of notes we established the central limit theorem, which we formulate here as […]

23 November, 2015 at 7:44 pm

Dillon

The characteristic function for a uniform distribution on an interval (Exercise 21) is missing an $i$ in the denominator. Also some of the preceding exercises still say $F$ for the characteristic function where $\varphi$ would be more consistent.

[Corrected, thanks – T.]

24 November, 2015 at 6:48 pm

Anonymous

In exercise 10 (vii), should it be liminf EG(X_n)>= EG(X)?

[Corrected, thanks – T.]

30 November, 2015 at 3:48 am

Anonymous

‘non-engative’ should be ‘non-negative’ (search and replace time)

[Corrected, thanks – T.]

30 November, 2015 at 4:02 am

Anonymous

‘Levi continuity theorem’ should be ‘Lévy continuity theorem’ (search and replace again)

[Corrected, thanks – T.]

2 December, 2015 at 3:24 am

jura05

thank you for these lectures!
some typos:
– formula after (14) – missing term O(.)
– before (15): instead of “below by 1/2 on [0,eps]” should be “below by 1/2 on [a,a+eps]”
– proof of (14) goes for the special case a=0; although there is no difference.. maybe it is convenient to write a general bound
$|E\psi(X)-E\psi(Y)|\le \int_{-c}^c |\varphi_X(t/\varepsilon)-\varphi_Y(t/\varepsilon)|dt/t$ and apply it to X=S_n/sqrt{n}-a, Y=N-a

[Corrected, thanks – T.]

30 May, 2016 at 12:56 am

Anonymous

Thanks for the great notes. Small typos:

1. inequality (2) and the following equation are missing a square in the $(x-x_1)^2$ term.

2. Closed bracket missing in Ex 17 (ii)

3. Ex 17 (iii), the indicator function doesn’t appear correctly.

[Corrected, thanks – T.]

30 May, 2016 at 7:56 am

Anonymous

What is known about the convergence rate in Erdos-Kac theorem?

30 May, 2016 at 10:49 am

Terence Tao

I think this paper by Harper represents the current state of the art: http://www.ams.org/mathscinet-getitem?mr=2507311

2 October, 2016 at 6:53 am

Anonymous

In Exercise 14, for each $n$ , ${(X_{1,n},\dots,X_{k_n,n})}$ are jointly independent. What do you mean by “We do not require the random variables ${(X_{1,n},\dots,X_{k_n,n})}$ to be jointly independent in ${n}$ “? Would you illustrate by giving an example that what could be dependent here?

[For instance, we do not require the tuple $(X_{1,1},\dots,X_{k_1,1})$ to be independent of $(X_{1,2},\dots,X_{k_2,2})$ . -T.]

4 October, 2016 at 5:55 am

Anonymous

In Exercise 25, can one avoid calculating the $j$ -th derivative of $\varphi_X(t)$ by just directly integrating the Taylor expansion of $e^{it}$ ?

[This should work also, as long as one is careful with estimating error terms – T.]

4 October, 2016 at 6:17 am

Anonymous

For Exercise 19, all the textbooks I have seen use the contour integrals. Can one do it with other methods than the one in complex analysis?

[One such method begins by observing that the pdf of the normal distribution obeys an ODE which then translates to an ODE for the characteristic function (cf. Stein’s method). -T]

4 October, 2016 at 6:23 am

Anonymous

In the Fourier proof of Theorem 8, would you elaborate the “equivalently” part? Is the parenthesis a typo there?

[We are using the Taylor approximation $1 - x = \exp( -x + o(x))$ for small $x$ (specifically, $x = t^2/2n - o(|t|^2/n)$ ). -T.]

4 October, 2016 at 7:09 pm

Anonymous

By direct substitution, I got
$\displaystyle \varphi_X(t/\sqrt{n})^n=\exp(-t^2+n\cdot o(|t|^2/n))=\exp(-t^2+\cdot o(|t|^2))$
For fixed $t$, when taking the limit, how do you get rid of the little o term?

4 October, 2016 at 9:26 pm

Terence Tao

The asymptotic notation $o()$ here is with respect to the limit $n \to \infty$ , rather than $t \to 0$ ,so the $o(|t|^2)$ term goes to zero in the limit when $t$ is fixed and $n$ goes to infinity. (By the way, I corrected a typo in my previous answer, in which I had omitted a factor of $1/n$ in the choice of $x$ .)

5 October, 2016 at 1:00 pm

Anonymous

In Exercise 14, what is the intuition of the Lindeberg condition?
$\displaystyle \frac{1}{\sigma_n^2} \sum_{i=1}^{k_n} {\bf E}( |X_{i,n}|^2 1_{|X_{i,n}| > \varepsilon \sigma_n} ) \rightarrow 0$

Why would one expect such a condition implies the desired result?

6 October, 2016 at 11:40 am

Terence Tao

The central limit theorem can fail if there is a significant probability that one of the summands $X_{i,n}$ is so large that it is comparable to the total standard deviation $\sigma_n$ . The Lindeberg condition is basically an assertion that these sort of outlier events are vanishingly rare (and their contribution to the sum has vanishingly small variance) in the limit $n \to \infty$ .

6 October, 2016 at 6:03 pm

Anonymous

I saw Billingsley in his Probability and Measure proves Remark 28 by proving the inversion formula:

If the probability measure $\mu$ has characteristic function $\varphi$ and $\mu\{a\}=\mu\{b\}=0$, then
$\displaystyle \mu(a,b]=\lim_{T\to\infty}\frac{1}{2\pi}\int_{-T}^T\frac{e^{-ita}-e^{-itb}}{it}\ dt.$

This approach looks rather different from the one in the proof of Theorem 27, which uses the Schwartz functions and is much shorter. What makes the Schwartz function-approach faster and is it hinted somewhere that these two approaches are essentially the same?

8 October, 2016 at 4:58 am

Anonymous

Ah, the inversion formula should be
$\displaystyle \mu(a,b]=\lim_{T\to\infty}\frac{1}{2\pi}\int_{-T}^T\frac{e^{-ita}-e^{-itb}}{it}\varphi(t) dt.$

Actually, the proof of the inversion formula is not that long. The key step is just exchanging the integral using Fubini and taking the limit. Still it looks rather different from the Schwartz-function-approach.

8 October, 2016 at 8:40 am

Terence Tao

Basically the difference lies in the use of the smoothing sums technique: replacing a rough cutoff such as $1_{[a,b]}$ with a smoother cutoff in order to enjoy better decay of the Fourier transform. The price one pays for this is that the formulae become a little less explicit, possibly leading to worse quantitative bounds in some cases, but in many applications this sort of loss is acceptable.

12 October, 2016 at 5:57 pm

Anonymous

In theorem 11, can we replace random variables with random vectors?

[Yes; see for instance the linked exercise. -T.]

13 October, 2016 at 6:55 pm

Anonymous

Also for Proposition 4 and Exercise 7 one can have random vectors instead? I’m wondering if i missed reading some of the notes about the random vector part or it is just trivial generalization.

14 October, 2016 at 6:37 pm

Terence Tao

If one deletes (ii) then the remaining claims are equivalent in the vector-valued case. But it is not clear to me whether there is an analogue for (ii) in the vector-valued case (one has to define the cumulative distribution function properly, and specify exactly what a “point of continuity” would mean for that function); the proof given is restricted to the scalar case.

14 October, 2016 at 6:57 pm

Anonymous

https://en.wikipedia.org/wiki/Cumulative_distribution_function#Multivariate_case

13 October, 2016 at 6:26 pm

Anonymous

In Exercise 7, do you have a hint for (v) —>(i)? I don’t have any idea why the boundary of $E$ would have anything to do with continuity of a function.

[Consider sets $E$ of the form $E = (-\infty,t]$ . -T.]

13 October, 2016 at 6:52 pm

Anonymous

In the proof of Proposition 4, is $PG(X_n)$ a typo? Likewise for several others?

[Corrected, thanks – T.]

22 May, 2017 at 5:55 pm

Quantitative continuity estimates | What's new

[…] starting point for the Lindeberg exchange method in probability theory, discussed for instance in this previous post. The identity in (ii) can also be used in the Lindeberg exchange method; the terms in the […]

25 June, 2017 at 4:51 pm

Discotech

In exercise 35, should the exponent have a $\pi$? I think maybe it should not. Unless I’m misunderstanding the meaning of the Lebesgue integral over the complex plane, the measure defined isn’t normalized to 1.

In the hint on exercise 40, do you mean to refer to the probability $P( X_1 1_{|X_1|>M} + \dots + X_n 1_{|X_d|>M} \geq 0)$ (greater than signs instead of less than signs)?

[Corrected, thanks – T.]

26 October, 2018 at 4:05 am

Anonymous

In Slutsky’s theorem, can one relax the condition regarding the convergence of $Y_n$ to be convergence in distribution?

27 October, 2018 at 9:03 am

Terence Tao

When the limit $Y$ is deterministic, convergence in distribution is the same as convergence in probability.

20 February, 2019 at 2:46 pm

haonanz

Hello Prof. Tao, I have a question regard Exercise 7 prove ii) => v). I am thinking if I could use sigma algebra like proof. I believe the difficulty comes from the fact we have only convergence at continuity point. If we assume it’s true for all t, then I think I can show that the set of E satisfy this property is closed under countable union and countable closure), then use the fact Borel sets are the smallest sigma algebra generated by the half-intervals to complete the proof. I wonder the fact we know those discontinuity is at most countably many points can somehow be used to show this generated sigma algebra still contains the half-intervals thus the borel set.

21 February, 2019 at 8:48 am

Terence Tao

I’m not sure a sigma algebra approach will be easy to execute, because the set of Borel sets $B$ with $\partial B$ of measure zero is not closed under countable union or countable intersection. (The strategy I had in mind instead is to approximate $B$ by its interior and its exterior and use (iii), (iv).)

23 February, 2019 at 7:06 am

haonanz

A follow up question regard proving i)=>iii), I can prove the claim for the compact case using upper approximation of a CTS and compact supported function. Do you mind providing hint on how can I extend this to the closed case? Thanks again!

23 February, 2019 at 8:12 am

Terence Tao

Show that for large $R$ , the probability ${\bf P}(|X| \leq R)$ is close to 1, then show that the probability ${\bf P}(|X_n| \leq 2R)$ is close to 1 for large $n$ , thus the probability ${\bf P}(|X_n| > 2R)$ is small.

6 November, 2019 at 3:54 am

anonymous

There should be a $k_n$ instead of $n$ in point (ii) of exercise 14.

[Corrected, thanks – T]

18 April, 2020 at 2:07 pm

Anonymous

Let $X_1, X_2, \ldots$ be iid Bernoulli( $p$ ) random variables, that is, $P(X_i=1)=p$ and $P(X_i=0)=1-p$ .

Let

$\displaystyle D_n=\sum_{i=1}^n Y_i$

where $Y_i=I_{(X_i=1,X_{i+1}=1)}$ .

Show that $\sqrt{n} \left(\frac{D_n}{n}-p^2\right)\overset{D}{\rightarrow} N(0, \sigma^2)$ and determine $\sigma^2$ .

How can we prove this by using some elementary tools?

24 December, 2020 at 3:43 am

Aditya Guha Roy

You may try to solve it using the central limit theorem and Slutsky’s theorem. I have not done this very exercise, but it seems that this can be done by using these tools. Try it and let us know here.

1 January, 2021 at 8:37 pm

N is a number

Yes that would work. I just checked it.

23 December, 2020 at 4:59 am

Aditya Guha Roy

On line number 22 of these notes, you have written about applying Paley Zygmund inequality to get the mentioned lower bound.
I think there is a typo in labelling it as Exercise 42 of Notes 1 ; it should be Exercise 44 of Notes 1.

[Corrected, thanks – T.]

1 January, 2021 at 6:01 am

N is a number

In Exercise 19, we see that from the definition of characteristic function it suffices to show that the result holds when $\mu = 0$ and $\sigma^2 = 1.$ Now that calls for showing $\mathbf{E} [ e^{i tX} ] = e^{-t^2 / 2}$ for all complex numbers $t.$ Now rewriting the LHS as an integral one has a holomorphic function on the LHS and the RHS is also a holomorphic function. Now, one can prove this identity for the case when $t$ is purely imaginary so that $it$ is purely real by a direct evaluation of the integral which appears on the LHS.
Now, using the fact that non-trivial holomorphic maps have isolated zeroes, and identifying that the holomorphic map LHS – RHS is zero identically on the imaginary axis, one concludes that the identity holds for all complex numbers $t.$

Is that correct ?

[Sure, this works. One can also compute the integral for arbitrary $t$ by completing the square and contour shifting instead of relying on analytic continuation. -T]

1 January, 2021 at 9:01 pm

Aditya Guha Roy

I think to show that $\mathbf{E} [ e^{it X} ]$ is holomorphic over $\mathbf{C}$ , an easy way is to notice that it is the derivative of the holomorphic map $t \mapsto \mathbf{E} [ X e^{itX}]$ and hence the former must also be holomorphic.

1 January, 2021 at 9:04 pm

Aditya Guha Roy

Filling a gap in my previous comment: we need to notice that we can interchange the derivative and expectation, thanks to Lebesgue’s dominated convergence theorem.

1 January, 2021 at 8:34 pm

N is a number

Prof. Tao in lines 4-5 of the proof of Proposition 1 you have cited Theorem 24 from Notes 1. I think you wanted to cite Theorem 25 from Notes 1 (since here we have the premises satisfied for $\varepsilon = 2$ ).

[Corrected, thanks – T.]

17 March, 2021 at 2:46 pm

Orestis Plevrakis

I tried to prove Levy’s continuity theorem without using Prokhorov’s theorem, but I failed. However, I cannot see why one must pass through such an argument to get Levy’s theorem. Can you give some intuition?

27 July, 2021 at 7:23 am

Anonymous

In the proof of the moment method (after exercise 46) there are two mentions of exercise 36 in Notes 1, but it should be exercise 37.

Further, a question about theorem 44: in the final equation, the RHS seems to be tending to $\infty$ for fixed $t$ . How can we argue that $\phi_{X_n}(t) \rightarrow \phi_{X}(t)$ point-wise for $t$ ?

[Corrected, thanks. There was a negative sign that was omitted by mistake in the exponent of the RHS that has now been restored – T.]

26 March, 2022 at 9:54 am

Aditya Guha Roy

In Exercise 15 (Weak Berry-Esseen bounds) part (i) is it the right way to do as follows:

We would do everything verbatim as in Lindeberg’s proof except now we expand the Taylor series till the third derivative, and in the first step instead of specializing to finite third moment we consider the special case of finite fourth moment; then for the general case we use the same truncation method.

26 March, 2022 at 9:55 am

Aditya Guha Roy

More precisely:
first of all we consider the case when the fourth moment of $X$ is finite, then notice that we can just imitate the construction $(X_n , Y_n)$ as done in Lindeberg’s proof of the central limit theorem, with the only difference being that now we are expanding the Taylor series of $G(Z + \frac{1}{\sqrt{n}} X_n )$ upto third order derivatives to get
$G (Z + \frac{1}{\sqrt{n}} X_n) = G(Z) + \frac{1}{\sqrt{n}} G' (Z) X_n + \frac{1}{2n} G '' (Z) X_n^2 + \frac{1}{6 \sqrt{n^3} } G ''' (Z) X_n ^3 + O ( n^{-2} | X_n |^4 )$ ;
now we just imitate all the other steps of Lindeberg’s proof (for the case of finite third moments) the only distinction being that now we keep carrying the third derivative term throughout,
thus we would obtain at the final step
$\mathbf{E} [ G ( \frac{X_1 + ..... + X_n}{\sqrt{n}} ) ] - \frac{E} [ G ( N ) ] = c \cdot \sup_{x} |G ''' (x)| \mathbf{E} [ |X|^3 ] n^{-1/2} + o ( n^{-1/2} )$
where $c$ is an absolute constant (the role of $G$ got nullified by taking the supremum).

Now we relax the condition of finite fourth moment in the exact same manner by constructing a truncation and then bounding the remaining portion of the original series to show that $\mathbf{E} [ \frac{|X_{1 , > } + .... + X_{n , >} | }{\sqrt{n} } ] \to 0$ as $n \to \infty$ and then using the Lipschitz property of $G$ which is guaranteed as $G$ is three times differentiable and hence at least continuously differentiable, and also compactly supported (thanks to the mean value theorem).
Now about the truncated portion in a similar way we show that it satisfies the conclusions we had for the finite fourth moment case (the same arguments as in Lindeberg’s proof would go through).

Finally putting all these together would give us the weak Berry-Esseen bounds.

26 March, 2022 at 9:57 am

Aditya Guha Roy

I see some typos: that should be $o ( n^{-1/2}),$ and $\mathbf{E} [ G ( N ) ]$ respectively in the 13th line of display containing the $\sup$ term.

26 March, 2022 at 1:34 pm

Terence Tao

This will not work because the implied constant in the error term depends on bounds on the fourth derivative of G, which will diverge in the limit. A Taylor expansion to third order will suffice here, with no truncation required.

26 March, 2022 at 8:08 pm

Aditya Guha Roy

Yes sir, I realize this now. Thank you for pointing it out to me.

26 March, 2022 at 9:59 am

Aditya Guha Roy

Sir, towards the end of Lindeberg’s proof you have said that this proves (3).
I think you meant to label the equation in the 5th line of display after Exercise 13 involving $X_i$ s as (3) instead of the one with $Y_i$ s.

[Corrected, thanks – T.]

31 March, 2022 at 3:49 am

Aditya Guha Roy

Prof Tao in the statement of Berry Esseen theorem (Theorem 37) I think you may like to add the condition that $X$ needs to have a finite third moment.
(Though technically speaking what you have written is still quite correct.)

3 December, 2023 at 5:04 pm

Convergence of mean from almost sure convergence in a super-linear bounded mean setting | Aditya Guha Roy's weblog

[…] (together with Kolmogorov’s 0-1 law) is demonstrated in the proof of the opening proposition here. In a later blogpost I shall discuss consequences of Proposition 1 which are of a similar […]

	Anonymous on 254A, Supplement 4: Probabilis…
	Terence Tao on Analysis II
	Anonymous on Analysis II
	El problema de Erdős… on Two announcements: AI for Math…
	Anonymous on An airport-inspired puzzle
	oliverknill on Two announcements: AI for Math…
	Anonymous on An airport-inspired puzzle
	Prashant Patil on Two announcements: AI for Math…
	Anonymous on Two announcements: AI for Math…
	Anonymous on Two announcements: AI for Math…
	Anonymous on 275A, Notes 3: The weak and st…
	Anonymous on 275A, Notes 3: The weak and st…
	Anonymous on Two announcements: AI for Math…
	Anonymous on Two announcements: AI for Math…
	Lior Silberman on Two announcements: AI for Math…

275A, Notes 4: The central limit theorem

Recent Comments

Articles by others

Diversions

Mathematics

Selected articles

Software

The sciences

Top Posts

Archives

Categories

The Polymath Blog

72 comments

Leave a reply to Anonymous Cancel reply

For commenters

275A, Notes 4: The central limit theorem

Share this:

Recent Comments

Articles by others

Diversions

Mathematics

Selected articles

Software

The sciences

Top Posts

Archives

Categories

The Polymath Blog

72 comments

Leave a reply to Anonymous Cancel reply

For commenters