In the previous set of notes we established the central limit theorem, which we formulate here as follows:

Theorem 1 (Central limit theorem) Let ${X_1,X_2,X_3,\dots}$ be iid copies of a real random variable ${X}$ of mean ${\mu}$ and variance ${0 < \sigma^2 < \infty}$, and write ${S_n := X_1 + \dots + X_n}$. Then, for any fixed ${a < b}$, we have

$\displaystyle {\bf P}( a \leq \frac{S_n - n \mu}{\sqrt{n} \sigma} \leq b ) \rightarrow \frac{1}{\sqrt{2\pi}} \int_a^b e^{-t^2/2}\ dt \ \ \ \ \ (1)$

as ${n \rightarrow \infty}$.

This is however not the end of the matter; there are many variants, refinements, and generalisations of the central limit theorem, and the purpose of this set of notes is to present a small sample of these variants.

First of all, the above theorem does not quantify the rate of convergence in (1). We have already addressed this issue to some extent with the Berry-Esséen theorem, which roughly speaking gives a convergence rate of ${O(1/\sqrt{n})}$ uniformly in ${a,b}$ if we assume that ${X}$ has finite third moment. However there are still some quantitative versions of (1) which are not addressed by the Berry-Esséen theorem. For instance one may be interested in bounding the large deviation probabilities

$\displaystyle {\bf P}( |\frac{S_n - n \mu}{\sqrt{n} \sigma}| \geq \lambda ) \ \ \ \ \ (2)$

in the setting where ${\lambda}$ grows with ${n}$. Chebyshev’s inequality gives an upper bound of ${1/\lambda^2}$ for this quantity, but one can often do much better than this in practice. For instance, the central limit theorem (1) suggests that this probability should be bounded by something like ${O( e^{-\lambda^2/2})}$; however, this theorem only kicks in when ${n}$ is very large compared with ${\lambda}$. For instance, if one uses the Berry-Esséen theorem, one would need ${n}$ as large as ${e^{\lambda^2}}$ or so to reach the desired bound of ${O( e^{-\lambda^2/2})}$, even under the assumption of finite third moment. Basically, the issue is that convergence-in-distribution results, such as the central limit theorem, only really control the typical behaviour of statistics in ${\frac{S_n-n \mu}{\sqrt{n} \sigma}}$; they are much less effective at controlling the very rare outlier events in which the statistic strays far from its typical behaviour. Fortunately, there are large deviation inequalities (or concentration of measure inequalities) that do provide exponential type bounds for quantities such as (2), which are valid for both small and large values of ${n}$. A basic example of this is the Chernoff bound that made an appearance in Exercise 47 of Notes 4; here we give some further basic inequalities of this type, including versions of the Bennett and Hoeffding inequalities.

In the other direction, we can also look at the fine scale behaviour of the sums ${S_n}$ by trying to control probabilities such as

$\displaystyle {\bf P}( a \leq S_n \leq a+h ) \ \ \ \ \ (3)$

where ${h}$ is now bounded (but ${a}$ can grow with ${n}$). The central limit theorem predicts that this quantity should be roughly ${\frac{h}{\sqrt{2\pi n} \sigma} e^{-(a-n\mu)^2 / 2n \sigma^2}}$, but even if one is able to invoke the Berry-Esséen theorem, one cannot quite see this main term because it is dominated by the error term ${O(1/n^{1/2})}$ in Berry-Esséen. There is good reason for this: if for instance ${X}$ takes integer values, then ${S_n}$ also takes integer values, and ${{\bf P}( a \leq S_n \leq a+h )}$ can vanish when ${h}$ is less than ${1}$ and ${a}$ is slightly larger than an integer. However, this turns out to essentially be the only obstruction; if ${X}$ does not lie in a lattice such as ${{\bf Z}}$, then we can establish a local limit theorem controlling (3), and when ${X}$ does take values in a lattice like ${{\bf Z}}$, there is a discrete local limit theorem that controls probabilities such as ${{\bf P}(S_n = m)}$. Both of these limit theorems will be proven by the Fourier-analytic method used in the previous set of notes.

We also discuss other limit theorems in which the limiting distribution is something other than the normal distribution. Perhaps the most common example of these theorems is the Poisson limit theorems, in which one sums a large number of indicator variables (or approximate indicator variables), each of which is rarely non-zero, but which collectively add up to a random variable of medium-sized mean. In this case, it turns out that the limiting distribution should be a Poisson random variable; this again is an easy application of the Fourier method. Finally, we briefly discuss limit theorems for other stable laws than the normal distribution, which are suitable for summing random variables of infinite variance, such as the Cauchy distribution.

Finally, we mention a very important class of generalisations to the CLT (and to the variants of the CLT discussed in this post), in which the hypothesis of joint independence between the variables ${X_1,\dots,X_n}$ is relaxed, for instance one could assume only that the ${X_1,\dots,X_n}$ form a martingale. Many (though not all) of the proofs of the CLT extend to these more general settings, and this turns out to be important for many applications in which one does not expect joint independence. However, we will not discuss these generalisations in this course, as they are better suited for subsequent courses in this series when the theory of martingales, conditional expectation, and related tools are developed.

— 1. Large deviation inequalities —

We now look at some upper bounds for the large deviation probability (2). To get some intuition as to what kinds of bounds one can expect, we first consider some examples. First suppose that ${X \equiv N(0,1)}$ has the standard normal distribution, then ${\mu=0}$, ${\sigma^2=1}$, and ${S_n}$ has the distribution of ${N(0,n)}$, so that ${\frac{S_n - n \mu}{\sqrt{n} \sigma}}$ has the distribution of ${N(0,1)}$. We thus have

$\displaystyle {\bf P}( |\frac{S_n - n \mu}{\sqrt{n} \sigma}| \geq \lambda ) = \frac{2}{\sqrt{2\pi}} \int_\lambda^\infty e^{-t^2/2}\ dt$

which on using the inequality ${e^{-t^2/2} \leq e^{-\lambda^2/2} e^{-\lambda (t-\lambda)}}$ leads to the bound

$\displaystyle {\bf P}( |\frac{S_n - n \mu}{\sqrt{n} \sigma}| \geq \lambda ) \leq \lambda^{-1} e^{-\lambda^2/2}. \ \ \ \ \ (4)$

Next, we consider the example when ${X}$ is a Bernoulli random variable drawn uniformly at random from ${\{0,1\}}$. Then ${\mu=1/2}$, ${\sigma^2=1/4}$, and ${S_n}$ has the standard binomial distribution on ${\{0,\dots,n\}}$, thus ${{\bf P}(S_n=i) = \binom{n}{i}/2^i}$. By symmetry, we then have

$\displaystyle {\bf P}( |\frac{S_n - n \mu}{\sqrt{n} \sigma}| \geq \lambda ) = \sum_{i \geq \frac{n}{2} + \frac{\lambda \sqrt{n}}{2}} \binom{n}{i} / 2^i.$

We recall Stirling’s formula, which we write crudely as

$\displaystyle n! = n^n e^{-n + o(n)} = \exp( n \log n - n + o(n) )$

as ${n \rightarrow \infty}$, where ${o(n)}$ denotes a quantity with ${o(n)/n \rightarrow 0}$ as ${n \rightarrow \infty}$ (and similarly for other uses of the ${o()}$ notation in the sequel). If ${i = \alpha n}$ and ${\alpha}$ is bounded away from zero and one, we then have the asymptotic

$\displaystyle \binom{n}{i} = \exp( h(\alpha) n + o(n) )$

where ${h(\alpha)}$ is the entropy function

$\displaystyle h(\alpha) := \alpha \log \frac{1}{\alpha} + (1-\alpha) \log \frac{1}{1-\alpha}$

(compare with Exercise 17 of Notes 3). One can check that ${h(\alpha)}$ is decreasing for ${1/2 \leq \alpha \leq 1}$, and so one can compute that

$\displaystyle {\bf P}( |\frac{S_n - n \mu}{\sqrt{n} \sigma}| \geq \theta \sqrt{n} ) = \exp( (h(\frac{1+\theta}{2}) - \log 2 + o(1)) n )$

as ${n \rightarrow \infty}$ for any fixed ${0 < \theta < 1}$. To compare this with (4), observe from Taylor expansion that

$\displaystyle h(\frac{1+\theta}{2}) = - \frac{\theta^2}{2} + o(\theta^2)$

as ${\theta \rightarrow 0}$.

Finally, consider the example where ${X}$ takes values in ${\{0,1\}}$ with ${{\bf P}(X=0)=1-p}$ and ${{\bf P}(X=1)=p}$ for some small ${0 < p < 1/2}$, thus ${\mu = p}$ and ${\sigma^2 = p - p^2 \approx p}$. We have ${{\bf P}(S_n = n) = p^n}$, and hence

$\displaystyle {\bf P}( |\frac{S_n - n \mu}{\sqrt{n} \sigma}| \geq \lambda ) = p^n = \exp( - n \log \frac{1}{p} )$

with

$\displaystyle \lambda := \sqrt{n} \frac{\sqrt{1-p}}{\sqrt{p}} \approx \sqrt{n} / \sqrt{p}.$

Here, we see that the large deviation probability ${\exp( - n \log \frac{1}{p} )}$ is somewhat larger than the gaussian prediction of ${\exp( - c \lambda^2 )}$. Instead, the exponent ${n \log \frac{1}{p}}$ is approximately related to ${\lambda}$ and ${\sigma}$ by the formula

$\displaystyle n \log \frac{1}{p} \approx \lambda \sqrt{n} \sigma \log \frac{\lambda}{\sqrt{n} \sigma}.$

We now give a general large deviations inequality that is consistent with the above examples.

Proposition 2 (Cheap Bennett inequality) Let ${a > 0}$, and let ${X_1,\dots,X_n}$ be independent random variables, each of which takes values in an interval of length at most ${a}$. Write ${S_n := X_1+\dots+X_n}$, and write ${\mu^{(n)}}$ for the mean of ${S_n}$. Let ${\sigma^{(n)} > 0}$ be such that ${S_n}$ has variance at most ${(\sigma^{(n)})^2}$. Then for any ${\lambda > 0}$, we have

$\displaystyle {\bf P}( |\frac{S_n - \mu^{(n)}}{\sigma^{(n)}}| \geq \lambda ) \leq 2 \exp( - c \min( \lambda^2, \frac{\lambda \sigma^{(n)}}{a} \log \frac{a \lambda}{\sigma^{(n)}} ) ) \ \ \ \ \ (5)$

for some absolute constant ${c>0}$.

There is more precise form of this inequality known as Bennett’s inequality, but we will not prove it here.

The first term in the minimum dominates when ${\lambda \ll \frac{\sigma^{(n)}}{a}}$, and the second term dominates when ${\lambda \gg \frac{\sigma^{(n)}}{a}}$. Sometimes it is convenient to weaken the estimate by discarding the logarithmic factor, leading to

$\displaystyle {\bf P}( |\frac{S_n - \mu^{(n)}}{\sigma^{(n)}}| \geq \lambda ) \leq 2 \exp( - c \min( \lambda^2, \frac{\lambda \sigma^{(n)}}{a} ) )$

(possibly with a slightly different choice of ${c}$); thus we have Gaussian type large deviation estimates for ${\lambda}$ as large as ${\sigma^{(n)}/a}$, and (slightly better than) exponential decay after that.

In the case when ${X_1,\dots,X_n}$ are iid copies of a random variable ${X}$ of mean ${\mu}$ and variance ${\sigma^2}$ taking values in an interval of length ${1}$, we have ${\mu^{(n)} = n \mu}$ and ${(\sigma^{(n)})^2 \leq n \sigma^2}$, and the above inequality simplifies slightly to

$\displaystyle {\bf P}( |\frac{S_n - n\mu}{\sqrt{n} \sigma}| \geq \lambda ) \leq 2 \exp( - c \min( \lambda^2, \lambda \sqrt{n} \sigma \log \frac{\lambda}{\sqrt{n} \sigma} ) ).$

Proof: We first begin with some quick reductions. Firstly, by dividing the ${X_i}$ (and ${S_n}$, ${\mu^{(n)}}$, and ${\sigma^{(n)}}$) by ${a}$, we may normalise ${a=1}$; by subtracting the mean from each of the ${X_i}$, we may assume that the ${X_i}$ have mean zero, so that ${\mu^{(n)}=0}$ as well. We also write ${\sigma_i^2}$ for the variance of each ${X_i}$, so that ${(\sigma^{(n)})^2 \geq \sum_{i=1}^n \sigma_i^2}$. Our task is to show that

$\displaystyle {\bf P}( |S_n| \geq \lambda \sigma^{(n)} ) \leq 2 \exp( - c \min( \lambda^2, \lambda \sigma^{(n)} \log \frac{\lambda}{\sigma^{(n)}} ) )$

for all ${\lambda > 0}$. We will just prove the upper tail bound

$\displaystyle {\bf P}( S_n \geq \lambda \sigma^{(n)} ) \leq \exp( - c \min( \lambda^2, \lambda \sigma^{(n)} \log \frac{\lambda}{\sigma^{(n)}} ) );$

the lower tail bound then follows by replacing all ${X_i}$ with their negations ${-X_i}$, and the claim then follows by summing the two estimates.

We use the “exponential moment method”, previously seen in proving the Chernoff bound (Exercise 47 of Notes 4), in which one uses the exponential moment generating function ${t \mapsto {\bf E} e^{tS_n}}$ of ${S_n}$. On the one hand, from Markov’s inequality one has

$\displaystyle {\bf P}( S_n \geq \lambda \sigma^{(n)} ) \leq \exp( - t \lambda \sigma^{(n)}) {\bf E} e^{t S_n}$

for any real parameter ${t}$. On the other hand, from the joint independence of the ${X_i}$ one has

$\displaystyle {\bf E} e^{t S_n} = \prod_{i=1}^n {\bf E} e^{tX_i}.$

Since the ${X_i}$ take values in an interval of length at most ${1}$ and have mean zero, we have ${X_i = O(1)}$ and so ${tX_i = O(t)}$. This leads to the Taylor expansion

$\displaystyle e^{tX_i} = 1 + t X_i + O(e^{O(t)} t^2 X_i^2 )$

so on taking expectations we have

$\displaystyle {\bf E} e^{t X_i} = 1 + O( e^{O(t)} t^2 \sigma_i^2 )$

$\displaystyle = \exp( O( e^{O(t)} t^2 \sigma_i^2 ) ).$

Putting all this together, we conclude that

$\displaystyle {\bf P}( S_n \geq \lambda \sigma^{(n)} ) \leq \exp( - t \lambda \sigma^{(n)} + O(e^{O(t)} t^2 (\sigma^{(n)})^2) ).$

If ${\lambda \leq 10 \sigma^{(n)}}$ (say), one can then set ${t}$ to be a small multiple of ${\lambda / \sigma^{(n)}}$ to obtain a bound of the form

$\displaystyle {\bf P}( S_n \geq \lambda \sigma^{(n)} ) \leq \exp( - c \lambda^2 ).$

If instead ${\lambda > 10 \sigma^{(n)}}$, one can set ${t}$ to be a small multiple of ${\log \frac{\lambda}{\sigma^{(n)}}}$ to obtain a bound of the form

$\displaystyle {\bf P}( S_n \geq \lambda \sigma^{(n)} ) \leq \exp( - c \lambda \sigma^{(n)} \log \frac{\lambda}{\sigma^{(n)}} ).$

In either case, the claim follows. $\Box$

The following variant of the above proposition is also useful, in which we get a simpler bound at the expense of worsening the quantity ${\sigma^{(n)}}$ slightly:

Proposition 3 (Cheap Hoeffding inequality) Let ${X_1,\dots,X_n}$ be independent random variables, with each ${X_i}$ taking values in an interval ${[a_i,b_i]}$ with ${b_i > a_i}$. Write ${S_n := X_1+\dots+X_n}$, and write ${\mu^{(n)}}$ for the mean of ${S_n}$, and write

$\displaystyle (\sigma^{(n)})^2 := \sum_{i=1}^n (b_i-a_i)^2.$

Then for any ${\lambda > 0}$, we have

$\displaystyle {\bf P}( |\frac{S_n - \mu^{(n)}}{\sigma^{(n)}}| \geq \lambda ) \leq 2 \exp( - c \lambda^2 ) \ \ \ \ \ (6)$

for some absolute constant ${c>0}$.

In fact one can take ${c=2}$, a fact known as Hoeffding’s inequality; see Exercise 6 below for a special case of this.

Proof: We again normalise the ${X_i}$ to have mean zero, so that ${\mu^{(n)}=0}$. We then have ${|X_i| \leq b_i-a_i}$ for each ${i}$, so by Taylor expansion we have for any real ${t}$ that

$\displaystyle e^{tX_i} =1 + tX_i + O( t^2 (b_i-a_i)^2 \exp( O( t (b_i-a_i) ) ) )$

and thus

$\displaystyle {\bf E} e^{tX_i} = 1 + O( t^2 (b_i-a_i)^2 \exp( O( t (b_i-a_i) ) ) )$

$\displaystyle = \exp( O( t^2 (b_i-a_i)^2 ) ).$

Multiplying in ${i}$, we then have

$\displaystyle {\bf E} e^{tS_n} = \exp( O( t^2 ( \sigma^{(n)})^2 )$

and one can now repeat the previous arguments (but without the ${e^{O(t)}}$ factor to deal with). $\Box$

Remark 4 In the above examples, the underlying random variable ${X}$ was assumed to either be restricted to an interval, or to be subgaussian. This type of hypothesis is necessary if one wishes to have estimates on (2) that are similarly subgaussian. For instance, suppose ${X}$ has a zeta distribution

$\displaystyle {\bf P}(X=m) = \frac{1}{\zeta(s)} \frac{1}{m^s}$

for some ${s>1}$ and all natural numbers ${i}$, where ${\zeta(s) := \sum_{i=1}^\infty \frac{1}{m^s}}$. One can check that this distribution has finite mean and variance for ${s > 3}$. On the other hand, since we trivially have ${S_n \geq X_1}$, we have the crude lower bound

$\displaystyle {\bf P}(S_n \geq m) \geq {\bf P}(X = m) = \frac{1}{\zeta(s)} \frac{1}{m^s}$

which shows that in this case the expression (2) only decays at a polynomial rate in ${\lambda}$ rather than an exponential or subgaussian rate.

Exercise 5 (Khintchine inequality) Let ${\varepsilon_1,\dots,\varepsilon_n}$ be iid copies of a Bernoulli random variable ${\varepsilon}$ drawn uniformly from ${\{-1,+1\}}$.

• (i) For any non-negative reals ${a_1,\dots,a_n}$ and any ${0 < p < \infty}$, show that

$\displaystyle {\bf E} |\sum_{i=1}^n \varepsilon_i a_i|^p \leq C_p (\sum_{i=1}^n |a_i|^2)^{p/2}$

for some constant ${C_p}$ depending only on ${p}$. When ${p=2}$, show that one can take ${C_2=1}$ and equality holds.

• (ii) With the hypotheses in (i), obtain the matching lower bound

$\displaystyle {\bf E} |\sum_{i=1}^n \varepsilon_i a_i|^p \geq c_p (\sum_{i=1}^n |a_i|^2)^{p/2}$

for some ${c_p>0}$ depending only on ${p}$. (Hint: use (i) and Hölder’s inequality.)

• (iii) For any ${0 < p < \infty}$ and any functions ${f_1,\dots,f_n \in L^p(X)}$ on a measure space ${X = (X, {\mathcal X},\mu)}$, show that

$\displaystyle {\bf E} \| \sum_{i=1}^n \varepsilon_i f_i \|_{L^p(X)}^p \leq C_p \| (\sum_{i=1}^n |f_i|^2)^{1/2} \|_{L^p(X)}^p$

and

$\displaystyle {\bf E} \| \sum_{i=1}^n \varepsilon_i f_i \|_{L^p(X)}^p \geq c_p \| (\sum_{i=1}^n |f_i|^2)^{1/2} \|_{L^p(X)}^p$

with the same constants ${c_p, C_p}$ as in (i), (ii). When ${p=2}$, show that one can take ${c_2 = C_2=1}$ and equality holds.

• (iv) (Marcinkiewicz-Zygmund theorem) The Khintchine inequality is very useful in real analysis; we give one example here. Let ${X, Y}$ be measure spaces, let ${1 < p < \infty}$, and suppose ${T: L^p(X) \rightarrow L^p(Y)}$ is a linear operator obeying the bound

$\displaystyle \|Tf\|_{L^p(Y)} \leq A \|f\|_{L^p(X)}$

for all ${f \in L^p(X)}$ and some finite ${A}$. Show that for any finite sequence ${f_1,\dots,f_n \in L^p(X)}$, one has the bound

$\displaystyle \| (\sum_{i=1}^n |Tf_i|^2)^{1/2} \|_{L^p(Y)} \leq C'_p A \| (\sum_{i=1}^n |f_i|^2)^{1/2} \|_{L^p(X)}$

for some constant ${C'_p}$ depending only on ${p}$. (Hint: test ${T}$ against a random sum ${\sum_{i=1}^n \varepsilon_i f_i}$.)

• (v) By using gaussian sums in place of random signs, show that one can take the constant ${C'_p}$ in (iv) to be one. (For simplicity, let us take the functions in ${L^p(X), L^p(Y)}$ to be real valued.)

In this set of notes we have not focused on getting explicit constants in the large deviation inequalities, but it is not too difficult to do so with a little extra work. We give just one example here:

Exercise 6 Let ${\varepsilon_1,\dots,\varepsilon_n}$ be iid copies of a Bernoulli random variable ${\varepsilon}$ drawn uniformly from ${\{-1,+1\}}$.

• (i) Show that ${{\bf E} e^{t \varepsilon} \leq \exp( t^2/2 )}$ for any real ${t}$. (Hint: expand both sides as an infinite Taylor series in ${t}$.)
• (ii) Show that for any real numbers ${a_1,\dots,a_n}$ and any ${\lambda > 0}$, we have

$\displaystyle {\bf P}( |\sum_{i=1}^n \varepsilon_i a_i| > \lambda (\sum_{i=1}^n |a_i|^2)^{1/2} ) \leq 2 e^{-\lambda^2/2}.$

(Note that this is consistent with (6) with ${c=2}$.)

There are many further large deviation inequalities than the ones presented here. For instance, the Azuma-Hoeffding inequality gives Hoeffding-type bounds when the random variables ${X_1,\dots,X_n}$ are not assumed to be jointly independent, but are instead required to form a martingale. Concentration of measure inequalities such as McDiarmid’s inequality handle the situation in which the sum ${S_n = X_1 + \dots + X_n}$ is replaced by a more nonlinear function ${F(X_1,\dots,X_n)}$ of the input variables ${X_1,\dots,X_n}$. There are also a number of refinements of the Chernoff estimate from the previous notes, that are collectively referred to as “Chernoff bounds“. The Bernstein inequalities handle situations in which the underlying random variable ${X}$ is not bounded, but enjoys good moment bounds. See this previous blog post for these inequalities and some further discussion. Last, but certainly not least, there is an extensively developed theory of large deviations which is focused on the precise exponent in the exponential decay rate for tail probabilities such as (2) when ${\lambda}$ is very large (of the order of ${\sqrt{n}}$); there is also a complementary theory of moderate deviations that gives precise estimates in the regime where ${\lambda}$ is much larger than one, but much less than ${\sqrt{n}}$, for which we generally expect gaussian behaviour rather than exponentially decaying bounds. These topics are beyond the scope of this course.

— 2. Local limit theorems —

Let ${X_1,X_2,\dots}$ be iid copies of a random variable ${X}$ of mean ${\mu}$ and variance ${\sigma^2}$, and write ${S_n := X_1 + \dots + X_n}$. On the one hand, the central limit theorem tells us that ${S_n}$ should behave like the normal distribution ${N(n\mu, n\sigma^2)}$, which has probability density function ${x \mapsto \frac{1}{\sqrt{2\pi n} \sigma} e^{-(x-n\mu)^2 / n \sigma^2}}$. On the other hand, if ${X}$ is discrete, then ${S_n}$ must also be discrete. For instance, if ${X}$ takes values in the integers ${{\bf Z}}$, then ${S_n}$ takes values in the integers as well. In this case, we would expect (much as we expect a Riemann sum to approximate an integral) the probability distribution of ${S_n}$ to behave like the probability density function predicted by the central limit theorem, thus we expect

$\displaystyle {\bf P}( S_n = m) \approx \frac{1}{\sqrt{2\pi n} \sigma} e^{-(m-n\mu)^2 / 2n \sigma^2} \ \ \ \ \ (7)$

for integer ${m}$. This is not a direct consequence of the central limit theorem (which does not distinguish between continuous or discrete random variables ${X}$), and in any case is not true in some cases: if ${X}$ is restricted to an infinite subprogression ${a + q {\bf Z}}$ of ${{\bf Z}}$ for some ${q>1}$ and integer ${a}$, then ${S_n}$ is similarly restricted to the infinite subprogression ${na + q{\bf Z}}$, so that (7) totally fails when ${m}$ is outside of ${na+q{\bf Z}}$ (and when ${m}$ does lie in ${na + q{\bf Z}}$, one would now expect the left-hand side of (7) to be about ${q}$ times larger than the right-hand side, to keep the total probability close to ${1}$). However, this turns out to be the only obstruction:

Theorem 7 (Discrete local limit theorem) Let ${X_1,X_2,\dots}$ be iid copies of an integer-valued random variable ${X}$ of mean ${\mu}$ and variance ${\sigma^2}$. Suppose furthermore that there is no infinite subprogression ${a+q{\bf Z}}$ of ${{\bf Z}}$ with ${q>1}$ for which ${X}$ takes values almost surely in ${a+q{\bf Z}}$. Then one has

$\displaystyle {\bf P}( S_n = m) = \frac{1}{\sqrt{2\pi n} \sigma} e^{-(m-n\mu)^2 / 2n \sigma^2} + o(1/n^{1/2})$

for all ${n \geq 1}$ and all integers ${m}$, where the error term ${o(1/n^{1/2})}$ is uniform in ${m}$. In other words, we have

$\displaystyle \sup_{m \in {\bf Z}} n^{1/2} | {\bf P}( S_n = m) - \frac{1}{\sqrt{2\pi n} \sigma} e^{-(m-n\mu)^2 / 2n \sigma^2}| \rightarrow 0$

as ${n \rightarrow \infty}$.

Note for comparison that the Berry-Esséen theorem (writing ${{\bf P}( S_n = m)}$ as, say, ${{\bf P}( m-1/2 \leq S_n \leq m+1/2 )}$) would give (assuming finite third moment) an error term of ${O(1/n^{1/2})}$ instead of ${o(1/n^{1/2})}$, which would overwhelm the main term which is also of size ${O(1/n^{1/2})}$.

Proof: Unlike previous arguments, we do not have the luxury here use an affine change of variables to normalise to mean zero and variance one, as this would disrupt the hypothesis that ${X}$ takes values in ${{\bf Z}}$.

Fix ${n}$ and ${m}$. Since ${S_n}$ and ${m}$ are integers, we have the Fourier identity

$\displaystyle 1_{S_n = m} = \frac{1}{2\pi} \int_{-\pi}^{\pi} e^{itS_n} e^{-itm}\ dt,$

which upon taking expectations and using Fubini’s theorem gives

$\displaystyle {\bf P}(S_n = m) = \frac{1}{2\pi} \int_{-\pi}^{\pi} \varphi_{S_n}(t) e^{-itm}\ dt$

where ${\varphi_{S_n}(t) := {\bf E} e^{itS_n} }$ is the characteristic function of ${S_n}$. Expanding ${S_n = X_1 + \dots + X_n}$ and noting that ${X_1,\dots,X_n}$ are iid copies of ${X}$, we have

$\displaystyle \varphi_{S_n}(t) = \varphi_X(t)^n = e^{in \mu t} \varphi_{X-\mu}(t).$

It will be convenient to make the change of variables ${t = x/\sqrt{n}}$, to obtain

$\displaystyle \sqrt{n} {\bf P}(S_n = m) = \frac{1}{2\pi} \int_{-\pi \sqrt{n}}^{\pi \sqrt{n}} \varphi_{X-\mu}(x/\sqrt{n})^n e^{i\mu x \sqrt{n}} e^{-ixm/\sqrt{n}}\ dx.$

As in the Fourier-analytic proof of the central limit theorem, we have

$\displaystyle \varphi_{X-\mu}(t) = 1 - \frac{\sigma^2}{2} t^2 + o(t^2) \ \ \ \ \ (8)$

as ${t \rightarrow 0}$, so by Taylor expansion we have

$\displaystyle \varphi_{X-\mu}(x/\sqrt{n})^n \rightarrow e^{-\sigma^2 x^2 / 2} \ \ \ \ \ (9)$

as ${n \rightarrow \infty}$ for any fixed ${x}$. This suggests (but does not yet prove) that

$\displaystyle \sqrt{n} {\bf P}(S_n = m) \rightarrow \frac{1}{2\pi} \int_{\bf R} e^{i \mu x \sqrt{n}} e^{-\sigma^2 x^2/2} e^{-ixm/\sqrt{n}}\ dx.$

A standard Fourier-analytic calculation gives

$\displaystyle \frac{1}{2\pi} \int_{\bf R} e^{i \mu x \sqrt{n}} e^{-\sigma^2 x^2/2} e^{-ixm/\sqrt{n}}\ dx = \frac{1}{\sqrt{2\pi} \sigma} e^{-(m-n\mu)^2 / 2n \sigma^2}$

so it will now suffice to establish that

$\displaystyle \int_{-\pi \sqrt{n}}^{\pi \sqrt{n}} e^{i \mu x \sqrt{n}} \varphi_{X-\mu}(x/\sqrt{n})^n e^{-ixm/\sqrt{n}}\ dx$

$\displaystyle = \int_{\bf R} e^{i \mu x \sqrt{n}} e^{-\sigma^2 x^2/2} e^{-ixm/\sqrt{n}}\ dx + o(1)$

uniformly in ${m}$. From dominated convergence we have

$\displaystyle \int_{|x| \geq \pi \sqrt{n}} e^{-\sigma^2 x^2/2} e^{-ixm/\sqrt{n}}\ dx = o(1)$

so by the triangle inequality, it suffices to show that

$\displaystyle \int_{-\pi \sqrt{n}}^{\pi \sqrt{n}} |\varphi_{X-\mu}(x/\sqrt{n})^n - e^{-\sigma^2 x^2/2}| dx = o(1).$

This will follow from (9) and the dominated convergence theorem, as soon as we can dominate the integrands ${|\varphi_{X-\mu}(x/\sqrt{n})^n - e^{-\sigma^2 x^2/2}| 1_{|x| \leq \pi \sqrt{n}}}$ by an absolutely integrable function.

From (8), there is an ${\varepsilon > 0}$ such that

$\displaystyle \hbox{Re} \varphi_{X-\mu}(t) \leq 1 - \frac{\sigma^2}{4} t^2 \leq \exp( - \sigma^2 t^2 / 4)$

for all ${|t| \leq \varepsilon}$, and hence

$\displaystyle |\varphi_{X-\mu}(x/\sqrt{n})^n| \leq \exp( - \sigma^2 x^2 / 4 )$

for ${|x| \leq \varepsilon \sqrt{n}}$. This gives the required domination in the region ${|x| \leq \varepsilon \sqrt{n}}$, so it remains to handle the region ${\varepsilon \sqrt{n} < |x| \leq \pi \sqrt{n}}$.

From the triangle inequality we have ${|\varphi_{X-\mu}(t)| \leq 1}$ for all ${t}$. Actually we have the stronger bound ${|\varphi_{X-\mu}(t)| < 1}$ for ${0 < |t| \leq \pi}$. Indeed, if ${|\varphi_{X-\mu}(t)| = 1}$ for some such ${t}$, this would imply that ${e^{it(X-\mu)}}$ is a deterministic constant, which means that ${t(X-\mu)}$ takes values in ${a + 2\pi {\bf Z}}$ for some real ${a}$, which implies that ${X}$ takes values in ${\mu + \frac{a}{t} + \frac{2\pi}{t} {\bf Z}}$; since ${X}$ also takes values in ${{\bf Z}}$, this would place ${X}$ either in a singleton set or in an infinite subprogression of ${{\bf Z}}$ (depending on whether ${\frac{2\pi}{t}}$ and ${\mu + \frac{a}{t}}$ are rational), a contradiction. As ${\varphi_{X-\mu}}$ is continuous and the region ${\varepsilon \leq |t| \leq \pi}$ is compact, there exists ${0 such that ${|\varphi_{X-\mu}(t)| \leq c}$ for all ${\varepsilon \leq |t| \leq \pi}$. This allows us to dominate ${|\varphi_{X-\mu}(x/\sqrt{n})^n - e^{-\sigma^2 x^2/2}|}$ by ${c^n + e^{-\sigma^2 \varepsilon^2 n / 2}}$ for ${\varepsilon \sqrt{n} \leq |x| \leq \pi \sqrt{n}}$, which is in turn bounded by ${\exp( - \delta x^2 )}$ for some ${\delta > 0}$ independent of ${n}$, giving the required domination. $\Box$

Of course, if the random variable ${X}$ in Theorem 7 did take values almost surely in some subprogression ${a + q{\bf Z}}$, then either ${X}$ is almost surely constant, or there is a minimal progression ${a+q{\bf Z}}$ with this property (since ${q}$ must divide the difference of any two integers that ${X}$ attains with positive probability). One can then make the affine change of variables ${X = a + q X'}$ (modifying ${\mu}$ and ${\sigma^2}$ appropriately) and apply the above theorem to obtain a similar local limit theorem, which we will not write here. For instance, if ${X}$ is the uniform distribution on ${\{-1,1\}}$, then this argument gives

$\displaystyle {\bf P}( S_n = m) = \frac{2}{\sqrt{2\pi n} \sigma} e^{-2m^2 / n} + o(1/n^{1/2})$

when ${m}$ is an integer of the same parity as ${n}$ (of course, ${{\bf P}(S_n=m)}$ will vanish otherwise). A further affine change of variables handles the case when ${X}$ is not integer valued, but takes values in some other lattice ${a+q{\bf Z}}$, where ${q}$ and ${a}$ are now real-valued.

We can complement these local limit theorems with the following result that handles the non-lattice case:

Theorem 8 (Continuous local limit theorem) Let ${X_1,X_2,\dots}$ be iid copies of an real-valued random variable ${X}$ of mean ${\mu}$ and variance ${\sigma^2}$. Suppose furthermore that there is no infinite progression ${a+q{\bf Z}}$ with ${a,q}$ real for which ${X}$ takes values almost surely in ${a+q{\bf Z}}$. Then for any ${h > 0}$, one has

$\displaystyle {\bf P}( a \leq S_n \leq a+h ) = \frac{h}{\sqrt{2\pi n} \sigma} e^{-(a-n\mu)^2 / 2n \sigma^2} + o(1/n^{1/2})$

for all ${n \geq 1}$ and all ${a \in {\bf R}}$, where the error term ${o(1/n^{1/2})}$ is uniform in ${a}$ (but may depend on ${h}$).

Equivalently, if we let ${N_n}$ be a normal random variable with mean ${n\mu}$ and variance ${n\sigma^2}$, then

$\displaystyle {\bf P}( a \leq S_n \leq a+h ) = {\bf P}( a \leq N_n \leq a+h ) + o(1/n^{1/2})$

uniformly in ${a}$. Again, this can be compared with the Berry-Esséen theorem, which (assuming finite third moment) has an error term of ${O(1/n^{1/2})}$ which is uniform in both ${n}$ and ${h}$.

Proof: Unlike the discrete case, we have the luxury here of normalising ${\mu=0}$ and ${\sigma^2=1}$, and we shall now do so.

We first observe that it will suffice to show that

$\displaystyle {\bf E} G(S_n-a) = {\bf E} G(N_n-a) + o(1/n^{1/2}), \ \ \ \ \ (10)$

whenever ${G: {\bf R} \rightarrow {\bf C}}$ is a Schwartz function whose Fourier transform

$\displaystyle \hat G(t) := \frac{1}{2\pi} \int_{\bf R} G(x) e^{-itx}\ dx$

is compactly supported, and where the ${o(1/n^{1/2})}$ error can depend on ${G}$ but is uniform in ${a}$. Indeed, if this bound (10) holds, then (after replacing ${G}$ by ${|G|^2}$ to make it positive on some interval, and then rescaling) we obtain a bound of the form

$\displaystyle {\bf P}( a \leq S_n \leq a+h ) \leq O(h / n^{1/2} ) + o(1/n^{1/2}) \ \ \ \ \ (11)$

for any ${h>0}$ and ${a \in {\bf R}}$, where the ${o(1/n^{1/2})}$ term can depend on ${h}$ but not on ${a}$. Then, by convolving ${1_{[0,h]}}$ by an approximation to the identity of some width ${h'}$ much smaller than ${h}$ with compactly supported Fourier transform, applying (10) to the resulting function, and using (11) to control the error between that function and ${1_{[0,h]}}$, we see that

$\displaystyle {\bf P}( a \leq S_n \leq a+h ) = {\bf P}( a \leq N_n \leq a+h ) + O(h'/n^{1/2}) + o(1/n^{1/2})$

uniformly in ${a}$, where the ${o(1/n^{1/2})}$ term can depend on ${h}$ and ${h'}$ but is uniform in ${a}$. Letting ${h'}$ tend slowly to zero, we obtain the claim.

It remains to establish (10). We adapt the argument from the discrete case. By the Fourier inversion formula we may write

$\displaystyle G(x) = \int_{\bf R} \hat G(t) e^{itx}\ dt.$

By Fubini’s theorem as before, we thus have

$\displaystyle {\bf E} G(S_n-a) = \int_{\bf R} \hat G(t) \varphi_{S_n}(t) e^{-ita}\ dt$

and similarly

$\displaystyle {\bf E} G(N_n-a) = \int_{\bf R} \hat G(t) \varphi_{N_n}(t) e^{-ita}\ dt$

so it suffices by the triangle inequality and the boundedness and compact support of ${\hat G}$ to show that

$\displaystyle \int_{|t| \leq A} |\hat G(t)| |\varphi_{S_n}(t) - \varphi_{N_n}(t)|\ dt = o(1/n^{1/2})$

for any fixed ${A}$ (where the ${o(1/n^{1/2})}$ term can depend on ${A}$). We have ${\varphi_{S_n}(t) = \varphi_X(t)^n}$ and ${\varphi_{N_n}(t) = e^{-n t^2 / 2}}$, so by making the change of variables ${x = \sqrt{n} t}$, we now need to show that

$\displaystyle \int_{|t| \leq A \sqrt{n}} |\hat G(t/\sqrt{n})| |\varphi_X(x/\sqrt{n})^n - e^{-x^2/2}|\ dx \rightarrow 0$

as ${n \rightarrow \infty}$. But this follows from the argument used to handle the discrete case. $\Box$

— 3. The Poisson central limit theorem —

The central limit theorem (after normalising the random variables to have mean zero) studies the fluctuations of sums ${\frac{S_n}{\sqrt{n}} = \frac{X_1}{\sqrt{n}} + \dots + \frac{X_n}{\sqrt{n}}}$ where each individual term ${\frac{X_i}{\sqrt{n}}}$ is quite small (typically of size ${O(1/\sqrt{n})}$). Now we consider a variant situation, in which one considers a sum ${S_n = X_1 + \dots + X_n}$ of random variables which are usually zero, but occasionally equal to a larger value such as ${1}$. (This situation arises in many real-life situations when compiling aggregate statistics on rare events, e.g. the number of car crashes in a short period of time.) In these cases, one can get a different distribution than the gaussian distribution, namely a Poisson distribution with some intensity ${\lambda}$ – that is to say, a random variable ${X}$ taking values in the non-negative integers ${0,1,2,\dots}$ with probability distribution

$\displaystyle {\bf P}(X = n) = \frac{\lambda^n}{n!} e^{-\lambda}.$

One can check that this distribution has mean ${\lambda}$ and variance ${\lambda}$.

Theorem 9 (Poisson central limit theorem) Let ${(X_{j,n})_{1 \leq j \leq n}}$ be a triangular array of real random variables, where for each ${n}$, the variables ${X_{1,n},\dots,X_{n,n}}$ are jointly independent. Assume furthermore that

• (i) (${X_{j,n}}$ mostly ${0,1}$) One has ${\sum_{j=1}^n {\bf P}( X_{j,n} \not \in \{0,1\} ) \rightarrow 0}$ as ${n \rightarrow \infty}$.
• (ii) (${X_{j,n}}$ rarely ${1}$) One has ${\sup_{j=1}^n {\bf P}(X_{j,n} = 1 ) \rightarrow 0}$ as ${n \rightarrow \infty}$.
• (iii) (Convergent expectation) One has ${\sum_{j=1}^n {\bf P}(X_{j,n} = 1 ) \rightarrow \lambda}$ as ${n \rightarrow \infty}$ for some ${0 < \lambda < \infty}$.

Then the random variables ${S_n := X_{1,n} + \dots + X_{n,n}}$ converge in distribution to a Poisson random variable of intensity ${\lambda}$.

Proof: From hypothesis (i) and the union bound we see that for each ${n}$, we have that all of the ${X_{1,n},\dots,X_{n,n}}$ lie in ${\{0,1\}}$ with probability ${1-o(1)}$ as ${n \rightarrow \infty}$. Thus, if we replace each ${X_{j,n}}$ by the restriction ${X_{j,n} 1_{X_{j,n} \in \{0,1\}}}$, the random variable ${S_n}$ is only modified on an event of probability ${o(1)}$, which does not affect distributional limits (Slutsky’s theorem). Thus, we may assume without loss of generality that the ${X_{j,n}}$ take values in ${\{0,1\}}$.

By Exercise 20 of Notes 4, a Poisson random variable of intensity ${\lambda}$ has characteristic function ${t \mapsto \exp( \lambda(e^{it}-1) )}$. Applying the Lévy convergence theorem (Theorem 27 of Notes 4), we conclude that it suffices to show that

$\displaystyle \varphi_{S_n}(t) \rightarrow \exp( \lambda(e^{it}-1)$

as ${n \rightarrow \infty}$ for any fixed ${t}$.

Fix ${t}$. By the independence of the ${X_{j,n}}$, we may write

$\displaystyle \varphi_{S_n}(t) = \prod_{j=1}^n \varphi_{X_{j,n}}(t).$

Since ${X_{j,n}}$ only takes on the values ${0}$ and ${1}$, we can write

$\displaystyle \varphi_{X_{j,n}}(t) = 1 - p_{j,n} + p_{j,n} e^{it}$

where ${p_{j,n} := {\bf P}( X_{j,n} = 1 )}$. By hypothesis (ii), we have ${p_{j,n} = o(1)}$, so by using a branch of the complex logarithm that is analytic near ${1}$, we can write

$\displaystyle \varphi_{S_n}(t) = \exp( \sum_{j=1}^n \log( 1 - p_{j,n} + p_{j,n} e^{it} ) ).$

By Taylor expansion we have

$\displaystyle \log( 1 - p_{j,n} + p_{j,n} e^{it} ) ) = p_{j,n} (e^{it}-1) + O( p_{j,n}^2 )$

and hence by (ii), (iii)

$\displaystyle \sum_{j=1}^n \log( 1 - p_{j,n} + p_{j,n} e^{it} ) ) = \lambda (e^{it}-1) + o(1)$

as ${n \rightarrow \infty}$, and the claim follows. $\Box$

Exercise 10 Establish the conclusion of Theorem 9 directly from explicit computation of the probabilities ${{\bf P}(S_n = k)}$ in the case when each ${X_{j,n}}$ takes values in ${0,1}$ with ${{\bf P}(X_{j,n}=1) = \lambda/n}$ for some fixed ${\lambda > 0}$.

The Poisson central limit theorem can be viewed as a degenerate limit of the central limit theorem, as seen by the next two exercises.

Exercise 11 Suppose we replace the hypothesis (iii) in Theorem 9 with the alternative hypothesis that the quantities ${\lambda_n := \sum_{j=1}^n {\bf P}(X_{j,n} = 1)}$ go to infinity as ${n \rightarrow \infty}$, while leaving hypotheses (i) and (ii) unchanged. Show that ${(S_n - \lambda_n) /\sqrt{\lambda_n}}$ converges in distribution to the normal distribution ${N(0,1)}$.

Exercise 12 For each ${\lambda>0}$, let ${P_\lambda}$ be a Poisson random variable with intensity ${\lambda}$. Show that as ${\lambda \rightarrow \infty}$, the random variables ${(P_\lambda - \lambda)/\sqrt{\lambda}}$ converge in distribution to the normal distribution ${N(0,1)}$. Discuss how this is consistent with Theorem 9 and the previous exercise.

— 4. Stable laws —

Let ${X}$ be a real random variable. We say that ${X}$ has a stable law or a stable distribution if for any positive reals ${a, b}$, there exists a positive real ${c}$ and a real ${d}$ such that ${a X_1 + b X_2 \equiv c X + d}$ whenever ${X_1,X_2}$ are iid copies of ${X}$. In terms of the characteristic function ${\varphi_X}$ of ${X}$, we see that ${X}$ has a stable law if for any positive reals ${a,b}$, there exist a positive real ${c}$ and a real ${d}$ for which we have the functional equation

$\displaystyle \phi_X(at) \phi_X(bt) = \phi_X(ct) e^{idt}$

for all real ${t}$.

For instance, a normally distributed variable ${X = N(\mu,\sigma^2)}$ is stable thanks to Lemma 12 of Notes 4; one can also see this from the characteristic function ${\phi_X(t) = e^{i \mu t - \sigma^2 t^2 / 2}}$. A Cauchy distribution ${X}$, with probability density ${\frac{1}{\pi} \frac{\gamma}{(x-x_0)^2 + \gamma^2}}$ can also be seen to be stable, as is most easily seen from the characteristic function ${\phi_X(t) = e^{ix_0 t - \gamma |t|}}$. As a more degenerate example, any deterministic random variable ${X = x_0}$ is stable. It is possible (though somewhat tedious) to completely classify all the stable distributions, see for instance the Wikipedia entry on these laws for the full classification.

If ${X}$ is stable, and ${X_1, X_2, \dots}$ are iid copies of ${X}$, then by iterating the stable law hypothesis we see that the sums ${S_n := X_1 + \dots + X_n}$ are all equal in distribution to some affine rescaling ${a_n X + b_n}$ of ${X}$. For instance, we have ${X_1 + X_2 \equiv cX + d}$ for some ${c,d}$, and a routine induction then shows that

$\displaystyle X_1 + \dots + X_{2^k} \equiv c^k X + \frac{d (c^k - 1)}{c-1}$

for all natural numbers ${k}$ (with the understanding that ${\frac{c^k-1}{c-1} = k}$ when ${c=1}$). In particular, the random variables ${\frac{S_n - b_n}{a_n}}$ all have the same distribution as ${X}$.

More generally, given two real random variables ${X}$ and ${Y}$, we say that ${X}$ is in the basin of attraction for ${Y}$ if, whenever ${X_1,X_2,\dots}$ are iid copies of ${X}$ and ${S_n := X_1 + \dots + X_n}$, there exist constants ${a_n > 0}$ and ${b_n}$ such that ${\frac{S_n - b_n}{a_n}}$ converges in distribution to ${Y}$. Thus, any stable law is in its own basin of attraction, while the central limit theorem asserts that any random variable of finite variance is in the basin of attraction of a normal distribution. One can check that every random variable lies in the basin of attraction of a deterministic random variable such as ${0}$, simply by letting ${a_n}$ go to infinity rapidly enough. To avoid this degenerate case, we now restrict to laws that are non-degenerate, in the sense that they are not almost surely constant. Then we have the following useful technical lemma:

Proposition 13 (Convergence of types) Let ${X_n}$ be a sequence of real random variables converging in distribution to a non-degenerate limit ${X}$. Let ${a_n>0}$ and ${b_n}$ be real numbers such that ${Y_n = a_n X_n + b_n}$ converges in distribution to a non-degenerate limit ${Y}$. Then ${a_n}$ and ${b_n}$ converge to some finite limits ${a,b}$ respectively, and ${aX + b \equiv Y}$.

Proof: Suppose first that ${a_n}$ goes to zero. The sequence ${X_n}$ converges in distribution, hence is tight, hence ${Y_n - b_n = a_n X_n}$ converges in probability to zero. In particular, if ${Y'_n}$ is an independent copy of ${Y_n}$, then ${Y_n - Y'_n}$ converges in probability to zero; but ${Y_n - Y'_n}$ also converges in distribution to ${Y-Y'}$ where ${Y'}$ is an independent copy of ${Y}$, and ${Y-Y'}$ is not almost surely zero since ${Y}$ is non-degenerate. This is a contradiction. Similarly if ${a_n}$ has any subsequence that goes to zero. We conclude that ${a_n}$ is bounded away from zero. Rewriting ${Y_n = a_n X_n + b_n}$ as ${X_n = a_n^{-1} Y_n - a_n^{-1} b_n}$ and reversing the roles of ${X_n}$ and ${Y_n}$, we conclude also that ${a_n^{-1}}$ is bounded away from zero, thus ${a_n}$ is bounded.

Since ${X_n}$ is tight and ${a_n}$ is bounded, ${a_n X_n}$ is tight; since ${Y_n}$ is also tight, this implies that ${Y_n - a_n X_n = b_n}$ is tight, that is to say ${b_n}$ is bounded.

Let ${(a,b)}$ be a limit point of the ${(a_n,b_n)}$. By Slutsky’s theorem, a subsequence of the ${Y_n = a_n X_n + b_n}$ then converges in distribution to ${aX + b}$, thus ${Y \equiv aX+b}$. If the limit point ${(a,b)}$ is unique then we are done, so suppose there are two limit points ${(a,b)}$, ${(a',b')}$. Thus ${aX+b \equiv a'X+b'}$, which on rearranging gives ${X \equiv cX + d}$ for some ${c>0}$ and real ${d}$ with ${(c,d) \neq (1,0)}$.

If ${c=1}$ then on iteration we have ${X \equiv X + dn}$ for any natural number ${n}$, which clearly leads to a contradiction as ${n \rightarrow \infty}$ since ${d \neq 0}$. If ${c < 1}$ then iteration gives ${X = c^n X + d \frac{c^n-1}{c-1}}$ for any natural number ${n}$, which on passing to the limit in distribution as ${n \rightarrow \infty}$ gives ${X=0}$, again a contradiction. If ${c>1}$ then we rewrite ${X \equiv cX + d}$ as ${X \equiv c^{-1} X - c^{-1} d}$ and again obtain a contradiction, and the claim follows. $\Box$

One can use this proposition to verify that basins of attraction of genuinely distinct laws are disjoint:

Exercise 14 Let ${Y}$ and ${Y'}$ be non-degenerate real random variables. Suppose that a random variable ${X}$ lies in the basin of attraction of both ${Y}$ and ${Y'}$. Then there exist ${a>0}$ and real ${b}$ such that ${Y' \equiv aY+b}$.

If ${X}$ lies in the basin of attraction for a non-degenerate law ${Y}$, then ${\frac{S_n - b_n}{a_n}}$ converges in distribution to ${Y}$; since ${S_{2n}}$ is equal in distribution to the sum of two iid copies of ${S_n}$, we see that ${\frac{S_{2n}-2b_n}{a_n}}$ converges in distribution to the sum ${Y_1+Y_2}$ of two iid copies of ${Y}$. On the other hand, ${\frac{S_{2n}-b_{2n}}{a_{2n}}}$ converges in distribution to ${Y}$. Using Proposition 13 we conclude that ${Y_1+Y_2 \equiv cY + d}$ for some ${c>0}$ and real ${d}$. One can go further and conclude that ${Y}$ in fact has a stable law; see the following exercise. Thus stable laws are the only laws that have a non-empty basin of attraction.

Exercise 15 Let ${X}$ lie in the basin of attraction for a non-degenerate law ${Y}$.

• (i) Show that for any iid copies ${Y_1,\dots,Y_k}$ of ${Y}$, there exists a unique ${c_k>0}$ and ${d_k \in {\bf R}}$ such that ${Y_1 + \dots + Y_k \equiv c_k Y + d_k}$. Also show that ${c_k Y_1 + c_l Y_2 \equiv c_{k+l} Y + d_{k+l} - d_k - d_l}$ for all natural numbers ${k,l}$.
• (ii) Show that the ${c_k}$ are strictly increasing, with ${c_{kl} = c_k c_l}$ for all natural numbers ${k,l}$. (Hint: study the absolute value ${|\phi_Y(t)|}$ of the characteristic function, using the non-degeneracy of ${Y}$ to ensure that this absolute value is usually strictly less than one.) Also show that ${d_k l + c_k d_l = d_{kl}}$ for all natural numbers ${k,l}$.
• (iii) Show that there exists ${\alpha > 0}$ such that ${c_k = k^\alpha}$ for all ${k}$. (Hint: first show that ${\frac{\log c_k}{\log k}}$ is a Cauchy sequence in ${k}$.)
• (iv) If ${\alpha = 1}$, and ${Y_1,Y_2}$ are iid copies of ${Y}$, show that ${\frac{k}{k+l} Y_1 + \frac{l}{k+l} Y_2 \equiv Y + \theta_{k,l}}$ for all natural numbers ${k,l}$ and some bounded real ${\theta_{k,l}}$. Then show that ${Y}$ has a stable law in this case.
• (v) If ${\alpha \neq 1}$, show that ${d_k = \mu ( k - k^\alpha )}$ for some real ${\mu}$ and all ${k}$. Then show that ${Y}$ has a stable law in this case.

Exercise 16 (Classification of stable laws) Let ${Y}$ be a non-degenerate stable law, then ${Y}$ lies in its own basin of attraction and one can then define ${c_k, d_k, \alpha}$ as in the preceding exercise.

• (i) If ${\alpha \neq 1}$, and ${\mu}$ is as in part (v) of the preceding exercise, show that ${\varphi_Y(t)^k = \varphi_Y(k^\alpha t) e^{i \mu (k - k^\alpha)}}$ for all ${t}$ and ${k}$. Then show that

$\displaystyle \varphi_Y(t) = \exp( i t \mu - |ct|^\alpha (1 - i \beta \hbox{sgn}(t) ) )$

for some real ${c, \beta}$. (One can use the identity ${\varphi_Y(-t) = \overline{\varphi_Y(t)}}$ to restrict attention to the case of positive ${t}$.)

• (ii) Now suppose ${\alpha = 1}$. Show that ${d_{k+l} = d_k + d_l + O(k+l)}$ for all ${k,l}$ (where the implied constant in the ${O()}$ notation is allowed to depend on ${Y}$). Conclude that ${d_k = O(k \log k)}$ for all ${k}$.
• (iii) We continue to assume ${\alpha = 1}$. Show that ${d_k = -\beta k \log k}$ for some real number ${k}$. (Hint: first show this when ${k}$ is a power of a fixed natural number ${k_0 > 1}$ (with ${\beta}$ possibly depending on ${k_0}$). Then use the estimates from part (ii) to show that ${\beta}$ does not actually depend on ${k_0}$. (One may need to invoke the Dirichlet approximation theorem to show that for any given ${k_0,k_1 > 1}$, one can find a power of ${k_0}$ that is somewhat close to a power of ${k_1}$.)
• (iv) We continue to assume ${\alpha=1}$. Show that ${\varphi_Y(t)^k = \varphi_Y(kt) e^{-i \beta k \log k}}$ for all ${t}$ and ${k}$. Then show that

$\displaystyle \varphi_Y(t) = \exp(it \mu - |ct| (1 - i \beta \hbox{sgn}(t) \log t ) )$

for all ${t}$ and some real ${c}$.

It is also possible to determine which choices of parameters ${\mu,c,\alpha,\beta}$ are actually achievable by some random variable ${Y}$, but we will not do so here.

It is possible to associate a central limit theorem to each stable law, which precisely determines their basin of attraction. We will not do this in full generality, but just illustrate the situation for the Cauchy distribution.

Exercise 17 Let ${X}$ be a real random variable which is symmetric (that is, ${X}$ has the same distribution as ${-X}$) and obeys the distribution identity

$\displaystyle {\mathbf P}( |X| \geq x ) = L(x) \frac{2}{\pi x}$

for all ${x>0}$, where ${L: (0,+\infty) \rightarrow (0,+\infty)}$ is a function which is slowly varying in the sense that ${L(cx)/L(x) \rightarrow 1}$ as ${x \rightarrow \infty}$ for all ${c>0}$.

• (i) Show that

$\displaystyle {\bf E}( e^{itX} ) = 1 - |t| + o(|t|)$

as ${t \rightarrow 0}$, where ${o(|t|)}$ denotes a quantity such that ${o(|t|)/|t| \rightarrow 0}$ as ${t \rightarrow 0}$. (You may need to establish the identity ${\int_0^\infty \frac{1 - \cos(x)}{x^2}\ dx = \frac{\pi}{2}}$, which can be done by contour integration.)

• (ii) Let ${X_1,X_2,\dots}$ be iid copies of ${X}$. Show that ${\frac{X_1+\dots+X_n}{n}}$ converges in distribution to a copy of the standard Cauchy distribution (i.e., to a random variable with probability density function ${x \mapsto \frac{1}{\pi} \frac{1}{1+x^2}}$).