One of the major activities in probability theory is studying the various statistics that can be produced from a complex system with many components. One of the simplest possible systems one can consider is a finite sequence {X_1,\dots,X_n} or an infinite sequence {X_1,X_2,\dots} of jointly independent scalar random variables, with the case when the {X_i} are also identically distributed (i.e. the {X_i} are iid) being a model case of particular interest. (In some cases one may consider a triangular array {(X_{n,i})_{1 \leq i \leq n}} of scalar random variables, rather than a finite or infinite sequence.) There are many statistics of such sequences that one can study, but one of the most basic such statistics are the partial sums

\displaystyle S_n := X_1 + \dots + X_n.

The first fundamental result about these sums is the law of large numbers (or LLN for short), which comes in two formulations, weak (WLLN) and strong (SLLN). To state these laws, we first must define the notion of convergence in probability.

Definition 1 Let {X_n} be a sequence of random variables taking values in a separable metric space {R = (R,d)} (e.g. the {X_n} could be scalar random variables, taking values in {{\bf R}} or {{\bf C}}), and let {X} be another random variable taking values in {R}. We say that {X_n} converges in probability to {X} if, for every radius {\varepsilon > 0}, one has {{\bf P}( d(X_n,X) > \varepsilon ) \rightarrow 0} as {n \rightarrow \infty}. Thus, if {X_n, X} are scalar, we have {X_n} converging to {X} in probability if {{\bf P}( |X_n-X| > \varepsilon ) \rightarrow 0} as {n \rightarrow \infty} for any given {\varepsilon > 0}.

The measure-theoretic analogue of convergence in probability is convergence in measure.

It is instructive to compare the notion of convergence in probability with almost sure convergence. it is easy to see that {X_n} converges almost surely to {X} if and only if, for every radius {\varepsilon > 0}, one has {{\bf P}( \bigvee_{n \geq N} (d(X_n,X)>\varepsilon) ) \rightarrow 0} as {N \rightarrow \infty}; thus, roughly speaking, convergence in probability is good for controlling how a single random variable {X_n} is close to its putative limiting value {X}, while almost sure convergence is good for controlling how the entire tail {(X_n)_{n \geq N}} of a sequence of random variables is close to its putative limit {X}.

We have the following easy relationships between convergence in probability and almost sure convergence:

Exercise 2 Let {X_n} be a sequence of scalar random variables, and let {X} be another scalar random variable.

  • (i) If {X_n \rightarrow X} almost surely, show that {X_n \rightarrow X} in probability. Give a counterexample to show that the converse does not necessarily hold.
  • (ii) Suppose that {\sum_n {\bf P}( |X_n-X| > \varepsilon ) < \infty} for all {\varepsilon > 0}. Show that {X_n \rightarrow X} almost surely. Give a counterexample to show that the converse does not necessarily hold.
  • (iii) If {X_n \rightarrow X} in probability, show that there is a subsequence {X_{n_j}} of the {X_n} such that {X_{n_j} \rightarrow X} almost surely.
  • (iv) If {X_n,X} are absolutely integrable and {{\bf E} |X_n-X| \rightarrow 0} as {n \rightarrow \infty}, show that {X_n \rightarrow X} in probability. Give a counterexample to show that the converse does not necessarily hold.
  • (v) (Urysohn subsequence principle) Suppose that every subsequence {X_{n_j}} of {X_n} has a further subsequence {X_{n_{j_k}}} that converges to {X} in probability. Show that {X_n} also converges to {X} in probability.
  • (vi) Does the Urysohn subsequence principle still hold if “in probability” is replaced with “almost surely” throughout?
  • (vii) If {X_n} converges in probability to {X}, and {F: {\bf R} \rightarrow {\bf R}} or {F: {\bf C} \rightarrow {\bf C}} is continuous, show that {F(X_n)} converges in probability to {F(X)}. More generally, if for each {i=1,\dots,k}, {X^{(i)}_n} is a sequence of scalar random variables that converge in probability to {X^{(i)}}, and {F: {\bf R}^k \rightarrow {\bf R}} or {F: {\bf C}^k \rightarrow {\bf C}} is continuous, show that {F(X^{(1)}_n,\dots,X^{(k)}_n)} converges in probability to {F(X^{(1)},\dots,X^{(k)})}. (Thus, for instance, if {X_n} and {Y_n} converge in probability to {X} and {Y} respectively, then {X_n + Y_n} and {X_n Y_n} converge in probability to {X+Y} and {XY} respectively.
  • (viii) (Fatou’s lemma for convergence in probability) If {X_n} are non-negative and converge in probability to {X}, show that {{\bf E} X \leq \liminf_{n \rightarrow \infty} {\bf E} X_n}.
  • (ix) (Dominated convergence in probability) If {X_n} converge in probability to {X}, and one almost surely has {|X_n| \leq Y} for all {n} and some absolutely integrable {Y}, show that {{\bf E} X_n} converges to {{\bf E} X}.

Exercise 3 Let {X_1,X_2,\dots} be a sequence of scalar random variables converging in probability to another random variable {X}.

  • (i) Suppose that there is a random variable {Y} which is independent of {X_i} for each individual {i}. Show that {Y} is also independent of {X}.
  • (ii) Suppose that the {X_1,X_2,\dots} are jointly independent. Show that {X} is almost surely constant (i.e. there is a deterministic scalar {c} such that {X=c} almost surely).

We can now state the weak and strong law of large numbers, in the model case of iid random variables.

Theorem 4 (Law of large numbers, model case) Let {X_1, X_2, \dots} be an iid sequence of copies of an absolutely integrable random variable {X} (thus the {X_i} are independent and all have the same distribution as {X}). Write {\mu := {\bf E} X}, and for each natural number {n}, let {S_n} denote the random variable {S_n := X_1 + \dots + X_n}.

  • (i) (Weak law of large numbers) The random variables {S_n/n} converge in probability to {\mu}.
  • (ii) (Strong law of large numbers) The random variables {S_n/n} converge almost surely to {\mu}.

Informally: if {X_1,\dots,X_n} are iid with mean {\mu}, then {X_1 + \dots + X_n \approx \mu n} for {n} large. Clearly the strong law of large numbers implies the weak law, but the weak law is easier to prove (and has somewhat better quantitative estimates). There are several variants of the law of large numbers, for instance when one drops the hypothesis of identical distribution, or when the random variable {X} is not absolutely integrable, or if one seeks more quantitative bounds on the rate of convergence; we will discuss some of these variants below the fold.

It is instructive to compare the law of large numbers with what one can obtain from the Kolmogorov zero-one law, discussed in Notes 2. Observe that if the {X_n} are real-valued, then the limit superior {\limsup_{n \rightarrow \infty} S_n/n} and {\liminf_{n \rightarrow \infty} S_n/n} are tail random variables in the sense that they are not affected if one changes finitely many of the {X_n}; in particular, events such as {\limsup_{n \rightarrow \infty} S_n/n > t} are tail events for any {t \in {\bf R}}. From this and the zero-one law we see that there must exist deterministic quantities {-\infty \leq \mu_- \leq \mu_+ \leq +\infty} such that {\limsup_{n \rightarrow \infty} S_n/n = \mu_+} and {\liminf_{n \rightarrow \infty} S_n/n = \mu_-} almost surely. The strong law of large numbers can then be viewed as the assertion that {\mu_- = \mu_+ = \mu} when {X} is absolutely integrable. On the other hand, the zero-one law argument does not require absolute integrability (and one can replace the denominator {n} by other functions of {n} that go to infinity as {n \rightarrow \infty}).

The law of large numbers asserts, roughly speaking, that the theoretical expectation {\mu} of a random variable {X} can be approximated by taking a large number of independent samples {X_1,\dots,X_n} of {X} and then forming the empirical mean {S_n/n = \frac{X_1+\dots+X_n}{n}}. This ability to approximate the theoretical statistics of a probability distribution through empirical data is one of the basic starting points for mathematical statistics, though this is not the focus of the course here. The tendency of statistics such as {S_n/n} to cluster closely around their mean value {\mu} is the simplest instance of the concentration of measure phenomenon, which is of tremendous significance not only within probability, but also in applications of probability to disciplines such as statistics, theoretical computer science, combinatorics, random matrix theory and high dimensional geometry. We will not discuss these topics much in this course, but see this previous blog post for some further discussion.

There are several ways to prove the law of large numbers (in both forms). One basic strategy is to use the moment method – controlling statistics such as {S_n/n} by computing moments such as the mean {{\bf E} S_n/n}, variance {{\bf E} |S_n/n - {\bf E} S_n/n|^2}, or higher moments such as {{\bf E} |S_n/n - {\bf E} S_n/n|^k} for {k = 4, 6, \dots}. The joint independence of the {X_i} make such moments fairly easy to compute, requiring only some elementary combinatorics. A direct application of the moment method typically requires one to make a finite moment assumption such as {{\bf E} |X|^k < \infty}, but as we shall see, one can reduce fairly easily to this case by a truncation argument.

For the strong law of large numbers, one can also use methods relating to the theory of martingales, such as stopping time arguments and maximal inequalities; we present some classical arguments of Kolmogorov in this regard.

— 1. The moment method —

We begin by using the moment method to establish both the strong and weak law of large numbers for sums of iid random variables, under additional moment hypotheses.

We first make a very simple observation: in order to prove the weak or strong law of large numbers for complex variables, it suffices to do so for real variables, as the complex case follows from the real case after taking real and imaginary parts. Thus we shall restrict attention henceforth to real random variables, in order to avoid some unnecessarily complications involving complex conjugation.

Let {X_1,X_2,X_3,\dots} be a sequence of iid copies of a scalar random variable {X}, and define the partial sums {S_n := X_1 + \dots + X_n}. Suppose that {X} is absolutely integrable, with expectation (or mean) {{\bf E} X = \mu}. Then we can use linearity of expectation to compute the expectation (or first moment) of {S_n}:

\displaystyle {\bf E} S_n = {\bf E} X_1 + \dots + {\bf E} X_n

\displaystyle = n \mu.

In particular, the expectation of {S_n/n} is {\mu}. This looks consistent with the strong and weak law of large numbers, but does not immediately imply these laws. However, thanks to Markov’s inequality, we do at least get the following very weak bound

\displaystyle {\bf P}( S_n \geq \lambda \mu n ) \leq \frac{1}{\lambda} \ \ \ \ \ (1)


for any {\lambda > 0}, in the case that {X} is unsigned and absolutely integrable. Thus, in the unsigned case at least, we see that {S_n/n} usually doesn’t get much larger than the mean {\mu}. We will refer to (1) as a first moment bound on {S_n}, as it was obtained primarily through a computation of the first moment {{\bf E} S_n} of {S_n}.

Now we turn to second moment bounds on {S_n}, obtained through computations of second moments such as {{\bf E} |S_n|^2} or {{\bf Var}(S_n) = {\bf E} |S_n - {\bf E} S_n|^2}. It will be convenient to normalise the mean {\mu} to equal zero, by replacing each {X_i} with {X_i - \mu} (and {X} by {X - \mu}), so that {S_n} gets replaced by {S_n - \mu n} (and {S_n/n} by {S_n/n-\mu}). With this normalisation, we see that to prove the strong or weak law of large numbers, it suffices to do so in the mean zero case {\mu=0}. (On the other hand, if {X} is unsigned, then normalising {X} in this fashion will almost certainly destroy the unsigned property, so it is not always desirable to perform this normalisation.)

Suppose that {X} has finite second moment (i.e. {{\bf E} |X|^2 < \infty}, that is to say {X} is square-integrable) and has been normalised to have mean zero. We write the variance {{\bf Var}(X) = {\bf E} |X|^2} as {\sigma^2}. The first moment calculation then shows that {S_n} has mean zero. Now we compute the variance of {S_n}, which in the mean zero case is simply {{\bf E} |S_n|^2}; note from the triangle inequality that this quantity is finite. By linearity of expectation, we have

\displaystyle {\bf Var}(S_n) = {\bf E} |X_1 + \dots + X_n|^2

\displaystyle = \sum_{1 \leq i,j \leq n} {\bf E} X_i X_j.

(All expressions here are absolutely integrable thanks to the Cauchy-Schwarz inequality.) If {i=j}, then the term {{\bf E} X_i X_j} is equal to {{\bf E} |X|^2 = \sigma^2}. If {i \neq j}, then by hypothesis {X_i} and {X_j} are independent and mean zero, and thus

\displaystyle {\bf E} X_i X_j = ({\bf E} X_i) ({\bf E} X_j) = 0.

Putting all this together, we obtain

\displaystyle {\bf Var}(S_n) = n \sigma^2

or equivalently

\displaystyle {\bf Var}(S_n/n) = \frac{1}{n} \sigma^2.

This bound was established in the mean zero case, but it is clear that it also holds in general, since subtracting a constant from a random variable does not affect its variance. Thus we see that while {S_n/n} has the same mean {\mu} as {X}, it has a much smaller variance: {\sigma^2/n} in place of {\sigma^2}. This is the first demonstration of the concentration of measure effect that comes from combining many independent random variables {X_1,\dots,X_n}. (At the opposite extreme to the independent case, suppose we took {X_1,\dots,X_n} to all be exactly the same random variable: {X_1 = \dots = X_n = X}. Then {S_n/n = X} has exactly the same mean and variance as {X}. Decorrelating the {X_1,\dots,X_n} does not affect the mean of {S_n}, but produces significant cancellations that reduce the variance.)

If we insert this variance bound into Chebyshev’s inequality, we obtain the bound

\displaystyle {\bf P}( |\frac{S_n}{n}-\mu| \geq \varepsilon ) \leq \frac{1}{n} \frac{\sigma^2}{\varepsilon^2} \ \ \ \ \ (2)


for any natural number {n} and {\varepsilon > 0}, whenever {X} has mean {\mu} and a finite variance {\sigma^2 < \infty}. The right-hand side goes to zero as {n \rightarrow \infty} for fixed {\varepsilon}, so we have in fact established the weak law of large numbers in the case that {X} has finite variance.

Note that (2) implies that

\displaystyle \frac{\sqrt{n}}{\omega(n)} |\frac{S_n}{n}-\mu|

converges to zero in probability whenever {\omega(n)} is a function of {n} that goes to infinity as {n \rightarrow \infty}. Thus for instance

\displaystyle \frac{\sqrt{n}}{\log n} |\frac{S_n}{n}-\mu| \rightarrow 0

in probability. Informally, this means that {S_n/n} tends to stray by not much more than {\sqrt{n}} from {\mu} typically. This intuition will be reinforced in the next set of notes when we study the central limit theorem and related results such as the Chernoff inequality. (It is also supported by the law of the iterated logarithm, which we will probably not be able to get to in this set of notes.)

One can hope to use (2) and the Borel-Cantelli lemma (Exercise 2(ii)) to also obtain the strong law of large numbers in the second moment case, but unfortunately the quantities {\frac{1}{n} \frac{\sigma^2}{\varepsilon^2}} are not summable in {n}. To resolve this issue, we will go to higher moments than the second moment. One could calculate third moments such as {{\bf E} S_n^3}, but this turns out to not convey too much information (unless {X_n} is unsigned) because of the signed nature of {S_n^3}; the expression {{\bf E} |S_n|^3} would in principle convey more usable information, but is difficult to compute as {|S_n|^3} is not a polynomial combination of the {X_i}. Instead, we move on to the fourth moment. Again, we normalise {X} to have mean {\mu=0}, and now assume a finite fourth moment {{\bf E} |X|^4 < \infty} (which, by the Hölder or Jensen inequalities, implies that all lower moments such as {{\bf E} |X|^2} are finite). We again use {\sigma^2 = {\bf E} |X|^2} to denote the variance of {X}. We can expand

\displaystyle {\bf E} |S_n|^4 = {\bf E} |X_1 + \dots + X_n|^4

\displaystyle = \sum_{1 \leq i,j,k,l \leq n} {\bf E} X_i X_j X_k X_l.

Note that all expectations here are absolutely integrable by Hölder’s inequality and the hypothesis of finite fourth moment. The correlation {{\bf E} X_i X_j X_k X_l} looks complicated, but fortunately it simplifies greatly in most cases. Suppose for instance that {i} is distinct from {j,k,l}, then {X_i} is independent of {(X_j,X_k,X_l)} (even if some of the {j,k,l} are equal to each other) and so

\displaystyle {\bf E} X_i X_j X_k X_l = ({\bf E} X_i) ({\bf E} X_j X_k X_l) = 0

since {{\bf E} X_i = {\bf E} X = 0}. Similarly for permutations. This leaves only a few quadruples {(i,j,k,l)} for which {{\bf E} X_i X_j X_k X_l} could be non-zero: the three cases {i=j \neq k=l}, {i = k \neq j = l}, {i=l \neq j=k} where each of the indices {i,j,k,l} is paired up with exactly one other index; and the diagonal case {i=j=k=l}. If for instance {i=j \neq k=l}, then

\displaystyle {\bf E} X_i X_j X_k X_l = {\bf E} X_i^2 X_k^2

\displaystyle = ({\bf E} X_i^2) ({\bf E} X_k^2)

\displaystyle = \sigma^2 \times \sigma^2.

Similarly for the cases {i=k \neq j=l} and {i=l \neq j=k}, which gives a total contribution of {3 n(n-1) \sigma^4} to {{\bf E} |S_n|^4}. Finally, when {i=j=k=l}, then {{\bf E} X_i X_j X_k X_l = {\bf E} X^4}, and there are {n} contributions of this form to {{\bf E} |S_n|^4}. We conclude that

\displaystyle {\bf E} |S_n|^4 = 3n(n-1) \sigma^4 + n {\bf E} |X|^4

and hence by Markov’s inequality

\displaystyle {\bf P} ( |\frac{S_n}{n}| > \varepsilon ) \leq \frac{3n(n-1) \sigma^4 + n {\bf E} |X|^4}{\varepsilon^4 n^4}

for any {\varepsilon>0}. If we remove the normalisation {\mu=0}, we conclude that

\displaystyle {\bf P} ( |\frac{S_n}{n}-\mu| > \varepsilon ) \leq \frac{3n(n-1) \sigma^4 + n {\bf E} |X-\mu|^4}{\varepsilon^4 n^4}

The right-hand side decays like {O(1/n^2)}, which is now summable in {n} (in contrast to (2)). Thus we may now apply the Borel-Cantelli lemma and conclude the strong law of large numbers in the case when one has bounded fourth moment {{\bf E} |X|^4 < \infty}.

One can of course continue to compute higher and higher moments of {S_n} (assuming suitable finite moment hypotheses on {X}), though as one can already see from the fourth moment calculation, the computations become increasingly combinatorial in nature. We will pursue this analysis more in the next set of notes, when we discuss the central limit theorem. For now, we turn to some applications and variants of the moment method (many of which are taken from Durrett’s book).

We begin with two quick applications of the weak law of large numbers to topics outside of probability. We first give an explicit version of the Weierstrass approximation theorem, which asserts that continuous functions on (say) the unit interval {[0,1]} can be approximated by polynomials.

Proposition 5 (Approximation by Bernstein polynomials) Let {f: [0,1] \rightarrow {\bf R}} be a continuous function. Then the Bernstein polynomials

\displaystyle f_n(x) := \sum_{i=0}^n \binom{n}{i} x^i (1-x)^{n-i} f(\frac{i}{n})

converges uniformly to {f} as {n \rightarrow \infty}.

Proof: We first establish the pointwise bound {f_n(p) \rightarrow f(p)} for each {0 \leq p \leq 1}. Fix {0 \leq p \leq 1}, and let {X_1,X_2,X_3,\dots} be iid Bernoulli variables (i.e. variables taking values in {\{0,1\}}) with each {X_i} equal to {1} with probability {p}. The mean of the {X_i} is clearly {p}, and the variance is bounded crudely by {1} (in fact it is {p(1-p)}), so if we set {S_n := X_1 + \dots + X_n} then by the weak law of large numbers for random variables of finite second moment, we see that {S_n/n} converges in probability to {p} (note that {S_n/n} is clearly dominated by {1}). By the dominated convergence theorem in probability, we conclude that {{\bf E} f(S_n/n)} converges to {f(p)}. But from direct computation we see that {S_n} takes values in {\{0, 1/n, \dots, n/n\}}, with each {i/n} being attained with probability {\binom{n}{i} p^i (1-p)^{n-i}} (i.e. {S_n} as a binomial distribution), and so from the definition of the Bernstein polynomials {f_n} we see that {{\bf E} f(S_n/n) = f_n(p)}. This concludes the pointwise convergence claim.

To establish the uniform convergence, we use the proof of the weak law of large numbers, rather than the statement of that law, to get the desired uniformity in the parameter {p}. For a given {p}, we see from (2) that

\displaystyle {\bf P}( |\frac{S_n}{n}-p| \geq \delta ) \leq \frac{1}{n} \frac{p(1-p)}{\delta^2}

for any {\delta > 0}. On the other hand, as {f} is continuous on {[0,1]}, it is uniformly continuous, and so for any {\varepsilon>0} there exists a {\delta>0} such that {|f(x)-f(y)| < \varepsilon} whenever {x,y \in [0,1]} with {|x-y| < \delta}. For such an {\varepsilon} and {\delta}, we conclude that

\displaystyle {\bf P}( |f(\frac{S_n}{n})-f(p)| \geq \varepsilon ) \leq \frac{1}{n} \frac{p(1-p)}{\delta^2}.

On the other hand, being continuous on {[0,1]}, {f} must be bounded in magnitude by some bound {M}, so that {|f(\frac{S_n}{n})-f(p)| \leq 2M}. This leads to the upper bound

\displaystyle {\bf E} |f(\frac{S_n}{n})-f(p)| \leq \varepsilon + \frac{2M}{n} \frac{p(1-p)}{\delta^2}

and thus by the triangle inequality and the identity {{\bf E} f(S_n/n) = f_n(p)}

\displaystyle |f_n(p) - f(p)| \leq \varepsilon + \frac{2M}{n} \frac{p(1-p)}{\delta^2}.

Since {p(1-p)} is bounded by (say) {1}, and {\varepsilon} can be made arbitrarily small, we conclude that {f_n} converges uniformly to {f} as required. \Box

The other application of the weak law of large numbers is to the geometry of high-dimensional cubes, giving the rather unintuitive conclusion that most of the volume of the high-dimensional cube is contained in a thin annulus.

Proposition 6 Let {\varepsilon > 0}. Then, for sufficiently large {n}, a proportion of at least {1-\varepsilon} of the cube {[-1,1]^n} (by {n}-dimensional Lebesgue measure) is contained in the annulus {\{ x \in {\bf R}^n: (1-\varepsilon) \sqrt{n/3} \leq |x| \leq (1+\varepsilon) \sqrt{n/3} \}}.

This proposition already indicates that high-dimensional geometry can behave in a manner quite differently from what one might naively expect from low-dimensional geometric intuition; one needs to develop a rather distinct high-dimensional geometric intuition before one can accurately make predictions in large dimensions.

Proof: Let {X_1,X_2,\dots} be iid random variables drawn uniformly from {[-1,1]}. Then the random vector {(X_1,\dots,X_n)} is uniformly distributed on the cube {[-1,1]^n}. The variables {X_1^2,X_2^2,\dots} are also iid, and (by the change of variables formula) have mean

\displaystyle \int_0^1 x^2\ dx = \frac{1}{3}.

Hence, by the weak law of large numbers, the quantity {\frac{X_1^2+\dots+X_n^2}{n}} converges in probability to {\frac{1}{3}}, so in particular the probability

\displaystyle {\bf P}( \sqrt{X_1^2+\dots+X_n^2} - \sqrt{n/3}| > \varepsilon \sqrt{n})

goes to zero as {n} goes to infinity. But this quantity is precisely the proportion of {[-1,1]^n} that lies outside the annulus {\{ x \in {\bf R}^n: (1-\varepsilon) \sqrt{n/3} \leq |x| \leq (1+\varepsilon) \sqrt{n/3} \}}, and the claim follows. \Box

The first and second moment method are very general, and apply to sums {S_n = X_1 + \dots + X_n} of random variables {X_1,\dots,X_n} that do not need to be identically distributed, or even independent (although the bounds can get weaker and more complicated if one strays too far from these hypotheses). For instance, it is clear from linearity of expectation that {S_n} has mean

\displaystyle {\bf E} S_n = {\bf E} X_1 + \dots + {\bf E} X_n \ \ \ \ \ (3)


(assuming of course that {X_1,\dots,X_n} are absolutely integrable) and variance

\displaystyle {\bf Var}(S_n) = {\bf Var}(X_1) + \dots + {\bf Var}(X_n) + \sum_{1 \leq i,j \leq n: i \neq j} \hbox{Cov}(X_i,X_j)

(assuming now that {X_1,\dots,X_n} are square-integrable). (For the latter claim, it is convenient, as before, to first normalise each of the {X_i} to have mean zero.) If the {X_1,\dots,X_n} are pairwise independent in addition to being square-integrable, then all the covariances vanish, and we obtain additivity of the variance:

\displaystyle {\bf Var}(S_n) = {\bf Var}(X_1) + \dots + {\bf Var}(X_n). \ \ \ \ \ (4)


Remark 7 Viewing the variance as the square of the standard deviation, the identity (4) can be interpreted as a rigorous instantiation of the following informal principle of square root cancellation: if one has a sum {S_n = X_1 + \dots + X_n} of random (or pseudorandom) variables that “oscillate” in the sense that their mean is either zero or close to zero, and each {X_i} has an expected magnitude of about {a_i} (in the sense that a statistic such as the standard deviation of {X_i} is comparable to {a_i}), and the {X_i} “behave independently”, then the sum {S_n} is expected to have a magnitude of about {(\sum_{i=1}^n |a_i|^2)^{1/2}}. Thus for instance a sum of {n} unbiased signs {X_i \in \{-1,+1\}} would be expected to have magnitude about {\sqrt{n}} if the {X_i} do not exhibit strong correlations with each other. This principle turns out to be remarkably broadly applicable (at least as a heuristic, if not as a rigorous argument), even in situations for which no randomness is evident (e.g. in considering the type of exponential sums that occur in analytic number theory). We will see some further instantiations of this principle in later notes.

These identities, together with Chebyshev’s inequality, already gives some useful control on many statistics, including some which are not obviously of the form of a sum of independent random variables. A classic example of this is the coupon collector problem, which we formulate as follows. Let {N} be a natural number, and let {Y_1, Y_2, \dots} be an infinite sequence of “coupons”, which are iid and uniformly distributed from the finite set {\{1,\dots,N\}}. Let {T_N} denote the first time at which one has collected all {N} different types of coupons, thus {T_N} is the first natural number for which the set {\{Y_1,\dots,Y_{T_N}\}} attains the maximal cardinality of {N} (or {T_N=\infty} if no such natural number exists, though it is easy to see that this is a null event, indeed note from the strong law of large numbers that almost surely one will collect infinitely many of each coupon over time). The question is then to describe the behaviour of {T_N} as {N} gets large.

At first glance, {T_N} does not seem to be easily describable as the sum of many independent random variables. However, if one looks at it the right way, one can see such a structure emerge (and much of the art of probability is in finding useful and different ways of thinking of the same random variable). Namely, for each {i=0,\dots,N}, let {T_{i,N}} denote the first time one has collected {i} coupons out of {N}, thus {T_{i,N}} is the first non-negative integer such that {\{Y_1,\dots,Y_{T_{i,N}}\}} has cardinality {i}, with

\displaystyle 0 = T_{0,N} < T_{1,N} < \dots < T_{N,N} = T_N.

If we then write {X_i := T_{i,N} - T_{i-1,N}} for {i=1,\dots,n}, then the {X_i} take values in the natural numbers, we have the telescoping sum

\displaystyle T_N = X_1 + X_2 + \dots + X_N.

Remarkably, the random variables {X_1,\dots,X_N} have a simple structure:

Proposition 8 The random variables {X_1,\dots,X_N} are jointly independent, and each {X_i} has a geometric distribution with parameter {\frac{N-i+1}{N}}, in the sense that

\displaystyle {\bf P}(X_i = j) = (1 - \frac{N-i+1}{N})^{j-1} \frac{N-i+1}{N}

for {j=1,2,\dots} and {i=1,\dots,N}.

The joint independence of the {X_1,\dots,X_N} reflects the “Markovian” or “memoryless” nature of a certain process relating to the coupon collector problem, and can be easily established once one has understood the concept of conditional expectation, but the further exploration of these concepts will have to be deferred to the course after this one (which I will not be teaching or writing notes for). But as the coupon collecting problem is so simple, we shall proceed instead by direct computation.

Proof: It suffices to show that

\displaystyle {\bf P}(X_1 = j_1 \wedge \dots \wedge X_N = j_N) = \prod_{i=1}^N (1 - \frac{N-i+1}{N})^{j_i-1} \frac{N-i+1}{N}

\displaystyle = N^{-(j_1+\dots+j_n)} \prod_{i=1}^N (N-i+1) (i-1)^{j_i-1}

for any choice of natural numbers {j_1,\dots,j_N}. In order for the event {X_1 = j_1 \wedge \dots \wedge X_n = j_n} to hold, the first coupon {Y_1} can be arbitrary, but the coupons {Y_2,\dots,Y_{j_1}} have to be equal to {Y_1}; then {Y_{j_1+1}} must be from one the remaining {n-1} elements of {\{1,\dots,N\}} not equal to {Y_1}, and {Y_{j_1+2},\dots,Y_{j_1+j_2}} must be from the two-element set {\{Y_1,Y_{j_1+1}\}}; and so on and so forth up to {Y_{j_1+\dots+j_n}}. The claim then follows from a routine application of elementary combinatorics to count all the possible values for the tuple {(Y_1,\dots,Y_{j_1+\dots+j_n})} of the above form and dividing by the total number {N^{j_1+\dots+j_n}} of such tuples. \Box

Exercise 9 Show that if {X} is a geometric distribution with parameter {p} for some {0 < p \leq 1} (thus {{\bf P}(X=j) = (1-p)^{j-1} p} for all {j}) then {X} has mean {\frac{1}{p}} and variance {\frac{1-p}{p^2}}.

From the above proposition and exercise as well as (3), (4) we see that

\displaystyle {\bf E} T_N = \sum_{i=1}^N \frac{N}{N-i+1}


\displaystyle {\bf Var} T_N = \sum_{i=1}^N (\frac{N}{N-i+1})^2 \frac{i-1}{N}.

From the integral test (and crudely bounding {(i-1)/N = O(1)}) one can thus obtain the bounds

\displaystyle {\bf E} T_N = N \log N + O(N)


\displaystyle {\bf Var} T_N = O(N^2)

where we use the usual asymptotic notation of denoting by {O(X)} any quantity bounded in magnitude by a constant multiple {CX} of {X}. From Chebyshev’s inequality we thus see that

\displaystyle {\bf P}( |T_N - N \log N| \geq \lambda N ) = O( \lambda^{-2} )

for any {\lambda > 0} (note the bound is trivial unless {\lambda} is large). This implies in particular that {T_N/N \log N} converges to {1} in probability as {N \rightarrow \infty} (assuming that our underlying probability space can model a separate coupon collector problem for each choice of {N}). Thus, roughly speaking, we see that one expects to take about {N \log N} units of time to collect all {N} coupons.

Another application of the weak and strong law of large numbers (even with the moment hypotheses currently imposed on these laws) is a converse to the Borel-Cantelli lemma in the jointly independent case:

Exercise 10 (Second Borel-Cantelli lemma) Let {E_1,E_2,\dots} be a sequence of jointly independent events. If {\sum_{n=1}^\infty {\bf P}(E_n) = \infty}, show that almost surely an infinite number of the {E_n} hold simultaneously. (Hint: compute the mean and variance of {S_n= \sum_{i=1}^n 1_{E_i}}. One can also compute the fourth moment if desired, but it is not necessary to do so for this result.)

One application of the second Borel-Cantelli lemma has the colourful name of the “infinite monkey theorem“:

Exercise 11 (Infinite monkey theorem) Let {X_1,X_2,\dots} be iid random variables drawn uniformly from a finite alphabet {A}. Show that almost surely, every finite word {a_1 \dots a_k} of letters {a_1,\dots,a_k} in the alphabet appears infinitely often in the string {X_1 X_2 X_3 \dots}.

In the usual formulation of the weak or strong law of large numbers, we draw the sums {S_n} from a single infinite sequence {X_1,X_2,\dots} of iid random variables. One can generalise the situation slightly by working instead with sums from rows of a triangular array, which are jointly independent within rows but not necessarily across rows:

Exercise 12 (Triangular arrays) Let {(X_{i,n})_{i,n \in {\bf N}: i \leq n}} be a triangular array of scalar random variables {X_{i,n}}, such that for each {n}, the row {X_{1,n},\dots,X_{n,n}} is a collection of independent random variables. For each {n}, we form the partial sums {S_n = X_{1,n} + \dots + X_{n,n}}.

  • (i) (Weak law) If all the {X_{i,n}} have mean {\mu} and {\sup_{i,n} {\bf E} |X_{i,n}|^2 < \infty}, show that {S_n/n} converges in probability to {\mu}.
  • (ii) (Strong law) If all the {X_{i,n}} have mean {\mu} and {\sup_{i,n} {\bf E} |X_{i,n}|^4 < \infty}, show that {S_n/n} converges almost surely to {\mu}.

Note that the weak and strong law of large numbers established previously corresponds to the case {X_{i,n} = X_i} when the triangular array collapses to a single sequence {X_1,X_2,\dots} of iid random variables.

We now illustrate the use of moment method and law of large number methods to two important examples of random structures, namely random graphs and random matrices.

Exercise 13 For a natural number {n} and a parameter {0 \leq p \leq 1}, define an Erdös-Renyi graph on {n} vertices with parameter {p} to be a random graph {(V,E)} on a (deterministic) vertex set {V} of {n} vertices (thus {(V,E)} is a random variable taking values in the discrete space of all {2^{\binom{n}{2}}} possible graphs one can place on {V}) such that the events {\{i,j\} \in E} for unordered pairs {\{i,j\}} in {V} are jointly independent and each occur with probability {p}.

For each {n}, let {(V_n,E_n)} be an Erdös-Renyi graph on {n} vertices with parameter {1/2} (we do not require the graphs to be independent of each other).

  • (i) If {|E_n|} is the number of edges in {(V_n,E_n)}, show that {|E_n|/\binom{n}{2}} converges almost surely to {1/2}. (Hint: use Exercise 12.)
  • (ii) If {|T_n|} is the number of triangles in {(V_n,E_n)} (i.e. the set of unordered triples {\{i,j,k\}} in {V_n} such that {\{i,j\}, \{i,k\}, \{j,k\} \in E_n}), show that {|T_n|/\binom{n}{3}} converges in probability to {1/8}. (Note: there is not quite enough joint independence here to directly apply the law of large numbers, however the second moment method still works nicely.)
  • (iii) Show in fact that {|T_n|/\binom{n}{3}} converges almost surely to {1/8}. (Note: in contrast with the situation with the strong law of large numbers, the fourth moment does not need to be computed here.)

Exercise 14 For each {n}, let {A_n = (a_{ij,n})_{1 \leq i,j \leq n}} be a random {n \times n} matrix (i.e. a random variable taking values in the space {{\bf R}^{n \times n}} or {{\bf C}^{n \times n}} of {n \times n} matrices) such that the entries {a_{ij,n}} of {A_n} are jointly independent in {i,j} and take values in {\{-1,+1\}} with a probability of {1/2} each. (Such matrices are known as random sign matrices.) We do not assume any independence for the sequence {A_1,A_2,\dots}.

  • (i) Show that the random variables {\hbox{tr} A_n A_n^* / n^2} are deterministically equal to {1}, where {A_n^*} denotes the adjoint (which, in this case, is also the transpose) of {A_n} and {\hbox{tr}} denotes the trace (sum of the diagonal entries) of a matrix.
  • (ii) Show that for any natural number {k}, the quantities {{\bf E} \hbox{tr} (A_n A_n^*)^k / n^{k+1}} are bounded uniformly in {n} (i.e. they are bounded by a quantity {C_k} that can depend on {k} but not on {n}). (You may wish to first work with simple cases like {k=2} or {k=3} to gain intuition.)
  • (iii) If {\|A_n\|_{op}} denotes the operator norm of {A_n}, and {\varepsilon > 0}, show that {\|A_n\|_{op} / n^{1/2+\varepsilon}} converges almost surely to zero, and that {\|A_n\|_{op} / n^{1/2-\varepsilon}} diverges almost surely to infinity. (Hint: use the spectral theorem to relate {\|A_n\|_{op}} with the quantities {\hbox{tr} (A_n A_n^*)^k}.)

One can obtain much sharper information on quantities such as the operator norm of a random matrix; see this previous blog post for further discussion.

Exercise 15 The Cramér random model for the primes is a random subset {{\mathcal P}} of the natural numbers with {1 \not \in {\mathcal P}}, {2 \in {\mathcal P}}, and the events {n \in {\mathcal P}} for {n=3,4,\dots} being jointly independent with {{\bf P}(n \in {\mathcal P}) = \frac{1}{\log n}} (the restriction to {n \geq 3} is to ensure that {\frac{1}{\log n}} is less than {1}). It is a simple, yet reasonably convincing, probabilistic model for the primes {\{2,3,5,7,\dots\}}, which can be used to provide heuristic confirmations for many conjectures in analytic number theory. (It can be refined to give what are believed to be more accurate predictions; see this previous blog post for further discussion.)

  • (i) (Probabilistic prime number theorem) Prove that almost surely, the quantity {\frac{1}{x/\log x} |\{ n \leq x: n \in {\mathcal P}\}|} converges to one as {x \rightarrow \infty}.
  • (ii) (Probabilistic Riemann hypothesis) Show that if {\varepsilon > 0}, then the quantity

    \displaystyle \frac{1}{x^{1/2+\varepsilon}} ( |\{ n \leq x: n \in {\mathcal P} \}| - \int_2^x \frac{dt}{\log t} )

    converges almost surely to zero as {x \rightarrow \infty}.

  • (iii) (Probabilistic twin prime conjecture) Show that almost surely, there are an infinite number of elements {p} of {{\mathcal P}} such that {p+2} also lies in {{\mathcal P}}.
  • (iv) (Probabilistic Goldbach conjecture) Show that almost surely, all but finitely many natural numbers {n} are expressible as the sum of two elements of {{\mathcal P}}.

Probabilistic methods are not only useful for getting heuristic predictions about the primes; they can also give rigorous results about the primes. We give one basic example, namely a probabilistic proof (due to Turán) of a theorem of Hardy and Ramanujan, which roughly speaking asserts that a typical large number {n} has about {\log\log n} distinct prime factors.

Exercise 16 (Hardy-Ramanujan theorem) Let {x \geq 100} be a natural number (so in particular {\log \log x \geq 1}), and let {n} be a natural number drawn uniformly at random from {1} to {x}. Assume Mertens’ theorem

\displaystyle \sum_{p \leq x} \frac{1}{p} =\log\log x + O(1)

for all {x \geq 100}, where the sum is over primes up to {x}.

  • (i) Show that the random variable {\sum_{p \leq x^{1/10}} 1_{p|n}} (where {1_{p|n}} is {1} when {p} divides {n} and {0} otherwise, and the sum is over primes up to {x^{1/10}}) has mean {\log\log x + O(1)} and variance {O( \log\log x)}. (Hint: compute (up to reasonable error) the means, variances and covariances of the random variables {1_{p|n}}.)
  • (ii) If {\omega(n)} denotes the number of distinct prime factors of {n}, show that {\frac{\omega(n)}{\log\log n}} converges to {1} in probability as {x \rightarrow \infty}. (Hint: first show that {\omega(n) = \sum_{p \leq x^{1/10}} 1_{p|n} + O(1))}.) More precisely, show that

    \displaystyle \frac{\omega(n) - \log\log n}{g(n) \sqrt{\log\log n}}

    converges in probability to zero, whenever {g: {\bf N} \rightarrow {\bf R}} is any function such that {g(n)} goes to infinity as {n \rightarrow \infty}.

Exercise 17 (Shannon entropy) Let {A} be a finite non-empty set of some cardinality {|A|}, and let {X} be a random variable taking values in {A}. Define the Shannon entropy {{\bf H}(X)} to be the quantity

\displaystyle {\bf H}(X) := \sum_{x \in A} {\bf P}(X = x) \log \frac{1}{{\bf P}(X=x)}

with the convention that {0 \log \frac{1}{0}=0}. (In some texts, the logarithm to base {2} is used instead of the natural logarithm {\log}.)

  • (i) Show that {0 \leq {\bf H}(X) \leq \log |A|}. (Hint: use Jensen’s inequality.) Determine when the equality {{\bf H}(X) = \log |A|} holds.
  • (ii) Let {\varepsilon > 0} and {n} be a natural number. Let {X_1,\dots,X_n} be {n} iid copies of {X}, thus {\vec X := (X_1,\dots,X_n)} is a random variable taking values in {A^n}, and the distribution {\mu_{\vec X}} is a probability measure on {A^n}. Let {\Omega \subset A^n} denote the set

    \displaystyle \Omega := \{ \vec x \in A^n: \exp(-(1+\varepsilon) n {\bf H}(X)) \leq \mu_{\vec X}(\{\vec x\})

    \displaystyle \leq \exp(-(1-\varepsilon) n {\bf H}(X)) \}.

    Show that if {n} is sufficiently large, then

    \displaystyle {\bf P}( \vec X \in \Omega) \geq 1-\varepsilon


    \displaystyle \exp((1-2\varepsilon) n {\bf H}(X)) \leq |\Omega|

    \displaystyle \leq \exp((1+2\varepsilon) n {\bf H}(X)).

    (Hint: use the weak law of large numbers to understand the number of times each element {x} of {A} occurs in {\vec X}.)

Thus, roughly speaking, while {\vec X} in principle takes values in all of {A^n}, in practice it is concentrated in a set of size about {\exp( n {\bf H}(X) )}, and is roughly uniformly distributed on that set. This is the beginning of the microstate interpretation of Shannon entropy, but we will not develop the theory of Shannon entropy further in this course.

— 2. Truncation —

The weak and strong laws of large numbers have been proven under additional moment assumptions (of finite second moment and finite fourth moment respectively). To remove these assumptions, we use the simple but effective truncation method, decomposing a general scalar random variable {X} into a truncated component such as {X 1_{|X| \leq M}} for some suitable threshold {M}, and a tail {X 1_{|X|>M}}. The main term {X 1_{|X| \leq M}} is bounded and thus has all moments finite. The tail {X 1_{|X| > M}} will not have much better moment properties than the original variable {X}, but one can still hope to make it “small” in various ways. There is a tradeoff regarding the selection of the truncation parameter {M}: if {M} is too large, then the truncated component {X 1_{|X| \leq M}} has poor estimates, but if {M} is too small then the tail {X 1_{|X|>M}} causes trouble.

Let’s first see how this works with the weak law of large numbers. As before, we assume we have an iid sequence {X_1,X_2,\dots} copies of a real absolutely integrable {X} with no additional finite moment hypotheses. We write {S_n = X_1 + \dots + X_n}. At present, we cannot take variances of the {X_i} or {S_n} and so the second moment method is not directly available. But we now perform a truncation; it turns out that a good choice of threshold here is {n}, thus we write {X_i = X_{i,\leq n} + X_{i, > n}} where {X_{i, \leq n} := X_i 1_{|X_i| \leq n}} and {X_{i,>n} := X_i 1_{|X_i| > n}}, and then similarly decompose {S_n = S_{n,\leq} + S_{n,>}} where

\displaystyle S_{n,\leq} := X_{1,\leq n} + \dots + X_{n,\leq n}


\displaystyle S_{n,>} := X_{1,>n} + \dots + X_{n,>n}.

Write {\mu := {\bf E} X}. We wish to show that

\displaystyle {\bf P}( |S_n/n - \mu| \geq \varepsilon ) \rightarrow 0 \ \ \ \ \ (5)


as {n \rightarrow \infty} for any given {\varepsilon > 0} (not depending on {n}). By the triangle inequality, we can split

\displaystyle {\bf P}( |S_n/n - \mu| \geq \varepsilon ) \leq {\bf P}( |S_{n,\leq}/n - \mu| \geq \varepsilon ) + {\bf P}( S_{n,>} \neq 0 ).

We begin by studying {S_{n,\leq}/n}.

The random variables {X_{1,\leq n},\dots,X_{n,\leq n}} are iid with mean {\mu_{\leq n} := {\bf E} X 1_{|X| \leq n}} and variance at most

\displaystyle {\bf Var}(X 1_{|X| \leq n}) \leq {\bf E} (X 1_{|X| \leq n})^2

\displaystyle = {\bf E} |X|^2 1_{|X| \leq n}.

Thus, {S_{n,\leq}/n} has mean {\mu_{\leq n}} and variance at most {\frac{1}{n} {\bf E} |X|^2 1_{|X| \leq n}}. By dominated convergence, {\mu_{\leq n} \rightarrow \mu} as {n \rightarrow \infty}, so for sufficiently large {n} we can bound

\displaystyle {\bf P}( |S_{n,\leq}/n - \mu| \geq \varepsilon ) \leq {\bf P}( |S_{n,\leq}/n - \mu_{\leq n}| \geq \varepsilon/2 )

and hence by the Chebyshev inequality

\displaystyle {\bf P}( |S_{n,\leq}/n - \mu| \geq \varepsilon ) \leq \frac{1}{(\varepsilon/2)^2} {\bf E} \frac{|X|^2}{n} 1_{|X| \leq n}.

Observe that {\frac{|X|^2}{n} 1_{|X| \leq n}} is bounded surely by the absolutely integrable {|X|}, and goes to zero as {n \rightarrow \infty}, so by dominated convergence we conclude that

\displaystyle {\bf P}( |S_{n,\leq}/n - \mu| \geq \varepsilon ) \rightarrow 0

as {n \rightarrow \infty} (keeping {\varepsilon} fixed).

To handle {S_{n,>}}, we observe that each {X_{i,>n}} is only non-zero with probability {{\bf P}(|X| > n)}, and hence by subadditivity

\displaystyle {\bf P}( |S_{n,>}| > 0 ) \leq n {\bf P}(|X|>n)

\displaystyle = {\bf E} n 1_{|X|>n}.

By dominated convergence again, {{\bf E} n 1_{|X|>n} \rightarrow 0} as {n \rightarrow \infty}, and thus

\displaystyle {\bf P}( S_{n,>} \neq 0 ) \rightarrow 0.

Putting all this together, we conclude (5) as required. This concludes the proof of the weak law of large numbers (in the iid case) for arbitrary absolutely integrable {X}. For future reference, we observe that the above arguments give the bound

\displaystyle {\bf P}( |S_n/n - \mu| \geq \varepsilon ) \leq \frac{1}{(\varepsilon/2)^2} {\bf E} \frac{|X|^2}{n} 1_{|X| \leq n} + {\bf E} n 1_{|X|>n} \ \ \ \ \ (6)


whenever {n} is sufficiently large depending on {\varepsilon}.

Due to the reliance on the dominated convergence theorem, the above argument does not provide any uniform rate of decay in (5). Indeed there is no such uniform rate. Consider for instance the sum {S_n = X_1 + \dots + X_n} where {X_1,\dots,X_n} are iid random variables that equal {n} with probability {1/n} and {0} with probability {1-1/n}. Then the {X_i} are unsigned and all have mean {1}, but {S_n} vanishes with probability {(1-1/n)^n}, which converges to {1/e} as {n \rightarrow \infty}. Thus we see that the probability that {S_n/n} stays a distance {1} from the mean value of {1} is bounded away from zero. This is not inconsistent with the weak law of large numbers, because the underlying random variable {X} depends on {n} in this example. However, it rules out an estimate of the form

\displaystyle {\bf P}( |S_n/n - \mu| \geq \varepsilon ) \leq c_{\varepsilon,M}(n)

that holds uniformly whenever {X} obeys the bound {{\bf E} |X| \leq M}, and {c_{\varepsilon,M}(n)} is a quantity that goes to zero as {n \rightarrow \infty} for a fixed choice of {\varepsilon,M}. (Contrast this with (2), which does provide such a uniform bound if one also assumes a bound on the second moment {{\bf E} |X|^2}.)

One can ask what happens to the {S_n} when the underlying random variable {X} is not absolutely integrable. In the unsigned case, we have

Exercise 18 Let {X_1,X_2,\dots} be iid copies of an unsigned random variable {X} with infinite mean, and write {S_n := X_1 + \dots + X_n}. Show that {S_n/n} diverges to infinity in probability, in the sense that {{\bf P}( S_n/n \geq M ) \rightarrow 1} as {n \rightarrow \infty} for any fixed {M < \infty}.

The above exercise shows that {S_n} grows faster than {n} (in probability, at least) when {X} is unsigned with infinite mean, but this does not completely settle the question of the precise rate at which {S_n} does grow. We will not answer this question in full generality here, but content ourselves with analysing a classic example of the unsigned infinite mean setting, namely the Saint Petersburg paradox. The paradox can be formulated as follows. Suppose we have a lottery whose payout {X} takes taking values in the powers of two {2,2^2,2^3,\dots} with

\displaystyle {\bf P}( X = 2^i ) = 2^{-i}

for {i=1,2,\dots}. The question is to work out what is the “fair” or “breakeven” price to pay for a lottery ticket. If one plays this lottery {n} times, the total payout is {S_n = X_1 + \dots + X_n} where {X_1,X_2,\dots} are independent copies of {X}, so the question boils down to asking what one expects the value of {S_n/n} to be. If {X} were absolutely integrable, the strong or weak law of large numbers would indicate that {{\bf E} X} is the fair price to pay, but in this case we have

\displaystyle {\bf E} X = \sum_{i=1}^\infty 2^i \times 2^{-i} = \sum_{i=1}^\infty 1 = \infty.

This suggests, paradoxically, that any finite price for this lottery, no matter how high, would be a bargain!

To clarify this paradox, we need to get a better understanding of the random variable {S_n/n}. For a given {n}, we let {M} be a truncation parameter to be chosen later, and split {X_i = X_{i,\leq M} + X_{i,>M}} where {X_{i,\leq M} := X_i 1_{X_i \leq M}} and {X_{i,>M} := X_i 1_{X_i>M}} as before (we no longer need the absolute value signs here as all random variables are unsigned). Since the {X_i} take values in powers of two, we may as well also set {M=2^m} to be a power of two. We split {S_n = S_{n,\leq} + S_{n,>}} where

\displaystyle S_{n,\leq} = X_{1,\leq M} + \dots + X_{n,\leq M}


\displaystyle S_{n,>} = X_{1,> M} + \dots + X_{n,>M}.

The random variable {X 1_{X \leq M}} can be computed to have mean

\displaystyle {\bf E} X 1_{X \leq M} = \sum_{i=1}^m 2^i \times 2^{-i} = m

and we can upper bound the variance by

\displaystyle {\bf Var}(X 1_{X \leq M}) \leq {\bf E} (X 1_{X \leq M})^2

\displaystyle = \sum_{i=1}^m 2^{2i} 2^{-i}

\displaystyle \leq 2^{m+1}

and hence {S_{n,\leq}/n} has mean {m} and variance at most {2^{m+1}/n}. By Chebyshev’s inequality, we thus have

\displaystyle {\bf P}( |S_{n,\leq}/n - m| \geq \lambda ) \leq \frac{2^{m+1}}{n \lambda^2}

for any {\lambda > 0}.

Now we turn to {S_{n,>}}. We cannot use the first or second moment methods here because the {X_{i,>M}} are not absolutely integrable. However, we can instead use the following “zeroth moment method” argument. Observe that the random variable {X 1_{X>M}} is only nonzero with probability {2^{-m}} (that is to say, the “zeroth moment” {{\bf E} (X 1_{X>M})^0} is {2^{-m}}, using the convention {0^0=0}). Thus {S_{n,>}} is nonzero with probability at most {n 2^{-m}}. We conclude that

\displaystyle {\bf P}( |S_{n}/n - m| \geq \lambda ) \leq \frac{2^{m+1}}{n \lambda^2} + n 2^{-m}

This bound is valid for any natural number {m} and any {\lambda > 0}. Of course for this bound to be useful, we want to select parameters so that the right-hand side is somewhat small. If we pick for instance {m} to be the integer part of {\log_2 n + \frac{1}{2} \log_2 \log_2 n}, and {\lambda} to be {\sqrt{\log_2 n}}, we see that

\displaystyle {\bf P}( |S_n/n - \log_2 n - \frac{1}{2}\log_2 \log_2 n| \geq \sqrt{\log_2 n} ) = O( \frac{1}{\log^{1/2} n} )

which (for large {n}) implies that

\displaystyle {\bf P}( |\frac{S_n}{n \log_2 n} - 1| \geq \frac{2}{\sqrt{\log_2 n}} ) = O( \frac{1}{\log^{1/2} n} ).

In particular, we see that {S_n / (n \log_2 n)} converges in probability to one. This suggests that the fair price to pay for the Saint Petersburg lottery is a function of the number {n} of tickets one wishes to play, and should be approximately {\log_2 n} when {n} is large. In particular, the lottery is indeed worth paying out any finite cost {M}, but one needs to buy about {2^M} tickets before one breaks even!

In contrast to the absolutely integrable case, in which the weak law can be upgraded to the strong law, there is no strong law for the Saint Petersburg paradox:

Exercise 19 With the notation as in the above analysis of the Saint Petersburg paradox, show that {\frac{S_n}{n \log_2 n}} is almost surely unbounded. (Hint: it suffices to show that {X_n / n \log_2 n} is almost surely unbounded. For this, use the second Borel-Cantelli lemma.)

The following exercise can be viewed as a continuous analogue of the Saint Petersburg paradox.

Exercise 20 A real random variable {X} is said to have a standard Cauchy distribution if it has the probability density function {x \mapsto \frac{1}{\pi} \frac{1}{1+x^2}}.

  • (i) Verify that standard Cauchy distributions exist (this boils down to checking that the integral of the probability density function is {1}).
  • (ii) Show that a real random variable with the standard Cauchy distribution is not absolutely integrable.
  • (iii) If {X_1,X_2,\dots} are iid copies of a random variable {X} with the standard Cauchy distribution, show that {\frac{|X_1|+\dots+|X_n|}{n \log n}} converges in probability to {\frac{2}{\pi}} but is almost surely unbounded.

Exercise 21 (Weak law of large numbers for triangular arrays) Let {(X_{i,n})_{i,n \in {\bf N}: 1 \leq i \leq n}} be a triangular array of random variables, with the variables {X_{1,n},\dots,X_{n,n}} jointly independent for each {n}. Let {M_n} be a sequence going to infinity, and write {X_{i,n,\leq} := X_{i,n} 1_{|X_{i,n}| \leq M_n}} and {\mu_n := \sum_{i=1}^n {\bf E} X_{i,n,\leq}}. Assume that

\displaystyle \sum_{i=1}^n {\bf P}( |X_{i,n}| > M_n ) \rightarrow 0


\displaystyle \frac{1}{M_n^2} \sum_{i=1}^n {\bf E} |X_{i,n,\leq}|^2 \rightarrow 0

as {n \rightarrow \infty}. Show that

\displaystyle \frac{X_{1,n}+\dots+X_{n,n} - \mu_n}{M_n} \rightarrow 0

in probability.

Now we turn to establishing the strong law of large numbers in full generality. A first attempt would be to apply the Borel-Cantelli lemma to the bound (6). However, the decay rates for quantities such as {{\bf P}( |S_n/n - \mu| > \varepsilon )} are far too weak to be absolutely summable, in large part due to the reliance on the dominated convergence theorem. To get around this we follow some arguments of Etemadi. We first need to make a few preliminary reductions, aimed at “sparsifying” the set of times {n} that one needs to control. It is here that we will genuinely use the fact that the averages {S_n} are being drawn from a single sequence {X_1,X_2,\dots} of random variables, rather than from a triangular array.

We turn to the details. In previous arguments it was convenient to normalise the underlying random variable {X} to have mean zero. Here we will use a different reduction, namely to the case when {X} is unsigned; the strong law for real absolutely integrable {X} clearly follows from the unsigned case by expressing {X} as the difference of two unsigned absolutely integrable variables {\max(X,0)} and {\max(-X,0)} (and splitting {X_n} similarly).

Henceforth we assume {X} (and hence the {X_n}) to be unsigned. Crucially, this now implies that the partial sums {S_n} are monotone: {S_n \leq S_{n+1}}. While this does not quite imply any monotonicity on the sequence {S_n/n}, it does make it significantly easier to show that it converges. The key point is as follows.

Lemma 22 Let {0 \leq S_1 \leq S_2 \leq \dots} be an increasing sequence, and let {\mu} be a real number. Suppose that for any {\delta > 0}, the sequence {S_{n_{j,\delta}}/n_{j,\delta}} converges to {\mu} as {j \rightarrow \infty}, where {n_{j,\delta} := \lfloor (1+\delta)^j \rfloor}. Then {S_n/n} converges to {\mu}.

Proof: Let {0 < \delta < 1/2}. For any {n}, let {j} be the index such that {n_{j,\delta} \leq n < n_{j+1,\delta}}. Then we have

\displaystyle S_{n_{j,\delta}} \leq S_n \leq S_{n_{j+1,\delta}}

and (for sufficiently large {n})

\displaystyle (1+2\delta) n_{j,\delta} \geq n \geq \frac{1}{1+2\delta} n_{j+1,\delta}

and thus

\displaystyle \frac{1}{1+2\delta} \frac{S_{n_{j,\delta}}}{n_{j,\delta}} \leq \frac{S_n}{n} \leq (1+2\delta) \frac{S_{n_{j+1,\delta}}}{n_{j+1,\delta}}.

Taking limit inferior and superior, we conclude that

\displaystyle \frac{1}{1+2\delta} \mu \leq \liminf_{n \rightarrow \infty} \frac{S_n}{n} \leq \limsup_{n \rightarrow \infty} \frac{S_n}{n} \leq (1+2\delta) \mu,

and then sending {\delta \rightarrow 0} we obtain the claim. \Box

An inspection of the above argument shows that we only need to verify the hypothesis for a countable sequence of {\delta} (e.g. {\delta=1/m} for natural number {m}). Thus, to show that {S_n/n} converges to {\mu} almost surely, it suffices to show that for any {\delta>0}, one has {S_{n_{j,\delta}}/n_{j,\delta} \rightarrow \mu} almost surely as {j \rightarrow \infty}.

Fix {\delta > 0}. The point is that the “lacunary” sequence {n_{j,\delta}} is much sparser than the sequence of natural numbers {n}, and one now will lose a lot less from the Borel-Cantelli argument. Indeed, for any {\varepsilon > 0}, we can apply (6) to conclude that

\displaystyle {\bf P}( |S_{n_{j,\delta}}/n_{j,\delta} - \mu| \geq \varepsilon ) \leq \frac{1}{(\varepsilon/2)^2} {\bf E} \frac{|X|^2}{n_{j,\delta}} 1_{|X| \leq n_{j,\delta}} + {\bf E} n_{j,\delta} 1_{|X|>n_{j,\delta}}

whenever {j} is sufficiently large depending on {\delta,\varepsilon}. Thus, by the Borel-Cantelli lemma, it will suffice to show that the sums

\displaystyle \sum_{j=1}^\infty {\bf E} \frac{|X|^2}{n_{j,\delta}} 1_{|X| \leq n_{j,\delta}}


\displaystyle \sum_{j=1}^\infty {\bf E} n_{j,\delta} 1_{|X|>n_{j,\delta}}

are finite. Using the monotone convergence theorem to interchange the sum and expectation, it thus suffices to show the pointwise estimates

\displaystyle \sum_{j=1}^\infty \frac{|X|^2}{n_{j,\delta}} 1_{|X| \leq n_{j,\delta}} \leq C_\varepsilon |X|


\displaystyle \sum_{j=1}^\infty n_{j,\delta} 1_{|X|>n_{j,\delta}} \leq C_\varepsilon |X|

for some {C_\varepsilon}. But this follows from the geometric series formula (the first sum is over the elements of the sequence {n_{j,\delta}} that are greater than or equal to {|X|}, while the latter is over those that are less than {|X|}). This proves the strong law of large numbers for arbitrary absolutely integrable iid {X_1,X_2,\dots}.

We remark that by carefully inspecting the above proof of the strong law of large numbers, we see that the hypothesis of joint independence of the {X_n} can be relaxed to pairwise independence.

The next exercise shows how one can use the strong law of large numbers to approximate the cumulative distribution function of a random variable {X} by an empirical cumulative distribution function.

Exercise 23 Let {X_1,X_2,\dots} be iid copies of a real random variable {X}.

  • (i) Show that for every real number {t}, one has almost surely that

    \displaystyle \frac{1}{n} | \{ 1 \leq i \leq n: X_i \leq t \}| \rightarrow {\bf P}( X \leq t )


    \displaystyle \frac{1}{n} | \{ 1 \leq i \leq n: X_i < t \}| \rightarrow {\bf P}( X < t )

    as {n \rightarrow \infty}.

  • (ii) Establish the Glivenko-Cantelli theorem: almost surely, one has

    \displaystyle \frac{1}{n} | \{ 1 \leq i \leq n: X_i \leq t \}| \rightarrow {\bf P}( X \leq t )

    uniformly in {t} as {n \rightarrow \infty}. (Hint: For any natural number {m}, let {f_m(x)} denote the largest integer multiple of {1/m} less than {x}. Show first that {f_m( \frac{1}{n} | \{ 1 \leq i \leq n: X_i \leq t \}| )} is within {O(1/m)} of {f_m( {\bf P}( X \leq t ) )} for all {t} when {n} is sufficiently large.)

Exercise 24 (Lack of strong law for triangular arrays) Let {X} be a random variable taking values in the natural numbers with {{\bf P}(X=n) = \frac{1}{\zeta(3)} \frac{1}{n^3}}, where {\zeta(3) := \sum_{n=1}^\infty \frac{1}{n^3}} (this is an example of a zeta distribution).

  • (i) Show that {X} is absolutely integrable.
  • (ii) Let {(X_{i,n})_{i,n \in {\bf N}: 1 \leq i \leq n}} be jointly independent copies of {X}. Show that the random variables {\frac{X_{1,n}+\dots+X_{n,n}}{n}} are almost surely unbounded. (Hint: for any constant {A}, show that {\frac{X_{1,n}+\dots+X_{n,n}}{n} > A} occurs with probability at least {\varepsilon/n} for some {\varepsilon > 0} depending on {A}. Then use the second Borel-Cantelli lemma.)

— 3. The Kolmogorov maximal inequality —

Let {X_1,X_2,\dots} be a sequence of jointly independent square-integrable real random variables of mean zero; we do not assume the {X_i} to be identically distributed. As usual, we form the sums {S_n := X_1 + \dots + X_n}, then {S_n} has mean zero and variance

\displaystyle {\bf Var}(S_n) = {\bf E} S_n^2 = {\bf Var}(X_1)+\dots+{\bf Var}(X_n). \ \ \ \ \ (7)


From Chebyshev’s inequality, we thus have

\displaystyle {\bf P}( |S_n| \geq t ) \leq \frac{{\bf Var}(X_1)+\dots+{\bf Var}(X_n)}{t^2}

for any {0 < t < \infty} and natural number {n}. Perhaps surprisingly, we have the following improvement to this bound, known as the Kolmogorov maximal inequality:

Theorem 25 (Kolmogorov maximal inequality) With the notation and hypotheses as above, we have

\displaystyle {\bf P}( \sup_{1 \leq i \leq n} |S_i| \geq t ) \leq \frac{{\bf Var}(X_1)+\dots+{\bf Var}(X_n)}{t^2}

Proof: For each {i}, let {E_i} be the event that {|S_i| \geq t}, but that {|S_j| < t} for all {1 \leq j < i}. It is clear that the event {\sup_{1 \leq i \leq n} |S_i| \geq t} is the disjunction of the disjoint events {E_1,\dots,E_n}, thus

\displaystyle {\bf P}( \sup_{1 \leq i \leq n} |S_i| \geq t ) = \sum_{i=1}^n {\bf P}(E_i).

On the event {E_i}, we have {1 \leq \frac{1}{t^2} S_i^2}, and hence

\displaystyle {\bf P}(E_i) \leq \frac{1}{t^2} {\bf E} S_i^2 1_{E_i}.

We will shortly prove the inequality

\displaystyle {\bf E} S_i^2 1_{E_i} \leq {\bf E} S_n^2 1_{E_i} \ \ \ \ \ (8)


for all {1 \leq i \leq n}. Assuming this inequality for the moment, we can put together all the above estimates, using the disjointness of the {E_i}, to conclude that

\displaystyle {\bf P}( \sup_{1 \leq i \leq n} |S_i| \geq t ) \leq \frac{1}{t^2} {\bf E} S_n^2

and the claim follows from (7).

It remains to prove (8). Since

\displaystyle S_n^2 = S_i^2 + (S_n - S_i)^2 + 2 (S_n-S_i) S_i

\displaystyle \geq S_i^2 + 2 (S_n-S_i) S_i

we have

\displaystyle {\bf E} S_n^2 1_{E_i} \geq {\bf E} S_i^2 1_{E_i} + 2 {\bf E} S_i 1_{E_i} (S_n-S_i).

But note that the random variable {S_i 1_{E_i}} is completely determined by {X_1,\dots,X_i}, while {S_n-S_i} is completely determined by the {X_{i+1},\dots,X_n}. Thus {S_i 1_{E_i}} and {S_n-S_i} are independent. Since {S_n-S_i} also has mean zero, we have

\displaystyle {\bf E} S_i 1_{E_i} (S_n-S_i) = 0

and the claim (8) follows. \Box

An inspection of the above proof reveals that the key ingredient is the lack of correlation between past and future – a variable such as {S_i 1_{E_i}}, which is determined by the portion of the sequence {X_1,X_2,\dots} to the “past” (and present) of {i}, is uncorrelated with a variable such as {S_n-S_i} that depends only on the “future” of {i}. One can formalise such a lack of correlation through the concept of a martingale, which will be covered in later courses in this sequence but which is beyond the scope of these notes. The use of the first time {i} at which {|X_i|} exceeds or attains the threshold {t} is a simple example of a stopping time, which will be a heavily used concept in the theory of martingales (and is also used extensively in harmonic analysis, which is also greatly interested in establishing maximal inequalities).

The Kolmogorov maximal inequality gives the following variant of the strong law of large numbers.

Theorem 26 (Convergence of random series) Let {X_1,X_2,\dots} be jointly independent square-integrable real random variables of mean zero with

\displaystyle \sum_{i=1}^\infty {\bf Var}(X_i) < \infty.

Then the series {\sum_{i=1}^\infty X_i} is almost surely convergent (i.e., the partial sums converge almost surely).

Proof: From the Kolmogorov maximal inequality and continuity from below we have

\displaystyle {\bf P}( \sup_i |S_i| \geq t ) \leq \frac{\sum_{i=1}^\infty {\bf Var}(X_i)}{t^2}. \ \ \ \ \ (9)


This is already enough to show that the partial sums {S_n = \sum_{i=1}^n X_i} are almost surely bounded in {n}, but this isn’t quite enough to establish conditional convergence. To finish the job, we apply (9) with {X_1,X_2,\dots} replaced by the shifted sequence {X_{n+1}, X_{n+2}, \dots} for a natural number {n} to conclude that

\displaystyle {\bf P}( \sup_{i > n} |S_i - S_n| \geq t ) \leq \frac{\sum_{i=n+1}^\infty {\bf Var}(X_i)}{t^2}.

Sending {n} to infinity using continuity from above, we conclude that

\displaystyle {\bf P}( \sup_{i > n} |S_i - S_n| \geq t \forall n) = 0

for all {t>0}; applying this for all rational {t>0}, we conclude that {S_1,S_2,\dots} is almost surely a Cauchy sequence, and the claim follows. \Box

We can use this result together with some elementary manipulation of sums to give the following alternative proof of the strong law of large numbers.

Theorem 27 (Strong law of large numbers) Let {X_1,X_2,\dots} be iid copies of an absolutely integrable variable {X} of mean {{\bf E} X = \mu}, and let {S_n := X_1+\dots+X_n}. Then {S_n/n} converges almost surely to {\mu}.

Proof: We may normalise {X} to be real valued with {\mu=0}. Note that that

\displaystyle \sum_{n=1}^\infty {\bf P}( |X_n| > n ) = \sum_{n=1}^\infty {\bf P}(|X| > n)

\displaystyle = {\bf E} \sum_{n=1}^\infty 1_{|X| > n}

\displaystyle \leq {\bf E} |X|

\displaystyle < \infty

and hence by the Borel-Cantelli lemma we almost surely have {|X_n| \leq n} for all but finitely many {n}. Thus if we write {X'_n := X_n 1_{|X_n| \leq n}} and {S'_n := X'_1+\dots+X'_n}, then the difference between {S_n/n} and {S'_n/n} almost surely goes to zero as {n \rightarrow \infty}. Thus it will suffice to show that {S'_n/n} goes to zero almost surely.

The random variables {X'_1,X'_2,\dots} are still jointly independent, but are not quite mean zero. However, the normalised random variables {Y_n := \frac{X'_n - {\bf E} X'_n}{n}} are of mean zero and

\displaystyle \sum_{i=1}^\infty {\bf Var}(Y_i) = \sum_{i=1}^\infty {\bf Var}(X'_i) / i^2

\displaystyle \leq \sum_{i=1}^\infty {\bf E} |X'_i|^2 / i^2

\displaystyle = \sum_{i=1}^\infty {\bf E} |X|^2 1_{|X| \leq i} / i^2

\displaystyle = {\bf E} |X|^2 \sum_{i > |X|} \frac{1}{i^2}

\displaystyle = {\bf E} O( |X| )

\displaystyle < \infty

so by Theorem 26 we see that the sum {\sum_{i=1}^\infty Y_i} is almost surely convergent.

Write {T_n := Y_1+\dots+Y_n}, thus the sequence {T_n} is almost surely a Cauchy sequence. From the identity

\displaystyle \frac{1}{n} \sum_{i=1}^n i Y_i = \frac{1}{n} \sum_{i=0}^{n-1} (T_n - T_i)

we conclude that the sequence {\frac{1}{n} \sum_{i=1}^n i Y_i} almost surely converges to zero, that is to say

\displaystyle \frac{S'_n}{n} - \frac{\sum_{i=1}^n {\bf E} X'_i}{n} \rightarrow 0

almost surely. On the other hand, we have

\displaystyle {\bf E} X'_i = {\bf E} X 1_{|X| \leq i}

\displaystyle = - {\bf E} X 1_{|X| > i}

and hence by dominated convergence

\displaystyle {\bf E} X'_i \rightarrow 0

as {i \rightarrow \infty}, which implies that

\displaystyle \frac{\sum_{i=1}^n {\bf E} X'_i}{n} \rightarrow 0

as {n \rightarrow \infty}. By the triangle inequality, we conclude that {S'_n/n \rightarrow 0} almost surely, as required. \Box

Exercise 28 (Kronecker lemma) Let {\sum_{n=1}^\infty Y_n} be a convergent series of real numbers {Y_n}, and let {0 < b_1 \leq b_2 \leq \dots} be a sequence tending to infinity. Show that {\frac{1}{b_n} \sum_{i=1}^n b_i Y_i} converges to zero as {n \rightarrow \infty}. (This is known as Kronecker’s lemma; the special case {b_i=i} was implicitly used in the above argument.)

Exercise 29 (Kolmogorov three-series theorem, one direction) Let {X_1,X_2,\dots} be a sequence of jointly independent real random variables, and let {A>0}. Suppose that the two series {\sum_{n=1}^\infty {\bf P}(|X_n| > A)} and {\sum_{n=1}^\infty {\bf Var}( X_n 1_{|X_n| \leq A} )} are absolutely convergent, and the third series {\sum_{n=1}^\infty {\bf E} X_n 1_{|X_n| \leq A}} is convergent. Show that the series {\sum_{n=1}^\infty X_n} is almost surely convergent. (The converse claim is also true, and will be discussed in later notes; the two claims are known collectively as the Kolmogorov three-series theorem.)

One advantage that the maximal inequality approach to the strong law of large numbers has over the moment method approach is that it tends to offer superior bounds on the (almost sure) rate of convergence. We content ourselves with just one example of this:

Exercise 30 (Cheap law of iterated logarithm) Let {X_1,X_2,\dots} be a sequence of jointly independent real random variables of mean zero and bounded variance (thus {\sup_n {\bf E} X_n^2 < \infty}). Write {S_n := X_1+\dots+X_n}. Show that {\frac{S_n}{n^{1/2} \log^{1/2+\varepsilon} n}} converges almost surely to zero as {n \rightarrow \infty} for any given {\varepsilon>0}. (Hint: use Theorem 26 and the Kronecker lemma for a suitable weighted sum of the {X_n}.) There is a more precise version of this fact known as the law of the iterated logarithm, which is beyond the scope of these notes.

The exercises below will be moved to a more appropriate location later, but are currently placed here in order to not disrupt existing numbering.

Exercise 31 Let {X_1,X_2,\dots} be iid copies of an absolutely integrable random variable {X} with mean {\mu}. Show that the averages {\frac{S_n}{n} = \frac{X_1+\dots+X_n}{n}} converge in {L^1} to {\mu}, that is to say that

\displaystyle {\bf E} |\frac{S_n}{n}-\mu| \rightarrow 0

as {n \rightarrow \infty}.

Exercise 32 A scalar random variable {X} is said to be in weak {L^1} if one has

\displaystyle \sup_{t>0} t {\bf P}(|X| \geq t) < \infty.

Thus Markov’s inequality tells us that every absolutely integrable random variable is in weak {L^1}, but the converse is not true (e.g. random variables with the Cauchy distribution are weak {L^1} but not absolutely integrable). Show that if {X_1,X_2,\dots} are iid copies of an unsigned weak {L^1} random variable, then there exists quantities {a_n \rightarrow \infty} such that {S_n/a_n} converges in probability to {1}, where {S_n = X_1 + \dots + X_n}. (Thus: there is a weak law of large numbers for weak {L^1} random variables, and a strong law for strong {L^1} (i.e. absolutely integrable) random variables.)