Let X be a real-valued random variable, and let X_1, X_2, X_3, ... be an infinite sequence of independent and identically distributed copies of X. Let \overline{X}_n := \frac{1}{n}(X_1 + \ldots + X_n) be the empirical averages of this sequence. A fundamental theorem in probability theory is the law of large numbers, which comes in both a weak and a strong form:

Weak law of large numbers. Suppose that the first moment {\Bbb E} |X| of X is finite. Then \overline{X}_n converges in probability to {\Bbb E} X, thus \lim_{n \to \infty} {\Bbb P}( |\overline{X}_n - {\Bbb E} X| \geq \varepsilon ) = 0 for every \varepsilon > 0.

Strong law of large numbers. Suppose that the first moment {\Bbb E} |X| of X is finite. Then \overline{X}_n converges almost surely to {\Bbb E} X, thus {\Bbb P}( \lim_{n \to \infty} \overline{X}_n = {\Bbb E} X ) = 1.

[The concepts of convergence in probability and almost sure convergence in probability theory are specialisations of the concepts of convergence in measure and pointwise convergence almost everywhere in measure theory.]

(If one strengthens the first moment assumption to that of finiteness of the second moment {\Bbb E}|X|^2, then we of course have a more precise statement than the (weak) law of large numbers, namely the central limit theorem, but I will not discuss that theorem here.  With even more hypotheses on X, one similarly has more precise versions of the strong law of large numbers, such as the Chernoff inequality, which I will again not discuss here.)

The weak law is easy to prove, but the strong law (which of course implies the weak law, by Egoroff’s theorem) is more subtle, and in fact the proof of this law (assuming just finiteness of the first moment) usually only appears in advanced graduate texts. So I thought I would present a proof here of both laws, which proceeds by the standard techniques of the moment method and truncation. The emphasis in this exposition will be on motivation and methods rather than brevity and strength of results; there do exist proofs of the strong law in the literature that have been compressed down to the size of one page or less, but this is not my goal here.

— The moment method —

The moment method seeks to control the tail probabilities of a random variable (i.e. the probability that it fluctuates far from its mean) by means of moments, and in particular the zeroth, first or second moment. The reason that this method is so effective is because the first few moments can often be computed rather precisely. The first moment method usually employs Markov’s inequality

\displaystyle {\Bbb P}( |X| \geq \lambda ) \leq \frac{1}{\lambda} {\Bbb E} |X| (1)

(which follows by taking expectations of the pointwise inequality \lambda I(|X| \geq \lambda) \leq |X|), whereas the second moment method employs some version of Chebyshev’s inequality, such as

\displaystyle {\Bbb P}( |X| \geq \lambda ) \leq \frac{1}{\lambda^2} {\Bbb E} |X|^2 (2)

(note that (2) is just (1) applied to the random variable |X|^2 and to the threshold \lambda^2).

Generally speaking, to compute the first moment one usually employs linearity of expectation

\displaystyle {\Bbb E} X_1 + \ldots + X_n = {\Bbb E} X_1 + \ldots + {\Bbb E} X_n,

whereas to compute the second moment one also needs to understand covariances (which are particularly simple if one assumes pairwise independence), thanks to identities such as

\displaystyle {\Bbb E} (X_1 + \ldots + X_n)^2 = {\Bbb E} X_1^2 + \ldots + {\Bbb E} X_n^2 + 2 \sum_{1 \leq i < j \leq n} X_i X_j

or the normalised variant

\displaystyle {\bf Var}(X_1+\ldots+X_n) = {\bf Var}(X_1) + \ldots + {\bf Var}(X_n)

\displaystyle + 2 \sum_{1 \leq i < j \leq n} {\bf Cov}(X_i,X_j). (3)

Higher moments can in principle give more precise information, but often require stronger assumptions on the objects being studied, such as joint independence.

Here is a basic application of the first moment method:

Borel-Cantelli lemma. Let E_1, E_2, E_3, \ldots be a sequence of events such that \sum_{n=1}^\infty {\Bbb P}(E_n) is finite. Then almost surely, only finitely many of the events E_n are true.

Proof. Let I(E_n) denote the indicator function of the event E_n. Our task is to show that \sum_{n=1}^\infty I(E_n) is almost surely finite. But by linearity of expectation, the expectation of this random variable is \sum_{n=1}^\infty {\Bbb P}(E_n), which is finite by hypothesis. By Markov’s inequality (1) we conclude that

\displaystyle {\Bbb P}( \sum_{n=1}^\infty I(E_n) \geq \lambda ) \leq \frac{1}{\lambda} \sum_{n=1}^\infty {\Bbb P}(E_n).

Letting \lambda \to \infty we obtain the claim. \Box

Returning to the law of large numbers, the first moment method gives the following tail bound:

Lemma 1. (First moment tail bound) If {\Bbb E}|X| is finite, then

\displaystyle {\Bbb P}( |\overline{X}_n| \geq \lambda ) \leq \frac{{\Bbb E}|X|}{\lambda}.

Proof. By the triangle inequality, |\overline{X}_n| \leq \overline{|X|}_n. By linearity of expectation, the expectation of \overline{|X|}_n is {\Bbb E}|X|. The claim now follows from Markov’s inequality. \Box

Lemma 1 is not strong enough by itself to prove the law of large numbers in either weak or strong form – in particular, it does not show any improvement as n gets large – but it will be useful to handle one of the error terms in those proofs.

We can get stronger bounds than Lemma 1 – in particular, bounds which improve with n – at the expense of stronger assumptions on X.

Lemma 2. (Second moment tail bound) If {\Bbb E}|X|^2 is finite, then

\displaystyle {\Bbb P}( |\overline{X}_n - {\Bbb E}(X)| \geq \lambda ) \leq \frac{ {\Bbb E}|X - {\Bbb E}(X)|^2 }{n \lambda^2}.

Proof. A standard computation, exploiting (3) and the pairwise independence of the X_i, shows that the variance {\Bbb E} |\overline{X}_n - {\Bbb E}(X)|^2 of the empirical averages \overline{X}_n is equal to \frac{1}{n} times the variance {\Bbb E} |X - {\Bbb E}(X)|^2 of the original variable X. The claim now follows from Chebyshev’s inequality (2). \Box

In the opposite direction, there is the zeroth moment method, more commonly known as the union bound

\displaystyle {\Bbb P}( E_1 \vee \ldots \vee E_n ) \leq \sum_{j=1}^n {\Bbb P}(E_j)

or equivalently (to explain the terminology “zeroth moment”)

\displaystyle {\Bbb E} (X_1 + \ldots + X_n)^0 \leq {\Bbb E} X_1^0 + \ldots + X_n^0

for any non-negative random variables X_1,\ldots,X_n \geq 0. Applying this to the empirical means, we obtain the zeroth moment tail estimate

{\Bbb P} (\overline{X}_n \neq 0) \leq n {\Bbb P}(X \neq 0). (4)

Just as the second moment bound (Lemma 2) is only useful when one has good control on the second moment (or variance) of X, the zeroth moment tail estimate (3) is only useful when we have good control on the zeroth moment {\Bbb E} |X|^0 = {\Bbb P}(X \neq 0), i.e. when X is mostly zero.

— Truncation —

The second moment tail bound (Lemma 2) already gives the weak law of large numbers in the case when X has finite second moment (or equivalently, finite variance). In general, if all one knows about X is that it has finite first moment, then we cannot conclude that X has finite second moment. However, we can perform a truncation

\displaystyle X = X_{\leq N} + X_{>N} (5)

of X at any desired threshold N, where X_{\leq N} := X I(|X| \leq N) and X_{>N} := X I(|X| > N). The first term X_{\leq N} has finite second moment; indeed we clearly have

\displaystyle {\Bbb E} |X_{\leq N}|^2 \leq N {\Bbb E} |X|

and hence also we have finite variance

\displaystyle {\Bbb E} |X_{\leq N} - {\Bbb E} X_{\leq N}|^2 \leq N {\Bbb E} |X|. (6)

The second term X_{>N} may have infinite second moment, but its first moment is well controlled. Indeed, by the monotone convergence theorem, we have

\displaystyle {\Bbb E} |X_{>N}| \to 0 \hbox{ as } N \to \infty. (7)

By the triangle inequality, we conclude that the first term X_{\leq N} has expectation close to {\Bbb E} X:

\displaystyle {\Bbb E} X_{\leq N} \to {\Bbb E}(X) \hbox{ as } N \to \infty. (8)

These are all the tools we need to prove the weak law of large numbers:

Proof of weak law. Let \varepsilon > 0. It suffices to show that whenever n is sufficiently large depending on \varepsilon, that \overline{X}_n = {\Bbb E} X + O(\varepsilon) with probability 1-O(\varepsilon).

From (7), (8), we can find a threshold N (depending on \varepsilon) such that {\Bbb E} |X_{\geq N}| = O(\varepsilon^2) and {\Bbb E} X_{<N} = {\Bbb E} X + O(\varepsilon). Now we use (5) to split

\displaystyle \overline{X}_n = (\overline{X_{\geq N}})_n +(\overline{X_{< N}})_n.

From the first moment tail bound (Lemma 1), we know that (\overline{X_{\geq N}})_n = O(\varepsilon) with probability 1 - O(\varepsilon). From the second moment tail bound (Lemma 2) and (6), we know that (\overline{X_{< N}})_n = {\Bbb E} X_{<N} + O(\varepsilon) = {\Bbb E} X + O(\varepsilon) with probability 1-O(\varepsilon) if n is sufficiently large depending on N and \varepsilon. The claim follows. \Box

— The strong law —

The strong law can be proven by pushing the above methods a bit further, and using a few more tricks.

The first trick is to observe that to prove the strong law, it suffices to do so for non-negative random variables X \geq 0. Indeed, this follows immediately from the simple fact that any random variable X with finite first moment can be expressed as the difference of two non-negative random variables \max(X,0), \max(-X,0) of finite first moment.

Once X is non-negative, we see that the empirical averages \overline{X}_n cannot decrease too quickly in n. In particular we observe that

\displaystyle \overline{X}_m \leq (1+O(\varepsilon)) \overline{X}_n whenever (1-\varepsilon) n \leq m \leq n. (9)

Because of this quasimonotonicity, we can sparsify the set of n for which we need to prove the strong law. More precisely, it suffices to show

Strong law of large numbers, reduced version. Let X be a non-negative random variable with {\Bbb E} X < \infty, and let 1 \leq n_1\leq n_2\leq n_3\leq\ldots be a sequence of integers which is lacunary in the sense that n_{j+1}/n_j > c for some c>1 and all sufficiently large j. Then \overline{X}_{n_j} converges almost surely to {\Bbb E} X.

Indeed, if we could prove the reduced version, then on applying that version to the lacunary sequence n_j := \lfloor (1 + \varepsilon)^j\rfloor and using (9) we would see that almost surely the empirical means \overline{X}_n cannot deviate by more than a multiplicative error of 1+O(\varepsilon) from the mean {\Bbb E} X. Setting \varepsilon := 1/m for m=1,2,3,\ldots (and using the fact that a countable intersection of almost sure events remains almost sure) we obtain the full strong law.

[This sparsification trick is philosophically related to the dyadic pigeonhole principle philosophy; see an old short story of myself on this latter topic. One could easily sparsify further, so that the lacunarity constant c is large instead of small, but this turns out not to help us too much in what follows.]

Now that we have sparsified the sequence, it becomes economical to apply the Borel-Cantelli lemma. Indeed, by many applications of that lemma we see that it suffices to show that

\displaystyle \sum_{j=1}^\infty {\Bbb P}( \overline{X}_{n_j} \neq {\Bbb E}(X) + O( \varepsilon ) ) < \infty (10)

for non-negative X of finite first moment, any lacunary sequence 1 \leq n_1 \leq n_2 \leq \ldots and any \varepsilon > 0. [This is a slight abuse of the O() notation, but it should be clear what is meant by this.]

[If we did not first sparsify the sequence, the Borel-Cantelli lemma would have been too expensive to apply; see Remark 2 below. Generally speaking, Borel-Cantelli is only worth applying when one expects the events E_n to be fairly “disjoint” or “independent” of each other; in the non-lacunary case, the events E_n change very slowly in n, which makes the lemma very inefficient. We will not see how lacunarity is exploited until the punchline at the very end of the proof, but certainly there is no harm in taking advantage of this “free” reduction to the lacunary case now, even if it is not immediately clear how it will be exploited.]

At this point we go back and apply the methods that already worked to give the weak law. Namely, to estimate each of the tail probabilities {\Bbb P}( \overline{X}_{n_j} \neq {\Bbb E}(X) + O(\varepsilon) ), we perform a truncation (5) at some threshold N_j. It is not immediately obvious what truncation to perform, so we adopt the usual strategy of leaving N_j unspecified for now and optimising in this parameter later.

We should at least pick N_j large enough so that {\Bbb E} X_{< N_j} = {\Bbb E} X + O(\varepsilon). From the second moment tail estimate (Lemma 2) we conclude that (\overline{X_{< N_j}})_{n_j} is also equal to {\Bbb E} X + O( \varepsilon ) with probability 1-O( \frac{1}{\varepsilon n_j} {\Bbb E} |X_{\leq N_j}|^2 ). One could attempt to simplify this expression using (6), but this turns out to be a little wasteful, so let us hold off on that for now. However, (6) does strongly suggest that we want to take N_j to be something like n_j, which is worth keeping in mind in what follows.

Now we look at the contribution of X_{\geq N_j}. One could use the first moment tail estimate (Lemma 1), but it turns out that the first moment {\Bbb E} X_{> N_j} decays too slowly in j to be of much use (recall that we are expecting N_j to be like the lacunary sequence n_j); the root problem here is that the decay (7) coming from the monotone convergence theorem is ineffective (one could effectivise this using the finite convergence principle, but this turns out to give very poor results here).

But there is one last card to play, which is the zeroth moment method tail estimate (4). As mentioned earlier, this bound is lousy in general – but is very good when X is mostly zero, which is precisely the situation with X_{>N_j}. and in particular we see that (\overline{X_{>N_j}})_{n_j} is zero with probability 1 - O( n_j {\Bbb P}(X > N_j) ).

Putting this all together, we see that

\displaystyle {\Bbb P}( \overline{X}_{n_j} \neq {\Bbb E}(X) + O( \varepsilon ) ) \leq O( \frac{1}{\varepsilon n_j} {\Bbb E} |X_{\leq N_j}|^2 ) + O( n_j {\Bbb P}(X > N_j) ).

Summing this in j, we see that we will be done as soon as we figure out how to choose N_j so that

\displaystyle \sum_{j=1}^\infty \frac{1}{n_j} {\Bbb E} |X_{\leq N_j}|^2 (11)

and

\displaystyle \sum_{j=1}^\infty n_j {\Bbb P}(X > N_j) (12)

are both finite. (As usual, we have a tradeoff: making the N_j larger makes (12) easier to establish at the expense of (11), and vice versa when making N_j smaller.)

Based on the discussion earlier, it is natural to try setting N_j := n_j. Happily, this choice works cleanly; the lacunary nature of n_j ensures (basically from the geometric series formula) that we have the pointwise estimates

\displaystyle \sum_{j=1}^\infty \frac{1}{n_j} |X_{\leq n_j}|^2 = O( X )

and

\displaystyle \sum_{j=1}^\infty n_j I( X \geq n_j ) = O( X )

(where the implied constant here depends on the sequence n_1, n_2, \ldots, and in particular on the lacunarity constant c). The claims (10), (11) then follow from one last application of linearity of expectation, giving the strong law of large numbers.

Remark 1. The above proof in fact shows that the strong law of large numbers holds even if one only assumes pairwise independence of the X_n, rather than joint independence. \diamond

Remark 2. It is essential that the random variables X_1,X_2,\ldots are “recycled” from one empirical average \overline{X}_n to the next, in order to get the crucial quasimonotonicity property (9). If instead we took completely independent averages \overline{X}_n = \frac{1}{n} (X_{n,1} + \ldots + X_{n,n} ), where the X_{i,j} are all iid, then the strong law of large numbers in fact breaks down with just a first moment assumption. (For a counterexample, consider a random variable X which equals 2^m / m^2 with probability 2^{-m} for m=1,2,3,\ldots; this random variable (barely) has finite first moment, but for n \sim 2^m/m^2, we see that \overline{X}_n deviates by at least absolute constant from its mean with probability \gg 1/m^2. As the empirical means \overline{X}_n for n \sim 2^m/m^2 are now jointly independent, the probability that one of them deviates significantly is now extremely close to 1 (super-exponentially close in m, in fact), leading to the total failure of the strong law in this setting.) Of course, if one restricts attention to a lacunary sequence of n then the above proof goes through in the independent case (since the Borel-Cantelli lemma is insensitive to this independence). By exploiting the joint independence further (e.g. by using Chernoff’s inequality) one can also get the strong law for independent empirical means for the full sequence n under second moment bounds. \diamond

Remark 3. From the perspective of interpolation theory, one can view the above argument as an interpolation argument, establishing an L^1 estimate (10) by interpolating between an L^2 estimate (Lemma 2) and the L^0 estimate (4). \diamond

Remark 4. By viewing the sequence X_1,X_2,\ldots as a stationary process, and thus as a special case of a measure-preserving system one can view the weak and strong law of large numbers as special cases of the mean and pointwise ergodic theorems respectively (see Exercise 9 from 254A Lecture 8 and Theorem 2 from 254A Lecture 9).  \diamond

[Update, Jul 19: some corrections.]

[Update, Jul 20: Connections with ergodic theory discussed.]