Let X be a real-valued random variable, and let be an infinite sequence of independent and identically distributed copies of X. Let be the empirical averages of this sequence. A fundamental theorem in probability theory is the law of large numbers, which comes in both a weak and a strong form:
Strong law of large numbers. Suppose that the first moment of X is finite. Then converges almost surely to , thus .
[The concepts of convergence in probability and almost sure convergence in probability theory are specialisations of the concepts of convergence in measure and pointwise convergence almost everywhere in measure theory.]
(If one strengthens the first moment assumption to that of finiteness of the second moment , then we of course have a more precise statement than the (weak) law of large numbers, namely the central limit theorem, but I will not discuss that theorem here. With even more hypotheses on X, one similarly has more precise versions of the strong law of large numbers, such as the Chernoff inequality, which I will again not discuss here.)
The weak law is easy to prove, but the strong law (which of course implies the weak law, by Egoroff’s theorem) is more subtle, and in fact the proof of this law (assuming just finiteness of the first moment) usually only appears in advanced graduate texts. So I thought I would present a proof here of both laws, which proceeds by the standard techniques of the moment method and truncation. The emphasis in this exposition will be on motivation and methods rather than brevity and strength of results; there do exist proofs of the strong law in the literature that have been compressed down to the size of one page or less, but this is not my goal here.
— The moment method —
The moment method seeks to control the tail probabilities of a random variable (i.e. the probability that it fluctuates far from its mean) by means of moments, and in particular the zeroth, first or second moment. The reason that this method is so effective is because the first few moments can often be computed rather precisely. The first moment method usually employs Markov’s inequality
(which follows by taking expectations of the pointwise inequality ), whereas the second moment method employs some version of Chebyshev’s inequality, such as
(note that (2) is just (1) applied to the random variable and to the threshold ).
Generally speaking, to compute the first moment one usually employs linearity of expectation
or the normalised variant
Higher moments can in principle give more precise information, but often require stronger assumptions on the objects being studied, such as joint independence.
Here is a basic application of the first moment method:
Borel-Cantelli lemma. Let be a sequence of events such that is finite. Then almost surely, only finitely many of the events are true.
Proof. Let denote the indicator function of the event . Our task is to show that is almost surely finite. But by linearity of expectation, the expectation of this random variable is , which is finite by hypothesis. By Markov’s inequality (1) we conclude that
Letting we obtain the claim.
Returning to the law of large numbers, the first moment method gives the following tail bound:
Lemma 1. (First moment tail bound) If is finite, then
Proof. By the triangle inequality, . By linearity of expectation, the expectation of is . The claim now follows from Markov’s inequality.
Lemma 1 is not strong enough by itself to prove the law of large numbers in either weak or strong form – in particular, it does not show any improvement as n gets large – but it will be useful to handle one of the error terms in those proofs.
We can get stronger bounds than Lemma 1 – in particular, bounds which improve with n – at the expense of stronger assumptions on X.
Lemma 2. (Second moment tail bound) If is finite, then
Proof. A standard computation, exploiting (3) and the pairwise independence of the , shows that the variance of the empirical averages is equal to times the variance of the original variable X. The claim now follows from Chebyshev’s inequality (2).
In the opposite direction, there is the zeroth moment method, more commonly known as the union bound
or equivalently (to explain the terminology “zeroth moment”)
for any non-negative random variables . Applying this to the empirical means, we obtain the zeroth moment tail estimate
Just as the second moment bound (Lemma 2) is only useful when one has good control on the second moment (or variance) of X, the zeroth moment tail estimate (3) is only useful when we have good control on the zeroth moment , i.e. when X is mostly zero.
— Truncation —
The second moment tail bound (Lemma 2) already gives the weak law of large numbers in the case when X has finite second moment (or equivalently, finite variance). In general, if all one knows about X is that it has finite first moment, then we cannot conclude that X has finite second moment. However, we can perform a truncation
of X at any desired threshold N, where and . The first term has finite second moment; indeed we clearly have
and hence also we have finite variance
The second term may have infinite second moment, but its first moment is well controlled. Indeed, by the monotone convergence theorem, we have
By the triangle inequality, we conclude that the first term has expectation close to :
These are all the tools we need to prove the weak law of large numbers:
Proof of weak law. Let . It suffices to show that whenever n is sufficiently large depending on , that with probability .
From (7), (8), we can find a threshold N (depending on ) such that and . Now we use (5) to split
From the first moment tail bound (Lemma 1), we know that with probability . From the second moment tail bound (Lemma 2) and (6), we know that with probability if n is sufficiently large depending on N and . The claim follows.
— The strong law —
The strong law can be proven by pushing the above methods a bit further, and using a few more tricks.
The first trick is to observe that to prove the strong law, it suffices to do so for non-negative random variables . Indeed, this follows immediately from the simple fact that any random variable X with finite first moment can be expressed as the difference of two non-negative random variables of finite first moment.
Once X is non-negative, we see that the empirical averages cannot decrease too quickly in n. In particular we observe that
whenever . (9)
Because of this quasimonotonicity, we can sparsify the set of n for which we need to prove the strong law. More precisely, it suffices to show
Strong law of large numbers, reduced version. Let be a non-negative random variable with , and let be a sequence of integers which is lacunary in the sense that for some and all sufficiently large j. Then converges almost surely to .
Indeed, if we could prove the reduced version, then on applying that version to the lacunary sequence and using (9) we would see that almost surely the empirical means cannot deviate by more than a multiplicative error of from the mean . Setting for (and using the fact that a countable intersection of almost sure events remains almost sure) we obtain the full strong law.
[This sparsification trick is philosophically related to the dyadic pigeonhole principle philosophy; see an old short story of myself on this latter topic. One could easily sparsify further, so that the lacunarity constant c is large instead of small, but this turns out not to help us too much in what follows.]
Now that we have sparsified the sequence, it becomes economical to apply the Borel-Cantelli lemma. Indeed, by many applications of that lemma we see that it suffices to show that
for non-negative X of finite first moment, any lacunary sequence and any . [This is a slight abuse of the O() notation, but it should be clear what is meant by this.]
[If we did not first sparsify the sequence, the Borel-Cantelli lemma would have been too expensive to apply; see Remark 2 below. Generally speaking, Borel-Cantelli is only worth applying when one expects the events to be fairly “disjoint” or “independent” of each other; in the non-lacunary case, the events change very slowly in n, which makes the lemma very inefficient. We will not see how lacunarity is exploited until the punchline at the very end of the proof, but certainly there is no harm in taking advantage of this “free” reduction to the lacunary case now, even if it is not immediately clear how it will be exploited.]
At this point we go back and apply the methods that already worked to give the weak law. Namely, to estimate each of the tail probabilities , we perform a truncation (5) at some threshold . It is not immediately obvious what truncation to perform, so we adopt the usual strategy of leaving unspecified for now and optimising in this parameter later.
We should at least pick large enough so that . From the second moment tail estimate (Lemma 2) we conclude that is also equal to with probability . One could attempt to simplify this expression using (6), but this turns out to be a little wasteful, so let us hold off on that for now. However, (6) does strongly suggest that we want to take to be something like , which is worth keeping in mind in what follows.
Now we look at the contribution of . One could use the first moment tail estimate (Lemma 1), but it turns out that the first moment decays too slowly in j to be of much use (recall that we are expecting to be like the lacunary sequence ); the root problem here is that the decay (7) coming from the monotone convergence theorem is ineffective (one could effectivise this using the finite convergence principle, but this turns out to give very poor results here).
But there is one last card to play, which is the zeroth moment method tail estimate (4). As mentioned earlier, this bound is lousy in general – but is very good when X is mostly zero, which is precisely the situation with . and in particular we see that is zero with probability .
Putting this all together, we see that
Summing this in j, we see that we will be done as soon as we figure out how to choose so that
are both finite. (As usual, we have a tradeoff: making the larger makes (12) easier to establish at the expense of (11), and vice versa when making smaller.)
Based on the discussion earlier, it is natural to try setting . Happily, this choice works cleanly; the lacunary nature of ensures (basically from the geometric series formula) that we have the pointwise estimates
(where the implied constant here depends on the sequence , and in particular on the lacunarity constant c). The claims (10), (11) then follow from one last application of linearity of expectation, giving the strong law of large numbers.
Remark 1. The above proof in fact shows that the strong law of large numbers holds even if one only assumes pairwise independence of the , rather than joint independence.
Remark 2. It is essential that the random variables are “recycled” from one empirical average to the next, in order to get the crucial quasimonotonicity property (9). If instead we took completely independent averages , where the are all iid, then the strong law of large numbers in fact breaks down with just a first moment assumption. (For a counterexample, consider a random variable X which equals with probability for ; this random variable (barely) has finite first moment, but for , we see that deviates by at least absolute constant from its mean with probability . As the empirical means for are now jointly independent, the probability that one of them deviates significantly is now extremely close to 1 (super-exponentially close in , in fact), leading to the total failure of the strong law in this setting.) Of course, if one restricts attention to a lacunary sequence of n then the above proof goes through in the independent case (since the Borel-Cantelli lemma is insensitive to this independence). By exploiting the joint independence further (e.g. by using Chernoff’s inequality) one can also get the strong law for independent empirical means for the full sequence n under second moment bounds.
Remark 3. From the perspective of interpolation theory, one can view the above argument as an interpolation argument, establishing an estimate (10) by interpolating between an estimate (Lemma 2) and the estimate (4).
Remark 4. By viewing the sequence as a stationary process, and thus as a special case of a measure-preserving system one can view the weak and strong law of large numbers as special cases of the mean and pointwise ergodic theorems respectively (see Exercise 9 from 254A Lecture 8 and Theorem 2 from 254A Lecture 9).
[Update, Jul 19: some corrections.]
[Update, Jul 20: Connections with ergodic theory discussed.]