Let X be a real-valued random variable, and let be an infinite sequence of independent and identically distributed copies of X. Let be the empirical averages of this sequence. A fundamental theorem in probability theory is the law of large numbers, which comes in both a weak and a strong form:

Weak law of large numbers.Suppose that the first moment of X is finite. Then converges in probability to , thus for every .

Strong law of large numbers.Suppose that the first moment of X is finite. Then converges almost surely to , thus .

[The concepts of convergence in probability and almost sure convergence in probability theory are specialisations of the concepts of convergence in measure and pointwise convergence almost everywhere in measure theory.]

(If one strengthens the first moment assumption to that of finiteness of the second moment , then we of course have a more precise statement than the (weak) law of large numbers, namely the central limit theorem, but I will not discuss that theorem here. With even more hypotheses on X, one similarly has more precise versions of the strong law of large numbers, such as the Chernoff inequality, which I will again not discuss here.)

The weak law is easy to prove, but the strong law (which of course implies the weak law, by Egoroff’s theorem) is more subtle, and in fact the proof of this law (assuming just finiteness of the first moment) usually only appears in advanced graduate texts. So I thought I would present a proof here of both laws, which proceeds by the standard techniques of the moment method and truncation. The emphasis in this exposition will be on motivation and methods rather than brevity and strength of results; there do exist proofs of the strong law in the literature that have been compressed down to the size of one page or less, but this is not my goal here.

— The moment method —

The moment method seeks to control the tail probabilities of a random variable (i.e. the probability that it fluctuates far from its mean) by means of moments, and in particular the zeroth, first or second moment. The reason that this method is so effective is because the first few moments can often be computed rather precisely. The first moment method usually employs Markov’s inequality

(1)

(which follows by taking expectations of the pointwise inequality ), whereas the second moment method employs some version of Chebyshev’s inequality, such as

(2)

(note that (2) is just (1) applied to the random variable and to the threshold ).

Generally speaking, to compute the first moment one usually employs linearity of expectation

,

whereas to compute the second moment one also needs to understand covariances (which are particularly simple if one assumes pairwise independence), thanks to identities such as

or the normalised variant

. (3)

Higher moments can in principle give more precise information, but often require stronger assumptions on the objects being studied, such as joint independence.

Here is a basic application of the first moment method:

Borel-Cantelli lemma.Let be a sequence of events such that is finite. Then almost surely, only finitely many of the events are true.

**Proof.** Let denote the indicator function of the event . Our task is to show that is almost surely finite. But by linearity of expectation, the expectation of this random variable is , which is finite by hypothesis. By Markov’s inequality (1) we conclude that

.

Letting we obtain the claim.

Returning to the law of large numbers, the first moment method gives the following tail bound:

Lemma 1.(First moment tail bound) If is finite, then.

**Proof.** By the triangle inequality, . By linearity of expectation, the expectation of is . The claim now follows from Markov’s inequality.

Lemma 1 is not strong enough by itself to prove the law of large numbers in either weak or strong form – in particular, it does not show any improvement as n gets large – but it will be useful to handle one of the error terms in those proofs.

We can get stronger bounds than Lemma 1 – in particular, bounds which improve with n – at the expense of stronger assumptions on X.

Lemma 2.(Second moment tail bound) If is finite, then.

**Proof.** A standard computation, exploiting (3) and the pairwise independence of the , shows that the variance of the empirical averages is equal to times the variance of the original variable X. The claim now follows from Chebyshev’s inequality (2).

In the opposite direction, there is the *zeroth moment method*, more commonly known as the union bound

or equivalently (to explain the terminology “zeroth moment”)

for any non-negative random variables . Applying this to the empirical means, we obtain the *zeroth moment tail estimate*

. (4)

Just as the second moment bound (Lemma 2) is only useful when one has good control on the second moment (or variance) of X, the zeroth moment tail estimate (3) is only useful when we have good control on the zeroth moment , i.e. when X is mostly zero.

— Truncation —

The second moment tail bound (Lemma 2) already gives the weak law of large numbers in the case when X has finite second moment (or equivalently, finite variance). In general, if all one knows about X is that it has finite first moment, then we cannot conclude that X has finite second moment. However, we can perform a truncation

(5)

of X at any desired threshold N, where and . The first term has finite second moment; indeed we clearly have

and hence also we have finite variance

. (6)

The second term may have infinite second moment, but its first moment is well controlled. Indeed, by the monotone convergence theorem, we have

. (7)

By the triangle inequality, we conclude that the first term has expectation close to :

. (8)

These are all the tools we need to prove the weak law of large numbers:

**Proof of weak law. ** Let . It suffices to show that whenever n is sufficiently large depending on , that with probability .

From (7), (8), we can find a threshold N (depending on ) such that and . Now we use (5) to split

.

From the first moment tail bound (Lemma 1), we know that with probability . From the second moment tail bound (Lemma 2) and (6), we know that with probability if n is sufficiently large depending on N and . The claim follows.

— The strong law —

The strong law can be proven by pushing the above methods a bit further, and using a few more tricks.

The first trick is to observe that to prove the strong law, it suffices to do so for non-negative random variables . Indeed, this follows immediately from the simple fact that any random variable X with finite first moment can be expressed as the difference of two non-negative random variables of finite first moment.

Once X is non-negative, we see that the empirical averages cannot decrease too quickly in n. In particular we observe that

whenever . (9)

Because of this quasimonotonicity, we can sparsify the set of n for which we need to prove the strong law. More precisely, it suffices to show

Strong law of large numbers, reduced version.Let be a non-negative random variable with , and let be a sequence of integers which islacunaryin the sense that for some and all sufficiently large j. Then converges almost surely to .

Indeed, if we could prove the reduced version, then on applying that version to the lacunary sequence and using (9) we would see that almost surely the empirical means cannot deviate by more than a multiplicative error of from the mean . Setting for (and using the fact that a countable intersection of almost sure events remains almost sure) we obtain the full strong law.

[This sparsification trick is philosophically related to the dyadic pigeonhole principle philosophy; see an old short story of myself on this latter topic. One could easily sparsify further, so that the lacunarity constant c is large instead of small, but this turns out not to help us too much in what follows.]

Now that we have sparsified the sequence, it becomes economical to apply the Borel-Cantelli lemma. Indeed, by many applications of that lemma we see that it suffices to show that

(10)

for non-negative X of finite first moment, any lacunary sequence and any . [This is a slight abuse of the O() notation, but it should be clear what is meant by this.]

[If we did not first sparsify the sequence, the Borel-Cantelli lemma would have been too expensive to apply; see Remark 2 below. Generally speaking, Borel-Cantelli is only worth applying when one expects the events to be fairly “disjoint” or “independent” of each other; in the non-lacunary case, the events change very slowly in n, which makes the lemma very inefficient. We will not see how lacunarity is exploited until the punchline at the very end of the proof, but certainly there is no harm in taking advantage of this “free” reduction to the lacunary case now, even if it is not immediately clear how it will be exploited.]

At this point we go back and apply the methods that already worked to give the weak law. Namely, to estimate each of the tail probabilities , we perform a truncation (5) at some threshold . It is not immediately obvious what truncation to perform, so we adopt the usual strategy of leaving unspecified for now and optimising in this parameter later.

We should at least pick large enough so that . From the second moment tail estimate (Lemma 2) we conclude that is also equal to with probability . One could attempt to simplify this expression using (6), but this turns out to be a little wasteful, so let us hold off on that for now. However, (6) does strongly suggest that we want to take to be something like , which is worth keeping in mind in what follows.

Now we look at the contribution of . One could use the first moment tail estimate (Lemma 1), but it turns out that the first moment decays too slowly in j to be of much use (recall that we are expecting to be like the lacunary sequence ); the root problem here is that the decay (7) coming from the monotone convergence theorem is *ineffective *(one could effectivise this using the finite convergence principle, but this turns out to give very poor results here).

But there is one last card to play, which is the zeroth moment method tail estimate (4). As mentioned earlier, this bound is lousy in general – but is very good when X is mostly zero, which is precisely the situation with . and in particular we see that is zero with probability .

Putting this all together, we see that

Summing this in j, we see that we will be done as soon as we figure out how to choose so that

(11)

and

(12)

are both finite. (As usual, we have a tradeoff: making the larger makes (12) easier to establish at the expense of (11), and vice versa when making smaller.)

Based on the discussion earlier, it is natural to try setting . Happily, this choice works cleanly; the lacunary nature of ensures (basically from the geometric series formula) that we have the pointwise estimates

and

(where the implied constant here depends on the sequence , and in particular on the lacunarity constant c). The claims (10), (11) then follow from one last application of linearity of expectation, giving the strong law of large numbers.

**Remark 1.** The above proof in fact shows that the strong law of large numbers holds even if one only assumes pairwise independence of the , rather than joint independence.

**Remark 2.** It is essential that the random variables are “recycled” from one empirical average to the next, in order to get the crucial quasimonotonicity property (9). If instead we took completely independent averages , where the are all iid, then the strong law of large numbers in fact breaks down with just a first moment assumption. (For a counterexample, consider a random variable X which equals with probability for ; this random variable (barely) has finite first moment, but for , we see that deviates by at least absolute constant from its mean with probability . As the empirical means for are now jointly independent, the probability that one of them deviates significantly is now extremely close to 1 (super-exponentially close in , in fact), leading to the total failure of the strong law in this setting.) Of course, if one restricts attention to a lacunary sequence of n then the above proof goes through in the independent case (since the Borel-Cantelli lemma is insensitive to this independence). By exploiting the joint independence further (e.g. by using Chernoff’s inequality) one can also get the strong law for independent empirical means for the full sequence n under second moment bounds.

**Remark 3.** From the perspective of interpolation theory, one can view the above argument as an interpolation argument, establishing an estimate (10) by interpolating between an estimate (Lemma 2) and the estimate (4).

**Remark 4.** By viewing the sequence as a stationary process, and thus as a special case of a measure-preserving system one can view the weak and strong law of large numbers as special cases of the mean and pointwise ergodic theorems respectively (see Exercise 9 from 254A Lecture 8 and Theorem 2 from 254A Lecture 9).

[*Update*, Jul 19: some corrections.]

[*Update*, Jul 20: Connections with ergodic theory discussed.]

## 88 comments

Comments feed for this article

19 June, 2008 at 7:35 am

Markov’s Inequality « Justin Domke’s Weblog[…] June 19, 2008 While looking at Tao’s post on the law of large numbers, I claimed to Alap that Markov’s inequality was […]

19 June, 2008 at 3:25 pm

toomuchcoffeemanA very clear and educative post: it’s been a long time since I was made to learn about weak and strong LLN, but I don’t recall the proofs seeming so well-motivated at the time.

One question: you remark that

(If one strengthens the first moment assumption to that of finiteness of the second moment , then we of course have a more precise statement, namely the central limit theorem, but I will not discuss that theorem here.)I’m a bit confused: do you mean that the CLT can be viewed as a more precise version of the WLLN or the SLLLN?

19 June, 2008 at 3:45 pm

Terence TaoDear toomuchcoffeeman,

I’ve clarified the text to reflect the fact that the CLT is a sharper version of the weak LLN. (There is no almost sure (or “strong”) version of the CLT, as the limit in the CLT is random rather than deterministic. One could argue though that Berry-Esseen quantitative version of the CLT is somewhat analogous to the quantitative versions of the LLN (such as (9)) that underlie the strong LLN.)

19 June, 2008 at 6:55 pm

nobelHubelProf. Tao, shouldn’t the inequality in the indicator function below equation (1) be reversed?

19 June, 2008 at 7:01 pm

Terence TaoThanks for the correction!

19 June, 2008 at 10:29 pm

Joshua BatsonProfessor Tao,

Isn’t good control of the zeroth moment attained when P(X !=0) is low, that is, when X is mostly 0?

20 June, 2008 at 2:00 am

athreyaDear Professor Tao,

In 1995 Mike Keane presented a (much) easier proof of the SLLN. The idea is essentially enclosed in the article:

M. Keane, The essence of large numbers, Algorithms, Fractals, and Dynamics (Okayama/Kyoto, 1992), 125-129, Plenum, New York, 1995.

It is a hands on proof that works with running average omega by omega.

best wishes

Siva

20 June, 2008 at 7:52 am

Terence TaoDear Joshua: Thanks for the correction!

Dear Siva: Thanks for the article, which I found at

http://repos.project.cwi.nl:8888/cwi_repository/docs/I/02/2627A.pdf

It appears to reinterpret the strong law of large numbers as a special case of the pointwise ergodic theorem for measure-preserving systems (or for stationary processes, which are much the same thing). But it seems the author has restricted himself to bounded random variables (such as those only taking the values 0 and 1); most of the difficulty in the strong law, as in the proof above, comes from the fact that X can be unbounded (which is why we need to mess around with truncations, as well as the use of the first or zeroth moment method to deal with the large values of X).

20 June, 2008 at 11:47 am

PROFESSOR CONSTANTINO DE SOUSAVERY GOOD.

CUMPLIMENTS

PROFESSOR CONSTANTINO DE SOUSA

20 June, 2008 at 4:38 pm

AnonymousThere is some philosophy related to the laws of large numbers which I don’t really understand, and which I’d be happy to have explained to me. In theory, we are speaking about a sequence of random variables on the same common probability space. In practice, we think of instantiating a random variable in a sequence of trials. This is a subtle point: each trial is itself a random variable, distributed exactly as the original one, and leaving in an identical, but separate and independent, probability space. When we toss a die times, there are possible outcomes. This means, we are no longer in our original six-element space, but in its cartesian power, with points. That is, tossing a die times is described by the probability space which is the product of copies of the space, corresponding to tossing just one die. So, our actually leave in

differentspaces.How does this agree with the theoretical assumption that the ‘s are defined on the same space? To make this more specific: say, what is the “practical meaning” of pointwise convergence in the Strong law? When we think of trials, the variables live in distinct spaces; on what space they pointwise converge?

On this occasion: the proof of Borel-Cantelli does not need Markov and

indicators. What would it mean that an infinite number of are

true with positive probability? The existence of an event , with

, such that is contained in infinitely many of the events . But then the probability of each of these events is at least as large as that of ; hence the series, having infinitely many terms bounded away from 0, diverges.

20 June, 2008 at 5:29 pm

Terence TaoDear anonymous,

In elementary (finitary) probability theory, the sample space (or probability space) is often defined in textbooks as simply the set of all possible outcomes, as this is the simplest choice of space to work in for most finitary applications. When it comes to infinitary probability theory, though, it is better to take a more flexible and abstract viewpoint: the sample space is now allowed to be an abstract set, and each outcome corresponds to a separate event inside that set. For instance, if one is studying the flip of a single coin, the sample space could be a two-element set {Heads, Tails}, but it doesn’t

haveto have just two elements; it could be a much larger set, partitioned into two disjoint subsets, the “Heads” event and the “Tails” event. For instance, the sample space could be the unit interval [0,1] with Lebesgue measure, and the Heads and Tails events could be the intervals [0,1/2) and (1/2,1] respectively.For the purposes of probability theory, the exact size of the sample space does not matter; the only thing that matters is the algebra (or more precisely, -algebra) of events and the probability (or measure) which is assigned to each event; the actual points in the sample space are in fact largely irrelevant to probability theory. (See also the notion of equivalence of measure spaces, as defined for instance in Lecture 11 of my 254A course. Equivalent probability spaces may have very different cardinalities, but are indistinguishable from each other for the purposes of doing probability theory.)

To construct a suitable sample space to hold an infinite collection of random variables, such as an infinite sequence of independent die rolls, one can take an inverse limit of the sample spaces associated to finite sequences of die rolls. One could also resort to more

ad hocdevices, such as taking the sample space to be the unit interval with Lebesgue measure (which is the standard sample space for selecting a random variable x uniformly at random from the unit interval) and then defining the value of the i^th die roll to be the i^th digit of x base 6, plus one. In this space, the law of large numbers has this interpretation: when one selects a number at random from the unit interval, then almost surely, its base 6 digits are uniformly distributed amongst 0,1,2,3,4,5.Incidentally, your argument for the Borel-Cantelli lemma does not quite work; the assertion “with positive probability, an infinite number of the events are true” is

notthe same as “an infinite number of the events are simultaneously true with positive probability”, because at different points in the sample space, a different (but still infinite) collection of infinite events could be true. For instance, to go back to the infinite die roll example, let be the event that the n^th die roll is a 1. Then any infinite collection of events has an intersection that has zero probability; for instance, the probability that all the odd die rolls are all 1 is zero. Equivalently, as you say, no event A of positive probability is contained in infinitely many of the . Nevertheless, it is true almost surely that infinitely many of the are true; if one rolls an infinite number of dice, one will almost surely roll an infinite number of ones. But the exact set of rolls that yield these ones vary from outcome to outcome; indeed, there are an uncountable number of possible sets of rolls, which is why one can represent an event of probability 1 as the union of events of probability 0. (This is, incidentally, one way to prove that the reals are uncountable, though certainly not the most direct way, as it requires one to first construct Lebesgue measure.)20 June, 2008 at 6:40 pm

AnonymousGot both your points, many thanks!

21 June, 2008 at 7:28 pm

AnonymousIt’s me again – the very same Anonymous.

> Incidentally, your argument for the Borel-Cantelli lemma does not quite > work

How about this? Fix and find such

that . For or more of the

events to occur, at least one of must

occur, and the probability of this is less than . We are

done!

21 June, 2008 at 7:58 pm

Terence TaoYes, this works too; in the notation of the above post, this is a “zeroth moment method” argument rather than a “first moment method” argument. (The first moment method argument gives a more precise upper bound on the probability that at least k events are true for any given k, though.)

24 June, 2008 at 4:00 am

Giovanni PeccatiDear Terence,

thank you for this beautiful website!

Just a quick note: there exists indeed a series of asymptotic results for partial sums of iid random variables that are known in the literature as “Almost Sure Central Limit Theorems”. This kind of theorems involve some (weighted) “logarithmic” versions of the empirical measures associated with the underlying partial sums. They hold almost surely, and they have a Gaussian probability measure as a limit object. A paper containing several general results in this direction is the following

http://www.springerlink.com/content/v3u54005214j853n/

Best!

Giovanni Peccati

24 June, 2008 at 8:54 am

toomuchcoffeemanDear Giovanni,

As the person who (implicitly) asked if there were versions of the CLT akin to the strong law of large numbers: thank you for your post and the link (the results you describe, involving empirical measures, are what I was thinking of, although I didn’t know how to express it).

27 June, 2008 at 11:49 am

YasharDear Terry,

I also share Anonymous’s philosophical/practical concern about the usefulness of the SLLN. Let me clarify: although I understand the mathematical meaning of almost sure pointwise convergence for functions in some abstract measure space, I don’t think SLLN has much value for the ‘usual’ application of the LLN, namely in relating the ’empirical’ (i.e. trial-wise — or timewise in ergodic theory) averages to expectation values calculated in probability theory. This is so because in ‘reality’ we are only dealing with FINITELY many (say iid) samples from a probability distribution. In this setting, pointwise convergence is meaningless as you need the whole infinite series to be able to talk about convergence, but convergence in probability still makes sense (especially, if we interpret the probability of deviation in WLLN, in the Bayesian degree-of-belief sense as opposed to a frequentist sense).

However, I admit that the ‘usual’ applications may not comprise all useful applications. Your pretty example, where you map the infinite dice-rolling sequence to [0,1], is certainly one such application. However, any generalization of the intuition gained there to cases where the probability space (and thus the set of ‘digits’ of your, now, super-real numbers) are uncountable are beyond me. But I’m aware that these problems are pretty much the very reason why measure theory was invented and that e.g. Kolmogorov’s theorem resolves the issue of constructing the infinite cartesian product of uncountable spaces (am I right?).

Maybe all of these concerns are too ‘philosophical’, and somewhat smell of the ancient problems with infinity, uncountability, etc, which you might find boring. But i’ll be glad to read your views about them if you don’t. More importantly, I’ll be grateful if you could point out some (important) applications of SLLN, whether ‘usual’ in the above sense or otherwise, where WLLN is useless. I’d rather the applications be in some sense intrinsically probabilistic rather than purely ‘geometric’/functional-analytic though.

29 June, 2008 at 9:56 am

Terence TaoDear Yashar,

Of course, any mathematical statement involving infinity will not have any direct application to the finitary situations encountered in the physical world, but any “qualitative” or “infinitary” statement in mathematics often tends to have “quantitative” or “finitary” counterparts which do have such an application, or at least gives some useful intuition on how to view these finitary contexts. (See for instance my blog post on this issue.)

Returning specifically to the question of finitary interpretations of the SLLN, these basically have to do with the situation in which one is simultaneously considering multiple averages of a single series of empirical samples, as opposed to considering just a single such average (which is basically the situation covered by the WLLN). For instance, if one had some random intensity field of grayscale pixels, and wanted to compare the average intensities at 10 x 10 blocks, 100 x 100 blocks, and 1000 x 1000 blocks, then the SLLN suggests that these intensities would be likely to be

simultaneouslyclose to the average intensity. (The WLLN only suggests that each of these spatial averages areindividuallylikely to be close to the average intensity, but does not preclude the possibility that when one considers multiple such spatial averages at once, that a few outlying spatial averages will deviate from the average intensity. In my example with only three different averages, there isn’t much difference here, as the union bound only loses a factor of three at most for the failure probability, but the SLLN begins to show its strength over the WLLN when one is considering a very large number of averages at once.)2 August, 2008 at 9:01 am

Random matrices: Universality of ESDs and the circular law « What’s new[…] in probability rather than almost sure convergence; this is in complete analogy with the weak and strong law of large numbers, and in fact this law is used in the proof.) In a previous paper, we had established the same […]

10 October, 2008 at 3:38 pm

Small samples, and the margin of error « What’s new[…] greater generality than the setting discussed here. (It is closely related to the arguments in my previous post on the law of large numbers.) The main mathematical result we need is Theorem. Let X be a finite set, let A be a subset of X, […]

14 October, 2008 at 12:31 pm

Non-measurable sets via non-standard analysis « What’s new[…] idea was to let E be a “random” subset of . If one (non-rigorously) applies the law of large numbers, one expects E to have “density” 1/2 with respect to every subinterval of , which would […]

9 December, 2008 at 11:40 am

AnonymousProf. Tao,

If, for example, S=\sum_{i=1}^{N} X_i, but N is a random variable taking values for a fairly large integer to infinity. Since S can be rewritten as

S= N (\bar X_N), is it safe to say that S \approx N E(N).

27 December, 2008 at 8:52 am

Tricks Wiki: Use basic examples to calibrate exponents « What’s new[…] G on N vertices, in which each edge has an independent probability of of lying in G. By the law of large numbers, we expect the edge density of such a random graph to be close to on the average. On the other […]

4 January, 2009 at 2:15 pm

KWhat is a good book to learn probability theory from if I already know measure theory (at the level of Rudin’s Real and Complex Analysis)?

4 January, 2009 at 3:34 pm

Terence TaoDear K,

I guess it depends on what you plan to use probability for. I myself tend to use probability for combinatorial purposes, and so I actually don’t use the measure-theoretic foundations of probability so heavily (in many combinatorial applications, one can in fact just work with discrete random variables). For such applications, Alon-Spencer’s “the probabilistic method” is very nice.

For the formal measure-theoretic foundations of probability, I have occasionally used Kallenberg’s “Foundations of probability theory”, which is quite thorough.

12 January, 2009 at 7:34 pm

AnonymousDear Professor Tao,

Would you know of a good book/resource where one can look up all the LLNs that exist under varying assumptions.

Thank you.

9 February, 2009 at 8:48 pm

WTH is the Law of Large Numbers? « What The Haven[…] a more involved commentary, see the famous blog by […]

17 February, 2009 at 9:35 pm

DucDear Professor Tao,

Would it be still sufficient if X_i are not identical?

Is the law still hold if we trade off the identical and independent of those X’s for sup E(X_i^2) bounded and X’s uncorrelated?

I’m reading the book Hilbert Space Methods in Probability and Statistical Inference (http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471592811.html) and wonder if we could prove it by using vectors in Hilbert space.

Thank you

10 March, 2009 at 7:45 pm

VivekHow does the Weak law of large numbers follow from strong law by dominated convergence theorem. I always thought it was Egoroff’s theorem but I think you have a more simple proof Dr Tao.

Vivek

10 March, 2009 at 7:59 pm

Terence TaoDear Vivek,

You are right, Egoroff’s theorem is a better citation here. (Dominated convergence works directly when the random variables are bounded, and for unbounded random variables one can apply dominated convergence to the level sets to get the strong law from the weak, but one may as well just jump straight to Egoroff in the latter case.)

28 April, 2009 at 10:04 pm

lutfuDear Prof. Tao,

If we have i.i.d rvs with mean zero, then what is the a.s limit of for different values of a less than 1?

30 April, 2009 at 9:50 am

Terence TaoDear lutfu,

Assuming the random variables have bounded second moment, the limit will be almost surely zero for , converge (in a distributional sense) to a normal distribution for , and diverge to infinity almost surely for , all thanks to the (strong) law of large numbers. For heavy-tailed random variables with infinite second moment, the situation is going to be more complicated, but can be worked out for any specific distribution by a variety of tools (e.g. Fourier analysis).

30 April, 2009 at 1:39 pm

lutfuDear Prof Tao,

Thank you for your help. but I am wondering can we explicitly characterize the a.s. limit of for any real number a , just assuming the existence of first moment?

I know we have law of iterated logarithm thm but it assumes existence of the second moment.

thanks

8 October, 2009 at 2:58 pm

AjitBeing much more into “discrete” math, I often have great difficulty trying to understand concepts from continuous math. One thing I’ve never really understood is why the strong law is stronger than the weak law. I’ve read the statements of the laws and read about the meaning of almost sure convergence and convergence in probability, but I don’t have a good feel for this. I checked the wikipedia entry on this topic, but it’s not clear as to how the strong law prevents you (with probability 1) from having infinitely many sample sequences whose average differs from the true mean. Can you help me?

8 October, 2009 at 7:46 pm

Terence TaoImagine a table in which the rows are all the possible points in the sample space (this is a continuum of rows, but never mind this), and the columns are the number n of trials, and there is a check mark whenever the empirical mean deviates significantly from the actual mean . The weak law asserts that the density of check marks in each column goes to zero as one moves off to the right. The strong law asserts that almost all of the rows have only finitely many checkmarks. A double counting argument (or the Lebesgue dominated convergence theorem) then shows that the latter implies the former.

26 April, 2014 at 1:39 am

nicoooGreat explanation. Thanks a lot !

15 October, 2009 at 8:41 pm

XiaoshengDear Prof.:

Do you have a proof of the SLLN without the assumption that E(X^2) exists?

15 October, 2009 at 9:52 pm

Terence TaoThe proof given in the article uses only finiteness of the first moment.

20 October, 2009 at 6:24 pm

XiaoshengProf Tao:

I’m not quite sure about the truncation method. You used a threshold N to divide the random variable X into two parts Y and Z. Are these two still random variables? I’m just wondering what is XI(X<N) and why does Xn bar equal the sum of Yn bar plus Zn bar?

20 October, 2009 at 6:43 pm

Terence TaoI(X<N) is the indicator function of the event X<N. Thus, is the random variable which equals X when X is less than N, and equals 0 otherwise.

20 October, 2009 at 8:16 pm

XiaoshengYeah, but then it seems that the “sum” of the empirical average of the two parts doesn’t equal the empirical average of X. At least the probabilities on both sides don’t equal… I think there should also be terms describing situations where X1>N>X2 and so forth, but that will spoil the proof. I’m sorry if I misunderstand your meaning.

By the way, it’s such a pitty that I missed your lecture on IMO 2009. Actually I was on the Chinese team in 2008 and didn’t sit for the test again this year. I’m now a freshman at Yale and hope I can have chance to hear your lecture someday.

Best,

Xiaoshengempirical average of X. At least the probabilities on both sides don’t equal… I think there

20 October, 2009 at 8:59 pm

Terence TaoThere is no and here; there is just a single random variable X, which one decomposes as , simply because the events and are complementary and so and add up to 1. The two random variables on the right-hand side are not independent, of course, but this doesn’t matter.

(It may help if one goes back to the foundations of probability theory, and views X as a measurable function on the underlying probability space.)

21 October, 2009 at 7:56 am

XiaoshengProf Tao:

I’m sorry for disturbing you, but I just wonder why does the empirical average of X equal the sum of the the two empirical averages of the decomposed parts XI(X>=N) and XI(X=N or not for each index j, and then we need bionomial coefficients to add these conditions up. But it seems that in this article, only two conditions–X1~Xn simultaneously less than N or simultaneously greater than or equal to N– is considered.

21 October, 2009 at 8:04 am

XiaoshengIt’s pretty strange that my sentence is again missing.Sorry for that. I just want to say that we need to consider Xj>=N for each index j.

21 October, 2009 at 8:48 am

Terence TaoTo apply linearity of expectation (the equation after (2) in the post), no decomposition into cases is required. (This is part of the reason why linearity of expectation is such a powerful method; it also does not require any independence assumptions on the summands.)

21 October, 2009 at 2:54 pm

Vivek KaulI understand that strong law of large number is proved with rigor under the mathematical framework of measure theory but how well does real number system with its problems model things in nature? Do you believe mathematics is just a tautology like Russel’s student Wittgenstein, or there is a deeper meaning to mathematical proofs. Do you think the only purpose of mathematics is to establish logical equivalences between statements, so that we get to a statement that is empirically verifiable? How much can be rely on Mathematical proofs in measure theory and does the Godel’ s incompleteness theorem deny

I would request you to give some insight into these questions as most mathematicians I met know much more advanced mathematics than me, but they do not answer my questions clearly. Hence I lose motivation to pursue mathematics.

21 October, 2009 at 4:02 pm

Terence TaoI find these sorts of metamathematical questions to have surprisingly little impact on the practice and understanding of both maths and science. Nevertheless, I can try to answer some of your questions.

At sufficiently microscopic scales (e.g. below the Planck length), the physical world may well have a different nature than that of the continuum, but this is not incompatible with the empirically observed fact that continuum models work extremely well at macroscopic scales (we have plenty of mathematical examples of discrete systems, for instance, which asymptotically behave like the continuum at large scales). In any event, the mathematical consistency of such continuum models has nothing to do with whether they actually model physical reality or not; the former is a question in pure mathematics, and the latter is a question in physics.

The real line, for instance, can be formally constructed in, say, the ZFC model of set theory, or in the framework of Euclidean geometry (using, say, Hilbert’s axiomatisation of that theory). Having a physical model (or visual model, etc.) of the real line is very useful for providing intuition and to serve as an analogy to the idealised mathematical concept, but is not essential.

It is true that mathematical theorems are, ultimately, tautologies, but they are very non-obvious and useful tautologies nonetheless. For instance, any theorem about three-dimensional Euclidean space is a tautology which does not directly have any physical significance; however, if one then combines such tautologies with the non-mathematical

belief that physical space can be modeled (at human scales) by , then suddenly these theorems yield useful physical consequences that can be verified empirically. Because of things like this, I find the distinction between tautologies and non-tautologies to not be particularly useful in practice.

In any event, theorems and proofs are not the fundamental objective of mathematics, despite the possible appearance to the contrary; what we really seek is the

understandingof various mathematical phenomena, including those phenomena which model analogous real-world phenomena. Formal theorems and proofs are an excellent way to objectively capture and quantify this understanding (and to eliminate all sorts of misunderstanding), but they are a means to an end only.Regarding the incompleteness theorem: it is true that this theorem tells us that we should not be able to establish the consistency of formal mathematical systems such as ZFC (say), so any theorem proven using ZFC is then “only” known to be correct conditionally on the consistency of ZFC.

This is still nevertheless provides an extraordinarily high degree of certainty, and in particular far more than one would expect from any real-world application of a mathematical model. Plus, in practice, most useful theorems in mathematics, once they are understood properly, can be phrased and proven in much weaker systems than ZFC, so the consistency of ZFC itself is usually irrelevant. This is not quite the same thing as having absolute certainty in the correctness of the theorem, but it is more than good enough for almost all purposes outside of philosophy and metamathematics.

21 October, 2009 at 4:32 pm

Vivek KaulThank you for your very informative answer. I generally had these confusions because many mathematics of the bygone era were Platonists who believed more in the reality of mathematical objects than physical things. I read a quote, ” Most current mathematicians if questioned are formalists(finding useful tautologies) but they act as if they are Platonists”. Is there some truth to that?

21 October, 2009 at 3:42 pm

HarshaDear Prof. Tao,

Thank you for your exposition of the proof – it has been very useful to me in my class!

I do have a general question – Suppose we know that $X_i$’s are IID, and $E \| X_i \| < \infty$. Clearly the SLLN gives us convergence of $\frac{S_n}{n}$ to $EX_i$ a.s. Can we also say something about $max_{1 \leq i \leq n} \frac{S_i}{n}$?

21 October, 2009 at 5:08 pm

Terence TaoI am not sure what you are looking for precisely; for instance, if the are non-negative, then is just . In the mean zero, finite variance case, perhaps the law of the iterated logarithm may answer your question.

27 October, 2009 at 11:16 am

FALAHADear Prof. Tao,

I know the defintion of random fields but I don`t know the definition of harmonicale isotropic random fields, do u know it, please?

Another question please, the law of large numbers was not proved largely for random fields, so do u know in which cases that it is not proved?

26 November, 2009 at 5:32 pm

WillJProf. Tao writes:

“Imagine a table in which the rows are all the possible points in the sample space (this is a continuum of rows, but never mind this), and the columns are the number n of trials, and there is a check mark whenever the empirical mean \overline{X}_n deviates significantly from the actual mean {\Bbb E} X. The weak law asserts that the density of check marks in each column goes to zero as one moves off to the right. The strong law asserts that almost all of the rows have only finitely many checkmarks. A double counting argument (or the Lebesgue dominated convergence theorem) then shows that the latter implies the former.”I don’t understand. Each row represents an outcome in the sample space. Each column represents a trial of the experiment. Right? (We assume, for purposes of the illustration, that our sequence of iid random variables is as follows: for any given outcome, all of the r.v.s assign the same number to that outcome; the different r.v.s correspond to the different trials. – Right?) Now where do we put the checkmarks? I can think of two possibilities of what you mean: 1) We imagine repeatedly running the experiment. Each trial gives us an outcome. If trial i gives us outcome j, then put a dot in (row j, column i) in our matrix. Now we take a look at what we have, and at each dot that we put in our matrix, we calculate the empirical mean up to that point (based on where that dot is and where the previous dots are), and if it’s significantly different from the actual mean, then we put a check-mark over that dot. If not, leave the dot be. (We don’t have to ever actually put dots; I just mentioned them for clarification.) But then a column can’t have more than one check-mark, so that must not be what you’re saying. Perhaps, then, you mean: 2) For EVERY cell in our matrix (every [row j, column i] combination), we take a look, do the calculations, and put a check-mark in that cell if, at that point, the empirical mean is significantly different from the actual mean. But that doesn’t make any sense, because the value of the empirical mean at any given cell depends on what the outcomes were in the previous trials, i.e. there’s no unique empirical mean that can be assigned to each cell in our matrix.

So I’m quite confused! The reason I’m interested: I’m trying to understand what exactly the SLOLN says (I think I already understand what the WLOLN says). My main source of confusion is: With regard to the event “the iid r.v.s converge to the constant r.v. always equal to the actual mean” (note that the SLOLN says that this event has probability 1), what is the sample space that this event is taken from? I think this is the issue the Anonymous guy was getting at, and I think your dice example (involving base-6 representations of numbers in [0,1]) was very nice (although I don’t understand your general, “non-ad-hoc” that preceded it:

“To construct a suitable sample space to hold an infinite collection of random variables, such as an infinite sequence of independent die rolls, one can take an inverse limit of the sample spaces associated to finite sequences of die rolls.”– I have some intuition as to what that might mean, and I can try reading up on “inverse limits” (which I never heard of until just now), but maybe you can elaborate on that).I would really, really like to understand your 2-dimensional table idea, as I think it very nicely captures what I’m getting at, if I could just understand it!

(Oh, and many thanks!)

26 November, 2009 at 9:58 pm

Terence TaoEach row represents an infinite sequence of trials. Each column just indexes an integer . A checkmark is placed at row and at column if the empirical mean deviates significantly from the mean .

Suppose for sake of concreteness that the represent the rolls of a six-sided die. Thus, one can imagine filling out the first n entries of one row by rolling a die n times, and recording the running means that one encounters in this process. To get the first n entries of another row, one has to make n fresh rolls.

Alternatively, as mentioned earlier, one can index the rows continuously by a real number x from 0 to 1, and use the base 6 representation to generate the samples that one is averaging over (let’s say that the dice are numbered from 0 to 5 rather than from 1 to 6 to make this representation work cleanly).

27 November, 2009 at 12:00 am

WillJAh, that makes perfect sense. Wahoo! Thank you so much!

3 January, 2010 at 10:21 pm

254A, Notes 1: Concentration of measure « What’s new[…] more sophisticated variant of this argument (which I gave in this earlier blog post, which also has some further discussion and details) gives Theorem 5 (Strong law of large numbers) […]

5 January, 2010 at 4:20 pm

254A, Notes 0: A review of probability theory « What’s new[…] more sophisticated variant of this argument (which I gave in this earlier blog post, which also has some further discussion and details) gives Theorem 5 (Strong law of large numbers) […]

6 January, 2010 at 3:49 pm

pavelDear Prof. Tao,

regarding the following passage:

Happily, this choice works cleanly; … we have the pointwise estimates

,

doesn’t one rather need O(1) to infer

?

It is obtained in essentially the same way, replacing the estimates by , where a is the first index with .

[Oops! Corrected, thanks – T.]20 January, 2010 at 1:58 pm

AnonymousPavel, can you please post as to how we obtain the results.

3 April, 2010 at 9:17 pm

HansDear Prof. Tao:

Could you please illuminate in more detail the two statements in Remark 2:

1. “this random variable (barely) has finite first moment, but for , we see that deviates by at least absolute constant from its mean with probability .”

2. “As the empirical means for are now jointly independent, the probability that one of them deviates significantly is now extremely close to 1 (super-exponentially close in , in fact)”

Thank you very much in advance!

15 May, 2010 at 12:37 pm

The Khinchine Law of Large Numbers | The Longboat and the Otter[…] https://terrytao.wordpress.com/2008/06/18/the-strong-law-of-large-numbers/ […]

19 August, 2010 at 7:09 am

Russell LyonsHi, Terry.

I just happened across this post. I have two small comments:

1. For the proof of the Borel-Cantelli lemma, one can simply observe that since the expectation of the sum of indicators is finite, so is the sum itself a.s. For some reason, it has been more popular to do it your way (i.e., using Markov’s inequality).

2. For those who are interested in refinements of the interpolation method that apply to weakly dependent random variables under a second moment condition, I can suggest an old paper of mine: Strong laws of large numbers for weakly correlated random variables, Mich. Math. J. 35, No. 3 (1988), 353–359 (http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.mmj/1029003816).

Best,

Russ

28 August, 2011 at 4:34 am

lkozmaThank you for the clear and informative post!

The formula for linearity of expectation shouldn’t have parentheses on the left side?

28 August, 2011 at 6:06 am

The law of large numbers and the central limit theorem | Nair Research Notes[…] central limit theorem (CLT). There are excellent resources on the net for LLN and CLT. For example, this and this are highly recommended readings. This blog will play a complementary with figures and […]

7 November, 2011 at 6:02 am

Note on the weak and strong laws of large numbers | Mathematics and me-themed antics.[…] my development was different from that in the book. My notes essentially follow the proof given by Terry Tao, so you may want to refer to his notes as well. Like this:LikeBe the first to like this post. […]

21 January, 2012 at 3:00 pm

abcIt seems like the proof of the pointwise ergodic theorem is easier than this proof! Inspecting that proof, it uses the the added generality of measure-preserving systems to deal with functions that you could not talk about (naturally) in the setting of the strong law of large numbers.

4 February, 2013 at 9:05 pm

Iosif PinelisI think this result and proof are very nice. There is a vague commonality between Keane’s and Tao’s proof: both exploit, to an extent, the fact that the sample mean varies little, in a sense, with . It is also nice that Tao’s result is not contained in the ergodic theorem. Indeed, one can easily see that the pairwise independence does not imply the stationarity. E.g., let be defined as , respectively, where the ‘s are independent Rademacher random variables, each taking each of the values with probability 1/2. Define similarly, based on ; etc. Then the ‘s are pairwise independent. However, , whereas . So, the sequence is not stationary.

[LaTeX code corrected; the problem was the lack of a space between “latex” and the LaTeX code. Also, the curly braces were unnecessary. -T]5 February, 2013 at 8:01 pm

Iosif PinelisThank you for the help with LaTeX. A couple more points here:

(i) Of course, I wanted to say “the pairwise independence does not imply the stationarity, even if the ‘s are identically distributed”, rather than just “the pairwise independence does not imply the stationarity”.

(ii) Tao’s result, with the pairwise independence rather than with the complete independence, can be extended in a standard manner to the case when the ‘s take values in an arbitrary separable Banach space (say).

9 March, 2013 at 7:36 pm

Laws of large numbers and Birkhoff’s ergodic theorem | Vaughn Climenhaga's Math Blog[…] including a different proof of the weak law than the one above, can be found on Terry Tao’s blog), we observe that the strong law of large numbers can be viewed as a special case of the Birkhoff […]

8 May, 2013 at 5:59 am

Mate WierdlHi Terry! Maybe a few exercises could be added based on the following remarks:

If we assume finite second moment then only “multiplicativity” is needed. By multiplicativity, I mean . This translates to orthogonality if the expectations are $\latex 0$.

An application would be Weyl’s result: if is a sequence of integers then for almost all , the sequence is uniformly distributed . For this, we take .

With this mutiplicativity assumption, one can simply get results for random variables with varying expectations: Consider nonnegative random variables with varying expectation, but then we would divide by the expectation of the sum of the first $n$ random variables instead of to get the right normalization. For example, the expectations can go to arbitrary slowly as long as . But the expectations can also go to ! They cannot increase arbitrary fast: something like for seems to be the limit for a strong law, while for norm (weak) convergence, one can get arbitrary close to , that is we can have .

I think these are good exercises for using and testing the limits of this "lacunary subsequence" trick.

4 October, 2013 at 4:46 pm

PaulMISSING PARENTHESIS:

Ex1+,…+xn = E(x1) +…+E(xn)

should have the parenthesis on the left side of equation thus:

E(x1+…+xn) = E(x1)+…+E(xn)

5 February, 2014 at 7:46 am

Law of large numbers for dependent but uncorrelated random variables | Vaughn Climenhaga's Math Blog[…] book “Probability and Measure”, where it is Theorem 22.1. There is also a discussion on Terry Tao’s blog, which emphasises the role of the main tools in this proof: the moment method and […]

10 February, 2014 at 9:30 am

TomaszCould someone provide some reference to a full proof of the “strong law for independent empirical means for the full sequence under second moment bounds” mentioned in Remark 2? Or at least could someone provide a more detailed explanation of the proof outlined in this remark, in particular what Chernoff’s inequality should be used and how?

28 May, 2014 at 6:38 am

AnonymousHi Dear Professor Tao.

I have a question. If you are an infinite number of random variables. You can still use the strong law of large numbers? please more explain for me.

best regards

5 June, 2015 at 8:49 pm

Prasenjit Karmakar“Indeed, this follows immediately from the simple fact that any random variable X with finite first moment can be expressed as the difference of two non-negative random variables $\max(X,0), \max(-X,0)$ of finite first moment.”

To express a random variable as the difference of two, I think finite moment is not needed, it is needed when we express expectation as the difference of two.

Thanks.

1 July, 2015 at 11:06 am

Law of Large Numbers | Kepler lounge[…] that the sample average converges in probability toward the expected value. There is also a Strong version which I won’t discuss further. But, both versions are of great importance in science as […]

4 December, 2015 at 12:08 pm

Possible explanation for a Bell experiment?[…] be taken as an axiom? The Law Of Large Numbers: https://en.wikipedia.org/wiki/Law_of_large_numbers https://terrytao.wordpress.com/2008/06/18/the-strong-law-of-large-numbers/ QM explains what's going on in Bell. Why you want to come up with another I cant quite […]

9 February, 2016 at 4:03 am

Error : what is n in quantum mechanics | Physics Forums - The Fusion of Science and Community[…] But as the number of systems in the ensemble tends to infinity the law of large numbers applies: https://terrytao.wordpress.com/2008/06/18/the-strong-law-of-large-numbers/ Don't worry about the proof if you don't know the math to follow it. Simply grasp what […]

22 July, 2016 at 8:24 am

Helmholtz: El Origen y Significado de los Axiomas Geométricos (1) | Divergiendo - «Nebar defiqueit mor danguot iuit»[…] 13. Tao, T. (2008, 18th June). The strong law of large numbers. [Weblog]. Retrieved 17 July 2016, from https://terrytao.wordpress.com/2008/06/18/the-strong-law-of-large-numbers/#comment-41939↩ […]

6 October, 2016 at 8:10 pm

chaosHi prof Tao,

You wrote: “The concepts of convergence in probability and almost sure convergence in probability theory are specialisations of the concepts of convergence in measure and pointwise convergence almost everywhere in measure theory.”

In measure theory, is there any different between “pointwise convergence almost everywhere” and “convergence almost everywhere”?

[No – T.]13 November, 2016 at 10:44 am

AnonymousYou might want to fix (3). The formula flows out.

[Corrected, thanks – T.]19 March, 2017 at 10:36 pm

NafisahNine years later, I’m finding this very helpful. Thank you

27 April, 2017 at 1:54 pm

ntthungDear Prof Tao,

One thing that’s been bothering me for a while: why is the absolute value sign used in the statement of the theorems? Why do we need E(|X|) instead of just E(X)?

Thank you.

28 April, 2017 at 9:12 am

Terence TaoOne cannot even define the expectation of a signed random variable unless one first knows that it is absolutely integrable (that is, ), since expectation is defined using Lebesgue integration.

28 April, 2017 at 2:07 pm

ntthungThank you very much. I was taught statistics in a slightly less formal way (without measure theory), but I guess it’s not too late to start.

28 April, 2017 at 5:08 pm

Dikran KaragueuzianHere is another perspective on why we need . Suppose that we defined by, say, , where f is the pdf of the random variable . Then, for the Cauchy random variable, we would have . However, if are iid Cauchy random variables, the mean is also a Cauchy random variable. (Use characteristic functions to prove this.) Thus does not converge almost surely to zero.

So if we tried to get around requiring , the strong law of large numbers, as stated, would fail.

3 July, 2017 at 2:04 pm

Robert WebberProfessor Tao,

Thanks for a great post!

Could you please explain the last claim of remark 2 in more detail? In Counterexamples in Probability by Romano & Siegel (pg. 113), there is a nice example of a triangular array of IID random variables that have moments of order p when p 2. But the p = 2 case is trickier and I could come up with neither proof nor counterexample. What did you mean when you said the Chernoff bounds work for the p = 2 case as well?

Much thanks!

3 July, 2017 at 2:08 pm

Robert WebberI apologize for the typo. In Counterexamples in Probability and Statistics (pg 113), there is an example of random variables that have moments of order p where p 2, I can see how Chernoff bounds plus a truncation argument will lead to a SLLN. I a curious about the p=2 case.

6 July, 2017 at 8:00 pm

AnonymousI just want to comment that this is almost surely the politest comments section on the internet. Beautiful website.