In Notes 0, we introduced the notion of a measure space , which includes as a special case the notion of a probability space. By selecting one such probability space as a sample space, one obtains a model for random events and random variables, with random events being modeled by measurable sets in , and random variables taking values in a measurable space being modeled by measurable functions . We then defined some basic operations on these random events and variables:
- Given events , we defined the conjunction , the disjunction , and the complement . For countable families of events, we similarly defined and . We also defined the empty event and the sure event , and what it meant for two events to be equal.
- Given random variables in ranges respectively, and a measurable function , we defined the random variable in range . (As the special case of this, every deterministic element of was also a random variable taking values in .) Given a relation , we similarly defined the event . Conversely, given an event , we defined the indicator random variable . Finally, we defined what it meant for two random variables to be equal.
- Given an event , we defined its probability .
These operations obey various axioms; for instance, the boolean operations on events obey the axioms of a Boolean algebra, and the probabilility function obeys the Kolmogorov axioms. However, we will not focus on the axiomatic approach to probability theory here, instead basing the foundations of probability theory on the sample space models as discussed in Notes 0. (But see this previous post for a treatment of one such axiomatic approach.)
It turns out that almost all of the other operations on random events and variables we need can be constructed in terms of the above basic operations. In particular, this allows one to safely extend the sample space in probability theory whenever needed, provided one uses an extension that respects the above basic operations; this is an important operation when one needs to add new sources of randomness to an existing system of events and random variables, or to couple together two separate such systems into a joint system that extends both of the original systems. We gave a simple example of such an extension in the previous notes, but now we give a more formal definition:
Definition 1 Suppose that we are using a probability space as the model for a collection of events and random variables. An extension of this probability space is a probability space , together with a measurable map (sometimes called the factor map) which is probability-preserving in the sense that
for all . (Caution: this does not imply that for all – why not?)
An event which is modeled by a measurable subset in the sample space , will be modeled by the measurable set in the extended sample space . Similarly, a random variable taking values in some range that is modeled by a measurable function in , will be modeled instead by the measurable function in . We also allow the extension to model additional events and random variables that were not modeled by the original sample space (indeed, this is one of the main reasons why we perform extensions in probability in the first place).
Thus, for instance, the sample space in Example 3 of the previous post is an extension of the sample space in that example, with the factor map given by the first coordinate projection . One can verify that all of the basic operations on events and random variables listed above are unaffected by the above extension (with one caveat, see remark below). For instance, the conjunction of two events can be defined via the original model by the formula
or via the extension via the formula
The two definitions are consistent with each other, thanks to the obvious set-theoretic identity
Similarly, the assumption (1) is precisely what is needed to ensure that the probability of an event remains unchanged when one replaces a sample space model with an extension. We leave the verification of preservation of the other basic operations described above under extension as exercises to the reader.
Remark 2 There is one minor exception to this general rule if we do not impose the additional requirement that the factor map is surjective. Namely, for non-surjective , it can become possible that two events are unequal in the original sample space model, but become equal in the extension (and similarly for random variables), although the converse never happens (events that are equal in the original sample space always remain equal in the extension). For instance, let be the discrete probability space with and , and let be the discrete probability space with , and non-surjective factor map defined by . Then the event modeled by in is distinct from the empty event when viewed in , but becomes equal to that event when viewed in . Thus we see that extending the sample space by a non-surjective factor map can identify previously distinct events together (though of course, being probability preserving, this can only happen if those two events were already almost surely equal anyway). This turns out to be fairly harmless though; while it is nice to know if two given events are equal, or if they differ by a non-null event, it is almost never useful to know that two events are unequal if they are already almost surely equal. Alternatively, one can add the additional requirement of surjectivity in the definition of an extension, which is also a fairly harmless constraint to impose (this is what I chose to do in this previous set of notes).
Roughly speaking, one can define probability theory as the study of those properties of random events and random variables that are model-independent in the sense that they are preserved by extensions. For instance, the cardinality of the model of an event is not a concept within the scope of probability theory, as it is not preserved by extensions: continuing Example 3 from Notes 0, the event that a die roll is even is modeled by a set of cardinality in the original sample space model , but by a set of cardinality in the extension. Thus it does not make sense in the context of probability theory to refer to the “cardinality of an event “.
On the other hand, the supremum of a collection of random variables in the extended real line is a valid probabilistic concept. This can be seen by manually verifying that this operation is preserved under extension of the sample space, but one can also see this by defining the supremum in terms of existing basic operations. Indeed, note from Exercise 24 of Notes 0 that a random variable in the extended real line is completely specified by the threshold events for ; in particular, two such random variables are equal if and only if the events and are surely equal for all . From the identity
we thus see that one can completely specify in terms of using only the basic operations provided in the above list (and in particular using the countable conjunction .) Of course, the same considerations hold if one replaces supremum, by infimum, limit superior, limit inferior, or (if it exists) the limit.
In this set of notes, we will define some further important operations on scalar random variables, in particular the expectation of these variables. In the sample space models, expectation corresponds to the notion of integration on a measure space. As we will need to use both expectation and integration in this course, we will thus begin by quickly reviewing the basics of integration on a measure space, although we will then translate the key results of this theory into probabilistic language.
As the finer details of the Lebesgue integral construction are not the core focus of this probability course, some of the details of this construction will be left to exercises. See also Chapter 1 of Durrett, or these previous blog notes, for a more detailed treatment.
— 1. Integration on measure spaces —
Let be a measure space, and let be a measurable function on , taking values either in the reals , the non-negative extended reals , the extended reals , or the complex numbers . We would like to define the integral
of on . (One could make the integration variable explicit, e.g. by writing , but we will usually not do so here.) When integrating a reasonably nice function (e.g. a continuous function) on a reasonably nice domain (e.g. a box in ), the Riemann integral that one learns about in undergraduate calculus classes suffices for this task; however, for the purposes of probability theory, we need the much more general notion of a Lebesgue integral in order to properly define (2) for the spaces and functions we will need to study.
Not every measurable function can be integrated by the Lebesgue integral. There are two key classes of functions for which the integral exists and is well behaved:
- Unsigned measurable functions , that take values in the non-negative extended reals ; and
- Absolutely integrable functions or , which are scalar measurable functions whose absolute value has a finite integral: . (Sometimes we also allow absolutely integrable functions to attain an infinite value , so long as they only do so on a set of measure zero.)
One could in principle extend the Lebesgue integral to slightly more general classes of functions, e.g. to sums of absolutely integrable functions and unsigned functions. However, the above two classes already suffice for most applications (and as a general rule of thumb, it is dangerous to apply the Lebesgue integral to functions that are not unsigned or absolutely integrable, unless you really know what you are doing).
We will construct the Lebesgue integral in the following four stages. First, we will define the Lebesgue integral just for unsigned simple functions – unsigned measurable functions that take on only finitely many values. Then, by a limiting procedure, we extend the Lebesgue integral to unsigned functions. After that, by decomposing a real absolutely integrable function into unsigned components, we extend the integral to real absolutely integrable functions. Finally, by taking real and imaginary parts, we extend to complex absolutely integrable functions. (This is not the only order in which one could perform this construction; for instance, in Durrett, one first constructs integration of bounded functions on finite measure support before passing to arbitrary unsigned functions.)
First consider an unsigned simple function , thus is measurable and only takes values at a finite number of values. Then we can express as a finite linear combination (in ) of indicator functions. Indeed, if we enumerate the values that takes as (avoiding repetitions) and setting for , then it is clear that
(It should be noted at this point that the operations of addition and multiplication on are defined by setting for all , and for all positive , but that is defined to equal . To put it another way, multiplication is defined to be continuous from below, rather than from above: . One can verify that the commutative, associative, and distributive laws continue to hold on , but we caution that the cancellation laws do not hold when is involved.)
Conversely, given any coefficients (not necessarily distinct) and measurable sets in (not necessarily disjoint), the sum is an unsigned simple function.
A single simple function can be decomposed in multiple ways as a linear combination of unsigned simple functions. For instance, on the real line , the function can also be written as or as . However, there is an invariant of all these decompositions:
Exercise 3 Suppose that an unsigned simple function has two representations as the linear combination of indicator functions:
where are nonnegative integers, lie in , and are measurable sets. Show that
(Hint: first handle the special case where the are all disjoint and non-empty, and each of the is expressible as the union of some subcollection of the . Then handle the general case by considering the atoms of the finite boolean algebra generated by and .)
We capture this invariant by introducing the simple integral of an unsigned simple function by the formula
whenever admits a decomposition . The above exercise is then precisely the assertion that the simple integral is well-defined as an element of .
Exercise 4 Let be unsigned simple functions, and let .
- (i) (Linearity) Show that
and
- (ii) Show that if and are equal almost everywhere, then
- (iii) Show that , with equality if and only if is zero almost everywhere.
- (iv) (Monotonicity) If almost everywhere, show that .
- (v) (Markov inequality) Show that for any .
Now we extend from unsigned simple functions to more general unsigned functions. If is an unsigned measurable function, we define the unsigned integral as
where the supremum is over all unsigned simple functions such that for all .
Many of the properties of the simple integral carry over to the unsigned integral easily:
Exercise 5 Let be unsigned functions, and let .
- (i) (Superadditivity) Show that
and
- (ii) Show that if and are equal almost everywhere, then
- (iii) Show that , with equality if and only if is zero almost everywhere.
- (iv) (Monotonicity) If almost everywhere, show that .
- (v) (Markov inequality) Show that for any . In particular, if , then is finite almost everywhere.
- (vi) (Compatibility with simple integral) If is simple, show that .
- (vii) (Compatibility with measure) For any measurable set , show that .
Exercise 6 If is a discrete probability space (with the associated probability measure ), and is a function, show that
(Note that the condition in the definition of a discrete probability space is not required to prove this identity.)
The observant reader will notice that the linearity property of simple functions has been weakened to superadditivity. This can be traced back to a breakdown of symmetry in the definition (3); the unsigned simple integral of is defined via approximation from below, but not from above. Indeed the opposite claim
can fail. For a counterexample, take to be the discrete probability space with probabilities , and let be the function . By Exercise 6 we have . On the other hand, any simple function with must equal on a set of positive measure (why?) and so the right-hand side of (4) can be infinite. However, one can get around this difficulty under some further assumptions on , and thus recover full linearity for the unsigned integral:
Exercise 7 (Linearity of the unsigned integral) Let be a measure space.
- (i) Let be an unsigned measurable function which is both bounded (i.e., there is a finite such that for all ) and has finite measure support (i.e., there is a measurable set with such that for all ). Show that (4) holds for this function .
- (ii) Establish the additivity property
whenever are unsigned measurable functions that are bounded with finite measure support.
- (iii) Show that
as whenever is unsigned measurable.
- (iv) Using (iii), extend (ii) to the case where are unsigned measurable functions with finite measure support, but are not necessarily bounded.
- (v) Show that
as whenever is unsigned measurable.
- (vi) Using (iii) and (v), show that (ii) holds for any unsigned measurable (which are not necessarily bounded or of finite measure support).
Next, we apply the integral to absolutely integrable functions. We call a scalar function or absolutely integrable if it is measurable and the unsigned integral is finite. A real-valued absolutely integrable function can be expressed as the difference of two unsigned absolutely integrable functions ; indeed, one can check that the choice and work for this. Conversely, any difference of unsigned absolutely integrable functions is absolutely integrable (this follows from the triangle inequality ). A single absolutely integrable function may be written as a difference of unsigned absolutely integrable functions in more than one way, for instance we might have
for unsigned absolutely integrable functions . But when this happens, we can rearrange to obtain
and thus by linearity of the unsigned integral
By the absolute integrability of , all the integrals are finite, so we may rearrange this identity as
This allows us to define the Lebesgue integral of a real-valued absolutely integrable function to be the expression
for any given decomposition of as the difference of two unsigned absolutely integrable functions. Note that if is both unsigned and absolutely integrable, then the unsigned integral and the Lebesgue integral of agree (as can be seen by using the decomposition ), and so there is no ambiguity in using the same notation to denote both integrals. (By the same token, we may now drop the modifier from the simple integral of a simple unsigned , which we may now also denote by .)
The Lebesgue integral also enjoys good linearity properties:
Exercise 8 Let be real-valued absolutely integrable functions, and let .
- (i) (Linearity) Show that and are also real-valued absolutely integrable functions, with
and
(For the second relation, one may wish to first treat the special cases and .)
- (ii) Show that if and are equal almost everywhere, then
- (iii) Show that , with equality if and only if is zero almost everywhere.
- (iv) (Monotonicity) If almost everywhere, show that .
- (v) (Markov inequality) Show that for any .
Because of part (iii) of the above exercise, we can extend the Lebesgue integral to real-valued absolutely integrable functions that are only defined and real-valued almost everywhere, rather than everywhere. In particular, we can apply the Lebesgue integral to functions that are sometimes infinite, so long as they are only infinite on a set of measure zero, and the function is absolutely integrable everywhere else.
Finally, we extend to complex-valued functions. If is absolutely integrable, observe that the real and imaginary parts are also absolutely integrable (because ). We then define the (complex) Lebesgue integral of in terms of the real Lebesgue integral by the formula
Clearly, if is real-valued and absolutely integrable, then the real Lebesgue integral and the complex Lebesgue integral of coincide, so it does not create ambiguity to use the same symbol for both concepts. It is routine to extend the linearity properties of the real Lebesgue integral to its complex counterpart:
Exercise 9 Let be complex-valued absolutely integrable functions, and let .
- (i) (Linearity) Show that and are also complex-valued absolutely integrable functions, with
and
(For the second relation, one may wish to first treat the special cases and .)
- (ii) Show that if and are equal almost everywhere, then
- (iii) Show that , with equality if and only if is zero almost everywhere.
- (iv) (Markov inequality) Show that for any .
We record a simple, but incredibly fundamental, inequality concerning the Lebesgue integral:
Lemma 10 (Triangle inequality) If is a complex-valued absolutely integrable function, then
Proof: We have
This looks weaker than what we want to prove, but we can “amplify” this inequality to the full strength triangle inequality as follows. Replacing by for any real , we have
Since we can choose the phase to make the expression equal to , the claim follows.
Finally, we observe that the Lebesgue integral extends the Riemann integral, which is particularly useful when it comes to actually computing some of these integrals:
Exercise 11 If is a Riemann integrable function on a compact interval , show that is also absolutely integrable, and that the Lebesgue integral (with Lebesgue measure restricted to ) coincides with the Riemann integral . Similarly if is Riemann integrable on a box .
— 2. Expectation of random variables —
We now translate the above notions of integration on measure spaces to the probabilistic setting.
A random variable taking values in the unsigned extended real line is said to be simple if it takes on at most finitely many values. Equivalently, can be expressed as a finite unsigned linear combination
of indicator random variables, where are unsigned and are events. We then define the simple expectation of to be the quantity
and checks that this definition is independent of the choice of decomposition of into indicator functions. Observe that if we model the random variable using a probability space , then the simple expectation of is precisely the simple integral of the corresponding unsigned simple function .
Next, given an arbitrary unsigned random variable taking values in , one defines its (unsigned) expectation as
where ranges over all simple unsigned random variables such that is surely true. This extends the simple expectation (thus for all simple unsigned ), and in terms of a probability space model , the expectation is precisely the unsigned integral of . The expectation of a random variable is also often referred to as the mean, particularly in applications connected to statistics. In some literature is also called the expected value of , but this is a somewhat misleading term as often one expects to deviate above or below .
A scalar random variable is said to be absolutely integrable if , thus for instance any bounded random variable is absolutely integrable. If is real-valued and absolutely integrable, we define its expectation by the formula
where is any representation of as the difference of unsigned absolutely integrable random variables ; one can check that this definition does not depend on the choice of representation and is thus well-defined. For complex-valued absolutely integrable , we then define
In all of these cases, the expectation of is equal to the integral of the representation of in any probability space model; in the case that is given by a discrete probability model, one can check that this definition of expectation agrees with the one given in Notes 0. Using the former fact, we can translate the properties of integration already established to the probabilistic setting:
Proposition 12
- (i) (Unsigned linearity) If are unsigned random variables, and is a deterministic unsigned quantity, then and . (Note that these identities hold even when are not absolutely integrable.)
- (ii) (Complex linearity) If are absolutely integrable random variables, and is a deterministic complex quantity, then and are also absolutely integrable, with and .
- (iii) (Compatibility with probability) If is an event, then . In particular, .
- (iv) (Almost sure equivalence) If are unsigned (resp. absolutely integrable) and almost surely, then .
- (v) If is unsigned or absolutely integrable, then , with equality if and only if almost surely.
- (vi) (Monotonicity) If are unsigned or real-valued absolutely integrable, and almost surely, then .
- (vii) (Markov inequality) If is unsigned or absolutely integrable, then for any deterministic .
- (viii) (Triangle inequality) If is absolutely integrable, then .
As before, we can use part (iv) to define expectation of scalar random variables that are only defined and finite almost surely, rather than surely.
Note that we have built the notion of expectation (and of related notions, such as absolute integrability) out of notions that were already probabilistic in nature, in the sense that they were unaffected if one replaced the underlying probabilistic model with an extension. Therefore, the notion of expectation is automatically probabilistic in the same sense. Because of this, we will be easily able to manipulate expectations of random variables without having to explicitly mention an underlying probability space , and so one will now see such spaces fade from view starting from this point in the course.
— 3. Exchanging limits with integrals or expectations —
When performing analysis on measure spaces, it is important to know if one can interchange a limit with an integral:
Similarly, in probability theory, we often wish to interchange a limit with an expectation:
Of course, one needs the integrands or random variables to be either unsigned or absolutely integrable, and the limits to be well-defined to have any hope of doing this. Naively, one could hope that limits and integrals could always be exchanged when the expressions involved are well-defined, but this is unfortunately not the case. In the case of integration on, say, the real line using Lebesgue measure , we already see four key examples:
- (Moving bump example) Take . Then , but .
- (Spreading bump example) Take . Then , but .
- (Concentrating bump example) Take . Then , but .
- (Receding infinity example) Take . Then , but .
In all these examples, the limit of the integral exceeds the integral of the limit; by replacing with in the first three examples (which involve absolutely integrable functions) one can also build examples where the limit of the integral is less than the integral of the limit. Most of these examples rely on the infinite measure of the real line and thus do not directly have probabilistic analogues, but the concentrating bump example involves functions that are all supported on the unit interval and thus also poses a problem in the probabilistic setting.
Nevertheless, there are three important cases in which we can relate the limit (or, in the case of Fatou’s lemma, the limit inferior) of the integral to the integral of the limit (or limit inferior). Informally, they are:
- (Fatou’s lemma) For unsigned , the integral of the limit inferior cannot exceed the limit inferior of the integral. “Limits (or more precisely, limits inferior) can destroy (unsigned) mass, but cannot create it.”
- (Monotone convergence theorem) For unsigned monotone increasing , the limit of the integral equals the integral of the limit.
- (Dominated convergence theorem) For that are uniformly dominated by an absolutely integrable function, the limit of the integral equals the integral of the limit.
These three results then have analogues for convergence of random variables. We will also mention a fourth useful tool in that setting, which allows one to exchange limits and expectations when one controls a higher moment. There are a few more such general results allowing limits to be exchanged with integrals or expectations, but my advice would be to work out such exchanges by hand rather than blindly cite (possibly incorrectly) an additional convergence theorem beyond the four mentioned above, as this is safer and will help strengthen one’s intuition on the situation.
We now state and prove these results more explicitly.
Lemma 13 (Fatou’s lemma) Let be a measure space, and let be a sequence of unsigned measurable functions. Then
An equivalent form of this lemma is that if one has
for some and all sufficiently large , then one has
as well. That is to say, if the original unsigned functions eventually have “mass” less than or equal to , then the limit (inferior) also has “mass” less than or equal to . The limit may have substantially less mass, as the four examples above show, but it can never have more mass (asymptotically) than the functions that comprise the limit. Of course, one can replace limit inferior by limit in the left or right hand side if one knows that the relevant limit actually exists (but one cannot replace limit inferior by limit superior if one does not already have convergence, see Example 15 below). On the other hand, it is essential that the are unsigned for Fatou’s lemma to work, as can be seen by negating one of the first three key examples mentioned above.
Proof: By definition of the unsigned integral, it suffices to show that
whenever is an unsigned simple function with . At present, is allowed to take the infinite , but it suffices to establish this claim for that only take finite values, since the claim then follows for possibly infinite-valued by applying the claim with replaced by and then letting go to infinity.
Multiplying by , it thus suffices to show that
for any and any unsigned as above.
We can write as the sum for some strictly positive finite and disjoint ; we allow the and the measures to be infinite. On each , we have . Thus, if we define
then the increase to as : . By continuity from below (Exercise 23 of Notes 0), we thus have
as . Since
we conclude upon integration that
and thus on taking limit inferior
But the right-hand side is , and the claim follows.
Of course, Fatou’s lemma may be phrased probabilistically:
Lemma 14 (Fatou’s lemma for random variables) Let be a sequence of unsigned random variables. Then
As a corollary, if are unsigned and converge almost surely to a random variable , then
Example 15 We now give an example to show that limit inferior cannot be replaced with limit superior in Fatou’s lemma. Let be drawn uniformly at random from , and for each , let be the binary digit of , thus when has odd integer part, and otherwise. (There is some ambiguity with the binary expansion when is a terminating binary decimal, but this event almost surely does not occur and can thus be safely ignored.) One has for all (why?). It is then easy to see that is almost surely (which is consistent with Fatou’s lemma) but is almost surely (so Fatou’s lemma fails if one replaces limit inferior with limit superior).
Next, we establish the monotone convergence theorem.
Theorem 16 (Monotone convergence theorem) Let be a measure space, and let be a sequence of unsigned measurable functions which is monotone increasing, thus for all and . Then
Note that the limits exist on both sides because monotone sequences always have limits. Indeed the limit in either side is equal to the supremum. The receding infinity example shows that it is important that the functions here are monotone increasing rather than monotone decreasing. We also observe that it is enough for the to be increasing almost everywhere rather than everywhere, since one can then modify the on a set of measure zero to be increasing everywhere, which does not affect the integrals on either side of this theorem.
Proof: From Fatou’s lemma we already have
On the other hand, from monotonicity we see that
for any natural number , and on taking limits as we obtain the claim.
Note that continuity from below for measures (Exercise 23.3 of Notes 0 can be viewed as the special case of the monotone convergence theorem when the functions are all indicator functions.)
An important corollary of the monotone convergence theorem is that one can freely interchange infinite sums with integrals for unsigned functions, that is to say
for any unsigned (not necessarily monotone). Indeed, to see this one simply applies the monotone convergence theorem to the partial sums .
We of course can translate this into the probabilistic context:
Theorem 17 (Monotone convergence theorem for random variables) Let be a monotone non-decreasing sequence of unsigned random variables. Then
Similarly, for any unsigned random variables , we have
Again, it is sufficient for the to be non-decreasing almost surely. We note a basic but important corollary of this theorem, namely the (first) Borel-Cantelli lemma:
Lemma 18 (Borel-Cantelli lemma) Let be a sequence of events with . Then almost surely, at most finitely many of the events hold; that is to say, one has almost surely.
Proof: From the monotone convergence theorem, we have
By Markov’s inequality, this implies that is almost surely finite, as required.
As the above proof shows, the Borel-Cantelli lemma is almost a triviality if one has the machinery of expectation (or integration); but it is remarkably hard to prove the lemma without that machinery, and it is an instructive exercise to attempt to do so.
We will develop a partial converse to the above lemma (the “second” Borel-Cantelli lemma) in a subsequent set of notes. For now, we give a crude converse in which we assume not only that the sum to infinity, but they are in fact uniformly bounded from below:
Exercise 19 Let be a sequence of events with . Show that with positive probability, an infinite number of the hold; that is to say, . (Hint: if for all , establish the lower bound for all . Alternatively, one can apply Fatou’s lemma to the random variables .)
Exercise 20 Let be a sequence such that . Show that there exist a sequence of events modeled by some probability space , such that for all , and such that almost surely infinitely many of the occur. Thus we see that the hypothesis in the Borel-Cantelli lemma cannot be relaxed.
Finally, we give the dominated convergence theorem.
Theorem 21 (Dominated convergence theorem) Let be a measure space, and let be measurable functions which converge pointwise to some limit. Suppose that there is an unsigned absolutely integrable function which dominates the in the sense that for all and all . Then
In particular, the limit on the right-hand side exists.
Again, it will suffice for to dominate each almost everywhere rather than everywhere, as one can upgrade this to everywhere domination by modifying each on a set of measure zero. Similarly, pointwise convergence can be replaced with pointwise convergence almost everywhere. The domination of each by a single function implies that the integrals are uniformly bounded in , but this latter condition is not sufficient by itself to guarantee interchangeability of the limit and integral, as can be seen by the first three examples given at the start of this section.
Proof: By splitting into real and imaginary parts, we may assume without loss of generality that the are real-valued. As is absolutely integrable, it is finite almost everywhere; after modification on a set of measure zero we may assume it is finite everywhere. Let denote the pointwise limit of the . From Fatou’s lemma applied to the unsigned functions and , we have
and
Rearranging this (taking crucial advantage of the finite nature of the , and hence and ), we conclude that
and the claim follows.
Remark 22 Amusingly, one can use the dominated convergence theorem to give an (extremely indirect) proof of the divergence of the harmonic series . For, if that series was convergent, then the function would be absolutely integrable, and the spreading bump example described above would contradict the dominated convergence theorem. (Expert challenge: see if you can deconstruct the above argument enough to lower bound the rate of divergence of the harmonic series .)
We again translate the above theorem to the probabilistic context:
Theorem 23 (Dominated convergence theorem for random variables) Let be scalar random variables which converge almost surely to a limit . Suppose there is an unsigned absolutely integrable random variable such that almost surely for each . Then
As a corollary of the dominated convergence theorem for random variables we have the bounded convergence theorem: if are scalar random variables that converge almost surely to a limit , and are almost surely bounded in magnitude by a uniform constant , then we have
(In Durrett, the bounded convergence theorem is proven first, and then used to establish Fatou’s theorem and the dominated and monotone convergence theorems. The order in which one establishes these results – which are all closely related to each other – is largely a matter of personal taste.) A closely related corollary (which can also be established directly is that if are scalar absolutely integrable random variables that converge uniformly to (thus, for each there is such that is surely true for all ), then converges to .
A further corollary of the dominated convergence theorem is that one has the identity
whenever are scalar random variables with absolutely integrable (or equivalently, that is finite).
Another useful variant of the dominated convergence theorem is
Theorem 24 (Convergence for random variables with bounded moment) Let be scalar random variables which converge almost surely to a limit . Suppose there is and such that for all . Then
This theorem fails for , as the concentrating bump example shows. The case (that is to say, bounded second moment ) is already quite useful. The intuition here is that concentrating bumps are in some sense the only obstruction to interchanging limits and expectations, and these can be eliminated by hypotheses such as a bounded higher moment hypothesis or a domination hypothesis.
Proof: By taking real and imaginary parts we may assume that the (and hence ) are real-valued. For any natural number , let denote the truncation of to the interval , and similarly define . Then converges pointwise to , and hence by the bounded convergence theorem
On the other hand, we have
(why?) and thus on taking expectations and using the triangle inequality
where we are using the asymptotic notation to denote a quantity bounded in magnitude by for an absolute constant . Also, from Fatou’s lemma we have
so we similarly have
Putting all this together, we see that
Sending , we obtain the claim.
Remark 25 The essential point about the condition was that the function grew faster than linearly as . One could accomplish the same result with any other function with this property, e.g. a hypothesis such as would also suffice. The most natural general condition to impose here is that of uniform integrability, which encompasses the hypotheses already mentioned, but we will not focus on this condition here.
Exercise 26 (Scheffé’s lemma) Let be a sequence of absolutely integrable scalar random variables that converge almost surely to another absolutely integrable scalar random variable . Suppose also that converges to as . Show that converges to zero as . (Hint: there are several ways to prove this result, known as Scheffe’s lemma. One is to split into two components , such that is dominated by but converges almost surely to , and is such that . Then apply the dominated convergence theorem.)
— 4. The distribution of a random variable —
We have seen that the expectation of a random variable is a special case of the more general notion of Lebesgue integration on a measure space. There is however another way to think of expectation as a special case of integration, which is particularly convenient for computing expectations. We first need the following definition.
Definition 27 Let be a random variable taking values in a measurable space . The distribution of (also known as the law of ) is the probability measure on defined by the formula
for all measurable sets ; one easily sees from the Kolmogorov axioms that this is indeed a probability measure.
In the language of measure theory, the distribution on is the push-forward of the probability measure on the sample space by the model of on that sample space.
Example 28 If only takes on at most countably many values (and if every point in is measurable), then the distribution is the discrete measure that assigns each point in the range of a measure of .
Example 29 If is a real random variable with cumulative distribution function , then is the Lebesgue-Stieltjes measure associated to . For instance, if is drawn uniformly at random from , then is Lebesgue measure restricted to . In particular, two scalar variables are equal in distribution if and only if they have the same cumulative distribution function.
Example 30 If and are the results of two separate rolls of a fair die (as in Example 3 of Notes 0), then and are equal in distribution, but are not equal as random variables.
Remark 31 In the converse direction, given a probability measure on a measurable space , one can always build a probability space model and a random variable represented by that model whose distribution is . Indeed, one can perform the “tautological” construction of defining the probability space model to be , and to be the identity function , and then one easily checks that . Compare with Corollaries 26 and 29 of Notes 0. Furthermore, one can view this tautological model as a “base” model for random variables of distribution as follows. Suppose one has a random variable of distribution which is modeled by some other probability space , thus is a measurable function such that
for all . Then one can view the probability space as an extension of the tautological probability space using as the factor map.
We say that two random variables are equal in distribution, and write , if they have the same law: , that is to say for any measurable set in the range. This definition makes sense even when are defined on different sample spaces. Roughly speaking, the distribution captures the “size” and “shape” of the random variable, but not its “location” or how it relates to other random variables. We also say that is a copy of if they are equal in distribution. For instance, the two dice rolls in Example 3 of Notes 0 are copies of each other.
Theorem 32 (Change of variables formula) Let be a random variable taking values in a measurable space . Let or be a measurable scalar function (giving or the Borel -algebra of course) such that either , or that . Then
Thus for instance, if is a real random variable, then
and more generally
for all ; furthermore, if is unsigned or absolutely integrable, one has
The point here is that the integration is not over some unspecified sample space , but over a very explicit domain, namely the reals; we have “changed variables” to integrate over instead over , with the distribution representing the “Jacobian” factor that typically shows up in such change of variables formulae.
If is a scalar variable that only takes on at most countably many values , the change of variables formula tells us that
if is unsigned or absolutely integrable.
Proof: First suppose that is unsigned and only takes on a finite number of values. Then
and hence
as required.
Next, suppose that is unsigned but can take on infinitely many values. We can express as the monotone increasing limit of functions that only take a finite number of values; for instance we can define to be rounded down to the largest multiple of less than both and . By the preceding computation, we have
and on taking limits as using the monotone convergence theorem we obtain the claim in this case.
Now suppose that is real-valued with . We write where and , then we have and
for . Subtracting these two identities together, we obtain the claim.
Finally, the case of complex-valued with follows from the real-valued case by taking real and imaginary parts.
Example 33 Let be the uniform distribution on , then
for any Riemann integrable ; thus for instance
for any .
Remark 34 An alternate way to prove the change of variables formula is to observe that the formula is obviously true when one uses the tautological model for , and then the claim follows from the model-independence of expectation and the observation from Remark 31 that any other model for is an extension of the tautological model.
Exercise 35 Let be a measurable function with . If one defines for any Borel subset of by the formula
show that is a probability measure on with Stieltjes measure function . If is a real random variable with probability distribution (in which case we call a random variable with an absolutely continuous distribution, and the probability density function (PDF) of ), show that
when either is an unsigned measurable function, or is measurable with absolutely integrable (or equivalently, that .
Exercise 36 Let be a real random variable with the probability density function of the standard normal distribution. Establish the Stein identity
whenever is a continuously differentiable function with and both of polynomial growth (i.e., there exist constants such that for all ). There is a robust converse to this identity which underpins the basis of Stein’s method, discussed in this previous blog post. Use this identity recursively to establish the identities
when is an odd natural number and
when is an even natural number. (This quantity is also known as the double factorial of .)
Exercise 37 Let be a real random variable with cumulative distribution function . Show that
for all . If is nonnegative, show that
for all .
— 5. Some basic inequalities —
The change of variables formula allows us, in principle at least, to compute the expectation of a scalar random variable as an integral. In very simple situations, for instance when has one of the standard distributions (e.g. uniform, gaussian, binomial, etc.), this allows us to compute such expectations exactly. However, once one gets to more complicated situations, one usually does not expect to be able to evaluate the required integrals in closed form. In such situations, it is often more useful to have some general inequalities concerning expectation, rather than identities.
We therefore record here for future reference some basic inequalities concerning expectation that we will need in the sequel. We have already seen the triangle inequality
for absolutely integrable , and the Markov inequality
for arbitrary scalar and (note the inequality is trivial if is not absolutely integrable). Applying the triangle inequality to the difference of two absolutely integrable random variables , we obtain the variant
Thus, for instance, if is a sequence of absolutely integrable scalar random variables which converges in to another absolutely integrable random variable , in the sense that as , then as .
Similarly, applying the Markov inequality to the quantity we obtain the important Chebyshev inequality
for absolutely integrable and , where the Variance of is defined as
Next, we record
Lemma 38 (Jensen’s inequality) If is a convex function, is a real random variable with and both absolutely integrable, then
Proof: Let be a real number. Being convex, the graph of must be supported by some line at , that is to say there exists a slope (depending on ) such that for all . (If is differentiable at , one can take to be the derivative of at , but one always has a supporting line even in the non-differentiable case.) In particular
Taking expectations and using linearity of expectation, we conclude
and the claim follows from setting .
Exercise 39 (Complex Jensen inequality) Let be a convex function (thus for all complex and all , and let be a complex random variable with and both absolutely integrable. Show that
Note that the triangle inequality is the special case of Jensen’s inequality (or the complex Jensen’s inequality, if is complex-valued) corresponding to the convex function on (or on ). Another useful example is
Applying Jensen’s inequality to the convex function and the random variable for some , we obtain the arithmetic mean-geometric mean inequality
assuming that and are absolutely integrable.
As a related application of convexity, observe from the convexity of the function that
for any and . This implies in particular Young’s inequality
for any scalar and any exponents with ; note that this inequality is also trivially true if one or both of are infinite. Taking expectations, we conclude that
if are scalar random variabels and are deterministic exponents with . In particular, if are absolutely integrable, then so is , and
We can amplify this inequality as follows. Multiplying by some and dividing by the same , we conclude that
optimising the right-hand side in , we obtain (after some algebra, and after disposing of some edge cases when or is almost surely zero) the important Hölder inequality
where we use the notation
for . Using the convention
(thus is the essential supremum of ), we also see from the triangle inequality that the Hölder inequality applies in the boundary case when one of is allowed to be (so that the other is equal to ):
The case is the important Cauchy-Schwarz inequality
valid whenever are square-integrable in the sense that are finite.
Exercise 40 Show that the expressions are non-decreasing in for . In particular, if is finite for some , then it is automatically finite for all smaller values of .
Exercise 41 For any square-integrable , show that
Exercise 42 If and are scalar random variables with , use Hölder's inequality to establish that
and
and then conclude the Minkowski inequality
Show that this inequality is also valid at the endpoint cases and .
Exercise 43 If is non-negative and square-integrable, and , establish the Paley-Zygmund inequality
(Hint: use the Cauchy-Schwarz inequality to upper bound in terms of and .)
Exercise 44 Let be a non-negative random variable that is almost surely bounded but not identically zero, show that
38 comments
Comments feed for this article
3 October, 2015 at 7:47 pm
pauljung
Great notes so far. You have a stray * appearing as f*(omega) in the first bullet of Exercise 7. In the same exercise, second bullet, I think you want equality instead of greater or equal. In Proposition 12 (i), do you not need that X and Y are integrable?
[Corrected, thanks – T.]
3 October, 2015 at 8:02 pm
pauljung
By the way, I like your statement about Probability Theory being the study of random events which are model-independent. M. Loeve’s Vol. I (pg 173) has a similar take on this view.
3 October, 2015 at 9:46 pm
Anonymous
Great notes! One small correction: in the proof of theorem 16, you should have LHS >= RHS in the second displayed equation.
[Corrected, thanks – T.]
3 October, 2015 at 11:48 pm
PerryZhao
Reblogged this on 木秀于林.
3 October, 2015 at 11:56 pm
pauljung
Ignore my comment about Proposition 12, I didn’t realize unsigned there meant nonnegative.
4 October, 2015 at 4:12 am
Anonymous
Professor – it looks like your informal heuristic for Fatou is backwards.
[Looks like it is in the right direction to me – could you elaborate? -T.]
4 October, 2015 at 9:32 am
Anonymous
Sure – in your informal heuristic bullet list you state:
“For unsigned {f_n}, the limit inferior of the integral cannot exceed the integral of the limit inferior”
but the inequality in Fatou actually goes the other way.
[Got it now. Corrected, thanks -T.]
4 October, 2015 at 4:37 am
David Gonzales
Exercise 9(iv) talks about monotonicity of complex functions $f$ and $g$, which isn’t defined so should probably be changed.
[Oops, that should have been deleted, thanks – T.]
4 October, 2015 at 6:55 am
Jennifer
Thanks for providing these great notes for self-study! Do you by any chance also share pdf versions of them?
[One should be able to print to PDF (with headers and sidebar removed) from the “Print” feature on your browser. -T]
5 October, 2015 at 6:44 pm
obryant
The very first bullet has a typo: wrong notation for the sure event. Thanks for the notes, and also for the pre-work that surely went into creating them.
[Actually, I am using (the complement of the empty event) to denote the sure event. -T.]
7 October, 2015 at 6:33 am
Not A Music Expert
What kind of music do you listen to?
7 October, 2015 at 9:52 am
John Mangual
I am going to go out on a limb here and say a lot of these complications arise from the non-compactness of ? Markov’s inequality is essentially the pigeonhole principle. Can the same be said of Chebyshev inequality? Or even Hölder inequality?
12 October, 2015 at 11:41 am
Ryan McNeive
Thanks (as always) for these great notes!
I wanted to ask about a typo. In the proof of Fatou’s lemma, on the RHS of the last three equations, surely the sum should be from 1 to n rather than 1 to N? Unless I am very confused.
[One can use the symbol in place of here (thus replacing , by respectively) if desired. I chose not to do so as this makes the definition of a bit confusing unless one also changes the symbol appearing there to some other symbol. -T.]
[Added, Oct 14: oh, I see the problem now, I had used n for two unrelated things. Fixed now, thanks – T.]
12 October, 2015 at 12:34 pm
275A, Notes 2: Product measures and independence | What's new
[…] the previous set of notes, we constructed the measure-theoretic notion of the Lebesgue integral, and used this to set up the […]
14 October, 2015 at 11:30 am
Sam
Above exercise 3, isn’t there a mistake in {1 \times 1_{[0,2)} + 1 \times 1_{[1,3)}? Perhaps it should be {1 \times 1_{[0,1)} + 1 \times 1_{[0,3)}?
[Corrected, thanks – T.]
14 October, 2015 at 12:07 pm
Sam
I think there is still a typo in the second indicator function: instead of .
[Corrected, thanks -T.]
17 October, 2015 at 1:22 pm
Anonymous
Should the second centered equation in Theorem 23, be ? The 2 comes from considering the events such that and for these ,
17 October, 2015 at 7:13 pm
Terence Tao
The factor of 2 is unnecessary; one has the pointwise bound when (checking the cases and separately) and otherwise.
18 October, 2015 at 4:51 am
L.
In the proof of the Fatou’s lemma, why do we need the “modification”? There is at least one point that I am not sure: since the are allowed to be , we should have on (instead of strict inequality in the note). But we know that on , . Now if we define , then we have , which is equal to . The are still increasing to .
18 October, 2015 at 10:51 am
Terence Tao
Ah, I did not treat the case when some of the were infinite; I’ve fixed the proof to address this.
One needs strict inequality in the bound to ensure that for sufficiently large n. The bound does not ensure that for even a single choice of (for instance, the could be increasing and converge to in the limit).
19 October, 2015 at 10:58 am
L.
Oh, I see the point. Thanks very much for your comments.
2 November, 2015 at 7:06 pm
275A, Notes 4: The central limit theorem | What's new
[…] this and the Paley-Zygmund inequality (Exercise 39 of Notes 1) we also get some lower bound for of the […]
20 November, 2015 at 12:41 pm
Anonymous
Is Chebyshev inequality (8) optimal in the sense that if the first two moments of a random variable are given with a threshold , then (8) can be made arbitrarily close to equality by an appropriate selection of a probability distribution for (depending of course on the given expectation and variance of and the given threshold ) ?
20 November, 2015 at 1:50 pm
Terence Tao
Yes, this can be seen by experimenting with Bernoulli type random variables (e.g. ones which attain a threshold with some probability , with probability , and with probability , for various choices of parameters ).
20 November, 2015 at 5:54 pm
Anonymous
Thank you! Let me add some comments:
1. If has mean and variance , without loss of generality we may assume that and if the standard deviation and the threshold are given, consider the two cases:
(i) : In this case the RHS of (8) is greater than 1, so (8) is trivial
but not(!) optimal.
(ii) : In this case (using your suggestion) choose and denote (which is less than 1) . If attain with probability , with probability , and with probability , then has zero mean and (from this choice of ) variance – as required. Moreover, (8) will be .
Hence (by choosing arbitrarily close to ) Chebyshev inequality (8) an be made arbitrarily close to equality (i.e. (8) is optimal in this case.)
2. Interestingly, this idea can be applied also for the one-sided chebyshev inequality (Chebyshev-Cantelli inequality):
Where and .
Choose and define
.
If attain with probability and
with probability , it follows that
and
– as required.
Moreover, the one-sided Chebyshev inequality will be
Hence (by choosing arbitrarily close to ) the one-sided Chebyshev inequality is also optimal.
21 November, 2015 at 4:51 pm
Anonymous
It is interesting to observe that for case (i) above (for which Chebyshev inequality (8) is non-optimal), the optimal bound on the LHS of (8) is the trivial bound 1. To see that, we use a simplified version of the proof of case (ii) above, in which the random variable attain with probability and with probability . This gives (as required) and as the probability in the LHS of (8).
25 November, 2015 at 6:40 am
Anonymous
Hi Terry, there’s a small typo at the end of the proof dominated convergence theorem (Theorem 21). The final equation is missing integral signs.
[Corrected, thanks – T.]
28 November, 2015 at 3:26 pm
Anonymous
Just before Exercise 3, there is “can also be written as {1 \times 1_{[0,1)} + 1 \times 1_{[1,3)}}”, but the last subscript is probably meant to be
[Corrected, thanks -T.]
28 November, 2015 at 4:21 pm
Anonymous
‘real-valuked’ to ‘real-valued’ (it’s somewhere in the middle)
[Corrected, thanks -T.]
28 November, 2015 at 4:24 pm
Anonymous
Forgotten subscript:
\displaystyle {\bf E} f_i(X) = \int{\bf R} f_i(x)\ d\mu_X(x)
should be:
\displaystyle {\bf E} f_i(X) = \int_{\bf R} f_i(x)\ d\mu_X(x)
(just after valuked)
[Corrected, thanks -T.]
30 December, 2015 at 9:53 am
Anonymous
Below Remark 31, “… for any measurable set {R} in the range…” should read “… any measurable set {S}..”
[Corrected, thanks – T.]
6 January, 2016 at 5:12 am
Anonymous
Just a small typo, I guest. In the proof of Theorem 32, when the general case is considered and the functions f_n are considered, there is a missed expectation before letting n to infinite. Should be
.
[Corrected, thanks – T.]
6 January, 2016 at 7:48 am
Anonymous
In exercise 37 I get the integrals with instead of just in both cases. Maybe there is a typo here.
[Corrected, thanks – T.]
4 August, 2016 at 5:01 pm
Sebastien Zany
Thanks for these notes!
Is entropy not a probabilistic concept? If not, then how should we think about it?
6 August, 2016 at 12:50 pm
Terence Tao
There are several different notions of entropy used in mathematics, not all of which are directly tied to probability. But the Shannon entropy of a discrete random variable is a probabilistic concept, since it can be expressed in terms of probabilities as (with the convention that ).
15 November, 2016 at 6:16 am
Anonymous
I’m a little confused about the summary at the very beginning of this note. In Notes 0, a measure space was introduced to model some randomness (an “abstract” probability space). And this abstract probability space is a triple which satisfies Kolmogrov’s three axioms. The difference between the concrete model and the abstract probability space that models is that
(1) is a measure space where is a (concrete) -algebra on ;
(2)while in , it does not have to be a measure space and the “event space” is an abstract -algebra and hence do not have to be a collection of subsets of .
Shouldn’t we have first so that we can talk about we models using a model ? Why do you define the “conjunction”, “disjunction” and the “probability” of an event after a model is given? (Why is this not a circular argument?) Shouldn’t all these be defined first?
Are you in fact abstracting thing out from a concrete model to get an abstract probability space , like we can get an abstract real space out of a concrete real space , and one can actually define an abstract probability space without given any model in the first place?
15 November, 2016 at 10:10 am
Terence Tao
Yes, if one wants to, one can define abstract probability spaces first without reference to concrete measure spaces and only then talk about their representations by concrete measure spaces. This is discussed in Section 4 of Notes 0. But from a pedagogical point of view this is undesirable as this delays the point in the course where one actually does probability, rather than foundations of probability. This is one reason why most texts compress the foundations by working entirely with concrete probability spaces and not abstract ones (somewhat similarly to how in a first undergraduate linear algebra class, vectors are often _defined_ to be rows or columns of numbers, rather than as elements of an abstract vector space, in order to get on with the linear algebra rather than the foundations of linear algebra).
For the purposes of actually doing probability, the only relevant features of the concept of an abstract probability space are that (a) it can be modeled by at least one concrete probability space, and (b) the basic probabilistic notions (e.g. boolean operations, probability of an event, whether two events are equal etc.) are independent of the model. The existence of such a concept can be guaranteed by the formal definition of an abstract probability space as an abstract sigma algebra equipped with an abstract probability measure (or by the other alternative definitions of this concept in Section 4 of Notes 0). But one could also proceed by defining concrete probability spaces first, and defining an abstraction of that space to be anything isomorphic to the events of that space together with their probabilities and boolean operations; this would suffice for most practical purposes, and is basically the approach taken in Notes 0 before Section 4.
Incidentally, while the 1933 text of Kolmogorov does use a concrete sigma algebra (or, in the language of that era, a field of sets) to model the event space, it is clear from his writing that he would have used an abstraction of this if was available at the time (e.g. on page 1 of your linked translation he writes “what the elements of this set represent is of no importance”). One reason for this is that the correct axiomatisation of an abstract sigma algebra was not identified and justified until the work of Loomis and Sikorski in 1947, which was many years after the original text of Kolmogorov.
15 November, 2016 at 6:31 am
Anonymous
Right after Definition 27, I think you mean “the distribution on is the push-forward of the probability measure .
[Corrected, thanks – T.]