In Notes 0, we introduced the notion of a measure space {\Omega = (\Omega, {\mathcal F}, \mu)}, which includes as a special case the notion of a probability space. By selecting one such probability space {(\Omega,{\mathcal F},\mu)} as a sample space, one obtains a model for random events and random variables, with random events {E} being modeled by measurable sets {E_\Omega} in {{\mathcal F}}, and random variables {X} taking values in a measurable space {R} being modeled by measurable functions {X_\Omega: \Omega \rightarrow R}. We then defined some basic operations on these random events and variables:

  • Given events {E,F}, we defined the conjunction {E \wedge F}, the disjunction {E \vee F}, and the complement {\overline{E}}. For countable families {E_1,E_2,\dots} of events, we similarly defined {\bigwedge_{n=1}^\infty E_n} and {\bigvee_{n=1}^\infty E_n}. We also defined the empty event {\emptyset} and the sure event {\overline{\emptyset}}, and what it meant for two events to be equal.
  • Given random variables {X_1,\dots,X_n} in ranges {R_1,\dots,R_n} respectively, and a measurable function {F: R_1 \times \dots \times R_n \rightarrow S}, we defined the random variable {F(X_1,\dots,X_n)} in range {S}. (As the special case {n=0} of this, every deterministic element {s} of {S} was also a random variable taking values in {S}.) Given a relation {P: R_1 \times \dots \times R_n \rightarrow \{\hbox{true}, \hbox{false}\}}, we similarly defined the event {P(X_1,\dots,X_n)}. Conversely, given an event {E}, we defined the indicator random variable {1_E}. Finally, we defined what it meant for two random variables to be equal.
  • Given an event {E}, we defined its probability {{\bf P}(E)}.

These operations obey various axioms; for instance, the boolean operations on events obey the axioms of a Boolean algebra, and the probabilility function {E \mapsto {\bf P}(E)} obeys the Kolmogorov axioms. However, we will not focus on the axiomatic approach to probability theory here, instead basing the foundations of probability theory on the sample space models as discussed in Notes 0. (But see this previous post for a treatment of one such axiomatic approach.)

It turns out that almost all of the other operations on random events and variables we need can be constructed in terms of the above basic operations. In particular, this allows one to safely extend the sample space in probability theory whenever needed, provided one uses an extension that respects the above basic operations; this is an important operation when one needs to add new sources of randomness to an existing system of events and random variables, or to couple together two separate such systems into a joint system that extends both of the original systems. We gave a simple example of such an extension in the previous notes, but now we give a more formal definition:

Definition 1 Suppose that we are using a probability space {\Omega = (\Omega, {\mathcal F}, \mu)} as the model for a collection of events and random variables. An extension of this probability space is a probability space {\Omega' = (\Omega', {\mathcal F}', \mu')}, together with a measurable map {\pi: \Omega' \rightarrow \Omega} (sometimes called the factor map) which is probability-preserving in the sense that

\displaystyle  \mu'( \pi^{-1}(E) ) = \mu(E) \ \ \ \ \ (1)

for all {E \in {\mathcal F}}. (Caution: this does not imply that {\mu(\pi(F)) = \mu'(F)} for all {F \in {\mathcal F}'} – why not?)

An event {E} which is modeled by a measurable subset {E_\Omega} in the sample space {\Omega}, will be modeled by the measurable set {E_{\Omega'} := \pi^{-1}(E_\Omega)} in the extended sample space {\Omega'}. Similarly, a random variable {X} taking values in some range {R} that is modeled by a measurable function {X_\Omega: \Omega \rightarrow R} in {\Omega}, will be modeled instead by the measurable function {X_{\Omega'} := X_\Omega \circ \pi} in {\Omega'}. We also allow the extension {\Omega'} to model additional events and random variables that were not modeled by the original sample space {\Omega} (indeed, this is one of the main reasons why we perform extensions in probability in the first place).

Thus, for instance, the sample space {\Omega'} in Example 3 of the previous post is an extension of the sample space {\Omega} in that example, with the factor map {\pi: \Omega' \rightarrow \Omega} given by the first coordinate projection {\pi(i,j) := i}. One can verify that all of the basic operations on events and random variables listed above are unaffected by the above extension (with one caveat, see remark below). For instance, the conjunction {E \wedge F} of two events can be defined via the original model {\Omega} by the formula

\displaystyle  (E \wedge F)_\Omega := E_\Omega \cap F_\Omega

or via the extension {\Omega'} via the formula

\displaystyle  (E \wedge F)_{\Omega'} := E_{\Omega'} \cap F_{\Omega'}.

The two definitions are consistent with each other, thanks to the obvious set-theoretic identity

\displaystyle  \pi^{-1}( E_\Omega \cap F_\Omega ) = \pi^{-1}(E_\Omega) \cap \pi^{-1}(F_\Omega).

Similarly, the assumption (1) is precisely what is needed to ensure that the probability {\mathop{\bf P}(E)} of an event remains unchanged when one replaces a sample space model with an extension. We leave the verification of preservation of the other basic operations described above under extension as exercises to the reader.

Remark 2 There is one minor exception to this general rule if we do not impose the additional requirement that the factor map {\pi} is surjective. Namely, for non-surjective {\pi}, it can become possible that two events {E, F} are unequal in the original sample space model, but become equal in the extension (and similarly for random variables), although the converse never happens (events that are equal in the original sample space always remain equal in the extension). For instance, let {\Omega} be the discrete probability space {\{a,b\}} with {p_a=1} and {p_b=0}, and let {\Omega'} be the discrete probability space {\{ a'\}} with {p'_{a'}=1}, and non-surjective factor map {\pi: \Omega' \rightarrow \Omega} defined by {\pi(a') := a}. Then the event modeled by {\{b\}} in {\Omega} is distinct from the empty event when viewed in {\Omega}, but becomes equal to that event when viewed in {\Omega'}. Thus we see that extending the sample space by a non-surjective factor map can identify previously distinct events together (though of course, being probability preserving, this can only happen if those two events were already almost surely equal anyway). This turns out to be fairly harmless though; while it is nice to know if two given events are equal, or if they differ by a non-null event, it is almost never useful to know that two events are unequal if they are already almost surely equal. Alternatively, one can add the additional requirement of surjectivity in the definition of an extension, which is also a fairly harmless constraint to impose (this is what I chose to do in this previous set of notes).

Roughly speaking, one can define probability theory as the study of those properties of random events and random variables that are model-independent in the sense that they are preserved by extensions. For instance, the cardinality {|E_\Omega|} of the model {E_\Omega} of an event {E} is not a concept within the scope of probability theory, as it is not preserved by extensions: continuing Example 3 from Notes 0, the event {E} that a die roll {X} is even is modeled by a set {E_\Omega = \{2,4,6\}} of cardinality {3} in the original sample space model {\Omega}, but by a set {E_{\Omega'} = \{2,4,6\} \times \{1,2,3,4,5,6\}} of cardinality {18} in the extension. Thus it does not make sense in the context of probability theory to refer to the “cardinality of an event {E}“.

On the other hand, the supremum {\sup_n X_n} of a collection of random variables {X_n} in the extended real line {[-\infty,+\infty]} is a valid probabilistic concept. This can be seen by manually verifying that this operation is preserved under extension of the sample space, but one can also see this by defining the supremum in terms of existing basic operations. Indeed, note from Exercise 24 of Notes 0 that a random variable {X} in the extended real line is completely specified by the threshold events {(X \leq t)} for {t \in {\bf R}}; in particular, two such random variables {X,Y} are equal if and only if the events {(X \leq t)} and {(Y \leq t)} are surely equal for all {t}. From the identity

\displaystyle  (\sup_n X_n \leq t) = \bigwedge_{n=1}^\infty (X_n \leq t)

we thus see that one can completely specify {\sup_n X_n} in terms of {X_n} using only the basic operations provided in the above list (and in particular using the countable conjunction {\bigwedge_{n=1}^\infty}.) Of course, the same considerations hold if one replaces supremum, by infimum, limit superior, limit inferior, or (if it exists) the limit.

In this set of notes, we will define some further important operations on scalar random variables, in particular the expectation of these variables. In the sample space models, expectation corresponds to the notion of integration on a measure space. As we will need to use both expectation and integration in this course, we will thus begin by quickly reviewing the basics of integration on a measure space, although we will then translate the key results of this theory into probabilistic language.

As the finer details of the Lebesgue integral construction are not the core focus of this probability course, some of the details of this construction will be left to exercises. See also Chapter 1 of Durrett, or these previous blog notes, for a more detailed treatment.

— 1. Integration on measure spaces —

Let {(\Omega, {\mathcal F}, \mu)} be a measure space, and let {f} be a measurable function on {\Omega}, taking values either in the reals {{\bf R}}, the non-negative extended reals {[0,+\infty]}, the extended reals {[-\infty,+\infty]}, or the complex numbers {{\bf C}}. We would like to define the integral

\displaystyle  \int_\Omega f\ d\mu \ \ \ \ \ (2)

of {f} on {\Omega}. (One could make the integration variable explicit, e.g. by writing {\int_\Omega f(\omega)\ d\mu(\omega)}, but we will usually not do so here.) When integrating a reasonably nice function (e.g. a continuous function) on a reasonably nice domain (e.g. a box in {{\bf R}^n}), the Riemann integral that one learns about in undergraduate calculus classes suffices for this task; however, for the purposes of probability theory, we need the much more general notion of a Lebesgue integral in order to properly define (2) for the spaces {\Omega} and functions {f} we will need to study.

Not every measurable function can be integrated by the Lebesgue integral. There are two key classes of functions for which the integral exists and is well behaved:

  • Unsigned measurable functions {f: \Omega \rightarrow [0,+\infty]}, that take values in the non-negative extended reals {[0,+\infty]}; and
  • Absolutely integrable functions {f: \Omega \rightarrow {\bf R}} or {f: \Omega \rightarrow {\bf C}}, which are scalar measurable functions whose absolute value {|f|} has a finite integral: {\int_\Omega |f|\ d\mu < \infty}. (Sometimes we also allow absolutely integrable functions to attain an infinite value {\infty}, so long as they only do so on a set of measure zero.)

One could in principle extend the Lebesgue integral to slightly more general classes of functions, e.g. to sums of absolutely integrable functions and unsigned functions. However, the above two classes already suffice for most applications (and as a general rule of thumb, it is dangerous to apply the Lebesgue integral to functions that are not unsigned or absolutely integrable, unless you really know what you are doing).

We will construct the Lebesgue integral in the following four stages. First, we will define the Lebesgue integral just for unsigned simple functions – unsigned measurable functions that take on only finitely many values. Then, by a limiting procedure, we extend the Lebesgue integral to unsigned functions. After that, by decomposing a real absolutely integrable function into unsigned components, we extend the integral to real absolutely integrable functions. Finally, by taking real and imaginary parts, we extend to complex absolutely integrable functions. (This is not the only order in which one could perform this construction; for instance, in Durrett, one first constructs integration of bounded functions on finite measure support before passing to arbitrary unsigned functions.)

First consider an unsigned simple function {f: \Omega \rightarrow [0,+\infty]}, thus {f} is measurable and only takes values at a finite number of values. Then we can express {f} as a finite linear combination (in {[0,+\infty]}) of indicator functions. Indeed, if we enumerate the values that {f} takes as {a_1,\dots,a_n \in [0,+\infty]} (avoiding repetitions) and setting {E_i := \{ \omega \in \Omega: f(\omega) = a_i \}} for {i=1,\dots,n}, then it is clear that

\displaystyle  f = \sum_{i=1}^n a_i 1_{E_i}.

(It should be noted at this point that the operations of addition and multiplication on {[0,+\infty]} are defined by setting {+\infty + a = a + \infty = +\infty} for all {a \in [0,+\infty]}, and {a \cdot +\infty = +\infty \cdot a} for all positive {a \in (0,+\infty]}, but that {0 \cdot +\infty = +\infty \cdot 0} is defined to equal {0}. To put it another way, multiplication is defined to be continuous from below, rather than from above: {a \cdot b = \lim_{x \rightarrow a^-, y \rightarrow b^-} x \cdot y}. One can verify that the commutative, associative, and distributive laws continue to hold on {[0,+\infty]}, but we caution that the cancellation laws do not hold when {+\infty} is involved.)

Conversely, given any coefficients {a_1,\dots,a_n \in [0,+\infty]} (not necessarily distinct) and measurable sets {E_1,\dots,E_n} in {{\mathcal F}} (not necessarily disjoint), the sum {\sum_{i=1}^n a_i 1_{E_i}} is an unsigned simple function.

A single simple function can be decomposed in multiple ways as a linear combination of unsigned simple functions. For instance, on the real line {{\bf R}}, the function {2 \times 1_{[0,1)} + 1 \times 1_{[1,3)}} can also be written as {1 \times 1_{[0,1)} + 1 \times 1_{[0,3)}} or as {2 \times 1_{[0,1)} + 1 \times 1_{[1,2)} + 1 \times 1_{[2,3)}}. However, there is an invariant of all these decompositions:

Exercise 3 Suppose that an unsigned simple function {f} has two representations as the linear combination of indicator functions:

\displaystyle  f = \sum_{i=1}^n a_i 1_{E_i} = \sum_{j=1}^m b_j 1_{F_j},

where {n,m} are nonnegative integers, {a_1,\dots,a_n,b_1,\dots,b_m} lie in {[0,+\infty]}, and {E_1,\dots,E_n,F_1,\dots,F_m} are measurable sets. Show that

\displaystyle  \sum_{i=1}^n a_i \mu(E_i) = \sum_{j=1}^m b_j \mu(F_j).

(Hint: first handle the special case where the {F_j} are all disjoint and non-empty, and each of the {E_i} is expressible as the union of some subcollection of the {F_j}. Then handle the general case by considering the atoms of the finite boolean algebra generated by {E_i} and {F_j}.)

We capture this invariant by introducing the simple integral {\hbox{Simp} \int_{\Omega} f\ d\mu} of an unsigned simple function by the formula

\displaystyle  \hbox{Simp} \int_\Omega f\ d\mu := \sum_{i=1}^n a_i \mu(E_i)

whenever {f} admits a decomposition {f = \sum_{i=1}^n a_i 1_{E_i}}. The above exercise is then precisely the assertion that the simple integral is well-defined as an element of {[0,+\infty]}.

Exercise 4 Let {f, g: \Omega \rightarrow [0,+\infty]} be unsigned simple functions, and let {c \in [0,+\infty]}.

  • (i) (Linearity) Show that

    \displaystyle  \hbox{Simp} \int_\Omega f+g\ d\mu = \hbox{Simp} \int_\Omega f\ d\mu + \hbox{Simp} \int_\Omega g\ d\mu

    and

    \displaystyle  \hbox{Simp} \int_\Omega cf\ d\mu = c \hbox{Simp} \int_\Omega f\ d\mu.

  • (ii) Show that if {f} and {g} are equal almost everywhere, then

    \displaystyle  \hbox{Simp} \int_\Omega f\ d\mu = \hbox{Simp} \int_\Omega g\ d\mu.

  • (iii) Show that {\hbox{Simp} \int_\Omega f\ d\mu \geq 0}, with equality if and only if {f} is zero almost everywhere.
  • (iv) (Monotonicity) If {f \leq g} almost everywhere, show that {\hbox{Simp} \int_\Omega f\ d\mu \leq \hbox{Simp} \int_\Omega g\ d\mu}.
  • (v) (Markov inequality) Show that {\mu( \{ \omega: f(\omega) \geq t \} ) \leq \frac{1}{t} \hbox{Simp} \int_\Omega f\ d\mu} for any {0 < t < \infty}.

Now we extend from unsigned simple functions to more general unsigned functions. If {f: \Omega \rightarrow [0,+\infty]} is an unsigned measurable function, we define the unsigned integral {\int_\Omega f\ d\mu} as

\displaystyle  \int_\Omega f\ d\mu = \sup_{0 \leq g \leq f} \hbox{Simp} \int_\Omega g\ d\mu \ \ \ \ \ (3)

where the supremum is over all unsigned simple functions such that {0 \leq g(\omega) \leq f(\omega)} for all {\omega \in \Omega}.

Many of the properties of the simple integral carry over to the unsigned integral easily:

Exercise 5 Let {f, g: \Omega \rightarrow [0,+\infty]} be unsigned functions, and let {c \in [0,+\infty]}.

  • (i) (Superadditivity) Show that

    \displaystyle  \int_\Omega f+g\ d\mu \geq \int_\Omega f\ d\mu + \int_\Omega g\ d\mu

    and

    \displaystyle  \int_\Omega cf\ d\mu = c \int_\Omega f\ d\mu.

  • (ii) Show that if {f} and {g} are equal almost everywhere, then

    \displaystyle  \int_\Omega f\ d\mu = \int_\Omega g\ d\mu.

  • (iii) Show that {\int_\Omega f\ d\mu \geq 0}, with equality if and only if {f} is zero almost everywhere.
  • (iv) (Monotonicity) If {f \leq g} almost everywhere, show that {\int_\Omega f\ d\mu \leq \int_\Omega g\ d\mu}.
  • (v) (Markov inequality) Show that {\mu( \{ \omega: f(\omega) \geq t \} ) \leq \frac{1}{t} \int_\Omega f\ d\mu} for any {0 < t < \infty}. In particular, if {\int_\Omega f\ d\mu < \infty}, then {f} is finite almost everywhere.
  • (vi) (Compatibility with simple integral) If {f} is simple, show that {\int_\Omega f\ d\mu = \hbox{Simp} \int_\Omega f\ d\mu}.
  • (vii) (Compatibility with measure) For any measurable set {E}, show that {\int_\Omega 1_E\ d\mu = \mu(E)}.

Exercise 6 If {(\Omega, (p_\omega)_{\omega \in \Omega})} is a discrete probability space (with the associated probability measure {\mu}), and {f: \Omega \rightarrow [0,+\infty]} is a function, show that

\displaystyle  \int_\Omega f\ d\mu = \sum_{\omega \in\Omega} f(\omega) p_\omega.

(Note that the condition {\sum_{\omega \in \Omega} p_\omega = 1} in the definition of a discrete probability space is not required to prove this identity.)

The observant reader will notice that the linearity property of simple functions has been weakened to superadditivity. This can be traced back to a breakdown of symmetry in the definition (3); the unsigned simple integral of {f} is defined via approximation from below, but not from above. Indeed the opposite claim

\displaystyle  \int_\Omega f\ d\mu \stackrel{?}{=} \inf_{g \geq f} \hbox{Simp} \int_\Omega g\ d\mu \ \ \ \ \ (4)

can fail. For a counterexample, take {\Omega} to be the discrete probability space {\{1,2,3,\dots\}} with probabilities {p_n := 2^{-n}}, and let {f: \Omega \rightarrow [0,+\infty]} be the function {f(n)=n}. By Exercise 6 we have {\int_\Omega f\ d\mu = \sum_{n=1}^\infty n2^{-n} = 2}. On the other hand, any simple function {g} with {g \geq f} must equal {+\infty} on a set of positive measure (why?) and so the right-hand side of (4) can be infinite. However, one can get around this difficulty under some further assumptions on {f}, and thus recover full linearity for the unsigned integral:

Exercise 7 (Linearity of the unsigned integral) Let {(\Omega, {\mathcal F}, \mu)} be a measure space.

  • (i) Let {f: \Omega \rightarrow [0,+\infty]} be an unsigned measurable function which is both bounded (i.e., there is a finite {M} such that {|f(\omega)| \leq M} for all {\omega \in \Omega}) and has finite measure support (i.e., there is a measurable set {E} with {\mu(E) < \infty} such that {f(\omega)=0} for all {\omega \in \Omega \backslash E}). Show that (4) holds for this function {f}.
  • (ii) Establish the additivity property

    \displaystyle  \int_\Omega f+g\ d\mu =\int_\Omega f\ d\mu + \int_\Omega g\ d\mu

    whenever {f, g: \Omega \rightarrow [0,+\infty]} are unsigned measurable functions that are bounded with finite measure support.

  • (iii) Show that

    \displaystyle  \int_\Omega \min(f,n)\ d\mu \rightarrow \int_\Omega f\ d\mu

    as {n \rightarrow \infty} whenever {f: \Omega \rightarrow [0,+\infty]} is unsigned measurable.

  • (iv) Using (iii), extend (ii) to the case where {f,g} are unsigned measurable functions with finite measure support, but are not necessarily bounded.
  • (v) Show that

    \displaystyle  \int_\Omega f 1_{f \geq 1/n}\ d\mu \rightarrow \int_\Omega f\ d\mu

    as {n \rightarrow \infty} whenever {f: \Omega \rightarrow [0,+\infty]} is unsigned measurable.

  • (vi) Using (iii) and (v), show that (ii) holds for any unsigned measurable {f,g} (which are not necessarily bounded or of finite measure support).

Next, we apply the integral to absolutely integrable functions. We call a scalar function {f: \Omega \rightarrow {\bf R}} or {f: \Omega \rightarrow {\bf C}} absolutely integrable if it is measurable and the unsigned integral {\int_\Omega |f|\ d\mu} is finite. A real-valued absolutely integrable function {f} can be expressed as the difference {f = f_1 - f_2} of two unsigned absolutely integrable functions {f_1,f_2}; indeed, one can check that the choice {f_1 := \max(f,0)} and {f_2 := \max(-f,0)} work for this. Conversely, any difference {f_1 - f_2} of unsigned absolutely integrable functions {f_1,f_2} is absolutely integrable (this follows from the triangle inequality {|f_1-f_2| \leq |f_1|+|f_2|}). A single absolutely integrable function {f} may be written as a difference {f_1-f_2} of unsigned absolutely integrable functions in more than one way, for instance we might have

\displaystyle  f = f_1 - f_2 = g_1 - g_2

for unsigned absolutely integrable functions {f_1,f_2,g_1,g_2}. But when this happens, we can rearrange to obtain

\displaystyle  f_1 + g_2 = g_1 + f_2

and thus by linearity of the unsigned integral

\displaystyle  \int_\Omega f_1\ d\mu + \int_\Omega g_2\ d\mu = \int_\Omega g_1\ d\mu + \int_\Omega f_2\ d\mu.

By the absolute integrability of {f_1,f_2,g_1,g_2}, all the integrals are finite, so we may rearrange this identity as

\displaystyle  \int_\Omega f_1\ d\mu - \int_\Omega f_2\ d\mu = \int_\Omega g_1\ d\mu - \int_\Omega g_2\ d\mu.

This allows us to define the Lebesgue integral {\int_\Omega f\ d\mu \in {\bf R}} of a real-valued absolutely integrable function {f} to be the expression

\displaystyle  \int_\Omega f\ d\mu := \int_\Omega f_1\ d\mu - \int_\Omega f_2\ d\mu

for any given decomposition {f = f_1-f_2} of {f} as the difference of two unsigned absolutely integrable functions. Note that if {f} is both unsigned and absolutely integrable, then the unsigned integral and the Lebesgue integral of {f} agree (as can be seen by using the decomposition {f = f - 0}), and so there is no ambiguity in using the same notation {\int_\Omega f\ d\mu} to denote both integrals. (By the same token, we may now drop the modifier {\hbox{Simp}} from the simple integral of a simple unsigned {f}, which we may now also denote by {\int_\Omega f\ d\mu}.)

The Lebesgue integral also enjoys good linearity properties:

Exercise 8 Let {f, g: \Omega \rightarrow {\bf R}} be real-valued absolutely integrable functions, and let {c \in {\bf R}}.

  • (i) (Linearity) Show that {f+g} and {cf} are also real-valued absolutely integrable functions, with

    \displaystyle \int_\Omega f+g\ d\mu = \int_\Omega f\ d\mu + \int_\Omega g\ d\mu

    and

    \displaystyle  \int_\Omega cf\ d\mu = c \int_\Omega f\ d\mu.

    (For the second relation, one may wish to first treat the special cases {c>0} and {c=-1}.)

  • (ii) Show that if {f} and {g} are equal almost everywhere, then

    \displaystyle  \int_\Omega f\ d\mu = \int_\Omega g\ d\mu.

  • (iii) Show that {\int_\Omega |f|\ d\mu \geq 0}, with equality if and only if {f} is zero almost everywhere.
  • (iv) (Monotonicity) If {f \leq g} almost everywhere, show that {\int_\Omega f\ d\mu \leq\int_\Omega g\ d\mu}.
  • (v) (Markov inequality) Show that {\mu( \{ \omega: |f(\omega)| \geq t \} ) \leq \frac{1}{t} \int_\Omega |f|\ d\mu} for any {0 < t < \infty}.

Because of part (iii) of the above exercise, we can extend the Lebesgue integral to real-valued absolutely integrable functions that are only defined and real-valued almost everywhere, rather than everywhere. In particular, we can apply the Lebesgue integral to functions that are sometimes infinite, so long as they are only infinite on a set of measure zero, and the function is absolutely integrable everywhere else.

Finally, we extend to complex-valued functions. If {f: \Omega \rightarrow {\bf C}} is absolutely integrable, observe that the real and imaginary parts {\hbox{Re} f, \hbox{Im} f: \Omega \rightarrow {\bf C}} are also absolutely integrable (because {|\hbox{Re} f|, |\hbox{Im} f| \leq |f|}). We then define the (complex) Lebesgue integral {\int_\Omega f\ d\mu \in {\bf C}} of {f} in terms of the real Lebesgue integral by the formula

\displaystyle  \int_\Omega f\ d\mu := \int_\Omega \hbox{Re}(f)\ d\mu + i \int_\Omega \hbox{Im}(f)\ d\mu.

Clearly, if {f} is real-valued and absolutely integrable, then the real Lebesgue integral and the complex Lebesgue integral of {f} coincide, so it does not create ambiguity to use the same symbol {\int_\Omega f\ d\mu} for both concepts. It is routine to extend the linearity properties of the real Lebesgue integral to its complex counterpart:

Exercise 9 Let {f, g: \Omega \rightarrow {\bf C}} be complex-valued absolutely integrable functions, and let {c \in {\bf C}}.

  • (i) (Linearity) Show that {f+g} and {cf} are also complex-valued absolutely integrable functions, with

    \displaystyle \int_\Omega f+g\ d\mu = \int_\Omega f\ d\mu + \int_\Omega g\ d\mu

    and

    \displaystyle  \int_\Omega cf\ d\mu = c \int_\Omega f\ d\mu.

    (For the second relation, one may wish to first treat the special cases {c \in {\bf R}} and {c = i}.)

  • (ii) Show that if {f} and {g} are equal almost everywhere, then

    \displaystyle  \int_\Omega f\ d\mu = \int_\Omega g\ d\mu.

  • (iii) Show that {\int_\Omega |f|\ d\mu \geq 0}, with equality if and only if {f} is zero almost everywhere.
  • (iv) (Markov inequality) Show that {\mu( \{ \omega: |f(\omega)| \geq t \} ) \leq \frac{1}{t} \int_\Omega |f|\ d\mu} for any {0 < t < \infty}.

We record a simple, but incredibly fundamental, inequality concerning the Lebesgue integral:

Lemma 10 (Triangle inequality) If {f: \Omega \rightarrow {\bf C}} is a complex-valued absolutely integrable function, then

\displaystyle  |\int_\Omega f\ d\mu| \leq \int_\Omega |f|\ d\mu.

Proof: We have

\displaystyle  \hbox{Re} \int_\Omega f\ d\mu = \int_\Omega \hbox{Re} f\ d\mu

\displaystyle  \leq \int_\Omega |f|\ d\mu.

This looks weaker than what we want to prove, but we can “amplify” this inequality to the full strength triangle inequality as follows. Replacing {f} by {e^{i\theta} f} for any real {\theta}, we have

\displaystyle  \hbox{Re} e^{i\theta} \int_\Omega f\ d\mu \leq \int_\Omega |f|\ d\mu.

Since we can choose the phase {e^{i\theta}} to make the expression {e^{i\theta} \int_\Omega f\ d\mu} equal to {|\int_\Omega f\ d\mu|}, the claim follows. \Box

Finally, we observe that the Lebesgue integral extends the Riemann integral, which is particularly useful when it comes to actually computing some of these integrals:

Exercise 11 If {f: [a,b] \rightarrow {\bf C}} is a Riemann integrable function on a compact interval {[a,b]}, show that {f} is also absolutely integrable, and that the Lebesgue integral {\int_{[a,b]} f\ dm} (with {m} Lebesgue measure restricted to {[a,b]}) coincides with the Riemann integral {\int_a^b f(x)\ dx}. Similarly if {f} is Riemann integrable on a box {[a_1,b_1] \times \dots \times [a_n,b_n]}.

— 2. Expectation of random variables —

We now translate the above notions of integration on measure spaces to the probabilistic setting.

A random variable {X} taking values in the unsigned extended real line {[0,+\infty]} is said to be simple if it takes on at most finitely many values. Equivalently, {X} can be expressed as a finite unsigned linear combination

\displaystyle  X = \sum_{i=1}^n a_i 1_{E_i}

of indicator random variables, where {a_1,\dots,a_n \in [0,+\infty]} are unsigned and {E_i} are events. We then define the simple expectation {\hbox{Simp} {\bf E} X \in [0,+\infty]} of {X} to be the quantity

\displaystyle  \hbox{Simp} {\bf E} X := \sum_{i=1}^n a_i {\bf P}(E_i),

and checks that this definition is independent of the choice of decomposition of {X} into indicator functions. Observe that if we model the random variable {X} using a probability space {\Omega}, then the simple expectation of {X} is precisely the simple integral of the corresponding unsigned simple function {X_\Omega}.

Next, given an arbitrary unsigned random variable {X} taking values in {[0,+\infty]}, one defines its (unsigned) expectation {{\bf E} X \in [0,+\infty]} as

\displaystyle  {\bf E} X := \sup_{0\leq Y \leq X} \hbox{Simp} {\bf E} Y

where {Y} ranges over all simple unsigned random variables such that {0 \leq Y \leq X} is surely true. This extends the simple expectation (thus {{\bf E} X = \hbox{Simp} {\bf E} X} for all simple unsigned {X}), and in terms of a probability space model {\Omega}, the expectation {{\bf E} X} is precisely the unsigned integral of {X_\Omega}. The expectation of a random variable is also often referred to as the mean, particularly in applications connected to statistics. In some literature {{\bf E} X} is also called the expected value of {X}, but this is a somewhat misleading term as often one expects {X} to deviate above or below {{\bf E} X}.

A scalar random variable {X} is said to be absolutely integrable if {{\bf E} |X| < \infty}, thus for instance any bounded random variable is absolutely integrable. If {X} is real-valued and absolutely integrable, we define its expectation by the formula

\displaystyle  {\bf E} X := {\bf E} X_1 - {\bf E} X_2

where {X = X_1 - X_2} is any representation of {X} as the difference of unsigned absolutely integrable random variables {X_1,X_2}; one can check that this definition does not depend on the choice of representation and is thus well-defined. For complex-valued absolutely integrable {X}, we then define

\displaystyle  {\bf E} X := {\bf E} \hbox{Re} X + i {\bf E} \hbox{Im} X.

In all of these cases, the expectation of {X} is equal to the integral of the representation {X_\Omega} of {X} in any probability space model; in the case that {\Omega} is given by a discrete probability model, one can check that this definition of expectation agrees with the one given in Notes 0. Using the former fact, we can translate the properties of integration already established to the probabilistic setting:

Proposition 12

  • (i) (Unsigned linearity) If {X,Y} are unsigned random variables, and {c} is a deterministic unsigned quantity, then {{\bf E}(X+Y) = {\bf E} X + {\bf E} Y} and {{\bf E} cX = c {\bf E} X}. (Note that these identities hold even when {X,Y} are not absolutely integrable.)
  • (ii) (Complex linearity) If {X,Y} are absolutely integrable random variables, and {c} is a deterministic complex quantity, then {X+Y} and {cX} are also absolutely integrable, with {{\bf E}(X+Y) = {\bf E} X + {\bf E} Y} and {{\bf E} cX = c {\bf E} X}.
  • (iii) (Compatibility with probability) If {E} is an event, then {{\bf E} 1_E = {\bf P}(E)}. In particular, {{\bf E} 1 = 1}.
  • (iv) (Almost sure equivalence) If {X, Y} are unsigned (resp. absolutely integrable) and {X=Y} almost surely, then {{\bf E} X = {\bf E} Y}.
  • (v) If {X} is unsigned or absolutely integrable, then {{\bf E} |X| \geq 0}, with equality if and only if {X=0} almost surely.
  • (vi) (Monotonicity) If {X,Y} are unsigned or real-valued absolutely integrable, and {X \leq Y} almost surely, then {{\bf E} X \leq {\bf E} Y}.
  • (vii) (Markov inequality) If {X} is unsigned or absolutely integrable, then {{\bf P}(|X| \geq t) \leq \frac{1}{t} {\bf E} |X|} for any deterministic {t>0}.
  • (viii) (Triangle inequality) If {X} is absolutely integrable, then {|{\bf E} X| \leq {\bf E} |X|}.

As before, we can use part (iv) to define expectation of scalar random variables {X} that are only defined and finite almost surely, rather than surely.

Note that we have built the notion of expectation (and of related notions, such as absolute integrability) out of notions that were already probabilistic in nature, in the sense that they were unaffected if one replaced the underlying probabilistic model with an extension. Therefore, the notion of expectation is automatically probabilistic in the same sense. Because of this, we will be easily able to manipulate expectations of random variables without having to explicitly mention an underlying probability space {\Omega}, and so one will now see such spaces fade from view starting from this point in the course.

— 3. Exchanging limits with integrals or expectations —

When performing analysis on measure spaces, it is important to know if one can interchange a limit with an integral:

\displaystyle  \int_\Omega \lim_{n \rightarrow \infty} f_n\ d\mu \stackrel{?}{=} \lim_{n \rightarrow \infty} \int_\Omega f_n\ d\mu.

Similarly, in probability theory, we often wish to interchange a limit with an expectation:

\displaystyle  {\bf E} \lim_{n \rightarrow \infty} X_n \stackrel{?}{=} \lim_{n \rightarrow \infty} {\bf E} X_n.

Of course, one needs the integrands or random variables to be either unsigned or absolutely integrable, and the limits to be well-defined to have any hope of doing this. Naively, one could hope that limits and integrals could always be exchanged when the expressions involved are well-defined, but this is unfortunately not the case. In the case of integration on, say, the real line {{\bf R}} using Lebesgue measure {m}, we already see four key examples:

  • (Moving bump example) Take {f_n := 1_{[n,n+1]}}. Then {\int_{\bf R} \lim_{n \rightarrow \infty} f_n\ dm = 0}, but {\lim_{n \rightarrow \infty} \int_{\bf R} f_n\ dm = 1}.
  • (Spreading bump example) Take {f_n := \frac{1}{n} 1_{[0,n]}}. Then {\int_{\bf R} \lim_{n \rightarrow \infty} f_n\ dm = 0}, but {\lim_{n \rightarrow \infty} \int_{\bf R} f_n\ dm = 1}.
  • (Concentrating bump example) Take {f_n := n 1_{[0,1/n]}}. Then {\int_{\bf R} \lim_{n \rightarrow \infty} f_n\ dm = 0}, but {\lim_{n \rightarrow \infty} \int_{\bf R} f_n\ dm = 1}.
  • (Receding infinity example) Take {f_n := 1_{[n,\infty)}}. Then {\int_{\bf R} \lim_{n \rightarrow \infty} f_n\ dm = 0}, but {\lim_{n \rightarrow \infty} \int_{\bf R} f_n\ dm = +\infty}.

In all these examples, the limit of the integral exceeds the integral of the limit; by replacing {f_n} with {-f_n} in the first three examples (which involve absolutely integrable functions) one can also build examples where the limit of the integral is less than the integral of the limit. Most of these examples rely on the infinite measure of the real line and thus do not directly have probabilistic analogues, but the concentrating bump example involves functions that are all supported on the unit interval {[0,1]} and thus also poses a problem in the probabilistic setting.

Nevertheless, there are three important cases in which we can relate the limit (or, in the case of Fatou’s lemma, the limit inferior) of the integral to the integral of the limit (or limit inferior). Informally, they are:

  • (Fatou’s lemma) For unsigned {f_n}, the integral of the limit inferior cannot exceed the limit inferior of the integral. “Limits (or more precisely, limits inferior) can destroy (unsigned) mass, but cannot create it.”
  • (Monotone convergence theorem) For unsigned monotone increasing {f_n}, the limit of the integral equals the integral of the limit.
  • (Dominated convergence theorem) For {f_n} that are uniformly dominated by an absolutely integrable function, the limit of the integral equals the integral of the limit.

These three results then have analogues for convergence of random variables. We will also mention a fourth useful tool in that setting, which allows one to exchange limits and expectations when one controls a higher moment. There are a few more such general results allowing limits to be exchanged with integrals or expectations, but my advice would be to work out such exchanges by hand rather than blindly cite (possibly incorrectly) an additional convergence theorem beyond the four mentioned above, as this is safer and will help strengthen one’s intuition on the situation.

We now state and prove these results more explicitly.

Lemma 13 (Fatou’s lemma) Let {(\Omega, {\mathcal F}, \mu)} be a measure space, and let {f_1, f_2, \dots: \Omega \rightarrow [0,+\infty]} be a sequence of unsigned measurable functions. Then

\displaystyle  \int_\Omega \liminf_{n \rightarrow \infty} f_n\ d\mu \leq \liminf_{n \rightarrow \infty} \int_\Omega f_n\ d\mu.

An equivalent form of this lemma is that if one has

\displaystyle  \int_\Omega f_n\ d\mu \leq M

for some {M} and all sufficiently large {n}, then one has

\displaystyle  \int_\Omega \liminf_{n \rightarrow \infty} f_n\ d\mu \leq M

as well. That is to say, if the original unsigned functions {f_n} eventually have “mass” less than or equal to {M}, then the limit (inferior) {\liminf_{n \rightarrow \infty} f_n} also has “mass” less than or equal to {M}. The limit may have substantially less mass, as the four examples above show, but it can never have more mass (asymptotically) than the functions that comprise the limit. Of course, one can replace limit inferior by limit in the left or right hand side if one knows that the relevant limit actually exists (but one cannot replace limit inferior by limit superior if one does not already have convergence, see Example 15 below). On the other hand, it is essential that the {f_n} are unsigned for Fatou’s lemma to work, as can be seen by negating one of the first three key examples mentioned above.

Proof: By definition of the unsigned integral, it suffices to show that

\displaystyle  \int_\Omega g\ d\mu \leq \liminf_{n \rightarrow \infty} \int_\Omega f_n\ d\mu

whenever {g} is an unsigned simple function with {g \leq \liminf_{n \rightarrow \infty} f_n}. At present, {g} is allowed to take the infinite {+\infty}, but it suffices to establish this claim for {g} that only take finite values, since the claim then follows for possibly infinite-valued {g} by applying the claim with {g} replaced by {\min(g,N)} and then letting {N} go to infinity.

Multiplying by {1-\varepsilon}, it thus suffices to show that

\displaystyle  (1-\varepsilon) \int_\Omega g\ d\mu \leq \liminf_{n \rightarrow \infty} \int_\Omega f_n\ d\mu

for any {0 < \varepsilon < 1} and any unsigned {g} as above.

We can write {g} as the sum {g = \sum_{i=1}^k a_i 1_{E_i}} for some strictly positive finite {a_i} and disjoint {E_i}; we allow the {a_i} and the measures {\mu(E_i)} to be infinite. On each {E_i}, we have {\liminf_{n \rightarrow \infty} f_n > (1-\varepsilon) a_i}. Thus, if we define

\displaystyle  E_{i,N} := \{ \omega \in E_i: f_n(\omega) \geq (1-\varepsilon) a_i \hbox{ for all } n \geq N \}

then the {E_{i,N}} increase to {E_i} as {N \rightarrow \infty}: {\bigcup_{N=1}^\infty E_{i,N} = E_i}. By continuity from below (Exercise 23 of Notes 0), we thus have

\displaystyle  \mu(E_{i,N}) \rightarrow \mu(E_i)

as {N \rightarrow \infty}. Since

\displaystyle  f_N \geq \sum_{i=1}^N (1-\varepsilon) a_i 1_{E_{i,N}}

we conclude upon integration that

\displaystyle  \int_\Omega f_N\ d\mu \geq \sum_{i=1}^k (1-\varepsilon) a_i \mu( E_{i,N} )

and thus on taking limit inferior

\displaystyle  \liminf_{N \rightarrow \infty} \int_\Omega f_N\ d\mu \geq \sum_{i=1}^k (1-\varepsilon) a_i \mu( E_i).

But the right-hand side is {(1-\varepsilon) \int_\Omega g\ d\mu}, and the claim follows. \Box

Of course, Fatou’s lemma may be phrased probabilistically:

Lemma 14 (Fatou’s lemma for random variables) Let {X_1,X_2,\dots} be a sequence of unsigned random variables. Then

\displaystyle  {\bf E} \liminf_{n \rightarrow \infty} X_n \leq \liminf_{n \rightarrow \infty} {\bf E} X_n.

As a corollary, if {X_1,X_2,\dots} are unsigned and converge almost surely to a random variable {Y}, then

\displaystyle  {\bf E} Y \leq \liminf_{n \rightarrow \infty} {\bf E} X_n.

Example 15 We now give an example to show that limit inferior cannot be replaced with limit superior in Fatou’s lemma. Let {X} be drawn uniformly at random from {[0,1]}, and for each {n}, let {X_n} be the {n^{th}} binary digit of {X}, thus {X_n = 1} when {2^n X} has odd integer part, and {X_n =0} otherwise. (There is some ambiguity with the binary expansion when {X} is a terminating binary decimal, but this event almost surely does not occur and can thus be safely ignored.) One has {{\bf E} X_n = \frac{1}{2}} for all {n} (why?). It is then easy to see that {\liminf_{n \rightarrow \infty} X_n} is almost surely {0} (which is consistent with Fatou’s lemma) but {\limsup_{n \rightarrow \infty} X_n} is almost surely {1} (so Fatou’s lemma fails if one replaces limit inferior with limit superior).

Next, we establish the monotone convergence theorem.

Theorem 16 (Monotone convergence theorem) Let {(\Omega, {\mathcal F}, \mu)} be a measure space, and let {f_1, f_2, \dots: \Omega \rightarrow [0,+\infty]} be a sequence of unsigned measurable functions which is monotone increasing, thus {f_n(\omega) \leq f_{n+1}(\omega)} for all {n} and {\omega \in \Omega}. Then

\displaystyle  \int_\Omega \lim_{n \rightarrow \infty} f_n\ d\mu = \lim_{n \rightarrow \infty} \int_\Omega f_n\ d\mu.

Note that the limits exist on both sides because monotone sequences always have limits. Indeed the limit in either side is equal to the supremum. The receding infinity example shows that it is important that the functions here are monotone increasing rather than monotone decreasing. We also observe that it is enough for the {f_n} to be increasing almost everywhere rather than everywhere, since one can then modify the {f_n} on a set of measure zero to be increasing everywhere, which does not affect the integrals on either side of this theorem.

Proof: From Fatou’s lemma we already have

\displaystyle  \int_\Omega \lim_{n \rightarrow \infty} f_n\ d\mu \leq \lim_{n \rightarrow \infty} \int_\Omega f_n\ d\mu.

On the other hand, from monotonicity we see that

\displaystyle  \int_\Omega \lim_{n \rightarrow \infty} f_n\ d\mu \geq \int_\Omega f_m\ d\mu

for any natural number {m}, and on taking limits as {m \rightarrow \infty} we obtain the claim. \Box

Note that continuity from below for measures (Exercise 23.3 of Notes 0 can be viewed as the special case of the monotone convergence theorem when the functions {f_n} are all indicator functions.)

An important corollary of the monotone convergence theorem is that one can freely interchange infinite sums with integrals for unsigned functions, that is to say

\displaystyle  \int_\Omega \sum_{n=1}^\infty g_n\ d\mu = \sum_{n=1}^\infty \int_\Omega g_n\ d\mu

for any unsigned {g_1,g_2,\dots: \Omega \rightarrow [0,+\infty]} (not necessarily monotone). Indeed, to see this one simply applies the monotone convergence theorem to the partial sums {f_N := \sum_{n=1}^N g_n}.

We of course can translate this into the probabilistic context:

Theorem 17 (Monotone convergence theorem for random variables) Let {0 \leq X_1 \leq X_2 \leq \dots} be a monotone non-decreasing sequence of unsigned random variables. Then

\displaystyle  {\bf E} \lim_{n \rightarrow \infty} X_n = \lim_{n \rightarrow \infty} {\bf E} X_n.

Similarly, for any unsigned random variables {Y_1,Y_2,\dots}, we have

\displaystyle  {\bf E} \sum_{n=1}^\infty Y_n = \sum_{n=1}^\infty {\bf E} Y_n.

Again, it is sufficient for the {X_n} to be non-decreasing almost surely. We note a basic but important corollary of this theorem, namely the (first) Borel-Cantelli lemma:

Lemma 18 (Borel-Cantelli lemma) Let {E_1,E_2,\dots} be a sequence of events with {\sum_n {\bf P}(E_n) < \infty}. Then almost surely, at most finitely many of the events {E_n} hold; that is to say, one has {\sum_n 1_{E_n} < \infty} almost surely.

Proof: From the monotone convergence theorem, we have

\displaystyle  {\bf E} \sum_n 1_{E_n} = \sum_n {\bf E} 1_{E_n} = \sum_n {\bf P}(E_n) < \infty.

By Markov’s inequality, this implies that {\sum_n 1_{E_n}} is almost surely finite, as required. \Box

As the above proof shows, the Borel-Cantelli lemma is almost a triviality if one has the machinery of expectation (or integration); but it is remarkably hard to prove the lemma without that machinery, and it is an instructive exercise to attempt to do so.

We will develop a partial converse to the above lemma (the “second” Borel-Cantelli lemma) in a subsequent set of notes. For now, we give a crude converse in which we assume not only that the {{\bf P}(E_n)} sum to infinity, but they are in fact uniformly bounded from below:

Exercise 19 Let {E_1,E_2,\dots} be a sequence of events with {\inf_n {\bf P}(E_n) > 0}. Show that with positive probability, an infinite number of the {E_n} hold; that is to say, {\mathop{\bf P}( \sum_n 1_{E_n} = \infty ) > 0}. (Hint: if {{\bf P}(E_n) \geq \delta > 0} for all {n}, establish the lower bound {\mathop{\bf P}( \sum_{n \leq N} 1_{E_n} \geq \delta N/2 ) \geq \delta/2} for all {N}. Alternatively, one can apply Fatou’s lemma to the random variables {1_{\overline{E_n}}}.)

Exercise 20 Let {p_1,p_2,\dots \in [0,1]} be a sequence such that {\sum_{n=1}^\infty p_n = +\infty}. Show that there exist a sequence of events {E_1,E_2,\dots} modeled by some probability space {\Omega}, such that {{\bf P}(E_n)=p_n} for all {n}, and such that almost surely infinitely many of the {E_n} occur. Thus we see that the hypothesis {\sum_{n=1}^\infty {\bf P}(E_n) < \infty} in the Borel-Cantelli lemma cannot be relaxed.

Finally, we give the dominated convergence theorem.

Theorem 21 (Dominated convergence theorem) Let {(\Omega, {\mathcal F}, \mu)} be a measure space, and let {f_1, f_2, \dots: \Omega \rightarrow {\bf C}} be measurable functions which converge pointwise to some limit. Suppose that there is an unsigned absolutely integrable function {g: \Omega \rightarrow [0,+\infty]} which dominates the {f_n} in the sense that {|f_n(\omega)| \leq |g(\omega)|} for all {n} and all {\omega}. Then

\displaystyle  \int_\Omega \lim_{n \rightarrow \infty} f_n\ d\mu = \lim_{n \rightarrow \infty} \int_\Omega f_n\ d\mu.

In particular, the limit on the right-hand side exists.

Again, it will suffice for {g} to dominate each {f_n} almost everywhere rather than everywhere, as one can upgrade this to everywhere domination by modifying each {f_n} on a set of measure zero. Similarly, pointwise convergence can be replaced with pointwise convergence almost everywhere. The domination of each {f_n} by a single function {g} implies that the integrals {\int_\Omega |f_n|\ d\mu} are uniformly bounded in {n}, but this latter condition is not sufficient by itself to guarantee interchangeability of the limit and integral, as can be seen by the first three examples given at the start of this section.

Proof: By splitting into real and imaginary parts, we may assume without loss of generality that the {f_n} are real-valued. As {g} is absolutely integrable, it is finite almost everywhere; after modification on a set of measure zero we may assume it is finite everywhere. Let {f} denote the pointwise limit of the {f_n}. From Fatou’s lemma applied to the unsigned functions {g-f_n} and {g+f_n}, we have

\displaystyle  \int_\Omega g-f\ d\mu \leq \liminf_{n \rightarrow \infty} \int_\Omega g-f_n\ d\mu

and

\displaystyle  \int_\Omega g+f\ d\mu \leq \liminf_{n \rightarrow \infty} \int_\Omega g+f_n\ d\mu

Rearranging this (taking crucial advantage of the finite nature of the {\int_\Omega g\ d\mu}, and hence {\int_\Omega f\ d\mu} and {\int_\Omega f_n\ d\mu}), we conclude that

\displaystyle  \limsup_{n \rightarrow \infty} \int_\Omega f_n\ d\mu \leq \int_\Omega f\ d\mu \leq \liminf_{n \rightarrow\infty} \int_\Omega f_n\ d\mu

and the claim follows. \Box

Remark 22 Amusingly, one can use the dominated convergence theorem to give an (extremely indirect) proof of the divergence of the harmonic series {\sum_{n=1}^\infty \frac{1}{n}}. For, if that series was convergent, then the function {\sum_{n=1}^\infty \frac{1}{n} 1_{[n-1,n]}} would be absolutely integrable, and the spreading bump example described above would contradict the dominated convergence theorem. (Expert challenge: see if you can deconstruct the above argument enough to lower bound the rate of divergence of the harmonic series {\sum_{n=1}^\infty \frac{1}{n}}.)

We again translate the above theorem to the probabilistic context:

Theorem 23 (Dominated convergence theorem for random variables) Let {X_1,X_2,\dots} be scalar random variables which converge almost surely to a limit {X}. Suppose there is an unsigned absolutely integrable random variable {Y} such that {|X_n| \leq Y} almost surely for each {n}. Then

\displaystyle  \lim_{n \rightarrow \infty} {\bf E} X_n = {\bf E} X.

As a corollary of the dominated convergence theorem for random variables we have the bounded convergence theorem: if {X_1,X_2,\dots} are scalar random variables that converge almost surely to a limit {X}, and are almost surely bounded in magnitude by a uniform constant {M}, then we have

\displaystyle  \lim_{n \rightarrow \infty} {\bf E} X_n = {\bf E} X.

(In Durrett, the bounded convergence theorem is proven first, and then used to establish Fatou’s theorem and the dominated and monotone convergence theorems. The order in which one establishes these results – which are all closely related to each other – is largely a matter of personal taste.) A closely related corollary (which can also be established directly is that if {X_n} are scalar absolutely integrable random variables that converge uniformly to {X} (thus, for each {\varepsilon>0} there is {N>0} such that {|X_n-X| \leq \varepsilon} is surely true for all {n \geq N}), then {{\bf E} X_n} converges to {{\bf E} X}.

A further corollary of the dominated convergence theorem is that one has the identity

\displaystyle  {\bf E} \sum_n Y_n = \sum_n {\bf E} Y_n

whenever {Y_n} are scalar random variables with {\sum_n |Y_n|} absolutely integrable (or equivalently, that {\sum_n {\bf E} |Y_n|} is finite).

Another useful variant of the dominated convergence theorem is

Theorem 24 (Convergence for random variables with bounded moment) Let {X_1,X_2,\dots} be scalar random variables which converge almost surely to a limit {X}. Suppose there is {\varepsilon>0} and {M>0} such that {{\bf E} |X_n|^{1+\varepsilon} \leq M} for all {n}. Then

\displaystyle  \lim_{n \rightarrow \infty} {\bf E} X_n = {\bf E} X.

This theorem fails for {\varepsilon=0}, as the concentrating bump example shows. The case {\varepsilon=1} (that is to say, bounded second moment {{\bf E} |X_n|^2}) is already quite useful. The intuition here is that concentrating bumps are in some sense the only obstruction to interchanging limits and expectations, and these can be eliminated by hypotheses such as a bounded higher moment hypothesis or a domination hypothesis.

Proof: By taking real and imaginary parts we may assume that the {X_n} (and hence {X}) are real-valued. For any natural number {m}, let {X_n^{[m]}} denote the truncation {X_n^{[m]} := \max(\min(X_n,m),-m)} of {X_n} to the interval {[-m,m]}, and similarly define {X^{[m]} := \max(\min(X,m),-m)}. Then {X_n^{[m]}} converges pointwise to {X^{[m]}}, and hence by the bounded convergence theorem

\displaystyle  \lim_{n \rightarrow \infty} {\bf E} X_n^{[m]} = {\bf E} X^{[m]}.

On the other hand, we have

\displaystyle  |X_n - X_n^{[m]}| \leq m^{-\varepsilon} |X_n|^{1+\varepsilon}

(why?) and thus on taking expectations and using the triangle inequality

\displaystyle  {\bf E} X_n = {\bf E} X_n^{[m]} + O( m^{-\varepsilon} M )

where we are using the asymptotic notation {O(X)} to denote a quantity bounded in magnitude by {CX} for an absolute constant {C}. Also, from Fatou’s lemma we have

\displaystyle  {\bf E} |X|^{1+\varepsilon} \leq M

so we similarly have

\displaystyle  {\bf E} X = {\bf E} X^{[m]} + O( m^{-\varepsilon} M )

Putting all this together, we see that

\displaystyle  \liminf_{n \rightarrow \infty} {\bf E} X_n, \limsup_{n \rightarrow \infty} {\bf E} X_n = {\bf E} X + O( m^{-\varepsilon} M ).

Sending {m \rightarrow \infty}, we obtain the claim. \Box

Remark 25 The essential point about the condition {{\bf E} |X_n|^{1+\varepsilon}} was that the function {x \mapsto x^{1+\varepsilon}} grew faster than linearly as {x \rightarrow \infty}. One could accomplish the same result with any other function with this property, e.g. a hypothesis such as {{\bf E} |X_n| \log |X_n| \leq M} would also suffice. The most natural general condition to impose here is that of uniform integrability, which encompasses the hypotheses already mentioned, but we will not focus on this condition here.

Exercise 26 (Scheffé’s lemma) Let {X_1,X_2,\dots} be a sequence of absolutely integrable scalar random variables that converge almost surely to another absolutely integrable scalar random variable {X}. Suppose also that {{\bf E} |X_n|} converges to {{\bf E} |X|} as {n \rightarrow \infty}. Show that {{\bf E} |X-X_n|} converges to zero as {n \rightarrow \infty}. (Hint: there are several ways to prove this result, known as Scheffe’s lemma. One is to split {X_n} into two components {X_n = X_{n,1} + X_{n,2}}, such that {X_{n,1}} is dominated by {|X|} but converges almost surely to {X}, and {X_{n,2}} is such that {|X_n| = |X_{n,1}| + |X_{n,2}|}. Then apply the dominated convergence theorem.)

— 4. The distribution of a random variable —

We have seen that the expectation of a random variable is a special case of the more general notion of Lebesgue integration on a measure space. There is however another way to think of expectation as a special case of integration, which is particularly convenient for computing expectations. We first need the following definition.

Definition 27 Let {X} be a random variable taking values in a measurable space {R = (R, {\mathcal B})}. The distribution of {X} (also known as the law of {X}) is the probability measure {\mu_X} on {R} defined by the formula

\displaystyle  \mu_X(S) := {\bf P}( X \in S )

for all measurable sets {S \in {\mathcal B}}; one easily sees from the Kolmogorov axioms that this is indeed a probability measure.

In the language of measure theory, the distribution {\mu_X} on {S} is the push-forward of the probability measure {{\bf P}} on the sample space {\Omega} by the model {X_\Omega: \Omega \rightarrow R} of {X} on that sample space.

Example 28 If {X} only takes on at most countably many values (and if every point in {R} is measurable), then the distribution {\mu_X} is the discrete measure that assigns each point {a} in the range of {X} a measure of {{\bf P}(X=a)}.

Example 29 If {X} is a real random variable with cumulative distribution function {F}, then {\mu_X} is the Lebesgue-Stieltjes measure associated to {F}. For instance, if {X} is drawn uniformly at random from {[0,1]}, then {\mu_X} is Lebesgue measure restricted to {[0,1]}. In particular, two scalar variables are equal in distribution if and only if they have the same cumulative distribution function.

Example 30 If {X} and {Y} are the results of two separate rolls of a fair die (as in Example 3 of Notes 0), then {X} and {Y} are equal in distribution, but are not equal as random variables.

Remark 31 In the converse direction, given a probability measure {\mu} on a measurable space {(R, {\mathcal B})}, one can always build a probability space model and a random variable {X} represented by that model whose distribution is {\mu}. Indeed, one can perform the “tautological” construction of defining the probability space model to be {\Omega := (R, {\mathcal B}, \mu)}, and {X: \Omega \rightarrow R} to be the identity function {X(\omega) := \omega}, and then one easily checks that {\mu_X = \mu}. Compare with Corollaries 26 and 29 of Notes 0. Furthermore, one can view this tautological model as a “base” model for random variables of distribution {\mu} as follows. Suppose one has a random variable {X} of distribution {\mu} which is modeled by some other probability space {\Omega' := (\Omega', {\mathcal F}', \mu')}, thus {X_{\Omega'}: \Omega' \rightarrow R} is a measurable function such that

\displaystyle \mu(S) = {\bf P}(X \in S) = \mu'( \{ \omega' \in \Omega': X_{\Omega'}(\omega') \in S \})

for all {S \in {\mathcal B}}. Then one can view the probability space {\Omega'} as an extension of the tautological probability space {\Omega = (R, {\mathcal B}, \mu)} using {X_{\Omega'}} as the factor map.

We say that two random variables {X,Y} are equal in distribution, and write {X \stackrel{d}{=} Y}, if they have the same law: {\mu_X = \mu_Y}, that is to say {{\bf P}(X \in S) = {\bf P}(Y \in S)} for any measurable set {S} in the range. This definition makes sense even when {X,Y} are defined on different sample spaces. Roughly speaking, the distribution captures the “size” and “shape” of the random variable, but not its “location” or how it relates to other random variables. We also say that {Y} is a copy of {X} if they are equal in distribution. For instance, the two dice rolls in Example 3 of Notes 0 are copies of each other.

Theorem 32 (Change of variables formula) Let {X} be a random variable taking values in a measurable space {R = (R, {\mathcal B})}. Let {f: R \rightarrow {\bf R}} or {f: R \rightarrow {\bf C}} be a measurable scalar function (giving {{\bf R}} or {{\bf C}} the Borel {\sigma}-algebra of course) such that either {f \geq 0}, or that {{\bf E} |f(X)| < \infty}. Then

\displaystyle  {\bf E} f(X) = \int_R f(x)\ d\mu_X(x).

Thus for instance, if {X} is a real random variable, then

\displaystyle  {\bf E} |X| = \int_{\bf R} |x|\ d\mu_X(x),

and more generally

\displaystyle  {\bf E} |X|^p = \int_{\bf R} |x|^p\ d\mu_X(x)

for all {0 < p < \infty}; furthermore, if {X} is unsigned or absolutely integrable, one has

\displaystyle  {\bf E} X = \int_{\bf R} x\ d\mu_X(x).

The point here is that the integration is not over some unspecified sample space {\Omega}, but over a very explicit domain, namely the reals; we have “changed variables” to integrate over {{\bf R}} instead over {\Omega}, with the distribution {\mu_X} representing the “Jacobian” factor that typically shows up in such change of variables formulae.

If {X} is a scalar variable that only takes on at most countably many values {x_1,x_2,\dots}, the change of variables formula tells us that

\displaystyle  {\bf E} X = \sum_i x_i {\bf P}(X=x_i)

if {X} is unsigned or absolutely integrable.

Proof: First suppose that {f} is unsigned and only takes on a finite number {a_1,\dots,a_n} of values. Then

\displaystyle  f(X) = \sum_{i=1}^n a_i 1_{f(X)=a_i}

and hence

\displaystyle  {\bf E} f(X) = \sum_{i=1}^n a_i {\bf P}(f(X)=a_i)

\displaystyle  = \sum_{i=1}^n a_i \mu_X( \{ x: f(x) = a_i\} )

\displaystyle  = \int_R \sum_{i=1}^n a_i 1_{f(x)=a_i}\ d\mu_X(x)

\displaystyle  = \int_R f(x)\ d\mu_X(x)

as required.

Next, suppose that {f} is unsigned but can take on infinitely many values. We can express {f} as the monotone increasing limit of functions {f_n} that only take a finite number of values; for instance we can define {f_n(x)} to be {f(x)} rounded down to the largest multiple of {1/n} less than both {n} and {f(x)}. By the preceding computation, we have

\displaystyle  {\bf E} f_n(X) = \int_R f_n(x)\ d\mu_X(x),

and on taking limits as {n \rightarrow \infty} using the monotone convergence theorem we obtain the claim in this case.

Now suppose that {f} is real-valued with {{\bf E} |f(X)| < \infty}. We write {f = f_1 - f_2} where {f_1 := \max(f,0)} and {f_2 := \min(f,0)}, then we have {{\bf E} f_1(X), {\bf E} f_2(X) < \infty} and

\displaystyle  {\bf E} f_i(X) = \int_{\bf R} f_i(x)\ d\mu_X(x)

for {i=1,2}. Subtracting these two identities together, we obtain the claim.

Finally, the case of complex-valued {f} with {{\bf E} |f(X)| < \infty} follows from the real-valued case by taking real and imaginary parts. \Box

Example 33 Let {X} be the uniform distribution on {[0,1]}, then

\displaystyle  {\bf E} f(X) = \int_0^1 f(x)\ dx

for any Riemann integrable {f}; thus for instance

\displaystyle  {\bf E} X^p = \frac{1}{p+1}

for any {0 < p < \infty}.

Remark 34 An alternate way to prove the change of variables formula is to observe that the formula is obviously true when one uses the tautological model {(R, {\mathcal B}, \mu_X)} for {X}, and then the claim follows from the model-independence of expectation and the observation from Remark 31 that any other model for {X} is an extension of the tautological model.

Exercise 35 Let {f: {\bf R} \rightarrow [0,+\infty]} be a measurable function with {\int_{\bf R} f(x)\ dx = 1}. If one defines {m_f(E)} for any Borel subset {E} of {{\bf R}} by the formula

\displaystyle  m_f(E) := \int_E f(x)\ dx,

show that {m_f} is a probability measure on {{\bf R}} with Stieltjes measure function {F(t) = \int_{-\infty}^t f(x)\ dx}. If {X} is a real random variable with probability distribution {m_f} (in which case we call {X} a random variable with an absolutely continuous distribution, and {f} the probability density function (PDF) of {X}), show that

\displaystyle  \mathop{\bf E} G(X) = \int_{\bf R} G(x) f(x)\ dx

when either {G: {\bf R} \rightarrow [0,+\infty]} is an unsigned measurable function, or {G: {\bf R} \rightarrow {\bf C}} is measurable with {G(X)} absolutely integrable (or equivalently, that {\int_{\bf R} |G(x)| f(x)\ dx < \infty}.

Exercise 36 Let {X} be a real random variable with the probability density function {x \mapsto \frac{1}{\sqrt{2\pi}} e^{-x^2/2}} of the standard normal distribution. Establish the Stein identity

\displaystyle  {\bf E} X F(X) = {\bf E} F'(X)

whenever {F: {\bf R} \rightarrow {\bf R}} is a continuously differentiable function with {F} and {F'} both of polynomial growth (i.e., there exist constants {C, A > 0} such that {|F(t)|, |F'(t)| \leq C (1+|t|)^A} for all {t \in {\bf R}}). There is a robust converse to this identity which underpins the basis of Stein’s method, discussed in this previous blog post. Use this identity recursively to establish the identities

\displaystyle  {\bf E} X^k = 0

when {k} is an odd natural number and

\displaystyle  {\bf E} X^k = (k-1) (k-3) \dots 1 = \frac{k!}{2^{k/2} (k/2)!}

when {k} is an even natural number. (This quantity is also known as the double factorial {(k-1)!!} of {k-1}.)

Exercise 37 Let {X} be a real random variable with cumulative distribution function {F_X(xt) := {\bf P}(X \leq x)}. Show that

\displaystyle  {\bf E} e^{tX} = \int_{\bf R} (1-F_X(x)) t e^{tx}\ dx

for all {t > 0}. If {X} is nonnegative, show that

\displaystyle  {\bf E} X^p = \int_0^\infty (1-F_X(x)) p x^{p-1}\ dx

for all {p>0}.

— 5. Some basic inequalities —

The change of variables formula allows us, in principle at least, to compute the expectation {{\bf E} X} of a scalar random variable as an integral. In very simple situations, for instance when {X} has one of the standard distributions (e.g. uniform, gaussian, binomial, etc.), this allows us to compute such expectations exactly. However, once one gets to more complicated situations, one usually does not expect to be able to evaluate the required integrals in closed form. In such situations, it is often more useful to have some general inequalities concerning expectation, rather than identities.

We therefore record here for future reference some basic inequalities concerning expectation that we will need in the sequel. We have already seen the triangle inequality

\displaystyle  |{\bf E} X| \leq {\bf E} |X| \ \ \ \ \ (5)

for absolutely integrable {X}, and the Markov inequality

\displaystyle  {\bf P}(|X| \geq t) \leq \frac{1}{t} {\bf E} |X| \ \ \ \ \ (6)

for arbitrary scalar {X} and {t>0} (note the inequality is trivial if {X} is not absolutely integrable). Applying the triangle inequality to the difference {X-Y} of two absolutely integrable random variables {X,Y}, we obtain the variant

\displaystyle  |{\bf E} X - {\bf E} Y| \leq {\bf E} |X-Y|. \ \ \ \ \ (7)

Thus, for instance, if {X_n} is a sequence of absolutely integrable scalar random variables which converges in {L^1} to another absolutely integrable random variable {X}, in the sense that {{\bf E} |X_n-X| \rightarrow 0} as {n \rightarrow \infty}, then {{\bf E} X_n \rightarrow {\bf E} X} as {n \rightarrow \infty}.

Similarly, applying the Markov inequality to the quantity {|X - {\bf E} X|^2} we obtain the important Chebyshev inequality

\displaystyle  {\bf P}(|X - {\bf E} X| \geq t) \leq \frac{1}{t^2} \hbox{Var}(X) \ \ \ \ \ (8)

for absolutely integrable {X} and {t>0}, where the Variance {\hbox{Var}(X)} of {X} is defined as

\displaystyle  \hbox{Var}(X) := {\bf E}( |X - {\bf E} X|^2 ).

Next, we record

Lemma 38 (Jensen’s inequality) If {f: {\bf R} \rightarrow {\bf R}} is a convex function, {X} is a real random variable with {X} and {f(X)} both absolutely integrable, then

\displaystyle  f( {\bf E} X ) \leq {\bf E} f(X).

Proof: Let {x_0} be a real number. Being convex, the graph of {f} must be supported by some line at {(x_0,f(x_0))}, that is to say there exists a slope {c} (depending on {x_0}) such that {f(x) \ge f(x_0) + c(x-x_0)} for all {x \in {\bf R}}. (If {f} is differentiable at {x_0}, one can take {c} to be the derivative of {f} at {x_0}, but one always has a supporting line even in the non-differentiable case.) In particular

\displaystyle  f(X) \geq f(x_0) + c(X-x_0).

Taking expectations and using linearity of expectation, we conclude

\displaystyle  {\bf E} f(X) \geq f(x_0) + c({\bf E} X - x_0 )

and the claim follows from setting {x_0 := {\bf E} X}. \Box

Exercise 39 (Complex Jensen inequality) Let {f: {\bf C} \rightarrow {\bf R}} be a convex function (thus {f((1-t)z+tw) \leq (1-t)f(z) + tf(w)} for all complex {z,w} and all {0 \leq t \leq 1}, and let {X} be a complex random variable with {X} and {f(X)} both absolutely integrable. Show that

\displaystyle  f( {\bf E} X ) \leq {\bf E} f(X).

Note that the triangle inequality {|{\bf E} X| \leq {\bf E} |X|} is the special case of Jensen’s inequality (or the complex Jensen’s inequality, if {X} is complex-valued) corresponding to the convex function {f(x) := |x|} on {{\bf R}} (or {f(z) := |z|} on {{\bf C}}). Another useful example is

\displaystyle  |{\bf E} X|^2 \leq {\bf E} |X|^2.

Applying Jensen’s inequality to the convex function {f: x \mapsto e^x} and the random variable {X = \log Y} for some {Y>0}, we obtain the arithmetic mean-geometric mean inequality

\displaystyle  \exp( {\bf E} \log Y ) \leq {\bf E} Y

assuming that {Y>0} and {Y,\log Y} are absolutely integrable.

As a related application of convexity, observe from the convexity of the function {x \mapsto e^x} that

\displaystyle  e^{tx + (1-t)y} \leq t e^x + (1-t) e^y

for any {0 \leq t \leq 1} and {x,y \in {\bf R}}. This implies in particular Young’s inequality

\displaystyle  |X| |Y| \leq \frac{1}{p} |X|^p + \frac{1}{q} |Y|^q

for any scalar {X,Y} and any exponents {1 < p,q < \infty} with {\frac{1}{p} + \frac{1}{q}=1}; note that this inequality is also trivially true if one or both of {|X|, |Y|} are infinite. Taking expectations, we conclude that

\displaystyle  {\bf E} |X| |Y| \leq \frac{1}{p} {\bf E} |X|^p + \frac{1}{q} {\bf E} |Y|^q

if {X,Y} are scalar random variabels and {1 < p, q < \infty} are deterministic exponents with {\frac{1}{p} + \frac{1}{q}=1}. In particular, if {|X|^p, |Y|^q} are absolutely integrable, then so is {XY}, and

\displaystyle  |{\bf E} X Y| \leq \frac{1}{p} {\bf E} |X|^p + \frac{1}{q} {\bf E} |Y|^q.

We can amplify this inequality as follows. Multiplying {X} by some {0 < \lambda < \infty} and dividing {Y} by the same {\lambda}, we conclude that

\displaystyle  |{\bf E} X Y| \leq \frac{\lambda^p}{p} {\bf E} |X|^p + \frac{\lambda^{-q}}{q} {\bf E} |Y|^q;

optimising the right-hand side in {\lambda}, we obtain (after some algebra, and after disposing of some edge cases when {X} or {Y} is almost surely zero) the important Hölder inequality

\displaystyle  |{\bf E} X Y| \leq ({\bf E} |X|^p)^{1/p} ({\bf E} |Y|^q)^{1/q} \ \ \ \ \ (9)

which we can write as

\displaystyle  |{\bf E} X Y| \leq \| X\|_p \|Y\|_q

where we use the notation

\displaystyle  \|X\|_p := ({\bf E} |X|^p)^{1/p}

for {0 < p < \infty}. Using the convention

\displaystyle  \|X\|_\infty := \inf \{ M: |X| \leq M \hbox{ a. s.} \}

(thus {\|X\|_\infty} is the essential supremum of {X}), we also see from the triangle inequality that the Hölder inequality applies in the boundary case when one of {p,q} is allowed to be {\infty} (so that the other is equal to {1}):

\displaystyle  |{\bf E} X Y| \leq \| X\|_1 \|Y\|_\infty, \|X\|_\infty \|Y\|_1.

The case {p=q=2} is the important Cauchy-Schwarz inequality

\displaystyle  |{\bf E} X Y| \leq \| X\|_2 \|Y\|_2, \ \ \ \ \ (10)

valid whenever {X,Y} are square-integrable in the sense that {\|X\|_2, \|Y\|_2} are finite.

Exercise 40 Show that the expressions {\|X\|_p} are non-decreasing in {p} for {p \in (0,+\infty]}. In particular, if {\|X\|_p} is finite for some {p}, then it is automatically finite for all smaller values of {p}.

Exercise 41 For any square-integrable {X}, show that

\displaystyle  {\bf P}(X \neq 0) \geq \frac{ ({\bf E} |X|)^2}{{\bf E}(|X|^2)}.

Exercise 42 If {1 < p < \infty} and {X,Y} are scalar random variables with {\|X\|_p, \|Y\|_p < \infty}, use Hölder's inequality to establish that

\displaystyle  {\bf E} |X| |X+Y|^{p-1} \leq \|X\|_p \|X+Y\|_p^{p-1}

and

\displaystyle  {\bf E} |Y| |X+Y|^{p-1} \leq \|X\|_p \|X+Y\|_p^{p-1}

and then conclude the Minkowski inequality

\displaystyle  \|X+Y\|_p \leq \|X\|_p + \|Y\|_p.

Show that this inequality is also valid at the endpoint cases {p=1} and {p=\infty}.

Exercise 43 If {X} is non-negative and square-integrable, and {0 \leq \theta \leq 1}, establish the Paley-Zygmund inequality

\displaystyle  {\bf P}( X > \theta {\bf E}(X) ) \geq (1-\theta)^2 \frac{({\bf E} X)^2}{{\bf E}(|X|^2)}.

(Hint: use the Cauchy-Schwarz inequality to upper bound {{\bf E} X 1_{X > \theta {\bf E} X}} in terms of {{\bf E} |X|^2} and {{\bf P}( X > \theta {\bf E}(X) )}.)

Exercise 44 Let {X} be a non-negative random variable that is almost surely bounded but not identically zero, show that

\displaystyle  \mathop{\bf P}( X \geq \frac{1}{2} \mathop{\bf E} X) \geq \frac{1}{2} \frac{\mathop{\bf E} X}{\|X\|_\infty}.