In the previous set of notes, we constructed the measure-theoretic notion of the Lebesgue integral, and used this to set up the probabilistic notion of expectation on a rigorous footing. In this set of notes, we will similarly construct the measure-theoretic concept of a product measure (restricting to the case of probability measures to avoid unnecessary techncialities), and use this to set up the probabilistic notion of independence on a rigorous footing. (To quote Durrett: “measure theory ends and probability theory begins with the definition of independence.”) We will be able to take virtually any collection of random variables (or probability distributions) and couple them together to be independent via the product measure construction, though for infinite products there is the slight technicality (a requirement of the Kolmogorov extension theorem) that the random variables need to range in standard Borel spaces. This is not the only way to couple together such random variables, but it is the simplest and the easiest to compute with in practice, as we shall see in the next few sets of notes.
— 1. Product measures —
for any Borel sets . This is in fact true (see Exercise 4 below), and is part of a more general phenomenon, which we phrase here in the case of probability measures:
for all . Furthermore, we have the following two facts:
- (Tonelli theorem) If is measurable, then for each , the function is measurable on , and the function is measurable on . Similarly, for each , the function is measurable on and is measurable on . Finally, we have
- (Fubini theorem) If is absolutely integrable, then for -almost every , the function is absolutely integrable on , and the function is absolutely integrable on . Similarly, for -almost every , the function is absolutely integrable on and is absolutely integrable on . Finally, we have
The Fubini and Tonelli theorems are often used together (so much so that one may refer to them as a single theorem, the Fubini-Tonelli theorem, often also just referred to as Fubini’s theorem in the literature). For instance, given an absolutely integrable function and an absolutely integrable function , the Tonelli theorem tells us that the tensor product defined by
Our proof of Theorem 1 will be based on the monotone class lemma that allows one to conveniently generate a -algebra from a Boolean algebra. (In Durrett, the closely related theorem is used in place of the monotone class lemma.) Define a monotone class in a set to be a collection of subsets of with the following two closure properties:
- If are a countable increasing sequence of sets in , then .
- If are a countable decreasing sequence of sets in , then .
Thus for instance any -algebra is a monotone class, but not conversely. Nevertheless, there is a key way in which monotone classes “behave like” -algebras:
Proof: Let be the intersection of all the monotone classes that contain . Since is clearly one such class, is a subset of . Our task is then to show that contains .
It is also clear that is a monotone class that contains . By replacing all the elements of with their complements, we see that is necessarily closed under complements.
For any , consider the set of all sets such that , , , and all lie in . It is clear that contains ; since is a monotone class, we see that is also. By definition of , we conclude that for all .
Next, let be the set of all such that , , , and all lie in for all . By the previous discussion, we see that contains . One also easily verifies that is a monotone class. By definition of , we conclude that . Since is also closed under complements, this implies that is closed with respect to finite unions. Since this class also contains , which contains , we conclude that is a Boolean algebra. Since is also closed under increasing countable unions, we conclude that it is closed under arbitrary countable unions, and is thus a -algebra. As it contains , it must also contain .
We now begin the proof of Theorem 1. We begin with the uniqueness claim. Suppose that we have two measures on that are product measures of and in the sense that
for all and . If we then set to be the collection of all such that , then contains all sets of the form with and . In fact contains the collection of all sets that are “elementary” in the sense that they are of the form for finite and for , since such sets can be easily decomposed into a finite union of disjoint products , at which point the claim follows from (4) and finite additivity. But is a Boolean algebra that generates as a -algebra, and from continuity from above and below we see that is a monotone class. By the monotone class lemma, we conclude that is all of , and hence . This gives uniqueness. Now we prove existence. We first claim that for any measurable set , the sets are measurable in . Indeed, the claim is obvious for sets that are “elementary” in the sense that they belong to the Boolean algebra defined previously, and the collection of all such sets is a monotone class, so the claim follows from the monotone class lemma. A similar argument (relying on monotone or dominated convergence) shows that the function
is measurable in for all . Thus, for any , we can define the quantity by
A routine application of the monotone convergence theorem verifies that is a countably additive measure; one easily checks that (2) holds for all , and in particular is a probability measure.
By construction, we see that the identity
holds (with all functions integrated being measurable) whenever is an indicator function with . By linearity of integration, the same identity holds (again with all functions measurable) when is an unsigned simple function. Since any unsigned measurable function can be expressed as the monotone non-decreasing limit of unsigned simple functions (for instance, one can round down to the largest multiple of that is less than and ), the above identity also holds for unsigned measurable by the monotone convergence theorem. Applying this fact to the absolute value of an absolutely integrable function , we conclude for such functions that
which by Markov’s inequality implies that
for -almost every . In other words, the function is absolutely integrable on for -almost every . By monotonicity we conclude that
and hence the function is absolutely integrable. Hence it makes sense to ask whether the identity
holds for absolutely integrable , as both sides are well-defined. We have already established this claim when is unsigned and absolutely integrable; by subtraction this implies the claim for real-valued absolutely integrable , and by taking real and imaginary parts we obtain the claim for complex-valued absolutely integrable .
We may reverse the roles of and , and define instead by the formula
By the previously proved uniqueness of product measure, we see that this defines the same product measure as previously. Repeating the previous arguments we obtain all the above claims with the roles of and reversed. This gives all the claims required for Theorem 1.
One can extend the product construction easily to finite products:
whenever for . Furthermore, show that
for any partition (after making the obvious identification between and ). Thus for instance one has the associativity property
for any probability spaces for .
By writing as products of pairs of probability spaces in many different ways, one can obtain a higher-dimensional analogue of the Fubini and Tonelli theorems; we leave the precise statement of such a theorem to the interested reader.
for measurable functions that are not unsigned, are usually only justified when is absolutely integrable on , or equivalently (by the Tonelli theorem) the function is absolutely integrable on (or that is absolutely integrable on . Without this joint absolute integrability (and without any unsigned property on ), the identity (5) can fail even if both sides are well-defined. For instance, let be the unit interval , and let be the uniform probability measure on this interval, and set
One can check that both sides of (5) are well-defined, but that the left-hand side is and the right-hand side is . Of course, this function is neither unsigned nor jointly absolutely integrable, so this counterexample does not violate either of the Fubini or Tonelli theorems. Thus one should take care to only interchange integrals when the integrands are known to be either unsigned or jointly absolutely integrable, or if one has another way to rigorously justify the exchange of integrals.
The above theory extends from probability spaces to finite measure spaces, and more generally to measure spaces that are -finite, that is to say they are expressable as the countable union of sets of finite measure. (With a bit of care, some portions of product measure theory are even extendible to non-sigma-finite settings, though I urge caution in applying these results blindly in that case.) We will not give the details of these generalisations here, but content ourselves with one example:
Exercise 4 Establish (4) for all Borel sets . (Hint: can be viewed as the disjoint union of a countable sequence of sets of measure .)
Remark 5 When doing real analysis (as opposed to probability), it is convenient to complete the Borel -algebra on spaces such as , to form the larger Lebesgue -algebra , defined as the collection of all subsets in that differ from a Borel set in by a sub-null set, in the sense that for some Borel subset of of zero Lebesgue measure. There are analogues of the Fubini and Tonelli theorems for such complete -algebras; see this previous lecture notes for details. However one should be cautioned that the product of Lebesgue -algebras is not the Lebesgue -algebra , but is instead an intermediate -algebra between and , which causes some additional small complications. For instance, if is Lebesgue measurable, then the functions can only be found to be Lebesgue measurable on for almost every , rather than for all . We will not dwell on these subtleties further here, as we will rarely have any need to complete the -algebras used in probability theory.
It is also important in probability theory applications to form the product of an infinite number of probability spaces for , where can be infinite or even uncountable. Recall from Notes 0 that the product -algebra on is defined to be the -algebra generated by the sets for and , where is the usual coordinate projection. Equivalently, if we define an elementary set to be a subset of of the form , where is a finite subset of , is the obvious projection map to , and is a measurable set in , then can be defined as the -algebra generated by the collection of elementary sets. (Elementary sets are the measure-theoretic analogue of cylinder sets in point set topology.) For future reference we note the useful fact that is a Boolean algebra.
We define a product measure to be a probability measure on the measurable space which extends all of the finite products in the sense that
for all finite subsets of and all in , where . If this product measure exists, it is unique:
Exercise 6 Show that for any collection of probability spaces for , there is at most one product measure . (Hint: adapt the uniqueness argument in Theorem 1 that used the monotone class lemma.)
Exercise 7 Let be probability measures on , and let be their Stieltjes measure functions. Show that is the unique probability measure on whose Stietljes transform is the tensor product of .
In the case of finite , the finite product constructed in Exercise 3 is clearly the unique product. But for infinite , the construction of product measure is a more nontrivial issue. We can generalise the problem as follows:
for all finite and ?
Again, one has uniqueness:
The extension problem is trivial for finite , but for infinite there are unfortunately examples where the probability measure fails to exist. However, there is one key case in which we can build the extension, thanks to the Kolmogorov extension theorem. Call a measurable space standard Borel if it is isomorphic as a measurable space to a Borel subset of the unit interval with Borel measure, that is to say there is a bijection from to a Borel subset of such that and are both measurable. (In Durrett, such spaces are called nice spaces.) Note that one can easily replace by other standard spaces such as if desired, since these spaces are isomorphic as measurable spaces (why?).
Theorem 10 (Kolmogorov extension theorem) Let the situation be as in Problem 8. If all the measurable spaces are standard Borel, then there exists probability measure solving the extension problem (which is then unique, thanks to Exercise 9).
The proof of this theorem is lengthy and is deferred to the next (optional) section. Specialising to the product case, we conclude
Corollary 11 Let be a collection of probability spaces with standard Borel. Then there exists a product measure (which is then unique, thanks to Exercise 6).
Of course, to use this theorem we would like to have a large supply of standard Borel spaces. Here is one tool that often suffices:
Lemma 12 Let be a complete separable metric space, and let be a Borel subset of . Then (with the Borel -algebra) is standard Borel.
Proof: Let us call two topological spaces Borel isomorphic if their corresponding Borel structures are isomorphic as measurable spaces. Using the binary expansion, we see that is Borel isomorphic to (the countable number of points that have two binary expansions can be easily permuted to obtain a genuine isomorphism). Similarly is Borel isomorphic . Since is in bijection with , we conclude hat is Borel isomorphic . Thus it will suffice to to show that every complete separable metric space is Borel isomorphic to a Borel subset of . But if we let be a countable dense subset in , the map
can easily be seen to be a homeomorphism between and a subset of , which is completely metrisable and hence Borel (in fact it is a set – the countable intersection of open sets – why?). The claim follows.
Exercise 13 (Kolmogorov extension theorem, alternate form) For each natural number , let be a probability measure on with the property that
for and any box in , where we identify with in the usual manner. Show that there exists a unique probability measure on (with the product -algebra, or equivalently the Borel -algebra on the product topology) such that
for all and Borel sets .
— 2. Proof of the Kolmogorov extension theorem (optional) —
We now prove Theorem 10. By the definition of a standard Borel space, we may assume without loss of generality that each is a Borel subset of with the Borel -algebra, and then by extending each to we may in fact assume without loss of generality that each is simply with the Borel -algebra. Thus each for finite is a probability measure on the cube .
We will exploit the regularity properties of such measures:
and the outer regularity property
Hint: use the monotone class lemma.
Another way of stating the above exercise is that finite Borel measures on the cube are automatically Radon measures. In fact there is nothing particularly special about the unit cube here; the claim holds for any compact separable metric spaces. Radon measures are often used in real analysis (see e.g. these lecture notes) but we will not develop their theory further here.
Observe that one can define the elementary measure of any elementary set in by defining
for any finite and any Borel . This definition is well-defined thanks to the compatibility hypothesis (6). From the finite additivity of the it is easy to see that is a finitely additive probability measure on the Boolean algebra of elementary sets.
We would like to extend to a countably additive probability measure on . The standard approach to do this is via the Carathéodory extension theorem in measure theory (or the closely related Hahn-Kolmogorov theorem); this approach is presented in these previous lecture notes, and a similar approach is taken in Durrett. Here, we will try to avoid developing the Carathéodory extension theorem, and instead take a more direct approach similar to the direct construction of Lebesgue measure, given for instance in these previous lecture notes.
Given any subset (not necessarily Borel), we define its outer measure to be the quantity
where we say that is an open elementary cover of if each is an open elementary set, and . Some properties of this outer measure are easily established:
- (i) Show that .
- (ii) (Monotonicity) Show that if then .
- (iii) (Countable subadditivity) For any countable sequence of subsets of , show that . In particular (from part (i)) we have the finite subadditivity for all .
- (iv) (Elementary sets) If is an elementary set, show that . (Hint: first establish the claim when is compact, relying heavily on the regularity properties of the provided by Exercise 14, then extend to the general case by further heavy reliance on regularity.) In particular, we have .
- (v) (Approximation) Show that if , then for any there exists an elementary set such that . (Hint: use the monotone class lemma. When dealing with an increasing sequence of measurable sets obeying the required property, approximate these sets by an increasing sequence of elementary sets , and use the finite additivity of elementary measure and the fact that bounded monotone sequences converge.)
From part (v) of the above exercise, we see that every can be viewed as a “limit” of a sequence of elementary sets such that . From parts (iii), (iv) we see that the sequence is a Cauchy sequence and thus converges to a limit, which we denote ; one can check from further application of (iii), (iv) that this quantity does not depend on the specific choice of . (Indeed, from subadditivity we see that .) From definition we see that extends (thus for any elementary set ), and from the above exercise one checks that is countably additive. Thus is a probability measure with the desired properties, and the proof of the Kolmogorov extension theorem is complete.
— 3. Independence —
Using the notion of product measure, we can now quickly define the notion of independence:
Definition 16 A collection of random variables (each of which take values in some measurable space ) is said to be jointly independent, if the distribution of is the product of the distributions of the . Or equivalently (after expanding all the definitions), we have
for all finite and all measurable subsets of . We say that two random variables are independent (or that is independent of ) if the pair is jointly independent.
It is worth reiterating that unless otherwise specified, all random variables under consideration are being modeled by a single probability space. The notion of independence between random variables does not make sense if the random variables are only being modeled by separate probability spaces; they have to be coupled together into a single probability space before independence becomes a meaningful notion.
Independence is a non-trivial notion only when one has two or more random variables; by chasing through the definitions we see that any collection of zero or one variables is automatically jointly independent.
Example 17 If we let be drawn uniformly from a product of two Borel sets in of positive finite Lebesgue measure, then and are independent. However, if is drawn from uniformly from another shape (e.g. a parallelogram), then one usually does not expect to have independence.
As a special case of the above definition, a finite family of random variables taking values in is jointly independent if one has
for all measurable in for .
Suppose that is a family of independent random variables, with each taking values in . From Exercise 3 we see that
whenever are disjoint finite subsets of , is the tuple , and is a measurable subset of . In particular, we see that the tuples are also jointly independent. This implies in turn that are jointly independent for any measurable functions . Thus, for instance, if are jointly independent random variables taking values in respectively, then and are independent for any measurable and . In particular, if two scalar random variables are jointly independent of a third random variable (i.e. the triple are jointly independent), then combinations such as or are also independent of .
We remark that there is a quantitative version of the above facts used in information theory, known as the data processing inequality, but this is beyond the scope of this course.
if and are either both unsigned, or both absolutely integrable. We caution however that the converse is not true: just because two random variables happen to obey (8) does not necessarily mean that they are independent; instead, we say merely that they are uncorrelated, which is a weaker statement.
More generally, if and are random variables taking values in ranges respectively, then
for any scalar functions on respectively, provided that and are either both unsigned, or both absolutely integrable. This is the property of and which is equivalent to independence (as can be seen by specialising to those that take values in ): thus for instance independence of two unsigned random variables entails not only (8), but , , etc.. Similarly when discussing the joint independence of larger numbers of random variables. It is this ability to easily decouple expectations of independent random variables that make independent variables particularly easy to compute with in probability.
Exercise 18 Show that a random variable is independent of itself (i.e. and are independent) if and only if is almost surely equal to a constant.
Exercise 19 Show that a constant (deterministic) random variable is independent of any other random variable.
Exercise 20 Let be discrete random variables (i.e. they take values in at most countable spaces equipped with the discrete sigma-algebra). Show that are jointly independent if and only if one has
for all .
Exercise 21 Let be real scalar random variables. Show that are jointly independent if and only if one has
for all .
The following exercise demonstrates that probabilistic independence is analogous to linear independence:
Exercise 22 Let be a finite-dimensional vector space over a finite field , and let be a random variable drawn uniformly at random from . Let be a non-degenerate bilinear form on , and let be non-zero vectors in . Show that the random variables are jointly independent if and only if the vectors are linearly independent.
Exercise 23 Give an example of three random variables which are pairwise independent (that is, any two of are independent of each other), but not jointly independent. (Hint: one can use the preceding exercise.)
Another analogy is with orthogonality:
Exercise 24 Let be a random variable taking values in with the Gaussian distribution, in the sense that
(where denotes the Euclidean norm on ), and let be vectors in . Show that the random variables (with denoting the Euclidean inner product) are jointly independent if and only if the are pairwise orthogonal.
We say that a family of events are jointly independent if their indicator random variables are jointly independent. Undoing the definitions, this is equivalent to requiring that
for all disjoint finite subsets of . This condition is complicated, but simplifies in the case of just two events:
- (i) Show that two events are independent if and only if .
- (ii) If are events, show that the condition is necessary, but not sufficient, to ensure that are jointly independent.
- (iii) Given an example of three events that are pairwise independent, but not jointly independent.
Because of the product measure construction, it is easy to insert independent sources of randomness into an existing randomness model by extending that model, thus giving a more useful version of Corollaries 27 and 31 of Notes 0:
Proposition 26 Suppose one has a collection of events and random variables modeled by some probability space , and let be a probability measure on a measurable space . Then there exists an extension of the probability space , and a random variable modeled by taking values in , such that has distribution and is independent of all random variables that were previously modeled by .
More generally, given a finite collection of probability spaces on measurable spaces , there exists an extension of and random variables modeled by taking values in for each , such that each has distribution and and are jointly independent for any random variable that was previously modeled by .
If the are all standard Borel spaces, then one can also take to be infinite (even if is uncountable).
Proof: For the first part, we define the extension to be the product of with the probability space , with factor map defined by , and with modeled by . It is then routine to verify all the claimed properties. The other parts of the proposition are proven similarly, using Proposition 11 for the final part.
Using this proposition, for instance, one can start with a given random variable and create an independent copy of that variable, which has the same distribution as but is independent of , by extending the probability model. Indeed one can create any finite number of independent copies, or even an infinite number of takes values in a standard Borel space (in particular, one can do this if is a scalar random variable). A finite or infinite sequence of random variables that are jointly independent and all have the same distribution is said to be an independent and identically distributed (or iid for short) sequence of random variables. The above proposition allows us to easily generate such sequences by extending the sample space as necessary.
Exercise 27 Let be random variables that are independent and identically distributed copies of the Bernoulli random variable with expectation , that is to say the are jointly independent with for all .
- (i) Show that the random variable is uniformly distributed on the unit interval .
- (ii) Show that the random variable has the distribution of Cantor measure (constructed for instance in Example 1.2.4 of Durrett).
Note that part (i) of this exercise provides a means to construct Lebesgue measure on the unit interval (although, when one unpacks the construction, it is actually not too different from the standard construction, as given for instance in this previous set of notes).
Given two square integrable real random variables , the covariance between the two is defined by the formula
The covariance is well-defined thanks to the Cauchy-Schwarz inequality, and it is not difficult to see that one has the alternate formula
for the covariance. Note that the variance is a special case of the covariance: .
From construction we see that if are independent square integrable variables, then the covariance vanishes. The converse is not true:
Exercise 28 Give an example of two square-integrable real bvariables which have vanishing covariance , but are not independent.
However, there is one key case in which the converse does hold, namely that of gaussian random vectors.
Exercise 29 A random vector taking values in is said to be a gaussian random vector if there exists and an positive definite real symmetric matrix such that
for all Borel sets (where we identify elements of with column vectors). The distribution of is called a multivariate normal distribution.
- (i) If is a gaussian random vector with the indicated parameters , show that and for . In particular . Thus we see that the parameters of a gaussian random variable can be recovered from the mean and covariances.
- (ii) If is a gaussian random vector and , show that and are independent if and only if the covariance vanishes. Furthermore, show that are jointly independent if and only if all the covariances for vanish. In particular, for gaussian random vectors, joint independence is equivalent to pairwise independence. (Contrast this with Exercise 23.)
- (iii) Give an example of two real random variables , each of which is gaussian, and for which , but such that and are not independent. (Hint: take to be the product of with a random sign.) Why does this not contradict (ii)?
We have discussed independence of random variables, and independence of events. It is also possible to define a notion of independence of -algebras. More precisely, define a -algebra of events to be a collection of events that contains the empty event, is closed under Boolean operations (in particular, under complements ) and under countable conjunctions and countable disjunctions. Each such -algebra, when using a probability space model , is modeled by a -algebra of measurable sets in , which behaves under an extension in the obvious pullback fashion:
A random variable taking values in some range is said to be measurable with respect to a -algebra of events if the event lies in for every measurable subset of ; in terms of a probabilistic model , is measurable with respect to if and only if is measurable with respect to . Note that every random variable generates a -algebra of events, defined to be the collection of all events of the form for a measurable subset of ; this is the smallest -algebra with respect to which is measurable. More generally, any collection of random variables, one can define the -algebra to be the smallest -algebra of events with respect to which all of the are measurable; in terms of a model , we have
where is the range of . Similarly, any collection of events generates a -algebra of events , defined as the smallest -algebra of events that contains all of the ; with respect to a model , one has
Definition 30 A collection of -algebras of events are said to be jointly independent if, whenever is a random variable measurable with respect to for , the tuple is jointly independent. Equivalently, is jointly independent if and only if one has
whenever is a finite subset of and for (why is this equivalent?).
Thus, for instance, and are independent -algebras of events if and only if one has
for all and , that is to say that all the events in are independent of all the events in .
The above notion generalises the notion of independence for random variables:
Exercise 31 If are a collection of random variables, show that are jointly independent random variables if and only if are jointly independent -algebras.
Exercise 32 Let be a sequence of random variables. Show that are jointly independent if and only if is independent of for all natural numbers .
Suppose one has a sequence of random variables (such a sequence can be referred to as a discrete stochastic process). For each natural number , we can define the -algebras , as the smallest algebra that makes all of the for measurable; for instance, this -algebra contains any event that is definable in terms of measurable relations of finitely many of the , together with countable boolean operations on such events. These -algebras are clearly decreasing in . We can define the tail -algebra to be the intersection of all these -algebras, that is to say consists of those events which lie in for every . For instance, if the are scalar random variables that converge almost surely to a limit , then we see that (after modification on a null set) is measurable with respect to the tail -algebra .
We have the remarkable Kolmogorov 0-1 law that says that the tail -algebra of a sequence of independent random variables is essentially trivial:
Theorem 33 (Kolmogorov zero-one law) Let be a sequence of jointly independent random variables. Then every event in the tail -algebra has probability equal to either or .
As a corollary of the zero-one law, note that any real scalar tail random variable will be almost surely constant (because, for each rational , the event is either almost surely true or almost surely false). Similarly for tail random variables taking values in or .
Example 34 Let be a sequence of jointly independent random variables in (not necessarily identically distributed). The random variable is measurable in the tail algebra, and hence must be almost surely constant, thus there exists such that almost surely. Similarly there exists such that . Thus, either we have and the converge almost surely to a deterministic limit, or and the almost surely do not converge. What cannot happen is (for instance) that converges with probability , and diverges with probability ; the zero-one law forces the only available probabilities of tail events to be zero or one.
Proof: Since are jointly independent, the -algebra is independent of for any . In particular, is independent of . Since the -algebra is generated by the for , a simple application of the monotone class lemma then shows that is also independent of . But contains , hence is independent of itself. But the only events that are independent of themselves have probability or , and the claim follows.
Note that the zero-one law gives no guidance as to which of the two probabilities actually occurs for a given tail event. This usually cannot be determined from such “soft” tools as the zero-one law; instead one often has to work with more “hard” estimates, in particular in explicit inequalities for the probabilities of various events that approximate the given tail event. On the other hand, the proof technique used to prove the Kolmogorov zero-one law is quite general, and is often adapted to prove other zero-one laws in the probability literature.
The zero-one law suggests that many asymptotic statistics of random variables will almost surely have deterministic values. We will see specific examples of this in the next few notes, when we discuss the law of large numbers and the central limit theorem.