254A, Notes 0: A review of probability theory

1 January, 2010 in 254A - random matrices, math.CA, math.PR | Tags: almost sure convergence, conditioning, convergence in probability, independence, moments, random variables | by Terence Tao

In preparation for my upcoming course on random matrices, I am briefly reviewing some relevant foundational aspects of probability theory, as well as setting up basic probabilistic notation that we will be using in later posts. This is quite basic material for a graduate course, and somewhat pedantic in nature, but given how heavily we will be relying on probability theory in this course, it seemed appropriate to take some time to go through these issues carefully.

We will certainly not attempt to cover all aspects of probability theory in this review. Aside from the utter foundations, we will be focusing primarily on those probabilistic concepts and operations that are useful for bounding the distribution of random variables, and on ensuring convergence of such variables as one sends a parameter ${n}$ off to infinity.

We will assume familiarity with the foundations of measure theory; see for instance these earlier lecture notes of mine for a quick review of that topic. This is also not intended to be a first introduction to probability theory, but is instead a revisiting of these topics from a graduate-level perspective (and in particular, after one has understood the foundations of measure theory). Indeed, I suspect it will be almost impossible to follow this course without already having a firm grasp of undergraduate probability theory.

— 1. Foundations —

At a purely formal level, one could call probability theory the study of measure spaces with total measure one, but that would be like calling number theory the study of strings of digits which terminate. At a practical level, the opposite is true: just as number theorists study concepts (e.g. primality) that have the same meaning in every numeral system that models the natural numbers, we shall see that probability theorists study concepts (e.g. independence) that have the same meaning in every measure space that models a family of events or random variables. And indeed, just as the natural numbers can be defined abstractly without reference to any numeral system (e.g. by the Peano axioms), core concepts of probability theory, such as random variables, can also be defined abstractly, without explicit mention of a measure space; we will return to this point when we discuss free probability later in this course.

For now, though, we shall stick to the standard measure-theoretic approach to probability theory. In this approach, we assume the presence of an ambient sample space ${\Omega}$ , which intuitively is supposed to describe all the possible outcomes of all the sources of randomness that one is studying. Mathematically, this sample space is a probability space ${\Omega = (\Omega,{\mathcal B},{\bf P})}$ – a set ${\Omega}$ , together with a ${\sigma}$ -algebra ${{\mathcal B}}$ of subsets of ${\Omega}$ (the elements of which we will identify with the probabilistic concept of an event), and a probability measure ${{\bf P}}$ on the space of events, i.e. an assignment ${E \mapsto {\bf P}(E)}$ of a real number in ${[0,1]}$ to every event ${E}$ (known as the probability of that event), such that the whole space ${\Omega}$ has probability ${1}$ , and such that ${{\bf P}}$ is countably additive.

Elements of the sample space ${\Omega}$ will be denoted ${\omega}$ . However, for reasons that will be explained shortly, we will try to avoid actually referring to such elements unless absolutely required to.

If we were studying just a single random process, e.g. rolling a single die, then one could choose a very simple sample space – in this case, one could choose the finite set ${\{1,\ldots,6\}}$ , with the discrete ${\sigma}$ -algebra ${2^{\{1,\ldots,6\}} := \{ A: A \subset \{1,\ldots,6\}\}}$ and the uniform probability measure. But if one later wanted to also study additional random processes (e.g. supposing one later wanted to roll a second die, and then add the two resulting rolls), one would have to change the sample space (e.g. to change it now to the product space ${\{1,\ldots,6\} \times \{1,\ldots,6\}}$ ). If one was particularly well organised, one could in principle work out in advance all of the random variables one would ever want or need, and then specify the sample space accordingly, before doing any actual probability theory. In practice, though, it is far more convenient to add new sources of randomness on the fly, if and when they are needed, and extend the sample space as necessary. This point is often glossed over in introductory probability texts, so let us spend a little time on it. We say that one probability space ${(\Omega',{\mathcal B}', {\mathcal P}')}$ extends another ${(\Omega,{\mathcal B}, {\mathcal P})}$ if there is a surjective map ${\pi: \Omega' \rightarrow \Omega}$ which is measurable (i.e. ${\pi^{-1}(E) \in {\mathcal B}'}$ for every ${E \in {\mathcal B}}$ ) and probability preserving (i.e. ${{\bf P}'(\pi^{-1}(E)) = {\bf P}(E)}$ for every ${E \in {\mathcal B}}$ ). (Strictly speaking, it is the pair ${((\Omega',{\mathcal B}', {\mathcal P}'), \pi)}$ which is the extension of ${(\Omega,{\mathcal B}, {\mathcal P})}$ , not just the space ${(\Omega',{\mathcal B}', {\mathcal P}')}$ , but let us abuse notation slightly here.) By definition, every event ${E}$ in the original probability space is canonically identified with an event ${\pi^{-1}(E)}$ of the same probability in the extension.

Example 1 As mentioned earlier, the sample space ${\{1,\ldots,6\}}$ , that models the roll of a single die, can be extended to the sample space ${\{1,\ldots,6\} \times \{1,\ldots,6\}}$ that models the roll of the original die together with a new die, with the projection map ${\pi: \{1,\ldots,6\} \times \{1,\ldots,6\} \rightarrow \{1,\ldots,6\}}$ being given by ${\pi(x,y) := x}$ .

Another example of an extension map is that of a permutation – for instance, replacing the sample space ${\{1,\ldots,6\}}$ by the isomorphic space ${\{ a,\ldots,f\}}$ by mapping ${a}$ to ${1}$ , etc. This extension is not actually adding any new sources of randomness, but is merely reorganising the existing randomness present in the sample space.

In order to have the freedom to perform extensions every time we need to introduce a new source of randomness, we will try to adhere to the following important dogma: probability theory is only “allowed” to study concepts and perform operations which are preserved with respect to extension of the underlying sample space. (This is analogous to how differential geometry is only “allowed” to study concepts and perform operations that are preserved with respect to coordinate change, or how graph theory is only “allowed” to study concepts and perform operations that are preserved with respect to relabeling of the vertices, etc..) As long as one is adhering strictly to this dogma, one can insert as many new sources of randomness (or reorganise existing sources of randomness) as one pleases; but if one deviates from this dogma and uses specific properties of a single sample space, then one has left the category of probability theory and must now take care when doing any subsequent operation that could alter that sample space. This dogma is an important aspect of the probabilistic way of thinking, much as the insistence on studying concepts and performing operations that are invariant with respect to coordinate changes or other symmetries is an important aspect of the modern geometric way of thinking. With this probabilistic viewpoint, we shall soon see the sample space essentially disappear from view altogether, after a few foundational issues are dispensed with.

Let’s give some simple examples of what is and what is not a probabilistic concept or operation. The probability ${{\bf P}(E)}$ of an event is a probabilistic concept; it is preserved under extensions. Similarly, boolean operations on events such as union, intersection, and complement are also preserved under extensions and are thus also probabilistic operations. The emptiness or non-emptiness of an event ${E}$ is also probabilistic, as is the equality or non-equality of two events ${E,F}$ (note how it was important here that we demanded the map ${\pi}$ to be surjective in the definition of an extension). On the other hand, the cardinality of an event is not a probabilistic concept; for instance, the event that the roll of a given die gives ${4}$ has cardinality one in the sample space ${\{1,\ldots,6\}}$ , but has cardinality six in the sample space ${\{1,\ldots,6\} \times \{1,\ldots,6\}}$ when the values of an additional die are used to extend the sample space. Thus, in the probabilistic way of thinking, one should avoid thinking about events as having cardinality, except to the extent that they are either empty or non-empty.

Indeed, once one is no longer working at the foundational level, it is best to try to suppress the fact that events are being modeled as sets altogether. To assist in this, we will choose notation that avoids explicit use of set theoretic notation. For instance, the union of two events ${E, F}$ will be denoted ${E \vee F}$ rather than ${E \cup F}$ , and will often be referred to by a phrase such as “the event that at least one of ${E}$ or ${F}$ holds”. Similarly, the intersection ${E \cap F}$ will instead be denoted ${E \wedge F}$ , or “the event that ${E}$ and ${F}$ both hold”, and the complement ${\Omega \backslash E}$ will instead be denoted ${\overline{E}}$ , or “the event that ${E}$ does not hold” or “the event that ${E}$ fails”. In particular the sure event ${\Omega}$ can now be referred to without any explicit mention of the sample space as ${\overline{\emptyset}}$ . We will continue to use the subset notation ${E \subset F}$ (since the notation ${E \leq F}$ may cause confusion), but refer to this statement as “ ${E}$ is contained in ${F}$ ” or “ ${E}$ implies ${F}$ ” or “ ${E}$ holds only if ${F}$ holds” rather than “ ${E}$ is a subset of ${F}$ “, again to downplay the role of set theory in modeling these events.

We record the trivial but fundamental union bound

$\displaystyle {\bf P}( \bigvee_i E_i ) \leq \sum_i {\bf P}(E_i) \ \ \ \ \ (1)$

for any finite or countably infinite collection of events ${E_i}$ . Taking complements, we see that if each event ${E_i}$ fails with probability at most ${\epsilon_i}$ , then the joint event ${\bigwedge_i E_i}$ fails with probability at most ${\sum_i \epsilon_i}$ . Thus, if one wants to ensure that all the events ${E_i}$ hold at once with a reasonable probability, one can try to do this by showing that the failure rate of the individual ${E_i}$ is small compared to the number of events one is controlling. This is a reasonably efficient strategy so long as one expects the events ${E_i}$ to be genuinely “different” from each other; if there are plenty of repetitions, then the union bound is poor (consider for instance the extreme case when ${E_i}$ does not even depend on ${i}$ ).

We will sometimes refer to use of the union bound to bound probabilities as the zeroth moment method, to contrast it with the first moment method, second moment method, exponential moment method, and Fourier moment methods for bounding probabilities that we will encounter later in this course.

Let us formalise some specific cases of the union bound that we will use frequently in the course. In most of this course, there will be an integer parameter ${n}$ , which will often be going off to infinity, and upon which most other quantities will depend; for instance, we will often be considering the spectral properties of ${n \times n}$ random matrices.

Definition 1 (Asymptotic notation) We use ${X = O(Y)}$ , ${Y = \Omega(X)}$ , ${X \ll Y}$ , or ${Y \gg X}$ to denote the estimate ${|X| \leq CY}$ for some ${C}$ independent of ${n}$ and all ${n \geq C}$ . If we need ${C}$ to depend on a parameter, e.g. ${C = C_k}$ , we will indicate this by subscripts, e.g. ${X = O_k(Y)}$ . We write ${X=o(Y)}$ if ${|X| \leq c(n) Y}$ for some ${c}$ that goes to zero as ${n \rightarrow \infty}$ . We write ${X \sim Y}$ or ${X = \Theta(Y)}$ if ${X \ll Y \ll X}$ .

Given an event ${E = E_n}$ depending on such a parameter ${n}$ , we have five notions (in decreasing order of confidence) that an event is likely to hold:

An event ${E}$ holds surely (or is true) if it is equal to the sure event ${\overline{\emptyset}}$ .
An event ${E}$ holds almost surely (or with full probability) if it occurs with probability ${1}$ : ${{\bf P}(E) = 1}$ .
An event ${E}$ holds with overwhelming probability if, for every fixed ${A>0}$ , it holds with probability ${1-O_A(n^{-A})}$ (i.e. one has ${{\bf P}(E) \geq 1 - C_A n^{-A}}$ for some ${C_A}$ independent of ${n}$ ).
An event ${E}$ holds with high probability if it holds with probability ${1-O(n^{-c})}$ for some ${c>0}$ independent of ${n}$ (i.e. one has ${{\bf P}(E) \geq 1 - C n^{-c}}$ for some ${C}$ independent of ${n}$ ).
An event ${E}$ holds asymptotically almost surely if it holds with probability ${1-o(1)}$ , thus the probability of success goes to ${1}$ in the limit ${n \rightarrow \infty}$ .

Of course, all of these notions are probabilistic notions.

Given a family of events ${E_\alpha}$ depending on some parameter ${\alpha}$ , we say that each event in the family holds with overwhelming probability uniformly in ${\alpha}$ if the constant ${C_A}$ in the definition of overwhelming probability is independent of ${\alpha}$ ; one can similarly define uniformity in the concepts of holding with high probability or asymptotic almost sure probability.

From the union bound (1) we immediately have

Lemma 2 (Union bound)

If ${E_\alpha}$ is an arbitrary family of events that each hold surely, then ${\bigwedge_\alpha E_\alpha}$ holds surely.

If ${E_\alpha}$ is an at most countable family of events that each hold almost surely, then ${\bigwedge_\alpha E_\alpha}$ holds almost surely.

If ${E_\alpha}$ is a family of events of polynomial cardinality (i.e. cardinality ${O(n^{O(1)})}$ ) which hold with uniformly overwhelming probability, the ${\bigwedge_\alpha E_\alpha}$ holds with overwhelming probability.

If ${E_\alpha}$ is a family of events of sub-polynomial cardinality (i.e. cardinality ${O(n^{o(1)})}$ ) which hold with uniformly high probability, the ${\bigwedge_\alpha E_\alpha}$ holds with high probability. (In particular, the cardinality can be polylogarithmic in size, ${O(\log^{O(1)} n)}$ .)

If ${E_\alpha}$ is a family of events of uniformly bounded cardinality (i.e. cardinality ${O(1)}$ ) which each hold asymptotically almost surely, then ${\bigwedge_\alpha E_\alpha}$ holds asymptotically almost surely. (Note that uniformity of asymptotic almost sureness is automatic when the cardinality is bounded.)

Note how as the certainty of an event gets stronger, the number of times one can apply the union bound increases. In particular, holding with overwhelming probability is practically as good as holding surely or almost surely in many of our applications (except when one has to deal with the entropy of an ${n}$ -dimensional system, which can be exponentially large, and will thus require a certain amount of caution).

— 2. Random variables —

An event ${E}$ can be in just one of two states: the event can hold or fail, with some probability assigned to each. But we will usually need to consider the more general class of random variables which can be in multiple states.

Definition 3 (Random variable) Let ${R = (R,{\mathcal R})}$ be a measurable space (i.e. a set ${R}$ , equipped with a ${\sigma}$ -algebra of subsets of ${{\mathcal R}}$ ). A random variable taking values in ${R}$ (or an ${R}$ -valued random variable) is a measurable map ${X}$ from the sample space to ${R}$ , i.e. a function ${X: \Omega \rightarrow R}$ such that ${X^{-1}(S)}$ is an event for every ${S \in {\mathcal R}}$ .

As the notion of a random variable involves the sample space, one has to pause to check that it invariant under extensions before one can assert that it is a probabilistic concept. But this is clear: if ${X: \Omega \rightarrow R}$ is a random variable, and ${\pi: \Omega' \rightarrow \Omega}$ is an extension of ${\Omega}$ , then ${X' := X \circ \pi}$ is also a random variable, which generates the same events in the sense that ${(X')^{-1}(S) = \pi^{-1}( X^{-1}(S) )}$ for every ${S \in {\mathcal R}}$ .

At this point let us make the convenient convention (which we have in fact been implicitly using already) that an event is identified with the predicate which is true on the event set and false outside of the event set. Thus for instance the event ${X^{-1}(S)}$ could be identified with the predicate “ ${X \in S}$ “; this is preferable to the set-theoretic notation ${\{ \omega \in \Omega: X(\omega) \in S \}}$ , as it does not require explicit reference to the sample space and is thus more obviously a probabilistic notion. We will often omit the quotes when it is safe to do so, for instance ${{\bf P}(X \in S)}$ is shorthand for ${{\bf P}(``X \in S'')}$ .

Remark 1 On occasion, we will have to deal with almost surely defined random variables, which are only defined on a subset ${\Omega'}$ of ${\Omega}$ of full probability. However, much as measure theory and integration theory is largely unaffected by modification on sets of measure zero, many probabilistic concepts, in particular probability, distribution, and expectation, are similarly unaffected by modification on events of probability zero. Thus, a lack of definedness on an event of probability zero will usually not cause difficulty, so long as there are at most countably many such events in which one of the probabilistic objects being studied is undefined. In such cases, one can usually resolve such issues by setting a random variable to some arbitrary value (e.g. ${0}$ ) whenever it would otherwise be undefined.

We observe a few key subclasses and examples of random variables:

Discrete random variables, in which ${{\mathcal R} = 2^R}$ is the discrete ${\sigma}$ -algebra, and ${R}$ is at most countable. Typical examples of ${R}$ include a countable subset of the reals or complexes, such as the natural numbers or integers. If ${R=\{0,1\}}$ , we say that the random variable is Boolean, while if ${R}$ is just a singleton set ${\{c\}}$ we say that the random variable is deterministic, and (by abuse of notation) we identify this random variable with ${c}$ itself. Note that a Boolean random variable is nothing more than an indicator function ${{\bf I}(E)}$ of an event ${E}$ , where ${E}$ is the event that the boolean function equals ${1}$ .
Real-valued random variables, in which ${R}$ is the real line and ${{\mathcal R}}$ is the Borel ${\sigma}$ -algebra, generated by the open sets of ${R}$ . Thus for any real-valued random variable ${X}$ and any interval ${I}$ , we have the events “ ${X \in I}$ “. In particular, we have the upper tail event “ ${X \geq \lambda}$ ” and lower tail event “ ${X \leq \lambda}$ ” for any threshold ${\lambda}$ . (We also consider the events “ ${X > \lambda}$ ” and “ ${X < \lambda}$ ” to be tail events; in practice, there is very little distinction between the two.)
Complex random variables, whose range is the complex plane with the Borel ${\sigma}$ -algebra. A typical event associated to a complex random variable ${X}$ is the small ball event “ ${|X-z| < r}$ ” for some complex number ${z}$ and some (small) radius ${r>0}$ . We refer to real and complex random variables collectively as scalar random variables.
Given a ${R}$ -valued random variable ${X}$ , and a measurable map ${f: R \rightarrow R'}$ , the ${R'}$ -valued random variable ${f(X)}$ is indeed a random variable, and the operation of converting ${X}$ to ${f(X)}$ is preserved under extension of the sample space and is thus probabilistic. This variable ${f(X)}$ can also be defined without reference to the sample space as the unique random variable for which the identity
$\displaystyle ``f(X) \in S'' = ``X \in f^{-1}(S)''$

holds for all ${R'}$ -measurable sets ${S}$ .
Given two random variables ${X_1}$ and ${X_2}$ taking values in ${R_1, R_2}$ respectively, one can form the joint random variable ${(X_1,X_2)}$ with range ${R_1 \times R_2}$ with the product ${\sigma}$ -algebra, by setting ${(X_1,X_2)(\omega) := (X_1(\omega),X_2(\omega))}$ for every ${\omega \in \Omega}$ . One easily verifies that this is indeed a random variable, and that the operation of taking a joint random variable is a probabilistic operation. This variable can also be defined without reference to the sample space as the unique random variable for which one has ${\pi_1(X_1,X_2)=X_1}$ and ${\pi_2(X_1,X_2)=X_2}$ , where ${\pi_1: (x_1,x_2) \mapsto x_1}$ and ${\pi_2: (x_1,x_2) \mapsto x_2}$ are the usual projection maps from ${R_1 \times R_2}$ to ${R_1, R_2}$ respectively. One can similarly define the joint random variable ${(X_\alpha)_{\alpha \in A}}$ for any family of random variables ${X_\alpha}$ in various ranges ${R_\alpha}$ (note here that the set ${A}$ of labels can be infinite or even uncountable).
Combining the previous two constructions, given any measurable binary operation ${f: R_1 \times R_2 \rightarrow R'}$ and random variables ${X_1, X_2}$ taking values in ${R_1, R_2}$ respectively, one can form the ${R'}$ -valued random variable ${f(X_1,X_2) := f((X_1,X_2))}$ , and this is a probabilistic operation. Thus for instance one can add or multiply together scalar random variables, and similarly for the matrix-valued random variables that we will consider shortly. Similarly for ternary and higher order operations. A technical issue: if one wants to perform an operation (such as division of two scalar random variables) which is not defined everywhere (e.g. division when the denominator is zero). In such cases, one has to adjoin an additional “undefined” symbol ${\top}$ to the output range ${R'}$ . In practice, this will not be a problem as long as all random variables concerned are defined (i.e. avoid ${\top}$ ) almost surely.
Vector-valued random variables, which take values in a finite-dimensional vector space such as ${{\bf R}^n}$ or ${{\bf C}^n}$ with the Borel ${\sigma}$ -algebra. One can view a vector-valued random variable ${X = (X_1,\ldots,X_n)}$ as the joint random variable of its scalar component random variables ${X_1,\ldots,X_n}$ .
Matrix-valued random variables or random matrices, which take values in a space ${M_{n \times p}({\bf R})}$ or ${M_{n \times p}({\bf C})}$ of ${n \times p}$ real or complex-valued matrices, again with the Borel ${\sigma}$ -algebra, where ${n, p \geq 1}$ are integers (usually we will focus on the square case ${n=p}$ ). Note here that the shape ${n \times p}$ of the matrix is deterministic; we will not consider in this course matrices whose shapes are themselves random variables. One can view a matrix-valued random variable ${X = (X_{ij})_{1 \leq i \leq n; 1 \leq j \leq p}}$ as the joint random variable of its scalar components ${X_{ij}}$ . One can apply all the usual matrix operations (e.g. sum, product, determinant, trace, inverse, etc.) on random matrices to get a random variable with the appropriate range, though in some cases (e.g with inverse) one has to adjoin the undefined symbol ${\top}$ as mentioned earlier.
Point processes, which take values in the space ${{\mathfrak N}(S)}$ of subsets ${A}$ of a space ${S}$ (or more precisely, on the space of multisets of ${S}$ , or even more precisely still as integer-valued locally finite measures on ${S}$ ), with the ${\sigma}$ -algebra being generated by the counting functions ${|A \cap B|}$ for all precompact measurable sets ${B}$ . Thus, if ${X}$ is a point process in ${S}$ , and ${B}$ is a precompact measurable set, then the counting function ${|X \cap B|}$ is a discrete random variable in ${\{0,1,2,\ldots\} \cup \{+\infty\}}$ . For us, the key example of a point process comes from taking the spectrum ${\{ \lambda_1,\ldots,\lambda_n\}}$ of eigenvalues (counting multiplicity) of a random ${n \times n}$ matrix ${M_n}$ . I discuss point processes further in this previous blog post. We will return to point processes (and define them more formally) later in this course.

Remark 2 A pedantic point: strictly speaking, one has to include the range ${R = (R,{\mathcal R})}$ of a random variable ${X}$ as part of that variable (thus one should really be referring to the pair ${(X,R)}$ rather than ${X}$ ). This leads to the annoying conclusion that, technically, boolean random variables are not integer-valued, integer-valued random variables are not real-valued, and real-valued random variables are not complex-valued. To avoid this issue we shall abuse notation very slightly and identify any random variable ${X = (X,R)}$ to any coextension ${(X,R')}$ of that random variable to a larger range space ${R' \supset R}$ (assuming of course that the ${\sigma}$ -algebras are compatible). Thus, for instance, a real-valued random variable which happens to only take a countable number of values will now be considered a discrete random variable also.

Given a random variable ${X}$ taking values in some range ${R}$ , we define the distribution ${\mu_X}$ of ${X}$ to be the probability measure on the measurable space ${R = (R,{\mathcal R})}$ defined by the formula

$\displaystyle \mu_X(S) := {\bf P}(X \in S), \ \ \ \ \ (2)$

thus ${\mu_X}$ is the pushforward ${X_* {\bf P}}$ of the sample space probability measure ${{\bf P}}$ by ${X}$ . This is easily seen to be a probability measure, and is also a probabilistic concept. The probability measure ${\mu_X}$ is also known as the law for ${X}$ .

We write ${X \equiv Y}$ for ${\mu_X=\mu_Y}$ ; we also abuse notation slightly by writing ${X \equiv \mu_X}$ .

We have seen that every random variable generates a probability distribution ${\mu_X}$ . The converse is also true:

Lemma 4 (Creating a random variable with a specified distribution) Let ${\mu}$ be a probability measure on a measurable space ${R = (R,{\mathcal R})}$ . Then (after extending the sample space ${\Omega}$ if necessary) there exists an ${R}$ -valued random variable ${X}$ with distribution ${\mu}$ .

Proof: Extend ${\Omega}$ to ${\Omega \times R}$ by using the obvious projection map ${(\omega,r) \mapsto \omega}$ from ${\Omega \times R}$ back to ${\Omega}$ , and extending the probability measure ${{\bf P}}$ on ${\Omega}$ to the product measure ${{\bf P} \times \mu}$ on ${\Omega \times R}$ . The random variable ${X(\omega,r) := r}$ then has distribution ${\mu}$ . $\Box$

In the case of discrete random variables, ${\mu_X}$ is the discrete probability measure

$\displaystyle \mu_X(S) = \sum_{x \in S} p_x \ \ \ \ \ (3)$

where ${p_x := {\bf P}(X=x)}$ are non-negative real numbers that add up to ${1}$ . To put it another way, the distribution of a discrete random variable can be expressed as the sum of Dirac masses:

$\displaystyle \mu_X = \sum_{x \in R} p_x \delta_x. \ \ \ \ \ (4)$

We list some important examples of discrete distributions:

Dirac distributions ${\delta_{x_0}}$ , in which ${p_x = 1}$ for ${x=x_0}$ and ${p_x=0}$ otherwise;
discrete uniform distributions, in which ${R}$ is finite and ${p_x = 1/|R|}$ for all ${x \in R}$ ;
(Unsigned) Bernoulli distributions, in which ${R=\{0,1\}}$ , ${p_1=p}$ , and ${p_0=1-p}$ for some parameter ${0 \leq p \leq 1}$ ;
The signed Bernoulli distribution, in which ${R = \{-1,+1\}}$ and ${p_{+1}=p_{-1} = 1/2}$ ;
Lazy signed Bernoulli distributions, in which ${R = \{-1,0,+1\}}$ , ${p_{+1}=p_{-1}=\mu/2}$ , and ${p_0 = 1-\mu}$ for some parameter ${0 \leq \mu \leq 1}$ ;
Geometric distributions, in which ${R = \{0,1,2,\ldots\}}$ and ${p_k = (1-p)^k p}$ for all natural numbers ${k}$ and some parameter ${0 \leq p \leq 1}$ ; and
Poisson distributions, in which ${R = \{0,1,2,\ldots\}}$ and ${p_k = \frac{\lambda^k e^{-\lambda}}{k!}}$ for all natural numbers ${k}$ and some parameter ${\lambda}$ .

Now we turn to non-discrete random variables ${X}$ taking values in some range ${R}$ . We say that a random variable is continuous if ${{\bf P}(X=x)=0}$ for all ${x \in R}$ (here we assume that all points are measurable). If ${R}$ is already equipped with some reference measure ${dm}$ (e.g. Lebesgue measure in the case of scalar, vector, or matrix-valued random variables), we say that the random variable is absolutely continuous if ${{\bf P}(X \in S)=0}$ for all null sets ${S}$ in ${R}$ . By the Radon-Nikodym theorem, we can thus find a non-negative, absolutely integrable function ${f \in L^1(R, dm)}$ with ${\int_R f\ dm = 1}$ such that

$\displaystyle \mu_X(S) = \int_S f\ dm \ \ \ \ \ (5)$

for all measurable sets ${S \subset R}$ . More succinctly, one has

$\displaystyle d\mu_X = f\ dm. \ \ \ \ \ (6)$

We call ${f}$ the probability density function of the probability distribution ${\mu_X}$ (and thus, of the random variable ${X}$ ). As usual in measure theory, this function is only defined up to almost everywhere equivalence, but this will not cause any difficulties.

In the case of real-valued random variables ${X}$ , the distribution ${\mu_X}$ can also be described in terms of the cumulative distribution function

$\displaystyle F_X(x) := {\bf P}(X \leq x) = \mu_X( (-\infty,x] ). \ \ \ \ \ (7)$

Indeed, ${\mu_X}$ is the Lebesgue-Stieltjes measure of ${F_X}$ , and (in the absolutely continuous case) the derivative of ${F_X}$ exists and is equal to the probability density function almost everywhere. We will not use the cumulative distribution function much in this course, although we will be very interested in bounding tail events such as ${{\bf P}(X > \lambda)}$ or ${{\bf P}(X < \lambda)}$ .

We give some basic examples of absolutely continuous scalar distributions:

uniform distributions, in which ${f := \frac{1}{m(I)} 1_I}$ for some subset ${I}$ of the reals or complexes of finite non-zero measure, e.g. an interval ${[a,b]}$ in the real line, or a disk in the complex plane.
The real normal distribution ${N(\mu,\sigma^2) = N(\mu,\sigma^2)_{\bf R}}$ of mean ${\mu \in {\bf R}}$ and variance ${\sigma^2 > 0}$ , given by the density function ${f(x) := \frac{1}{\sqrt{2\pi \sigma^2}} \exp( - (x-\mu)^2/2\sigma^2 )}$ for ${x \in {\bf R}}$ . We isolate in particular the standard (real) normal distribution ${N(0,1)}$ . Random variables with normal distributions are known as gaussian random variables.
The complex normal distribution ${N(\mu,\sigma^2)_{\bf C}}$ of mean ${\mu \in {\bf C}}$ and variance ${\sigma^2 > 0}$ , given by the density function ${f(z) := \frac{1}{\pi \sigma^2} \exp( - |z-\mu|^2/\sigma^2 )}$ . Again, we isolate the standard complex normal distribution ${N(0,1)_{\bf C}}$ .

Later on, we will encounter several more scalar distributions of relevance to random matrix theory, such as the semicircular law or Marcenko-Pastur law. We will also of course encounter many matrix distributions (also known as matrix ensembles) as well as point processes.

Given an unsigned random variable ${X}$ (i.e. a random variable taking values in ${[0,+\infty]}$ ), one can define the expectation or mean ${{\bf E} X}$ as the unsigned integral

$\displaystyle {\bf E} X := \int_0^\infty x\ d\mu_X(x), \ \ \ \ \ (8)$

which by the Fubini-Tonelli theorem can also be rewritten as

$\displaystyle {\bf E} X = \int_0^\infty {\bf P}(X \geq \lambda)\ d\lambda. \ \ \ \ \ (9)$

The expectation of an unsigned variable lies in also ${[0,+\infty]}$ . If ${X}$ is a scalar random variable (which is allowed to take the value ${\infty}$ ) for which ${{\bf E}|X| < \infty}$ , we say that ${X}$ is absolutely integrable, in which case we can define its expectation as

$\displaystyle {\bf E} X := \int_{\bf R} x\ d\mu_X(x) \ \ \ \ \ (10)$

in the real case, or

$\displaystyle {\bf E} X := \int_{\bf C} z\ d\mu_X(z) \ \ \ \ \ (11)$

in the complex case. Similarly for vector-valued random variables (note that in finite dimensions, all norms are equivalent, so the precise choice of norm used to define ${|X|}$ is not relevant here). If ${X = (X_1,\ldots,X_n)}$ is a vector-valued random variable, then ${X}$ is absolutely integrable if and only if the components ${X_i}$ are all absolutely integrable, in which case one has ${{\bf E} X = ({\bf E} X_1, \ldots, {\bf E} X_n)}$ .

A deterministic scalar random variable ${c}$ is its own mean. An indicator function ${{\bf I}(E)}$ has mean ${\mathop{\bf P}(E)}$ . An unsigned Bernoulli variable (as defined previously) has mean ${p}$ , while a signed or lazy signed Bernoulli variable has mean ${0}$ . A real or complex gaussian variable with distribution ${N(\mu,\sigma^2)}$ has mean ${\mu}$ . A Poisson random variable has mean ${\lambda}$ ; a geometric random variable has mean ${p}$ . A uniformly distributed variable on an interval ${[a,b] \subset {\bf R}}$ has mean ${\frac{a+b}{2}}$ .

A fundamentally important property of expectation is that it is linear: if ${X_1,\ldots,X_k}$ are absolutely integrable scalar random variables and ${c_1,\ldots,c_k}$ are finite scalars, then ${c_1 X_1 + \ldots + c_k X_k}$ is also absolutely integrable and

$\displaystyle {\bf E} c_1 X_1 + \ldots + c_k X_k = c_1 {\bf E} X_1 + \ldots + c_k {\bf E} X_k. \ \ \ \ \ (12)$

By the Fubini-Tonelli theorem, the same result also applies to infinite sums ${\sum_{i=1}^\infty c_i X_i}$ provided that ${\sum_{i=1}^\infty |c_i| {\bf E}|X_i|}$ is finite.

We will use linearity of expectation so frequently in the sequel that we will often omit an explicit reference to it when it is being used. It is important to note that linearity of expectation requires no assumptions of independence or dependence amongst the individual random variables ${X_i}$ ; this is what makes this property of expectation so powerful.

In the unsigned (or real absolutely integrable) case, expectation is also monotone: if ${X \leq Y}$ is true for some unsigned or real absolutely integrable ${X, Y}$ , then ${{\bf E} X \leq {\bf E} Y}$ . Again, we will usually use this basic property without explicit mentioning it in the sequel.

For an unsigned random variable, we have the obvious but very useful Markov inequality

$\displaystyle {\bf P}(X \geq \lambda) \leq \frac{1}{\lambda} {\bf E} X \ \ \ \ \ (13)$

for any ${\lambda > 0}$ , as can be seen by taking expectations of the inequality ${\lambda {\bf I}(X \geq \lambda) \leq X}$ . For signed random variables, Markov’s inequality becomes

$\displaystyle {\bf P}(|X| \geq \lambda) \leq \frac{1}{\lambda} {\bf E} |X| \ \ \ \ \ (14)$

Another fact related to Markov’s inequality is that if ${X}$ is an unsigned or real absolutely integrable random variable, then ${X \geq {\bf E} X}$ must hold with positive probability, and also ${X \leq {\bf E} X}$ must also hold with positive probability. Use of these facts or (13), (14), combined with monotonicity and linearity of expectation, is collectively referred to as the first moment method. This method tends to be particularly easy to use (as one does not need to understand dependence or independence), but by the same token often gives sub-optimal results (as one is not exploiting any independence in the system).

Exercise 1 (Borel-Cantelli lemma) Let ${E_1,E_2,\ldots}$ be a sequence of events such that ${\sum_i {\bf P}(E_i) < \infty}$ . Show that almost surely, at most finitely many of the events ${E_i}$ occur at once. State and prove a result to the effect that the condition ${\sum_i {\bf P}(E_i) < \infty}$ cannot be weakened.

If ${X}$ is an absolutely integrable or unsigned scalar random variable, and ${F}$ is a measurable function from the scalars to the unsigned extended reals ${[0,+\infty]}$ , then one has the change of variables formula

$\displaystyle {\bf E} F(X) = \int_{\bf R} F(x)\ d\mu_X(x) \ \ \ \ \ (15)$

when ${X}$ is real-valued and

$\displaystyle {\bf E} F(X) = \int_{\bf C} F(z)\ d\mu_X(z) \ \ \ \ \ (16)$

when ${X}$ is complex-valued. The same formula applies to signed or complex ${F}$ if it is known that ${|F(X)|}$ is absolutely integrable. Important examples of expressions such as ${{\bf E} F(X)}$ are moments

$\displaystyle {\bf E} |X|^k \ \ \ \ \ (17)$

for various ${k \geq 1}$ (particularly ${k=1,2,4}$ ), exponential moments

$\displaystyle {\bf E} e^{tX} \ \ \ \ \ (18)$

for real ${t}$ , ${X}$ , and Fourier moments (or the characteristic function)

$\displaystyle {\bf E} e^{itX} \ \ \ \ \ (19)$

for real ${t, X}$ , or

$\displaystyle {\bf E} e^{it\cdot X} \ \ \ \ \ (20)$

for complex or vector-valued ${t, X}$ , where ${\cdot}$ denotes a real inner product. We shall also occasionally encounter the resolvents

$\displaystyle {\bf E} \frac{1}{X-z} \ \ \ \ \ (21)$

for complex ${z}$ , though one has to be careful now with the absolute convergence of this random variable. Similarly, we shall also occasionally encounter negative moments ${{\bf E} |X|^{-k}}$ of ${X}$ , particularly for ${k=2}$ . We also sometimes use the zeroth moment ${{\bf E} |X|^0 = {\bf P}(X \neq 0)}$ , where we take the somewhat unusual convention that ${x^0 := \lim_{k \rightarrow 0^+} x^k}$ for non-negative ${x}$ , thus ${x^0 := 1}$ for ${x>0}$ and ${0^0 := 0}$ . Thus, for instance, the union bound (1) can be rewritten (for finitely many ${i}$ , at least) as

$\displaystyle {\bf E} |\sum_i c_i X_i|^0 \leq \sum_i |c_i|^0 {\bf E} |X_i|^0 \ \ \ \ \ (22)$

for any scalar random variables ${X_i}$ and scalars ${c_i}$ (compare with (12)).

It will be important to know if a scalar random variable ${X}$ is “usually bounded”. We have several ways of quantifying this, in decreasing order of strength:

${X}$ is surely bounded if there exists an ${M>0}$ such that ${|X| \leq M}$ surely.
${X}$ is almost surely bounded if there exists an ${M>0}$ such that ${|X| \leq M}$ almost surely.
${X}$ is subgaussian if there exist ${C, c > 0}$ such that ${{\bf P}(|X| \geq \lambda) \leq C \exp( - c \lambda^2 )}$ for all ${\lambda > 0}$ .
${X}$ has sub-exponential tail if there exist ${C, c, a > 0}$ such that ${{\bf P}(|X| \geq \lambda) \leq C \exp( - c \lambda^a )}$ for all ${\lambda > 0}$ .
${X}$ has finite ${k^{th}}$ moment for some ${k \geq 1}$ if there exists ${C}$ such that ${{\bf E}|X|^k \leq C}$ .
${X}$ is absolutely integrable if ${{\bf E}|X| < \infty}$ .
${X}$ is almost surely finite if ${|X| < \infty}$ almost surely.

Exercise 2 Show that these properties genuinely are in decreasing order of strength, i.e. that each property on the list implies the next.

Exercise 3 Show that each of these properties are closed under vector space operations, thus for instance if ${X, Y}$ have sub-exponential tail, show that ${X+Y}$ and ${cX}$ also have sub-exponential tail for any scalar ${c}$ .

The various species of Bernoulli random variable are surely bounded, and any random variable which is uniformly distributed in a bounded set is almost surely bounded. Gaussians are of course subgaussian, while the Poisson and geometric distributions merely have sub-exponential tail. Cauchy distributions are typical examples of heavy-tailed distributions which are almost surely finite, but do not have all moments finite (indeed, the Cauchy distribution does not even have finite first moment).

If we have a family of scalar random variables ${X_\alpha}$ depending on a parameter ${\alpha}$ , we say that the ${X_\alpha}$ are uniformly surely bounded (resp. uniformly almost surely bounded, uniformly subgaussian, have uniform sub-exponential tails, or uniformly bounded ${k^{th}}$ moment) if the relevant parameters ${M, C, c, a}$ in the above definitions can be chosen to be independent of ${\alpha}$ .

Fix ${k \geq 1}$ . If ${X}$ has finite ${k^{th}}$ moment, say ${{\bf E}|X|^k \leq C}$ , then from Markov’s inequality (14) one has

$\displaystyle {\bf P}(|X| \geq \lambda) \leq C \lambda^{-k}, \ \ \ \ \ (23)$

thus we see that the higher the moments that we control, the faster the tail decay is. From the dominated convergence theorem we also have the variant

$\displaystyle \lim_{\lambda \rightarrow \infty} \lambda^k {\bf P}(|X| \geq \lambda) = 0. \ \ \ \ \ (24)$

However, this result is qualitative or ineffective rather than quantitative because it provides no rate of convergence of ${\lambda^k {\bf P}(|X| \geq \lambda)}$ to zero. Indeed, it is easy to construct a family ${X_\alpha}$ of random variables of uniformly bounded ${k^{th}}$ moment, but for which the quantities ${\lambda^k {\bf P}(|X_\alpha| \geq \lambda)}$ do not converge uniformly to zero (e.g. take ${X_m}$ to be ${m}$ times the indicator of an event of probability ${m^{-k}}$ for ${m=1,2,\ldots}$ ). Because of this issue, we will often have to strengthen the property of having a uniformly bounded moment, to that of obtaining a uniformly quantitative control on the decay in (24) for a family ${X_\alpha}$ of random variables; we will see examples of this in later lectures. However, this technicality does not arise in the important model case of identically distributed random variables, since in this case we trivially have uniformity in the decay rate of (24).

We observe some consequences of (23):

Lemma 5 Let ${X = X_n}$ be a scalar random variable depending on a parameter ${n}$ .

If ${|X_n|}$ has uniformly bounded expectation, then for any ${\epsilon > 0}$ independent of ${n}$ , we have ${|X_n| = O(n^\epsilon)}$ with high probability.

If ${X_n}$ has uniformly bounded ${k^{th}}$ moment, then for any ${A > 0}$ , we have ${|X_n| = O(n^{A/k})}$ with probability ${1 - O( n^{-A} )}$ .

If ${X_n}$ has uniform sub-exponential tails, then we have ${|X_n| = O(\log^{O(1)} n)}$ with overwhelming probability.

Exercise 4 Show that a real-valued random variable ${X}$ is subgaussian if and only if there exist ${C>0}$ such that ${{\bf E} e^{t X} \leq C\exp( C t^2 )}$ for all real ${t}$ , and if and only if there exists ${C > 0}$ such that ${{\bf E} |X|^k \leq (Ck)^{k/2}}$ for all ${k \geq 1}$ .

Exercise 5 Show that a real-valued random variable ${X}$ has subexponential tails if and only if there exist ${C>0}$ such that ${{\bf E} |X|^k \leq \exp( C k^C )}$ for all positive integers ${k}$ .

Once the second moment of a scalar random variable is finite, one can define the variance

$\displaystyle {\bf Var}(X) := {\bf E} |X-{\bf E}(X)|^2. \ \ \ \ \ (25)$

From Markov’s inequality we thus have Chebyshev’s inequality

$\displaystyle {\bf P}(|X - {\bf E}(X)| \geq \lambda) \leq \frac{{\bf Var}(X)}{\lambda^2}. \ \ \ \ \ (26)$

Upper bounds on ${{\bf P}(|X - {\bf E}(X)| \geq \lambda)}$ for ${\lambda}$ large are known as large deviation inequality. Chebyshev’s inequality gives a simple but still useful large deviation inequality, which becomes useful once ${\lambda}$ exceeds the standard deviation ${{\bf Var}(X)^{1/2}}$ of the random variable. The use of Chebyshev’s inequality, combined with a computation of variances, is known as the second moment method.

Exercise 6 (Scaling of mean and variance) If ${X}$ is a scalar random variable of finite mean and variance, and ${a, b}$ are scalars, show that ${{\bf E}(a+bX) = a + b {\bf E}(X)}$ and ${{\bf Var}(a+bX) = |b|^2 {\bf Var}(X)}$ . In particular, if ${X}$ has non-zero variance, then there exist scalars ${a,b}$ such that ${a+bX}$ has mean zero and variance one.

Exercise 7 We say that a real number ${{\bf M}(X)}$ is a median of a real-valued random variable ${X}$ if ${{\bf P}(X > {\bf M}(X)), {\bf P}(X < {\bf M}(X)) \leq 1/2}$ .

Show that a median always exists, and if ${X}$ is absolutely continuous with strictly positive density function, then the median is unique.

If ${X}$ has finite second moment, show that ${{\bf M}(X) = {\bf E}(X) + O( {\bf Var}(X)^{1/2} )}$ for any median ${{\bf M}(X)}$ .

If ${X}$ is subgaussian (or has sub-exponential tails with exponent ${a>1}$ ), then from dominated convergence we have the Taylor expansion

$\displaystyle {\bf E} e^{tX} =1+ \sum_{k=1}^\infty \frac{t^k}{k!} {\bf E} X^k \ \ \ \ \ (27)$

for any real or complex ${t}$ , thus relating the exponential and Fourier moments with the ${k^{th}}$ moments.

— 3. Independence —

When studying the behaviour of a single random variable ${X}$ , the distribution ${\mu_X}$ captures all the probabilistic information one wants to know about ${X}$ . The following exercise is one way of making this statement rigorous:

Exercise 8 Let ${X}$ , ${X'}$ be random variables (on sample spaces ${\Omega,\Omega'}$ respectively) taking values in a range ${R}$ , such that ${X \equiv X'}$ . Show that after extending the spaces ${\Omega, \Omega'}$ , the two random variables ${X, X'}$ are isomorphic, in the sense that there exists a probability space isomorphism ${\pi: \Omega \rightarrow \Omega'}$ (i.e. an invertible extension map whose inverse is also an extension map) such that ${X = X' \circ \pi}$ .

However, once one studies families ${(X_\alpha)_{\alpha \in A}}$ of random variables ${X_\alpha}$ taking values in measurable spaces ${R_\alpha}$ (on a single sample space ${\Omega}$ ), the distribution of the individual variables ${X_\alpha}$ are no longer sufficient to describe all the probabilistic statistics of interest; the joint distribution of the variables (i.e. the distribution of the tuple ${(X_\alpha)_{\alpha \in A}}$ , which can be viewed as a single random variable taking values in the product measurable space ${\prod_{\alpha \in A} R_\alpha}$ ) also becomes relevant.

Example 2 Let ${(X_1, X_2)}$ be drawn uniformly at random from the set ${\{ (-1,-1), (-1,+1), (+1,-1), (+1,+1) \}}$ . Then the random variables ${X_1}$ , ${X_2}$ , and ${-X_1}$ all individually have the same distribution, namely the signed Bernoulli distribution. However the pairs ${(X_1,X_2)}$ , ${(X_1,X_1)}$ , and ${(X_1,-X_1)}$ all have different joint distributions: the first pair, by definition, is uniformly distributed in ${\{ (-1,-1), (-1,+1), (+1,-1), (+1,+1) \}}$ , while the second pair is uniformly distributed in ${\{(-1,-1),(+1,+1)\}}$ , and the third pair is uniformly distributed in ${\{ (-1,+1), (+1,-1)\}}$ . Thus, for instance, if one is told that ${X, Y}$ are two random variables with the Bernoulli distribution, and asked to compute the probability that ${X=Y}$ , there is insufficient information to solve the problem; if ${(X,Y)}$ were distributed as ${(X_1,X_2)}$ , then the probability would be ${1/2}$ , while if ${(X,Y)}$ were distributed as ${(X_1,X_1)}$ , the probability would be ${1}$ , and if ${(X,Y)}$ were distributed as ${(X_1,-X_1)}$ , the probability would be ${0}$ . Thus one sees that one needs the joint distribution, and not just the individual distributions, to obtain a unique answer to the question.

There is however an important special class of families of random variables in which the joint distribution is determined by the individual distributions.

Definition 6 (Joint independence) A family ${(X_\alpha)_{\alpha \in A}}$ of random variables (which may be finite, countably infinite, or uncountably infinite) is said to be jointly independent if the distribution of ${(X_\alpha)_{\alpha \in A}}$ is the product measure of the distribution of the individual ${X_\alpha}$ .

A family ${(X_\alpha)_{\alpha \in A}}$ is said to be pairwise independent if the pairs ${(X_\alpha,X_\beta)}$ are jointly independent for all distinct ${\alpha,\beta \in A}$ . More generally, ${(X_\alpha)_{\alpha \in A}}$ is said to be ${k}$ -wise independent if ${(X_{\alpha_1},\ldots,X_{\alpha_{k'}})}$ are jointly independent for all ${1 \leq k' \leq k}$ and all distinct ${\alpha_1,\ldots,\alpha_{k'} \in A}$ .

We also say that ${X}$ is independent of ${Y}$ if ${(X,Y)}$ are jointly independent.

A family of events ${(E_\alpha)_{\alpha \in A}}$ is said to be jointly independent if their indicators ${({\bf I}(E_\alpha))_{\alpha \in A}}$ are jointly independent. Similarly for pairwise independence and ${k}$ -wise independence.

From the theory of product measure, we have the following equivalent formulation of joint independence:

Exercise 9 Let ${(X_\alpha)_{\alpha \in A}}$ be a family of random variables, with each ${X_\alpha}$ taking values in a measurable space ${R_\alpha}$ .

Show that the ${(X_\alpha)_{\alpha \in A}}$ are jointly independent if and only if for every collection of distinct elements ${\alpha_1,\ldots,\alpha_{k'}}$ of ${A}$ , and all measurable subsets ${E_i \subset R_{\alpha_i}}$ for ${1 \leq i \leq k'}$ , one has
$\displaystyle \mathop{\bf P}( X_{\alpha_i} \in E_i \hbox{ for all } 1 \leq i \leq k' ) = \prod_{i=1}^{k'} \mathop{\bf P}( X_{\alpha_i} \in E_i ).$

Show that the necessary and sufficient condition ${(X_\alpha)_{\alpha \in A}}$ being ${k}$ -wise independent is the same, except that ${k'}$ is constrained to be at most ${k}$ .

In particular, a finite family ${(X_1,\ldots,X_k)}$ of random variables ${X_i}$ , ${1 \leq i \leq k}$ taking values in measurable spaces ${R_i}$ are jointly independent if and only if

$\displaystyle \mathop{\bf P}( X_i \in E_i \hbox{ for all } 1 \leq i \leq k ) = \prod_{i=1}^k \mathop{\bf P}( X_i \in E_i )$

for all measurable ${E_i \subset R_i}$ .

If the ${X_\alpha}$ are discrete random variables, one can take the ${E_i}$ to be singleton sets in the above discussion.

From the above exercise we see that joint independence implies ${k}$ -wise independence for any ${k}$ , and that joint independence is preserved under permuting, relabeling, or eliminating some or all of the ${X_\alpha}$ . A single random variable is automatically jointly independent, and so ${1}$ -wise independence is vacuously true; pairwise independence is the first nontrivial notion of independence in this hierarchy.

Example 3 Let ${{\bf F}_2}$ be the field of two elements, let ${V \subset {\bf F}_2^3}$ be the subspace of triples ${(x_1,x_2,x_3) \in {\bf F}_2^3}$ with ${x_1+x_2+x_3=0}$ , and let ${(X_1,X_2,X_3)}$ be drawn uniformly at random from ${V}$ . Then ${(X_1,X_2,X_3)}$ are pairwise independent, but not jointly independent. In particular, ${X_3}$ is independent of each of ${X_1,X_2}$ separately, but is not independent of ${(X_1,X_2)}$ .

Exercise 10 This exercise generalises the above example. Let ${{\bf F}}$ be a finite field, and let ${V}$ be a subspace of ${{\bf F}^n}$ for some finite ${n}$ . Let ${(X_1,\ldots,X_n)}$ be drawn uniformly at random from ${V}$ . Suppose that ${V}$ is not contained in any coordinate hyperplane in ${{\bf F}^n}$ .

Show that each ${X_i}$ , ${1 \leq i \leq n}$ is uniformly distributed in ${{\bf F}}$ .

Show that for any ${k \geq 2}$ , that ${(X_1,\ldots,X_n)}$ is ${k}$ -wise independent if and only if ${V}$ is not contained in any hyperplane which is definable using at most ${k}$ of the coordinate variables.

Show that ${(X_1,\ldots,X_n)}$ is jointly independent if and only if ${V = {\bf F}^n}$ .

Informally, we thus see that imposing constraints between ${k}$ variables at a time can destroy ${k}$ -wise independence, while leaving lower-order independence unaffected.

Exercise 11 Let ${V \subset {\bf F}_2^3}$ be the subspace of triples ${(x_1,x_2,x_3) \in {\bf F}_2^3}$ with ${x_1+x_2=0}$ , and let ${(X_1,X_2,X_3)}$ be drawn uniformly at random from ${V}$ . Then ${X_3}$ is independent of ${(X_1,X_2)}$ (and in particular, is independent of ${x_1}$ and ${x_2}$ separately), but ${X_1, X_2}$ are not independent of each other.

Exercise 12 We say that one random variable ${Y}$ (with values in ${R_Y}$ ) is determined by another random variable ${X}$ (with values in ${R_X}$ ) if there exists a (deterministic) function ${f: R_X \rightarrow R_Y}$ such that ${Y=f(X)}$ is surely true (i.e. ${Y(\omega) = f(X(\omega))}$ for all ${\omega \in \Omega}$ ). Show that if ${(X_\alpha)_{\alpha \in A}}$ is a family of jointly independent random variables, and ${(Y_\beta)_{\beta \in B}}$ is a family such that each ${Y_\beta}$ is determined by some subfamily ${(X_\alpha)_{\alpha \in A_\beta}}$ of the ${(X_\alpha)_{\alpha \in A}}$ , with the ${A_\beta}$ disjoint as ${\beta}$ varies, then the ${(Y_\beta)_{\beta \in B}}$ are jointly independent also.

Exercise 13 (Determinism vs. independence) Let ${X, Y}$ be random variables. Show that ${Y}$ is deterministic if and only if it is simultaneously determined by ${X}$ , and independent of ${X}$ .

Exercise 14 Show that a complex random variable ${X}$ is a complex gaussian random variable (i.e. its distribution is a complex normal distribution) if and only if its real and imaginary parts ${Re(X), Im(X)}$ are independent real gaussian random variables with the same variance. In particular, the variance of ${Re(X)}$ and ${Im(X)}$ will be half of variance of ${X}$ .

One key advantage of working with jointly independent random variables and events is that one can compute various probabilistic quantities quite easily. We give some key examples below.

Exercise 15 If ${E_1,\ldots,E_k}$ are jointly independent events, show that

$\displaystyle {\bf P}( \bigwedge_{i=1}^k E_i ) = \prod_{i=1}^k {\bf P}(E_i) \ \ \ \ \ (28)$

and

$\displaystyle {\bf P}( \bigvee_{i=1}^k E_i ) = 1-\prod_{i=1}^k (1-{\bf P}(E_i)) \ \ \ \ \ (29)$

Show that the converse statement (i.e. that (28) and (29) imply joint independence) is true for ${k=2}$ , but fails for higher ${k}$ . Can one find a correct replacement for this converse for higher ${k}$ ?

Exercise 16

If ${X_1,\ldots,X_k}$ are jointly independent random variables taking values in ${[0,+\infty]}$ , show that
$\displaystyle \mathop{\bf E} \prod_{i=1}^k X_i = \prod_{i=1}^k \mathop{\bf E} X_i.$

If ${X_1,\ldots,X_k}$ are jointly independent absolutely integrable scalar random variables, show that ${\prod_{i=1}^k X_i}$ is absolutely integrable, and
$\displaystyle \mathop{\bf E} \prod_{i=1}^k X_i = \prod_{i=1}^k \mathop{\bf E} X_i.$

Remark 3 The above exercise combines well with Exercise 12. For instance, if ${X_1,\ldots,X_k}$ are jointly independent subgaussian variables, then from Exercises 12, 16 we see that

$\displaystyle \mathop{\bf E} \prod_{i=1}^k e^{tX_i} = \prod_{i=1}^k \mathop{\bf E} e^{tX_i} \ \ \ \ \ (30)$

for any complex ${t}$ . This identity is a key component of the exponential moment method, which we will discuss in the next set of notes.

The following result is a key component of the second moment method.

Exercise 17 (Pairwise independence implies linearity of variance) If ${X_1,\ldots,X_k}$ are pairwise independent scalar random variables of finite mean and variance, show that

$\displaystyle {\bf Var}(\sum_{i=1}^k X_i) = \sum_{i=1}^k {\bf Var}(X_i)$

and more generally

$\displaystyle {\bf Var}(\sum_{i=1}^k c_i X_i) = \sum_{i=1}^k |c_i|^2 {\bf Var}(X_i)$

for any scalars ${c_i}$ (compare with (12), (22)).

The product measure construction allows us to extend Lemma 4:

Exercise 18 (Creation of new, independent random variables) Let ${(X_\alpha)_{\alpha \in A}}$ be a family of random variables (not necessarily independent or finite), and let ${(\mu_\beta)_{\beta \in B}}$ be a collection (not necessarily finite) of probability measures ${\mu_\beta}$ on measurable spaces ${R_\beta}$ . Then, after extending the sample space if necessary, one can find a family ${(Y_\beta)_{\beta \in B}}$ of independent random variables, such that each ${Y_\beta}$ has distribution ${\mu_\beta}$ , and the two families ${(X_\alpha)_{\alpha \in A}}$ and ${(Y_\beta)_{\beta \in B}}$ are independent of each other.

We isolate the important case when ${\mu_\beta = \mu}$ is independent of ${\beta}$ . We say that a family ${(X_\alpha)_{\alpha \in A}}$ of random variables is independently and identically distributed, or iid for short, if they are jointly independent and all the ${X_\alpha}$ have the same distribution.

Corollary 7 Let ${(X_\alpha)_{\alpha \in A}}$ be a family of random variables (not necessarily independent or finite), let ${\mu}$ be a probability measure on a measurable space ${R}$ , and let ${B}$ be an arbitrary set. Then, after extending the sample space if necessary, one can find an iid family ${(Y_\beta)_{\beta \in B}}$ with distribution ${\mu}$ which is independent of ${(X_\alpha)_{\alpha \in A}}$ .

Thus, for instance, one can create arbitrarily large iid families of Bernoulli random variables, Gaussian random variables, etc., regardless of what other random variables are already in play. We thus see that the freedom to extend the underyling sample space allows us access to an unlimited source of randomness. This is in contrast to a situation studied in complexity theory and computer science, in which one does not assume that the sample space can be extended at will, and the amount of randomness one can use is therefore limited.

Remark 4 Given two probability measures ${\mu_X, \mu_Y}$ on two measurable spaces ${R_X, R_Y}$ , a joining or coupling of the these measures is a random variable ${(X,Y)}$ taking values in the product space ${R_X \times R_Y}$ , whose individual components ${X, Y}$ have distribution ${\mu_X, \mu_Y}$ respectively. Exercise 18 shows that one can always couple two distributions together in an independent manner; but one can certainly create non-independent couplings as well. The study of couplings (or joinings) is particularly important in ergodic theory, but this will not be the focus of this course.

— 4. Conditioning —

Random variables are inherently non-deterministic in nature, and as such one has to be careful when applying deterministic laws of reasoning to such variables. For instance, consider the law of the excluded middle: a statement ${P}$ is either true or false, but not both. If this statement is a random variable, rather than deterministic, then instead it is true with some probability ${p}$ and false with some complementary probability ${1-p}$ . Also, applying set-theoretic constructions with random inputs can lead to sets, spaces, and other structures which are themselves random variables, which can be quite confusing and require a certain amount of technical care; consider, for instance, the task of rigorously defining a Euclidean space ${{\bf R}^d}$ when the dimension ${d}$ is itself a random variable.

Now, one can always eliminate these difficulties by explicitly working with points ${\omega}$ in the underlying sample space ${\Omega}$ , and replacing every random variable ${X}$ by its evaluation ${X(\omega)}$ at that point; this removes all the randomness from consideration, making everything deterministic (for fixed ${\omega}$ ). This approach is rigorous, but goes against the “probabilistic way of thinking”, as one now needs to take some care in extending the sample space.

However, if instead one only seeks to remove a partial amount of randomness from consideration, then one can do this in a manner consistent with the probabilistic way of thinking, by introducing the machinery of conditioning. By conditioning an event to be true or false, or conditioning a random variable to be fixed, one can turn that random event or variable into a deterministic one, while preserving the random nature of other events and variables (particularly those which are independent of the event or variable being conditioned upon).

We begin by considering the simpler situation of conditioning on an event.

Definition 8 (Conditioning on an event) Let ${E}$ be an event (or statement) which holds with positive probability ${{\bf P}(E)}$ . By conditioning on the event ${E}$ , we mean the act of replacing the underlying sample space ${\Omega}$ with the subset of ${\Omega}$ where ${E}$ holds, and replacing the underlying probability measure ${{\bf P}}$ by the conditional probability measure ${{\bf P}(|E)}$ , defined by the formula

$\displaystyle {\bf P}(F|E) := {\bf P}(F \wedge E) / {\bf P}(E). \ \ \ \ \ (31)$

All events ${F}$ on the original sample space can thus be viewed as events ${(F|E)}$ on the conditioned space, which we model set-theoretically as the set of all ${\omega}$ in ${E}$ obeying ${F}$ . Note that this notation is compatible with (31).

All random variables ${X}$ on the original sample space can also be viewed as random variables ${X}$ on the conditioned space, by restriction. We will refer to this conditioned random variable as ${(X|E)}$ , and thus define conditional distribution ${\mu_{(X|E)}}$ and conditional expectation ${{\bf E}(X|E)}$ (if ${X}$ is scalar) accordingly.

One can also condition on the complementary event ${\overline{E}}$ , provided that this event holds with positive probility also.

By undoing this conditioning, we revert the underlying sample space and measure back to their original (or unconditional) values. Note that any random variable which has been defined both after conditioning on ${E}$ , and conditioning on ${\overline{E}}$ , can still be viewed as a combined random variable after undoing the conditioning.

Conditioning affects the underlying probability space in a manner which is different from extension, and so the act of conditioning is not guaranteed to preserve probabilistic concepts such as distribution, probability, or expectation. Nevertheless, the conditioned version of these concepts are closely related to their unconditional counterparts:

Exercise 19 If ${E}$ and ${\overline{E}}$ both occur with positive probability, establish the identities

$\displaystyle {\bf P}(F) = {\bf P}(F|E) {\bf P}(E) + {\bf P}(F|\overline{E}) {\bf P}(\overline{E}) \ \ \ \ \ (32)$

for any (unconditional) event ${F}$ and

$\displaystyle \mu_X = \mu_{(X|E)} {\bf P}(E) + \mu_{(X|\overline{E})} {\bf P}(\overline{E}) \ \ \ \ \ (33)$

for any (unconditional) random variable ${X}$ (in the original sample space). In a similar spirit, if ${X}$ is a non-negative or absolutely integrable scalar (unconditional) random variable, show that ${(X|E)}$ , ${(X|\overline{E})}$ are also non-negative and absolutely integrable on their respective conditioned spaces, and that

$\displaystyle {\bf E} X = {\bf E}(X|E) {\bf P}(E) + {\bf E}(X|\overline{E}) {\bf P}(\overline{E}). \ \ \ \ \ (34)$

In the degenerate case when ${E}$ occurs with full probability, conditioning to the complementary event ${\overline{E}}$ is not well defined, but show that in those cases we can still obtain the above formulae if we adopt the convention that any term involving the vanishing factor ${{\bf P}(\overline{E})}$ should be omitted. Similarly if ${E}$ occurs with zero probability.

The above identities allow one to study probabilities, distributions, and expectations on the original sample space by conditioning to the two conditioned spaces.

From (32) we obtain the inequality

$\displaystyle {\bf P}(F|E) \leq {\bf P}(F) / {\bf P}(E), \ \ \ \ \ (35)$

thus conditioning can magnify probabilities by a factor of at most ${1/{\bf P}(E)}$ . In particular,

If ${F}$ occurs unconditionally surely, it occurs surely conditioning on ${E}$ also.
If ${F}$ occurs unconditionally almost surely, it occurs almost surely conditioning on ${E}$ also.
If ${F}$ occurs unconditionally with overwhelming probability, it occurs with overwhelming probability conditioning on ${E}$ also, provided that ${{\bf P}(E) \geq c n^{-C}}$ for some ${c, C>0}$ independent of ${n}$ .
If ${F}$ occurs unconditionally with high probability, it occurs with high probability conditioning on ${E}$ also, provided that ${{\bf P}(E) \geq c n^{-a}}$ for some ${c >0}$ and some sufficiently small ${a>0}$ independent of ${n}$ .
If ${F}$ occurs unconditionally asymptotically almost surely, it occurs asymptotically almost surely conditioning on ${E}$ also, provided that ${{\bf P}(E) \geq c}$ for some ${c>0}$ independent of ${n}$ .

Conditioning can distort the probability of events and the distribution of random variables. Most obviously, conditioning on ${E}$ elevates the probability of ${E}$ to ${1}$ , and sends the probability of the complementary event ${\overline{E}}$ to zero. In a similar spirit, if ${X}$ is a random variable uniformly distributed on some finite set ${S}$ , and ${S'}$ is a non-empty subset of ${S}$ , then conditioning to the event ${X \in S'}$ alters the distribution of ${X}$ to now become the uniform distribution on ${S'}$ rather than ${S}$ (and conditioning to the complementary event produces the uniform distribution on ${S \backslash S'}$ ).

However, events and random variables that are independent of the event ${E}$ being conditioned upon are essentially unaffected by conditioning. Indeed, if ${F}$ is an event independent of ${E}$ , then ${(F|E)}$ occurs with the same probability as ${F}$ ; and if ${X}$ is a random variable independent of ${E}$ (or equivalently, independently of the indicator ${{\bf I}(E)}$ ), then ${(X|E)}$ has the same distribution as ${X}$ .

Remark 5 One can view conditioning to an event ${E}$ and its complement ${\overline{E}}$ as the probabilistic analogue of the law of the excluded middle. In deterministic logic, given a statement ${P}$ , one can divide into two separate cases, depending on whether ${P}$ is true or false; and any other statement ${Q}$ is unconditionally true if and only if it is conditionally true in both of these two cases. Similarly, in probability theory, given an event ${E}$ , one can condition into two separate sample spaces, depending on whether ${E}$ is conditioned to be true or false; and the unconditional statistics of any random variable or event are then a weighted average of the conditional statistics on the two sample spaces, where the weights are given by the probability of ${E}$ and its complement.

Now we consider conditioning with respect to a discrete random variable ${Y}$ , taking values in some range ${R}$ . One can condition on any event ${Y=y}$ , ${y \in R}$ which occurs with positive probability. It is then not difficult to establish the analogous identities to those in Exercise 19:

Exercise 20 Let ${Y}$ be a discrete random variable with range ${R}$ . Then we have

$\displaystyle {\bf P}(F) = \sum_{y \in R} {\bf P}(F|Y=y) {\bf P}(Y=y) \ \ \ \ \ (36)$

for any (unconditional) event ${F}$ , and

$\displaystyle \mu_X = \sum_{y \in R} \mu_{(X|Y=y)} {\bf P}(Y=y) \ \ \ \ \ (37)$

for any (unconditional) random variable ${X}$ (where the sum of non-negative measures is defined in the obvious manner), and for absolutely integrable or non-negative (unconditional) random variables ${X}$ , one has

$\displaystyle {\bf E} X = \sum_{y \in R} {\bf E}(X|Y=y) {\bf P}(Y=y). \ \ \ \ \ (38)$

In all of these identities, we adopt the convention that any term involving ${{\bf P}(Y=y)}$ is ignored when ${{\bf P}(Y=y) = 0}$ .

With the notation as in the above exercise, we define the conditional probability ${{\bf P}(F|Y)}$ of an (unconditional) event ${F}$ conditioning on ${Y}$ to be the (unconditional) random variable that is defined to equal ${{\bf P}(F|Y=y)}$ whenever ${Y=y}$ , and similarly, for any absolutely integrable or non-negative (unconditional) random variable ${X}$ , we define the conditional expectation ${{\bf E}(X|Y)}$ to be the (unconditional) random variable that is defined to equal ${{\bf E}(X|Y=y)}$ whenever ${Y=y}$ . (Strictly speaking, since we are not defining conditional expectation when ${{\bf P}(Y=y)=0}$ , these random variables are only defined almost surely, rather than surely, but this will not cause difficulties in practice; see Remark 1.) Thus (36), (38) simplify to

$\displaystyle {\bf P}(F) = {\bf E}( {\bf P}(F|Y) ) \ \ \ \ \ (39)$

and

$\displaystyle {\bf E}(X) = {\bf E}( {\bf E}(X|Y) ). \ \ \ \ \ (40)$

Remark 6 One can interpret conditional expectation as a type of orthogonal projection; see for instance these previous lecture notes of mine. But we will not use this perspective in this course. Just as conditioning on an event and its complement can be viewed as the probabilistic analogue of the law of the excluded middle, conditioning on a discrete random variable can be viewed as the probabilistic analogue of dividing into finitely or countably many cases. For instance, one could condition on the outcome ${Y \in \{1,2,3,4,5,6\}}$ of a six-sided die, thus conditioning the underlying sample space into six separate subspaces. If the die is fair, then the unconditional statistics of a random variable or event would be an unweighted average of the conditional statistics of the six conditioned subspaces; if the die is weighted, one would take a weighted average instead.

Example 4 Let ${X_1, X_2}$ be iid signed Bernoulli random variables, and let ${Y := X_1+X_2}$ , thus ${Y}$ is a discrete random variable taking values in ${-2,0,+2}$ (with probability ${1/4}$ , ${1/2}$ , ${1/4}$ respectively). Then ${X_1}$ remains a signed Bernoulli random variable when conditioned to ${Y=0}$ , but becomes the deterministic variable ${+1}$ when conditioned to ${Y=+2}$ , and similarly becomes the deterministic variable ${-1}$ when conditioned to ${Y=-2}$ . As a consequence, the conditional expectation ${{\bf E}(X_1|Y)}$ is equal to ${0}$ when ${Y=0}$ , ${+1}$ when ${Y=+2}$ , and ${-1}$ when ${Y=-2}$ ; thus ${{\bf E}(X_1|Y) = Y/2}$ . Similarly ${{\bf E}(X_2|Y)=Y/2}$ ; summing and using the linearity of (conditional) expectation (which follows automatically from the unconditional version) we obtain the obvious identity ${{\bf E}(Y|Y)=Y}$ .

If ${X, Y}$ are independent, then ${(X|Y=y) \equiv X}$ for all ${y}$ (with the convention that those ${y}$ for which ${{\bf P}(Y=y)=0}$ are ignored), which implies in particular (for absolutely integrable ${X}$ ) that

$\displaystyle {\bf E}(X|Y) = {\bf E}(X)$

(so in this case the conditional expectation is a deterministic quantity).

Example 5 Let ${X, Y}$ be bounded scalar random variables (not necessarily independent), with ${Y}$ discrete. Then we have

$\displaystyle {\bf E}(XY) = {\bf E}( {\bf E}(XY|Y) ) = {\bf E}( Y {\bf E}(X|Y) )$

where the latter equality holds since ${Y}$ clearly becomes deterministic after conditioning on ${Y}$ .

We will also need to condition with respect to continuous random variables (this is the probabilistic analogue of dividing into a potentially uncountable number of cases). To do this formally, we need to proceed a little differently from the discrete case, introducing the notion of a disintegration of the underlying sample space.

Definition 9 (Disintegration) Let ${Y}$ be a random variable with range ${R}$ . A disintegration ${(R', (\mu_y)_{y \in R'})}$ of the underlying sample space ${\Omega}$ with respect to ${Y}$ is a subset ${R'}$ of ${R}$ of full measure in ${\mu_Y}$ (thus ${Y \in R'}$ almost surely), together with assignment of a probability measure ${{\bf P}(|Y=y)}$ on the subspace ${\Omega_y := \{ \omega \in \Omega: Y(\omega)=y\}}$ of ${\Omega}$ for each ${y \in R}$ , which is measurable in the sense that the map ${y \mapsto {\bf P}(F|Y=y)}$ is measurable for every event ${F}$ , and such that

$\displaystyle {\bf P}(F) = {\bf E} {\bf P}(F|Y)$

for all such events, where ${{\bf P}(F|Y)}$ is the (almost surely defined) random variable defined to equal ${{\bf P}(F|Y=y)}$ whenever ${Y=y}$ .

Given such a disintegration, we can then condition to the event ${Y=y}$ for any ${y \in R'}$ by replacing ${\Omega}$ with the subspace ${\Omega_y}$ (with the induced ${\sigma}$ -algebra), but replacing the underlying probability measure ${{\bf P}}$ with ${{\bf P}(|Y=y)}$ . We can thus condition (unconditional) events ${F}$ and random variables ${X}$ to this event to create conditioned events ${(F|Y=y)}$ and random variables ${(X|Y=y)}$ on the conditioned space, giving rise to conditional probabilities ${{\bf P}(F|Y=y)}$ (which is consistent with the existing notation for this expression) and conditional expectations ${{\bf E}(X|Y=y)}$ (assuming absolute integrability in this conditioned space). We then set ${{\bf E}(X|Y)}$ to be the (almost surely defined) random variable defined to equal ${{\bf E}(X|Y=y)}$ whenever ${Y=y}$ .

Example 6 (Discrete case) If ${Y}$ is a discrete random variable, one can set ${R'}$ to be the essential range of ${Y}$ , which in the discrete case is the set of all ${y \in R}$ for which ${{\bf P}(Y=y)>0}$ . For each ${y \in R'}$ , we define ${{\bf P}(|Y=y)}$ to be the conditional probability measure relative to the event ${Y=y}$ , as defined in Definition 8. It is easy to verify that this is indeed a disintegration; thus the continuous notion of conditional probability generalises the discrete one.

Example 7 (Independent case) Starting with an initial sample space ${\Omega}$ , and a probability measure ${\mu}$ on a measurable space ${R}$ , one can adjoin a random variable ${Y}$ taking values in ${R}$ with distribution ${\mu}$ that is independent of all previously existing random variables, by extending ${\Omega}$ to ${\Omega \times R}$ as in Lemma 4. One can then disintegrate ${Y}$ by taking ${R':=R}$ and letting ${\mu_y}$ be the probability measure on ${\Omega_y = \Omega \times \{y\}}$ induced by the obvious isomorphism between ${\Omega \times \{y\}}$ and ${\Omega}$ ; this is easily seen to be a disintegration. Note that if ${X}$ is any random variable from the original space ${\Omega}$ , then ${(X|Y=y)}$ has the same distribution as ${X}$ for any ${y \in R}$ .

Example 8 Let ${\Omega = [0,1]^2}$ with Lebesgue measure, and let ${(X_1,X_2)}$ be the coordinate random variables of ${\Omega}$ , thus ${X_1,X_2}$ are iid with the uniform distribution on ${[0,1]}$ . Let ${Y}$ be the random variable ${Y := X_1+X_2}$ with range ${R = {\bf R}}$ . Then one can disintegrate ${Y}$ by taking ${R' = [0,2]}$ and letting ${\mu_y}$ be normalised Lebesgue measure on the diagonal line segment ${\{ (x_1,x_2) \in [0,1]^2: x_1+x_2=y\}}$ .

Exercise 21 (Almost uniqueness of disintegrations) Let ${(R',(\mu_y)_{y \in R'})}$ , ${(\tilde R',(\tilde \mu_y)_{y \in \tilde R'})}$ be two disintegrations of the same random variable ${Y}$ . Show that for any event ${F}$ , one has ${{\bf P}(F|Y=y) = \tilde {\bf P}(F|Y=y)}$ for ${\mu_Y}$ -almost every ${y \in R}$ , where the conditional probabilities ${{\bf P}(|Y=y)}$ and ${\tilde {\bf P}(|Y=y)}$ are defined using the disintegrations ${(R',(\mu_y)_{y \in R'})}$ , ${(\tilde R',(\tilde \mu_y)_{y \in \tilde R'})}$ respectively. (Hint: argue by contradiction, and consider the set of ${y}$ for which ${{\bf P}(F|Y=y)}$ exceeds ${\tilde {\bf P}(F|Y=y)}$ (or vice versa) by some fixed ${\epsilon > 0}$ .)

Similarly, for a scalar random variable ${X}$ , show that for ${\mu_Y}$ -almost every ${y \in R}$ , that ${(X|Y=y)}$ is absolutely integrable with respect to the first disintegration if and only if it is absolutely integrable with respect to the second integration, and one has ${{\bf E}(X|Y=y) = \tilde {\bf E}(X|Y=y)}$ in such cases.

Remark 7 Under some mild topological assumptions on the underlying sample space (and on the measurable space ${R}$ ), one can always find at least one disintegration for every random variable ${Y}$ , by using tools such as the Radon-Nikodym theorem; see Theorem 4 of these previous lecture notes of mine. In practice, we will not invoke these general results here (as it is not natural for us to place topological conditions on the sample space), and instead construct disintegrations by hand in specific cases, for instance by using the construction in Example 7.

Remark 8 Strictly speaking, disintegration is not a probabilistic concept; there is no canonical way to extend a disintegration when extending the sample space;. However, due to the (almost) uniqueness and existence results alluded to earlier, this will not be a difficulty in practice. Still, we will try to use conditioning on continuous variables sparingly, in particular containing their use inside the proofs of various lemmas, rather than in their statements, due to their slight incompatibility with the “probabilistic way of thinking”.

Exercise 22 (Fubini-Tonelli theorem) Let ${(R', (\mu_y)_{y \in R'})}$ be a disintegration of a random variable ${Y}$ taking values in a measurable space ${R}$ , and let ${X}$ be a non-negative (resp. absolutely integrable) scalar random variable. Show that for ${\mu_Y}$ -almost all ${y \in R}$ , ${(X|Y=y)}$ is a non-negative (resp. absolutely integrable) random variable, and one has the identity

$\displaystyle \mathop{\bf E}( \mathop{\bf E}(X|Y) ) = \mathop{\bf E}(X), \ \ \ \ \ (41)$

where ${\mathop{\bf E}(X|Y)}$ is the (almost surely defined) random variable that equals ${\mathop{\bf E}(X|Y=y)}$ whenever ${y \in R'}$ . (Note that one first needs to show that ${\mathop{\bf E}(X|Y)}$ is measurable before one can take the expectation.) More generally, show that

$\displaystyle \mathop{\bf E}( \mathop{\bf E}(X|Y) f(Y) ) = \mathop{\bf E}(X f(Y)), \ \ \ \ \ (42)$

whenever ${f: R \rightarrow {\bf R}}$ is a non-negative (resp. bounded) measurable function. (One can essentially take (42), together with the fact that ${\mathop{\bf E}(X|Y)}$ is determined by ${Y}$ , as a definition of the conditional expectation ${\mathop{\bf E}(X|Y)}$ , but we will not adopt this approach here.)

A typical use of conditioning is to deduce a probabilistic statement from a deterministic one. For instance, suppose one has a random variable ${X}$ , and a parameter ${y}$ in some range ${R}$ , and an event ${E(X,y)}$ that depends on both ${X}$ and ${y}$ . Suppose we know that ${{\bf P} E(X,y) \leq \epsilon}$ for every ${y \in R}$ . Then, we can conclude that whenever ${Y}$ is a random variable in ${R}$ independent of ${X}$ , we also have ${{\bf P} E(X,Y) \leq \epsilon}$ , regardless of what the actual distribution of ${Y}$ is. Indeed, if we condition ${Y}$ to be a fixed value ${y}$ (using the construction in Example 7, extending the underlying sample space if necessary), we see that ${{\bf P}(E(X,Y)|Y=y) \leq \epsilon}$ for each ${y}$ ; and then one can integrate out the conditioning using (41) to obtain the claim.

The act of conditioning a random variable to be fixed is occasionally also called freezing.

— 5. Convergence —

In a first course in undergraduate real analysis, we learn what it means for a sequence ${x_n}$ of scalars to converge to a limit ${x}$ ; for every ${\epsilon > 0}$ , we have ${|x_n-x| \leq \epsilon}$ for all sufficiently large ${n}$ . Later on, this notion of convergence is generalised to metric space convergence, and generalised further to topological space convergence; in these generalisations, the sequence ${x_n}$ can lie in some other space than the space of scalars (though one usually insists that this space is independent of ${n}$ ).

Now suppose that we have a sequence ${X_n}$ of random variables, all taking values in some space ${R}$ ; we will primarily be interested in the scalar case when ${R}$ is equal to ${{\bf R}}$ or ${{\bf C}}$ , but will also need to consider fancier random variables, such as point processes or empirical spectral distributions. In what sense can we say that ${X_n}$ “converges” to a random variable ${X}$ , also taking values in ${R}$ ?

It turns out that there are several different notions of convergence which are of interest. For us, the four most important (in decreasing order of strength) will be almost sure convergence, convergence in probability, convergence in distribution, and tightness of distribution.

Definition 10 (Modes of convergence) Let ${R = (R,d)}$ be a ${\sigma}$ -compact, locally compact metric space (with the Borel ${\sigma}$ -algebra), and let ${X_n}$ be a sequence of random variables taking values in ${R}$ . Let ${X}$ be another random variable taking values in ${R}$ .

${X_n}$ converges almost surely to ${X}$ if, for almost every ${\omega \in \Omega}$ , ${X_n(\omega)}$ converges to ${X(\omega)}$ , or equivalently
$\displaystyle {\bf P}( \limsup_{n \rightarrow \infty} d(X_n,X) \leq \epsilon ) = 1$

for every ${\epsilon > 0}$ .

${X_n}$ converges in probability to ${X}$ if, for every ${\epsilon > 0}$ , one has
$\displaystyle \liminf_{n \rightarrow \infty} {\bf P}( d(X_n,X) \leq \epsilon ) = 1,$

or equivalently if ${d(X_n,X) \leq \epsilon}$ holds asymptotically almost surely for every ${\epsilon > 0}$ .

${X_n}$ converges in distribution to ${X}$ if, for every bounded continuous function ${F: R \rightarrow {\bf R}}$ , one has
$\displaystyle \lim_{n \rightarrow\infty} \mathop{\bf E} F(X_n) = \mathop{\bf E} F(X).$

${X_n}$ has a tight sequence of distributions if, for every ${\epsilon > 0}$ , there exists a compact subset ${K}$ of ${R}$ such that ${\mathop{\bf P}( X_n \in K ) \geq 1 - \epsilon}$ for all sufficiently large ${n}$ .

Remark 9 One can relax the requirement that ${R}$ be a ${\sigma}$ -compact, locally compact metric space in the definitions, but then some of the nice equivalences and other properties of these modes of convergence begin to break down. In our applications, though, we will only need to consider the ${\sigma}$ -compact, locally compact metric space case. Note that all of these notions are probabilistic (i.e. they are preserved under extensions of the sample space).

Exercise 23 (Implications and equivalences) Let ${X_n, X}$ be random variables taking values in a ${\sigma}$ -compact, locally compact metric space ${R}$ .

(i) Show that if ${X_n}$ converges almost surely to ${X}$ , then ${X_n}$ converges in probability to ${X}$ . (Hint: Fatou’s lemma.)

(ii) Show that if ${X_n}$ converges in distribution to ${X}$ , then ${X_n}$ has a tight sequence of distributions.

(iii) Show that if ${X_n}$ converges in probability to ${X}$ , then ${X_n}$ converges in distribution to ${X}$ . (Hint: first show tightness, then use the fact that on compact sets, continuous functions are uniformly continuous.)

(iv) Show that ${X_n}$ converges in distribution to ${X}$ if and only if ${\mu_{X_n}}$ converges to ${\mu_X}$ in the vague topology (i.e. ${\int f\ d\mu_{X_n} \rightarrow \int f\ d\mu_X}$ for all continuous functions ${f: R \rightarrow {\bf R}}$ of compact support).

(v) Conversely, if ${X_n}$ has a tight sequence of distributions, and ${\mu_{X_n}}$ is convergent in the vague topology, show that ${X_n}$ is convergent in distribution to another random variable (possibly after extending the sample space). What happens if the tightness hypothesis is dropped?

(vi) If ${X}$ is deterministic, show that ${X_n}$ converges in probability to ${X}$ if and only if ${X_n}$ converges in distribution to ${X}$ .

(vii) If ${X_n}$ has a tight sequence of distributions, show that there is a subsequence of the ${X_n}$ which converges in distribution. (This is known as Prokhorov’s theorem).

(viii) If ${X_n}$ converges in probability to ${X}$ , show that there is a subsequence of the ${X_n}$ which converges almost surely to ${X}$ .

(ix) ${X_n}$ converges in distribution to ${X}$ if and only if ${\liminf_{n \rightarrow \infty} {\bf P}(X_n \in U) \geq {\bf P}(X \in U)}$ for every open subset ${U}$ of ${R}$ , or equivalently if ${\limsup_{n \rightarrow \infty} {\bf P}(X_n \in K) \leq {\bf P}(X \in K)}$ for every closed subset ${K}$ of ${R}$ .

Remark 10 The relationship between almost sure convergence and convergence in probability may be clarified by the following observation. If ${E_n}$ is a sequence of events, then the indicators ${{\bf I}(E_n)}$ converge in probability to zero iff ${{\bf P}(E_n) \rightarrow 0}$ as ${n \rightarrow \infty}$ , but converge almost surely to zero iff ${{\bf P}(\bigcup_{n \geq N} E_n) \rightarrow 0}$ as ${N \rightarrow \infty}$ .

Example 9 Let ${Y}$ be a random variable drawn uniformly from ${[0,1]}$ . For each ${n \geq 1}$ , let ${E_n}$ be the event that the decimal expansion of ${Y}$ begins with the decimal expansion of ${n}$ , e.g. every real number in ${[0.25, 0.26)}$ lies in ${E_{25}}$ . (Let us ignore the annoying ${0.999\ldots=1.000\ldots}$ ambiguity in the decimal expansion here, as it will almost surely not be an issue.) Then the indicators ${{\bf I}(E_n)}$ converge in probability and in distribution to zero, but do not converge almost surely.

If ${y_n}$ is the ${n^{th}}$ digit of ${Y}$ , then the ${y_n}$ converge in distribution (to the uniform distribution on ${\{0,1,\ldots,9\}}$ , but do not converge in probability or almost surely. Thus we see that the latter two notions are sensitive not only to the distribution of the random variables, but how they are positioned in the sample space.

The limit of a sequence converging almost surely or in probability is clearly unique up to almost sure equivalence, whereas the limit of a sequence converging in distribution is only unique up to equivalence in distribution. Indeed, convergence in distribution is really a statement about the distributions ${\mu_{X_n}, \mu_X}$ rather than of the random vaariables ${X_n, X}$ themselves. In particular, for convergence in distribution one does not care about how correlated or dependent the ${X_n}$ are with respect to each other, or with ${X}$ ; indeed, they could even live on different sample spaces ${\Omega_n, \Omega}$ and we would still have a well-defined notion of convergence in distribution, even though the other two notions cease to make sense (except when ${X}$ is deterministic, in which case we can recover convergence in probability by Exercise 23(vi)).

Exercise 24 (Borel-Cantelli lemma) Suppose that ${X_n, X}$ are random variables such that ${\sum_n {\bf P}(d(X_n,X) \geq \epsilon) < \infty}$ for every ${\epsilon > 0}$ . Show that ${X_n}$ converges almost surely to ${X}$ .

Exercise 25 (Convergence and moments) Let ${X_n}$ be a sequence of scalar random variables, and let ${X}$ be another scalar random variable. Let ${k, \epsilon > 0}$ .

(i) If ${\sup_n {\bf E} |X_n|^k < \infty}$ , show that ${X_n}$ has a tight sequence of distributions.

(ii) If ${\sup_n {\bf E} |X_n|^k < \infty}$ and ${X_n}$ converges in distribution to ${X}$ , show that ${{\bf E} |X|^k \leq \liminf_{n \rightarrow \infty} {\bf E} |X_n|^k}$ .

(iii) If ${\sup_n {\bf E} |X_n|^{k+\epsilon} < \infty}$ and ${X_n}$ converges in distribution to ${X}$ , show that ${{\bf E} |X|^k = \lim_{n \rightarrow \infty} {\bf E} |X_n|^k}$ .

(iv) Give a counterexample to show that (iii) fails when ${\epsilon=0}$ , even if we upgrade convergence in distribution to almost sure convergence.

(v) If the ${X_n}$ are uniformly bounded and real-valued, and ${{\bf E} X^k = \lim_{n \rightarrow \infty} {\bf E} X_n^k}$ for every ${k=0,1,2,\ldots}$ , then ${X_n}$ converges in distribution to ${X}$ . (Hint: use the Weierstrass approximation theorem. Alternatively, use the analytic nature of the moment generating function ${{\bf E} e^{tX}}$ and analytic continuation.)

(vi) If the ${X_n}$ are uniformly bounded and complex-valued, and ${{\bf E} X^k \overline{X}^l = \lim_{n \rightarrow \infty} {\bf E} X_n^k \overline{X_n}^l}$ for every ${k,l=0,1,2,\ldots}$ , then ${X_n}$ converges in distribution to ${X}$ . Give a counterexample to show that the claim fails if one only considers the cases when ${l=0}$ .

There are other interesting modes of convergence on random variables and on distributions, such as convergence in total variation norm, in the Lévy-Prokhorov metric, or in Wasserstein metric, but we will not need these concepts in this course.

134 comments

Comments feed for this article

11 April, 2012 at 7:01 pm

Rex

When defining real-valued random variables, does one typically equip the real line with the Borel measure, or its Lebesgue completion? Does this distinction matter much in practice?

For instance, does one have to do a significant amount of extra work to check that certain random variables are Lebesgue-measurable as opposed to merely Borel-measurable?

In the definition you refer only to the Borel measure, but later on you mention some issues about pullbacks of (Lebesgue) null sets when discussing absolute continuity of random variables.

11 April, 2012 at 8:28 pm

Terence Tao

In general, the Borel sigma algebra is slightly more convenient to use than the Lebesgue sigma algebra for the _range_ of a measurable function, but the Lebesgue can be more a convenient algebra to use for the _domain_ of a measurable function. But the main advantage of Lebesgue measure, namely completeness, is more useful in measure theory than in probability theory; for most probabilistic applications one does not actually need completeness.

p.s. I don’t know what issue about pullbacks of null sets you are referring to in your comment.

11 April, 2012 at 8:32 pm

Rex

I did not really mean to say there was any “issue”, but rather just that you switched from Borel measure to Lebesgue measure in the following passage:

“Now we turn to non-discrete random variables {X} taking values in some range {R}. We say that a random variable is continuous if {{\bf P}(X=x)=0} f
or all {x \in R} (here we assume that all points are measurable). If {R} is already equipped with some reference measure {dm} (e.g. Lebesgue measure in the case of scalar, vector, or matrix-valued random variables), we say that the random variable is absolutely continuous if {{\bf P}(X \in S)=0} for all null sets {S} in {R}. ”

and it was not clear to me whether there was any significance in this switch.

11 April, 2012 at 8:43 pm

Terence Tao

Ah. I tend to use Lebesgue measure to denote both the standard measure on the Lebesgue sigma algebra, as well as its restriction to the Borel sigma algebra (which is indeed the slightly more natural sigma algebra to use in this context). (The terminology “the Borel measure on ${\bf R}$ ” to denote this restriction is also in use, but somewhat less common, perhaps because it can be confused with the more general concept of a Borel measure.)

17 June, 2012 at 9:41 am

frankpmurphyh

Reblogged this on algebrafm.

5 September, 2012 at 6:12 pm

Gelasio Salazar

Dear Terry,

One question about the distinction between “with high probability” and
“asymptotically almost surely”. We just got a referee report in which they
ask us to change “w.h.p.” to “a.a.s.” — since in a particular lemma, all
we can prove is that a certain event holds with probability 1 -o(1). I
would have normally used w.h.p. and a.a.s. interchangeably, but after the
referee’s remark (s/he gave your Notes as reference) I realized we need to
be more careful. In the revised version we’ll use “a.a.s.”, and I was
wondering if you were aware of other sources in which this distinction
between “overwhelming probability”, “with high probability” and
“asymptotically almost surely” is used.

Last but not least, thanks for your comprehensive notes in Probability
Theory.

5 September, 2012 at 8:21 pm

Terence Tao

“asymptotically almost surely”, when it is used in literature, invariably means 1-o(1), but for “with high probability” there is less consensus; I have seen it used for both 1-o(1) and for 1-O(n^{-c}) (though not in the same paper, of course). But given that a.a.s. is a perfectly useful and accepted notation for 1-o(1), it seems logical to me to exclusively use w.h.p for 1-O(n^{-c}) instead.

19 September, 2012 at 10:59 am

Jack

Could you give an example of your saying that “If one was particularly well organised, one could in principle work out in advance all of the random variables one would ever want or need, and then specify the sample space accordingly, before doing any actual probability theory.”?

19 September, 2012 at 11:28 am

Jack

Can one say to some degree that a random variable $X$ on $(\Omega,\mathcal{B},\mathcal{P})$ can be regarded as an extension $\pi$ ?

27 September, 2012 at 5:24 pm

Jack

I’m confused about the concept “pushforward”. Let $(\Omega,{\mathcal A},P)$ be a probability space and random variable $X$ on this space with range $(R, {\mathcal R})$ . Then $\mu_X$ is a probability measure on $(R, {\mathcal R})$ . However, according to your notes of 245A, $(R, {\mathcal R})$ can be any measurable space, for example $({\mathbb R},2^{\mathbb R})$ . But I also learned that one cannot sign a measure to $({\mathbb R},2^{\mathbb R})$ that render it a measure space. What do I do wrong here?

28 September, 2012 at 3:08 am

Terence Tao

One can place several measures on $({\mathbb R}, 2^{\mathbb R})$ , e.g. a Dirac measure. (But one cannot have a non-trivial translation-invariant measure on this space, due to Banach-Tarski type paradoxes.)

28 September, 2012 at 6:02 am

Jack

Ah, I see the point. As you said in this note, the underlying sample space of a random variable is often not specified. And the range of the random variable $X$ , according to Remark 2, can be somehow not mentioned either as I understand. I’m puzzled about this: to what extend should one specify a random variable? What’s left for a function when one does not specify its domain and range?

I saw lots of times when one says something like “consider a $R$ -value random variable”. They don’t even specify which $\sigma-$ algebra is used for $R$ . What’s more, I’ve never read that one specify a measure for the range measurable space $(R,{\mathcal R})$ . Is it because we have an immediate one, $\mu_X$ , or it doesn’t matter at all?

28 September, 2012 at 12:27 pm

Terence Tao

We usually don’t specify the sample space of a random variable for much the same reason we don’t specify which base (e.g. base 10, binary, etc.) we use to represent numbers, or which coordinate system we use to represent a manifold. We could specify these representations, if desired, but we wish to focus on those aspects of the mathematical objects being studied that are independent of the choice of representation, and so it is usually counterproductive to devote too much attention to these representations. (This is discussed near the beginning of this blog post.)

Also, one should make a distinction between a measurable space $(R, {\mathcal R})$ and a measure space $(R, {\mathcal R}, \mu)$ . A measurable space can be turned into a measure space by specifying a measure, but there are multiple measures one could use for this purpose, and in many cases it is better to not specify a measure at all.

15 October, 2012 at 9:01 am

SAADA

Professeur Tao, auriez-vous une version française de cette note ?
Merci par avance.

15 October, 2012 at 10:08 am

Terence Tao

Non, mais outils de traduction automatique, tels que Google Translate, peuvent faire un travail raisonnable: http://translate.google.com/translate?sl=en&tl=fr&js=n&prev=_t&hl=en&ie=UTF-8&layout=2&eotf=1&u=http%3A%2F%2Fterrytao.wordpress.com%2F2010%2F01%2F01%2F254a-notes-0-a-review-of-probability-theory&act=url

15 October, 2012 at 10:35 am

Daniel

merci beaucoup professeur.

http://www.daniel-saada.eu

15 October, 2012 at 10:48 pm

Hoeffding bound | blayz

[…] first two sections of the book Topics in Random Matrix Theory by Terry Tao, with his draft and relevant notes available online, and the Hoeffding’s original paper along with the paper by […]

4 January, 2013 at 9:45 am

http terrytao wordpress com 2010 01 01 254a… « Kathys LinkBook

[…] https://terrytao.wordpress.com/2010/01/01/254a-notes-0-a-review-of-probability-theory/ […]

19 January, 2013 at 12:14 pm

tim

Hi Prof Tao,

Thanks for sharing your note!

I don’t quite understand why the extension is defined to be surjective in Tao’s blog. Is the concept “extension” trying to become morphisms in some category of probability spaces?

I found two links http://theoreticalatlas.wordpress.com/2010/11/11/categorifying-measure-theory/ and http://mathoverflow.net/questions/49426/is-there-a-category-structure-one-can-place-on-measure-spaces-so-that-category-th. Although I can only understand some small part of them, I hope they can be interesting to you!

Thanks!

23 March, 2013 at 5:05 am

Marek

Dear Prof. Tao,

Excellent post, thank you for sharing! I have a related question, and I would be very grateful if you or someone else could provide me an answer.

As is well-known, the total variation distance between (the laws of) two random variables X and Y defined on R is given by sup|E[g(X)]−E[g(Y)]|, where the supremum is taken over all g:R→[0,1] that are measurable. Here is my question: do we obtain the same definition if one considers the supremum over all g:R→[0,1] that are continuous? Or over all g:R→[0,1] that are differentiable?

Thanks a lot in advance!

23 March, 2013 at 7:11 am

Terence Tao

Yes, basically because test functions are dense in $L^1(\mu)$ for any Borel probability measure on the real line (note that all Borel probability measures on R are Radon measures), thanks to basic tools such as the Stone-Weierstrass theorem, Lusin’s theorem, the Tietze extension theorem, and Urysohn’s lemma (as discussed in this previous blog post). Applying this general fact to $\mu := (\mu_X + \mu_Y)/2$ , the Borel probability measure formed by the average of the two laws, and using the Radon-Nikodym theorem to write $\mu_X, \mu_Y$ as bounded multiples of $\mu$ , we obtain the desired equivalence.

23 March, 2013 at 7:56 am

Marek

Thank you very much for the prompt reply! It is very clear in my mind now. Best, Marek

11 July, 2013 at 11:53 am

Probability and Statistics Books Online | Download free ebook

[…] A review of probability theory by Terence Tao, 2010 […]

16 September, 2013 at 7:12 am

Free Mathematics eBooks Online : Probability and Statistics | Top Free Books

[…] A review of probability theory by Terence Tao, 2010 […]

16 November, 2013 at 10:12 pm

Qualitative probability theory, types, and the group chunk and group configuration theorems | What's new

[…] classical foundations of probability theory (discussed for instance in this previous blog post) is founded on the notion of a probability space – a space (the sample space) equipped with […]

16 November, 2013 at 10:12 pm

Qualitative probability theory, types, and the group chunk and group configuration theorems | What's new

10 June, 2014 at 6:23 pm

Ehsan

Dear Prof. Tao,

Thank you for sharing your notes!

My question is about the probabilistic way of thinking. I understand that the cardinality of an event is not probabilistic notion because it is not preserved under the extension of the probability space.

Can rank and eigenvalues be probabilistic notions? If we consider the space of n-by-n matrices and extend it to the space of n+1-by-n+1 matrices, the the rank and eigenvalues may change as well.

I think I am missing something here.

Thanks!

10 June, 2014 at 9:13 pm

Terence Tao

The space of n by n matrices is not a probability space (and conversely, most probability spaces are not spaces of matrices), so I don’t see how your scenario would connect with probability theory.

Of course, if one had a matrix-valued random variable $A: \Omega \to M_n({\bf C})$ (i.e. a random matrix), one could compute its rank and eigenvalues, giving a random natural number and a random (multi-)set of points (i.e. a point process) respectively. These would all be probabilistic notions, but in all cases one would be extending the probabilistic domain $\Omega$ rather than the matrix space $M_n({\bf C})$ when performing probabilistic extensions.

11 June, 2014 at 3:05 pm

Ehsan

Dear Prof. Tao,

Thank you for the answer. As you mentioned, I missed the fact that the probabilistic domain $\Omega$ is extended rather than the space of matrices.

16 June, 2014 at 7:47 pm

CRQ

Dear Prof Tao,

quick notation question: in definition 1, you say ”for some $C$ independent of $n$ and all $n \geq C$ ”. Do you mean instead that there exists $n_0$ such that this holds ”for some $C$ independent of $n$ and all $n \geq n_0$ ? Thank you!

17 June, 2014 at 6:26 am

Terence Tao

As long as there are no parameters in play besides $n$ , this is equivalent to the definition presented, since one could always raise either $C$ or $n_0$ to be equal to each other. Once there are parameters, though, one has to be more careful. For instance if there are two parameters $k,l$ and one is asserting that $X = O_k(Y)$ , then the conventions I use are that this means that $|X| \leq C_k Y$ for all $n \geq C_k$ , where $C_k$ depends only on $k$ but not on $l$ or $n$ . With the definition involving $n_0$ , it can be ambiguous as to whether $n_0$ is allowed to depend on $l$ or not.

17 June, 2014 at 10:03 pm

nomen

I have been trying to define/explore “conditional random variables” for a while. In particular, I find that “sampling notation”, and what appears to be an adjunction in the algebra of random variables, makes them very natural. For example, consider $X \sim \mathbf{Z}, Y \sim \mathbf{U}_X$ . It is very natural to treat $Y$ as being conditional on $X$ , since $Y$ is uniformly distributed on $(0,X)$ .

However, when I have tried to find out about this idea (mostly on math.stackexchange.com and mathoverflow.net), people have been dismissive. Actually, somebody on mathoverflow.net was supportive and suggested that I look into disintegrations.

Can you comment on the feasibility of defining such objects? Are they compatible with distintegrations, as you have used them?

17 June, 2014 at 10:09 pm

nomen

Sorry for the self-reply. I just wanted to clarify my motivation.

I often see that $x\sim (X|Y)$ used as shorthand for

$y \sim Y, x \sim (X|Y=y)$

This “sample and bind” shorthand seems completely straight-forward to me, since we can “always” recover the distribution of Y from the joint distribution of (X,Y) by marginalizing X out. Are marginalizing and conditioning adjoints? It seems that the notation witnesses triple-ability, if they are adjoint at all.

This is effectively what I asked on mathoverflow.

22 June, 2014 at 11:15 am

Felix V.

Dear Prof. Tao,

thank you very much for this nice post and your book on Random Matrices.

as I noted in this (http://math.stackexchange.com/questions/842989/the-completeness-assumption-in-prokhorovs-theorem/843397#843397) stackexchange post, Exercise 23 seems to have mild problems.

You only assume that $R$ is $\sigma$ -compact, not necessarily complete.

In that case, part (iv) of that exercise fails. Take e.g. $R = \mathbb{Q}$ . Then there are no nontrivial continuous functions with compact support, so that one of the conditions is vacuous.

I am also not sure that part (i) is really true in that setting. The proofs that I found all use the completeness of $R$ to show that certain sets (namely of the form $K = \bigcap_j \bigcup_{i=1}^{k_j} \overline{B}(a_i, 1/j)$ are compact, as they are closed and totally bounded).

Sadly, I did not manage to produce a counterexample up to now.

Best regards,

Felix V.

22 June, 2014 at 1:11 pm

Terence Tao

Oops, you’re right; I had intended to assume that the metric space is both sigma compact and locally compact, but I think I thought that the latter hypotheses was implied by the former one for metric spaces, but as you point out, the example of the rationals shows that this is not the case. I’ve modified the text accordingly.

24 June, 2014 at 5:14 am

Felix V.

Thanks for the answer.

In my post above I was actually referring to part (ii) of the exercise, not part (i), sorry about that.

Do you know if this part (ii) is correct without the completeness assumption? Wikipedia (http://en.wikipedia.org/wiki/Prokhorov%27s_theorem , part 1 of “Statement of the theorem”) also only assumes that the metric space is separable, but they give neither a proof, nor a source.

24 June, 2014 at 7:42 am

Xiteng Liu

Congratulations Terry for The Breakthrough Award! You deserve it for your endeavor and enthusiasm in math.

27 June, 2014 at 3:00 am

Alon Gonen

Dear professor Terrence Tao,
How do you derive equation 1.24 using dominated converge?
Thanks

27 June, 2014 at 12:29 pm

Terence Tao

$\lambda^k 1_{|X| \geq \lambda}$ is dominated by $|X|^k$ , and converges a.s. to zero as $\lambda \to \infty$ .

15 July, 2014 at 4:22 am

Real analysis relative to a finite measure space | What's new

[…] 1 In my previous post on the foundations of probability theory, I emphasised the freedom to extend the sample space to a larger sample space whenever one wished […]

11 August, 2014 at 7:40 am

254A, Notes 0: A review of probability theory | Mathemania

[…] 254A, Notes 0: A review of probability theory. […]

4 March, 2015 at 12:57 pm

Aryeh

Dear Prof. Tao, I don’t see how Poisson variables are subgaussian. In particular, the Laplace transform of the Poisson distribution is not upper-bounded by $e^{c t^2}$. Best, -Aryeh

[Oops, you’re right – corrected, thanks -T.]

29 September, 2015 at 9:54 pm

275A, Notes 0: Foundations of probability theory | What's new

[…] the classical measure-theoretic foundations of probability. (I wrote on these foundations also in this previous blog post, but in that post I already assumed that the reader was familiar with measure theory and basic […]

3 October, 2015 at 2:58 pm

275A, Notes 1: Integration and expectation | What's new

[…] Remark 2 There is one minor exception to this general rule if we do not impose the additional requirement that the factor map is surjective. Namely, for non-surjective , it can become possible that two events are unequal in the original sample space model, but become equal in the extension (and similarly for random variables), although the converse never happens (events that are equal in the original sample space always remain equal in the extension). For instance, let be the discrete probability space with and , and let be the discrete probability space with , and non-surjective factor map defined by . Then the event modeled by in is distinct from the empty event when viewed in , but becomes equal to that event when viewed in . Thus we see that extending the sample space by a non-surjective factor map can identify previously distinct events together (though of course, being probability preserving, this can only happen if those two events were already almost surely equal anyway). This turns out to be fairly harmless though; while it is nice to know if two given events are equal, or if they differ by a non-null event, it is almost never useful to know that two events are unequal if they are already almost surely equal. Alternatively, one can add the additional requirement of surjectivity in the definition of an extension, which is also a fairly harmless constraint to impose (this is what I chose to do in this previous set of notes). […]

6 October, 2015 at 9:40 am

Ben Golub

As a complement to Durrett’s book, I always found Amir Dembo’s notes on probability to be useful for a course of this type… their main virtue is that they are very systematic and (perhaps obsessively!) organized. http://statweb.stanford.edu/~adembo/stat-310a/lnotes.pdf

24 May, 2016 at 7:40 am

Robert

Thanks for the post. In relating to Terry’s definition of random matrices, does anybody know what is the best way to define a random matrix when its shape is also random variable? For instance, if A is random matrix such that A: Omega -> {space of matrices with m rows}. Thus A maps to the space of matrices with m rows, but the number of columns is itself a random variable. Is there a common way to construct of such a random matrix?

13 July, 2017 at 4:39 pm

Making Probability Mathematical | Infinite Series

[…] 254A, Notes 0: A review of probability theory Kolmogorov – Foundations of the Theory of Probability Ian Hacking – The Emergence of Probability […]

22 November, 2017 at 8:04 am

Mateo Wirth

I am having trouble with the definition of a disintegration given above. In particular, without the condition that for any measurable subset S of R, letting E be the event that Y is in S, we have
$P(F \wedge E) = E P(F|Y) I(Y \in S),$
I can’t seem to conclude that
$P(F| Y = y) = P(F \wedge E | Y = y) \quad \forall y \in S$ .

22 November, 2017 at 8:13 am

Mateo Wirth

Nevermind I see how it works now. Sorry!

12 February, 2019 at 6:33 am

The probability space as a fiction | Hydrobates

[…] search starting from this suspicion let me to an enlightening blog post of Terry Tao called ‘Notes 0: A review of probability theory‘. There he reviews ‘foundational aspects of probability theory’. Fairly early in […]

23 February, 2019 at 1:12 pm

What's the meaning of "random" in Mathematics? | Page 2 | Physics Forums

[…] measure space; we will return to this point when we discuss free probability later in this course. https://terrytao.wordpress.com/2010/01/01/254a-notes-0-a-review-of-probability-theory/ so for starting out: why not focus on probabilistic concepts as opposed to representation in […]

7 August, 2019 at 8:33 am

Making Probability Mathematical | Infinite Series - دکتر تیز

[…] 254A, Notes 0: A review of probability theory Kolmogorov – Foundations of the Theory of Probability Ian Hacking – The Emergence of Probability […]

12 November, 2019 at 6:47 pm

254A, Notes 9 – second moment and entropy methods | What's new

[…] In these notes we presume familiarity with the basic concepts of probability theory, such as random variables (which could take values in the reals, vectors, or other measurable spaces), probability, and expectation. Much of this theory is in turn based on measure theory, which we will also presume familiarity with. See for instance this previous set of lecture notes for a brief review. […]

14 December, 2019 at 12:58 pm

Is there a category consisting of probability spaces as objects and measurable functions as morphisms? – Lofgren's financial decisions

[…] this post, Terance Tao writes ”probability theory is only ‘allowed’ to study concepts and […]

7 July, 2020 at 7:05 pm

Server Bug Fix: Categorification of probability theory: what does a “probability sheaf” tell us (if anything) about probability theory? - TECHPRPR

[…] If you’re interested in this field, I think you will want to read these references (plus the blog post by Tao that Simpson cites), and you may need to give it time before super compelling applications […]

12 January, 2021 at 2:44 am

Anonymous

Dear Prof. Tao,
I am an undergraduate student majoring in mathematics, and I really enjoyed the way you explained the “probabilistic way of thinking” and ideas of conditioning on events or extending the sample space in a way far better than most introductory texts that I’ve read. (Although most of this article went over my head). I would be really grateful if you could suggest some references/ lecture notes written in a similar style, which are more suited to an undergraduate looking to learn probability theory.
Thank you.

11 October, 2021 at 9:19 am

254A, Supplement 4: Probabilistic models and heuristics for the primes (optional) | What's new

[…] material in this set of notes presumes some prior exposure to probability theory. See for instance this previous post for a quick review of the relevant […]

	Anonymous on It ought to be common knowledg…
	Ring Theory Intervie… on Reading seminar: “Stable…
	Anonymous on Work hard
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Erratum for “An inverse…
	Anonymous on Erratum for “An inverse…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…

254A, Notes 0: A review of probability theory

Recent Comments

Articles by others

Diversions

Mathematics

Selected articles

Software

The sciences

Top Posts

Archives

Categories

The Polymath Blog

134 comments

Leave a comment Cancel reply

For commenters

254A, Notes 0: A review of probability theory

Share this:

Recent Comments

Articles by others

Diversions

Mathematics

Selected articles

Software

The sciences

Top Posts

Archives

Categories

The Polymath Blog

134 comments

Leave a comment Cancel reply

For commenters