Starting this week, I will be teaching an introductory graduate course (Math 275A) on probability theory here at UCLA. While I find myself *using* probabilistic methods routinely nowadays in my research (for instance, the probabilistic concept of Shannon entropy played a crucial role in my recent paper on the Chowla and Elliott conjectures, and random multiplicative functions similarly played a central role in the paper on the Erdos discrepancy problem), this will actually be the first time I will be *teaching* a course on probability itself (although I did give a course on random matrix theory some years ago that presumed familiarity with graduate-level probability theory). As such, I will be relying primarily on an existing textbook, in this case Durrett’s Probability: Theory and Examples. I still need to prepare lecture notes, though, and so I thought I would continue my practice of putting my notes online, although in this particular case they will be less detailed or complete than with other courses, as they will mostly be focusing on those topics that are not already comprehensively covered in the text of Durrett. Below the fold are my first such set of notes, concerning the classical measure-theoretic foundations of probability. (I wrote on these foundations also in this previous blog post, but in that post I already assumed that the reader was familiar with measure theory and basic probability, whereas in this course not every student will have a strong background in these areas.)

Note: as this set of notes is primarily concerned with foundational issues, it will contain a large number of pedantic (and nearly trivial) formalities and philosophical points. We dwell on these technicalities in this set of notes primarily so that they are out of the way in later notes, when we work with the actual mathematics of probability, rather than on the supporting foundations of that mathematics. In particular, the excessively formal and philosophical language in this set of notes will not be replicated in later notes.

** — 1. Some philosophical generalities — **

By default, mathematical reasoning is understood to take place in a *deterministic* mathematical universe. In such a universe, any given mathematical statement (that is to say, a sentence with no free variables) is either true or false, with no intermediate truth value available. Similarly, any deterministic variable can take on only one specific value at a time.

However, for a variety of reasons, both within pure mathematics and in the applications of mathematics to other disciplines, it is often desirable to have a rigorous mathematical framework in which one can discuss *non-deterministic* statements and variables – that is to say, statements which are not always true or always false, but in some intermediate state, or variables that do not take one particular value or another with definite certainty, but are again in some intermediate state. In probability theory, which is by far the most widely adopted mathematical framework to formally capture the concept of non-determinism, non-deterministic statements are referred to as *events*, and non-deterministic variables are referred to as *random variables*. In the standard foundations of probability theory, as laid out by Kolmogorov, we can then *model* these events and random variables by introducing a sample space (which will be given the structure of a probability space) to capture all the ambient sources of randomness; events are then modeled as measurable subsets of this sample space, and random variables are modeled as measurable functions on this sample space. (We will briefly discuss a more abstract way to set up probability theory, as well as other frameworks to capture non-determinism than classical probability theory, at the end of this set of notes; however, the rest of the course will be concerned exclusively with classical probability theory using the orthodox Kolmogorov models.)

Note carefully that sample spaces (and their attendant structures) will be used to *model* probabilistic concepts, rather than to actually *be* the concepts themselves. This distinction (a mathematical analogue of the map-territory distinction in philosophy) actually is implicit in much of modern mathematics, when we make a distinction between an abstract version of a mathematical object, and a concrete representation (or *model*) of that object. For instance:

- In linear algebra, we distinguish between an abstract vector space , and a concrete system of coordinates given by some basis of .
- In group theory, we distinguish between an abstract group , and a concrete representation of that group as isomorphisms on some space .
- In differential geometry, we distinguish between an abstract manifold , and a concrete atlas of coordinate systems that coordinatises that manifold.
- Though it is rarely mentioned explicitly, the abstract number systems such as are distinguished from the concrete numeral systems (e.g. the decimal or binary systems) that are used to represent them (this distinction is particularly useful to keep in mind when faced with the infamous identity , or when switching from one numeral representation system to another).

The distinction between abstract objects and concrete models can be fairly safely discarded if one is only going to use a single model for each abstract object, particularly if that model is “canonical” in some sense. However, one needs to keep the distinction in mind if one plans to switch between different models of a single object (e.g. to perform change of basis in linear algebra, change of coordinates in differential geometry, or base change in algebraic geometry). As it turns out, in probability theory it is often desirable to change the sample space model (for instance, one could *extend* the sample space by adding in new sources of randomness, or one could couple together two systems of random variables by *joining* their sample space models together). Because of this, we will take some care in this foundational set of notes to distinguish probabilistic concepts (such as events and random variables) from their sample space models. (But we may be more willing to conflate the two in later notes, once the foundational issues are out of the way.)

From a foundational point of view, it is often logical to begin with some axiomatic description of the abstract version of a mathematical object, and discuss the concrete representations of that object later; for instance, one could start with the axioms of an abstract group, and then later consider concrete representations of such a group by permutations, invertible linear transformations, and so forth. This approach is often employed in the more algebraic areas of mathematics. However, there are at least two other ways to present these concepts which can be preferable from a pedagogical point of view. One way is to start with the concrete representations as motivating examples, and only later give the abstract object that these representations are modeling; this is how linear algebra, for instance, is often taught at the undergraduate level, by starting first with , , and , and only later introducing the abstract vector spaces. Another way is to avoid the abstract objects altogether, and focus exclusively on concrete representations, but taking care to emphasise how these representations transform when one switches from one representation to another. For instance, in general relativity courses in undergraduate physics, it is not uncommon to see tensors presented purely through the concrete representation of coordinates indexed by multiple indices, with the transformation of such tensors under changes of variable carefully described; the abstract constructions of tensors and tensor spaces using operations such as tensor product and duality of vector spaces or vector bundles are often left to an advanced differential geometry class to set up properly.

The foundations of probability theory are usually presented (almost by default) using the last of the above three approaches; namely, one talks almost exclusively about sample space models for probabilistic concepts such as events and random variables, and only occasionally dwells on the need to extend or otherwise modify the sample space when one needs to introduce new sources of randomness (or to forget about some existing sources of randomness). However, much as in differential geometry one tends to work with manifolds without specifying any given atlas of coordinate charts, in probability one usually manipulates events and random variables without explicitly specifying any given sample space. For a student raised exclusively on concrete sample space foundations of probability, this can be a bit confusing, for instance it can give the misconception that any given random variable is somehow associated to its own unique sample space, with different random variables possibly living on different sample spaces, which often leads to nonsense when one then tries to combine those random variables together. Because of such confusions, we will try to take particular care in these notes to separate probabilistic concepts from their sample space models.

** — 2. A simple class of models: discrete probability spaces — **

The simplest models of probability theory are those generated by *discrete probability spaces*, which are adequate models for many applications (particularly in combinatorics and other areas of discrete mathematics), and which already capture much of the essence of probability theory while avoiding some of the finer measure-theoretic subtleties. We thus begin by considering discrete sample space models.

Definition 1 (Discrete probability theory)Adiscrete probability spaceis an at most countable set (whose elements will be referred to asoutcomes), together with a non-negative real number assigned to each outcome such that ; we refer to as theprobabilityof the outcome . The set itself, without the structure , is often referred to as thesample space, though we will often abuse notation by using the sample space to refer to the entire discrete probability space .In discrete probability theory, we choose an ambient discrete probability space as the randomness model. We then model

eventsby subsets of the sample space . Theprobabilityof an event is defined to be the quantitynote that this is a real number in the interval . An event is

surely trueor is thesure eventif , and issurely falseor is theempty eventif .We model random variables taking values in the range by functions from the sample space to the range . Random variables taking values in will be called

real random variablesorrandom real numbers. Similarly for random variables taking values in . We refer to real and complex random variables collectively asscalar random variables.We consider two events to be equal if they are modeled by the same set: . Similarly, two random variables taking values in a common range are considered to be equal if they are modeled by the same function: . In particular, if the discrete sample space is understood from context, we will usually abuse notation by identifying an event with its model , and similarly identify a random variable with its model .

Remark 2One can view classical (deterministic) mathematics as the special case of discrete probability theory in which is a singleton set (there is only one outcome ), and the probability assigned to the single outcome in is : . Then there are only two events (the surely true and surely false events), and a random variable in can be identified with a deterministic element of . Thus we can view probability theory as a generalisation of deterministic mathematics.

As discussed in the preceding section, the distinction between a collection of events and random variable and its models becomes important if one ever wishes to modify the sample space, and in particular to *extend* the sample space to a larger space that can accommodate new sources of randomness (an operation which we will define formally later, but which for now can be thought of as an analogue to change of basis in linear algebra, coordinate change in differential geometry, or base change in algebraic geometry). This is best illustrated with a simple example.

Example 3 (Extending the sample space)Suppose one wishes to model the outcome of rolling a single, unbiased six-sided die using discrete probability theory. One can do this by choosing the discrete proability space to be the six-element set , with each outcome given an equal probability of of occurring; this outcome may be interpreted as the state in which the die roll ended up being equal to . The outcome of rolling a die may then be identified with the identity function , defined by for . If we let be the event that the outcome of rolling the die is an even number, then with this model we have , andNow suppose that we wish to roll the die again to obtain a second random variable . The sample space is inadequate for modeling both the original die roll and the second die roll . To accommodate this new source of randomness, we can then move to the larger discrete probability space , where each outcome now having probability ; this outcome can be interpreted as the state in which the die roll ended up being , and the die roll ended up being . The random variable is now modeled by a new function defined by for ; the random variable is similarly modeled by the function defined by for . The event that is even is now modeled by the set

This set is distinct from the previous model of (for instance, has eighteen elements, whereas has just three), but the probability of is unchanged:

One can of course also combine together the random variables in various ways. For instance, the sum of the two die rolls is a random variable taking values in ; it cannot be modeled by the sample space , but in it is modeled by the function

Similarly, the event that the two die rolls are equal cannot be modeled by , but is modeled in by the set

and the probability of this event is

We thus see that extending the probability space has also enlarged the space of events one can consider, as well as the random variables one can define, but that existing events and random variables continue to be interpretable in the extended model, and that probabilistic concepts such as the probability of an event remain unchanged by the extension of the model.

The set-theoretic operations on the sample space induce similar boolean operations on events:

- The
*conjunction*of two events is defined through the intersection of their models: . - The
*disjunction*of two events is defined through the the union of their models: . - The
*symmetric difference*of two events is defined through the symmetric difference of their models: . - The
*complement*of an event is defined through the complement of their models: . - We say that one event is
*contained in*or*implies*another event , and write , if we have containment of their models: . We also write “ is true on ” synonymously with . - Two events are
*disjoint*if their conjunction is the empty event, or equivalently if their models are disjoint.

Thus, for instance, the conjunction of the event that a die roll is even, and that it is less than , is the event that the die roll is exactly . As before, we will usually be in a situation in which the sample space is clear from context, and in that case one can safely identify events with their models, and view the symbols and as being synonymous with their set-theoretic counterparts and (this is for instance what is done in Durrett).

With these operations, the space of all events (known as the *event space*) thus has the structure of a boolean algebra (defined below in Definition 4). We observe that the probability is *finitely additive* in the sense that

whenever are disjoint events; by induction this implies that

whenever are pairwise disjoint events. We have and , and more generally

for any event . We also have monotonicity: if , then .

Now we define operations on random variables. Whenever one has a function from one range to another , and a random variable taking values in , one can define a random variable taking values in by composing the relevant models:

thus maps to for any outcome . Given a finite number of random variables taking values in ranges , we can form the joint random variable taking values in the Cartesian product by concatenation of the models, thus

Combining these two operations, given any function of variables in ranges , and random variables taking values in respectively, we can form a random variable taking values in by the formula

Thus for instance we can add, subtract, or multiply two scalar random variables to obtain another scalar random variable.

A deterministic element of a range will (by abuse of notation) be identified with a random variable taking values in , whose model in is constant: for all . Thus for instance is a scalar random variable.

Given a relation on ranges , and random variables , we can define the event by setting

Thus for instance, for two real random variables , the event is modeled as

and the event is modeled as

At this point we encounter a slight notational conflict between the dual role of the equality symbol as a logical symbol and as a binary relation: we are interpreting both as an external equality relation between the two random variables (which is true iff the functions , are identical), and as an internal event (modeled by ). However, it is clear that is true in the external sense if and only if the internal event is surely true. As such, we shall abuse notation and continue to use the equality symbol for both the internal and external concepts of equality (and use the modifier “surely” for emphasis when referring to the external usage).

It is clear that any equational identity concerning functions or operations on deterministic variables implies the same identity (in the external, or surely true, sense) for random variables. For instance, the commutativity of addition for deterministic real numbers immediately implies the commutativity of addition: is surely true for real random variables ; similarly is surely true for all scalar random variables , etc.. We will freely apply the usual laws of algebra for scalar random variables without further comment.

Given an event , we can associate the indicator random variable (also written as in some texts) to be the unique real random variable such that when is true and when is false, thus is equal to when and otherwise. (The indicator random variable is sometimes called the *characteristic function* in analysis, and sometimes denoted instead of , but we avoid using the term “characteristic function” here, as it will have an unrelated but important meaning in probability theory.) We record the trivial but useful fact that Boolean operations on events correspond to arithmetic manipulations on their indicators. For instance, if are events, we have

and the inclusion-exclusion principle

In particular, if the events are disjoint, then

Also note that if and only if the assertion is surely true. We will use these identities and equivalences throughout the course without further comment.

Given a scalar random variable , we can attempt to define the expectation through the model by the formula

If the discrete sample space is finite, then this sum is always well-defined and so every scalar random variable has an expectation. If however the discrete sample space is infinite, the expectation may not be well defined. There are however two key cases in which one has a meaningful expectation. The first is if the random variable is *unsigned*, that is to say it takes values in the non-negative reals , or more generally in the extended non-negative real line . In that case, one can interpret the expectation as an element of . The other case is when the random variable is *absolutely integrable*, which means that the absolute value (which is an unsigned random variable) has finite expectation: . In that case, the series defining is absolutely convergent to a real or complex number (depending on whether was a real or complex random variable.)

We have the basic link

between probability and expectation, valid for any event . We also have the obvious, but fundamentally important, property of *linearity of expectation*: we have

and

whenever is a scalar and are scalar random variables, either under the assumption that are all unsigned, or that are absolutely integrable. Thus for instance by applying expectations to (1) we obtain the identity

We close this section by noting that discrete probabilistic models stumble when trying to model *continuous* random variables, which take on an uncountable number of values. Suppose for instance one wants to model a random real number drawn uniformly at random from the unit interval , which is an uncountable set. One would then expect, for any subinterval of , that will fall into this interval with probability . Setting (or, if one wishes instead, taking a limit such as ), we conclude in particular that for any real number in , that will equal with probability . If one attempted to model this situation by a discrete probability model, we would find that each outcome of the discrete sample space has to occur with probability (since for each , the random variable has only a single value ). But we are also requiring that the sum is equal to , a contradiction. In order to address this defect we must generalise from discrete models to more general probabilistic models, to which we now turn.

** — 3. The Kolmogorov foundations of probability theory — **

We now present the more general measure-theoretic foundation of Kolmogorov which subsumes the discrete theory, while also allowing one to model continuous random variables. It turns out that in order to perform sums, limits and integrals properly, the finite additivity property of probability needs to be amplified to *countable* additivity (but, as we shall see, *uncountable* additivity is too strong of a property to ask for).

We begin with the notion of a measurable space. (See also this previous blog post, which covers similar material from the perspective of a real analysis graduate class rather than a probability class.)

Definition 4 (Measurable space)Let be a set. A Boolean algebra in is a collection of subsets of which

- contains and ;
- is closed under pairwise unions and intersections (thus if , then and also lie in ); and
- is closed under complements (thus if , then also lies in .
(Note that some of these assumptions are redundant and can be dropped, thanks to de Morgan’s laws.) A -algebra in (also known as a -field) is a Boolean algebra in which is also

- closed under countable unions and countable intersections (thus if , then and ).
Again, thanks to de Morgan’s laws, one only needs to verify closure under just countable union (or just countable intersection) in order to verify that a Boolean algebra is a -algebra. A

measurable spaceis a pair , where is a set and is a -algebra in . Elements of are referred to asmeasurable setsin this measurable space.If are two -algebras in , we say that is

coarser than(or isfiner than) if , thus every set that is measurable in is also measurable in .

Example 5 (Trivial measurable space)Given any set , the collection is a -algebra; in fact it is the coarsest -algebra one can place on . We refer to as thetrivialmeasurable space on .

Example 6 (Discrete measurable space)At the other extreme, given any set , the power set is a -algebra (and is the finest -algebra one can place on ). We refer to as thediscretemeasurable space on .

Example 7 (Atomic measurable spaces)Suppose we have a partition of a set into disjoint subsets (which we will callatoms), indexed by some label set (which may be finite, countable, or uncountable). Such a partition defines a -algebra on , consisting of all sets of the form for subsets of (we allow to be empty); thus a set is measurable here if and only if it can be described as a union of atoms. One can easily verify that this is indeed a -algebra. The trivial and discrete measurable spaces in the preceding two examples are special cases of this atomic construction, corresponding to the trivial partition (in which there is just one atom ) and the discrete partition (in which the atoms are individual points in ).

Example 8Let be an uncountable set, and let be the collection of sets in which are either at most countable, or are cocountable (their complement is at most countable). Show that this is a -algebra on which is non-atomic (i.e. it is not of the form of the preceding example).

Example 9 (Generated measurable spaces)It is easy to see that if one has a non-empty family of -algebras on a set , then their intersection is also a -algebra, even if is uncountably infinite. Because of this, whenever one has an arbitrary collection of subsets in , one can define the -algebrageneratedby to be the intersection of all the -algebras that contain (note that there is always at least one -algebra participating in this intersection, namely the discrete -algebra). Equivalently, is the coarsest -algebra that views every set in as being measurable. (This is a rather indirect way to describe , as it does not make it easy to figure out exactly what sets lie in . There is a more direct description of this -algebra, but it requires the use of the first uncountable ordinal; see Exercise 15 of these notes.) In Durrett, the notation is used in place of .

Example 10 (Borel -algebra)Let be a topological space; to avoid pathologies let us assume that is locally compact Hausdorff and -compact, though the definition below also can be made for more general spaces. For instance, one could take or for some finite . We define the Borel -algebra on to be the -algebra generated by the open sets of . (Due to our topological hypotheses on , the Borel -algebra is also generated by the compact sets of .) Measurable subsets in the Borel -algebra are known asBorel sets. Thus for instance open and closed sets are Borel, and countable unions and countable intersections of Borel sets are Borel. In fact, as a rule of thumb, any subset of or that arises from a “non-pathological” construction (not using the axiom of choice, or from a deliberate attempt to build a non-Borel set) can be expected to be a Borel set. Nevertheless, non-Borel sets exist in abundance if one looks hard enough for them, even without the axiom of choice; see for instance Exercise 16 of this previous blog post.

The following exercise gives a useful tool (somewhat analogous to mathematical induction) to verify properties regarding measurable sets in generated -algebras, such as Borel -algebras.

Exercise 11Let be a collection of subsets of a set , and let be a property of subsets of (thus is true or false for each in ). Assume the following axioms:

- is true.
- is true for all .
- If is such that is true, then is also true.
- If are such that is true for all , then is true.
Show that is true for all . (

Hint:what can one say about ?)

Thus, for instance, if a property of subsets of is true for all open sets, and is closed under countable unions and complements, then it is automatically true for all Borel sets.

Example 12 (Pullback)Let be a measurable space, and let be any function from another set to . Then we can define thepullbackof the -algebra to be the collection of all subsets in that are of the form for some . This is easily verified to be a -algebra. We refer to the measurable space as the pullback of the measurable space by . Thus for instance an atomic measurable space on generated by a partition is the pullback of (viewed as a discrete measurable space) by the “colouring” map from to that sends each element of to for all .

Remark 13In probabilistic terms, one can interpret the space in the above construction as a sample space, and the function as some collection of “random variables” or “measurements” on that space, with being all the possible outcomes of these measurements. The pullback then represents all the “information” one can extract from that given set of measurements.

Example 14 (Product space)Let be a family of measurable spaces indexed by a (possibly infinite or uncountable) set . We define the product on the Cartesian product space by defining to be the -algebra generated by the basic cylinder sets of the formfor and . For instance, given two measurable spaces and , the product -algebra is generated by the sets and for . (One can also show that is the -algebra generated by the products for , but this observation does not extend to uncountable products of measurable spaces.)

Exercise 15Show that with the Borel -algebra is the product of copies of with the Borel -algebra.

As with almost any other notion of space in mathematics, there is a natural notion of a map (or *morphism*) between measurable spaces.

Definition 16A function between two measurable spaces , is said to bemeasurableif one has for all .

Thus for instance the pullback of a measurable space by a map could alternatively be defined as the coarsest measurable space structure on for which is still measurable. It is clear that the composition of measurable functions is also measurable.

Exercise 17Show that any continuous map from one topological space to another is necessarily measurable (when one gives and the Borel -algebras).

Exercise 18If are measurable functions into measurable spaces , show that the joint function into the product space defined by is also measurable.

As a corollary of the above exercise, we see that if are measurable, and is measurable, then is also measurable. In particular, if or are scalar measurable functions, then so are , , , etc..

Next, we turn measurable spaces into measure spaces by adding a measure.

Definition 19 (Measure spaces)Let be a measurable space. Afinitely additive measureon this space is a map obeying the following axioms:

- (Empty set) .
- (Finite additivity) If are disjoint, then .
A countably additive measure is a finitely additive measure obeying the following additional axiom:

- (Countable additivity) If are disjoint, then .
A probability measure on is a countably additive measure obeying the following additional axiom:

- (Unit total probability) .
A measure space is a triplet where is a measurable space and is a measure on that space. If is furthermore a probability measure, we call a probability space.

A set of measure zero is known as a

null set. A property that holds for all outside of a null set is said to holdalmost everywhereorfor almost every.

Example 20 (Discrete probability measures)Let be a discrete measurable space, and for each , let be a non-negative real number such that . (Note that this implies that there are at most countably many for which – why?.) Then one can form a probability measure on by definingfor all .

Example 21 (Lebesgue measure)Let be given the Borel -algebra. Then it turns out there is a unique measure on , known as Lebesgue measure (or more precisely, the restriction of Lebesgue measure to the Borel -algebra) such that for every closed interval with (this is also true if one uses open intervals or half-open intervals in place of closed intervals). More generally, there is a unique measure on for any natural number , also known as Lebesgue measure, such thatfor all closed boxes , that is to say products of closed intervals. The construction of Lebesgue measure is a little tricky; see this previous blog post for details.

We can then set up general probability theory similarly to how we set up discrete probability theory:

Definition 22 (Probability theory)In probability theory, we choose an ambient probability space as the randomness model, and refer to the set (without the additional structures , ) as thesample spacefor that model. We then model aneventby an element of -algebra , with each such element describing an event. Theprobabilityof an event is defined to be the quantityAn event is

surely trueor is thesure eventif , and issurely falseor is theempty eventif . It isalmost surely trueor analmost sure eventif , andalmost surely falseor anull eventif .We model random variables taking values in the range by measurable functions from the sample space to the range . We define real, complex, and scalar random variables as in the discrete case.

As in the discrete case, we consider two events to be equal if they are modeled by the same set: . Similarly, two random variables taking values in a common range are considered to be equal if they are modeled by the same function: . Again, if the sample space is understood from context, we will usually abuse notation by identifying an event with its model , and similarly identify a random variable with its model .

As in the discrete case, set-theoretic operations on the sample space induce similar boolean operations on events. Furthermore, since the -algebra is closed under countable unions and countable intersections, we may similarly define the countable conjunction or countable disjunction of a sequence of events; however, we do *not* define uncountable conjunctions or disjunctions as these may not be well-defined as events.

The axioms of a probability space then yield the Kolmogorov axioms for probability:

- .
- .
- If are disjoint events, then .

We can manipulate random variables just as in the discrete case, with the only caveat being that we have to restrict attention to *measurable* operations. For instance, if is a random variable taking values in a measurable space , and is a measurable map, then is well defined as a random variable taking values in . Similarly, if is a measurable map and are random variables taking values in respectively, then is a random variable taking values in . Similarly we can create events out of *measurable* relations (giving the boolean range the discrete -algebra, of course). Finally, we continue to view deterministic elements of a space as a special case of a random element of , and associate the indicator random variable to any event as before.

We say that two random variables *agree almost surely* if the event is almost surely true; this is an equivalence relation. In many cases we are willing to consider random variables up to almost sure equivalence. In particular, we can generalise the notion of a random variable slightly by considering random variables whose models are only defined almost surely, i.e. their domain is not all of , but instead with a set of measure zero removed. This is, technically, not a random variable as we have defined it, but it can be associated canonically with an equivalence class of random variables up to almost sure equivalence, and so we view such objects as random variables “up to almost sure equivalence”. Similarly, we declare two events and *almost surely equivalent* if their symmetric difference is a null event, and will often consider events up to almost sure equivalence only.

We record some simple consequences of the measure-theoretic axioms:

Exercise 23Let be a measure space.

- (Monotonicity) If are measurable, then .
- (Subadditivity) If are measurable (not necessarily disjoint), then .
- (Continuity from below) If are measurable, then .
- (Continuity from above) If are measurable and is finite, then . Give a counterexample to show that the claim can fail when is infinite.

Of course, these measure-theoretic facts immediately imply their probabilistic counterparts (and the pesky hypothesis that is finite is automatic and can thus be dropped):

- (Monotonicity) If are events, then . (In particular, for any event .)
- (Subadditivity) If are events (not necessarily disjoint), then .
- (Continuity from below) If are events, then .
- (Continuity from above) If is events, then .

Note that if a countable sequence of events each hold almost surely, then their conjunction does as well (by applying subadditivity to the complementary events . As a general rule of thumb, the notion of “almost surely” behaves like “surely” as long as one only performs an at most countable number of operations (which already suffices for a large portion of analysis, such as taking limits or performing infinite sums).

Exercise 24Let be a measurable space.

- If is a function taking values in the extended reals , show that is measurable (giving the Borel -algebra) if and only if the sets are measurable for all real .
- If are functions, show that if and only if for all reals .
- If are measurable, show that , , , and are all measurable.

Remark 25Occasionally, there is need to consider uncountable suprema or infima, e.g. . It is then no longer automatically the case that such an uncountable supremum or infimum of measurable functions is again measurable. However, in practice one can avoid this issue by carefully rewriting such uncountable suprema or infima in terms of countable ones. For instance, if it is known that depends continuously on for each , then , and so measurability is not an issue.

Using the above exercise, when given a sequence of random variables taking values in the extended real line , we can define the random variables , , , which also take values in the extended real line, and which obey relations such as

for any real number .

We now say that a sequence of random variables in the extended real line converges almost surely if one has

almost surely, in which case we can define the limit (up to almost sure equivalence) as

This corresponds closely to the concept of almost everywhere convergence in measure theory, which is a slightly weaker notion than pointwise convergence which allows for bad behaviour on a set of measure zero. (See this previous blog post for more discussion on different notions of convergence of measurable functions.)

We will defer the general construction of expectation of a random variable to the next set of notes, where we review the notion of integration on a measure space. For now, we quickly review the basic construction of continuous scalar random variables.

Exercise 26Let be a probability measure on the real line (with the Borel -algebra). Define theStieltjes measure functionassociated to by the formulaEstablish the following properties of :

- (i) is non-decreasing.
- (ii) and .
- (iii) is right-continuous, thus for all .

There is a somewhat difficult converse to this exercise: if is a function obeying the above three properties, then there is a unique probability measure on (the Lebesgue-Stieltjes measure associated to ) for which is the Stieltjes measure function. See Section 3 of this previous post for details. As a consequence of this, we have

Corollary 27 (Construction of a single continuous random variable)Let be a function obeying the properties (i)-(iii) of the above exercise. Then, by using a suitable probability space model, we can construct a real random variable with the property thatfor all .

Indeed, we can take the probability space to be with the Borel -algebra and the Lebesgue-Stieltjes measure associated to . This corollary is not fully satisfactory, because often we may already have chosen a probability space to model some other random variables, and the probability space provided by this corollary may be completely unrelated to the one used. We can resolve these issues with product measures and other joinings, but this will be deferred to a later set of notes.

Define the cumulative distribution function of a real random variable to be the function

Thus we see that cumulative distribution functions obey the properties (i)-(iii) above, and conversely any function with those properties is the cumulative distribution function of some real random variable. We say that two real random variables (possibly on different sample spaces) *agree in distribution* if they have the same cumulative distribution function. One can therefore define a real random variable, up to agreement in distribution, by specifying the cumulative distribution function. See Durrett for some standard real distributions (uniform, normal, geometric, etc.) that one can define in this fashion.

Exercise 28Let be a real random variable with cumulative distribution function . For any real number , show thatand

In particular, one has for all if and only if is continuous.

Note in particular that this illustrates the distinction between almost sure and sure events: if has a continuous cumulative distribution function, and is a real number, then is almost surely false, but it does not have to be surely false. (Indeed, if one takes the sample space to be and to be the identity function, then will not be surely false.) On the other hand, the fact that is equal to *some* real number is of course surely true. The reason these statements are consistent with each other is that there are uncountably many real numbers . (Countable additivity tells us that a countable disjunction of null events is still null, but says nothing about uncountable disjunctions.)

Exercise 29 (Skorokhod representation of scalar variables)Let be a uniform random variable taking values in (thus has cumulative distribution function ), and let be another cumulative distribution function. Show that the random variablesand

are indeed random variables (that is to say, they are measurable in any given model ), and have cumulative distribution function . (This construction is attributed to Skorokhod, but it should not be confused with the Skorokhod representation theorem. It provides a quick way to generate a single scalar variable, but unfortunately it is difficult to modify this construction to generate multiple scalar variables, especially if they are somehow coupled to each other.)

There is a multidimensional analogue of the above theory, which is almost identical, except that the monotonicity property has to be strengthened:

Exercise 30Let be a probability measure on (with the Borel -algebra). Define theStieltjes measure functionassociated to by the formulaEstablish the following properties of :

- (i) is non-decreasing: whenever for all .
- (ii) and .
- (iii) is right-continuous, thus for all , where the superscript denotes that we restrict each to be greater than or equal to .
- (iv) One has
whenever are real numbers for . (

Hint:try to express the measure of a box with respect to in terms of the Stieltjes measure function .)

Again, there is a difficult converse to this exercise: if is a function obeying the above four properties, then there is a unique probability measure on for which is the Stieltjes measure function. See Durrett for details; one can also modify the arguments in this previous post. In particular, we have

Corollary 31 (Construction of several continuous random variables)Let be a function obeying the properties (i)-(iv) of the above exercise. Then, by using a suitable probability space model, we can construct real random variables with the property thatfor all .

Again, this corollary is not completely satisfactory because the probability space produced by it (which one can take to be with the Borel -algebra and the Lebesgue-Stieltjes measure on ) may not be the probability space one wants to use; we will return to this point later.

** — 4. Variants of the standard foundations (optional) — **

We have focused on the orthodox foundations of probability theory in which we model events and random variables through probability spaces. In this section, we briefly discuss some alternate ways to set up the foundations, as well as alternatives to probability theory itself. (Actually, many of the basic objects and concepts in mathematics have multiple such foundations; see for instance this blog post exploring the many different ways to define the notion of a group.) We mention them here in order exclude them from discussion in subsequent notes, which will be focused almost exclusively on orthodox probability.

One approach to the foundations of probability is to view the event space as an *abstract* -algebra – a collection of abstract objects with operations such as and (and and ) that obey a number of axioms; see this previous post for a formal definition. The probability map can then be viewed as an abstract probability measure on , that is to say a map from to that obeys the Kolmogorov axioms. Random variables taking values in a measurable space can be identified with their pullback map , which is the morphism of (abstract) -algebras that sends a measurable set to the event in ; with some care one can then redefine all the operations in previous sections (e.g. applying a measurable map to a random variable taking values in to obtain a random variable taking values in ) in terms of this pullback map, allowing one to define random variables satisfactorily in this abstract setting. The probability space models discussed above can then be viewed as *representations* of abstract probability spaces by concrete ones. It turns out that (up to null events) any abstract probability space can be represented by a concrete one, a result known as the *Loomis-Sikorski theorem*; see this previous post for details.

Another, related, approach is to start not with the event space, but with the space of scalar random variables, and more specifically with the space of *almost surely bounded* scalar random variables (thus, there is a deterministic scalar such that almost surely). It turns out that this space has the structure of a commutative tracial (abstract) von Neumann algebra. Conversely, starting from a commutative tracial von Neumann algebra one can form an abstract probability space (using the idempotent elements of the algebra as the events), and thus represent this algebra (up to null events) by a concrete probability space. This particular choice of probabilistic foundations is particularly convenient when one wishes to generalise classical probability to *noncommutative* probability, as this is simply a matter of dropping the axiom that the von Neumann algebra is commutative. This leads in particular to the subjects of quantum probability and free probability, which are generalisations of classical probability that are beyond the scope of this course (but see this blog post for an introduction to the latter, and this previous post for an abstract algebraic description of a probability space).

It is also possible to model continuous probability via a nonstandard version of discrete probability (or even finite probability), which removes some of the technicalities of measure theory at the cost of replacing them with the formalism of nonstandard analysis instead. This approach was pioneered by Ed Nelson, but will not be discussed further here. (See also these previous posts on the Loeb measure construction, which is a closely related way to combine the power of measure theory with the conveniences of nonstandard analysis.)

One can generalise the traditional, countably additive, form of probability by replacing countable additivity with finite additivity, but then one loses much of the ability to take limits or infinite sums, which reduces the amount of analysis one can perform in this setting. Still, finite additivity is good enough for many applications, particularly in discrete mathematics. An even broader generalisation is that of *qualitative* probability, in which events that are neither almost surely true or almost surely false are not assigned any specific numerical probability between or , but are simply assigned a symbol such as to indicate their indeterminate status; see this previous blog post for this generalisation, which can for instance be used to view the concept of a “generic point” in algebraic geometry or metric space topology in probabilistic terms.

There have been multiple attempts to move more radically beyond the paradigm of probability theory and its relatives as discussed above, in order to more accurately capture mathematically the concept of non-determinism. One family of approaches is based on replacing deterministic *logic* by some sort of probabilistic logic; another is based on allowing several parameters in one’s model to be unknown (as opposed to being probabilistic random variables), leading to the area of uncertainty quantification. These topics are well beyond the scope of this course.

## 90 comments

Comments feed for this article

29 September, 2015 at 10:28 pm

Bo Jacobyhttps://www.academia.edu/3247833/Statistical_induction_and_prediction This useful elementary result is not widely known.

1 October, 2015 at 9:45 pm

AnonymousIs this available anywhere else, like arxiv.org? There doesn’t seem to be a way to download it without enrolling on that annoying academia.edu social media site. Thanks.

5 October, 2015 at 7:44 am

Bo JacobySorry, I was not aware that academia.edu is annoying. Here is a dropbox link. https://www.dropbox.com/s/yzxugqjxigkpw5b/Induction.pdf?dl=0

30 September, 2015 at 4:35 am

Pedro Lauridsen RibeiroSince these notes will have also a focus on foundations, it would be interesting to point out the connection of Kolmogorov’s axioms for probability to (quantifying) plausible reasoning via the connection of -algebras to Boolean -algebras provided by the Loomis-Sikorski theorem, so that probability measures amount to “generalized truth” functions, so to speak.

[This point is briefly covered in the final section of the notes – T.]30 September, 2015 at 8:57 am

Scott ThomasI look forward to reading the notes. I do wonder why Durrett’s book was chosen, however. It has a nasty reputation.

30 September, 2015 at 9:25 am

Brendan MurphyDurrett’s book is free and covers the standard topics. I liked some of the problems in it as well. Maybe it’s not the best choice for self-study, but as an adjunct to a graduate course it’s pretty good.

30 September, 2015 at 3:51 pm

D GhatakWhich book is a good choice for self-study?

1 October, 2015 at 9:47 pm

AnonymousI’ve found Wikipedia’s coverage to be pretty good, though maybe it’s rotted my mind. I’m looking forward to following these notes. There’ are a couple of well-known books by Alfréd Rényi that I’ve been wanting to read but so far I’ve only looked at a few pages.

2 October, 2015 at 2:03 am

D GhatakThanks.

2 October, 2015 at 12:07 pm

jldohmannDurrett’s book is pretty standard, at least in my experience with probability and stochastics courses. He also has interesting advanced topics, like his book on random graph dynamics (that i’m currently reading) which his previous books sort of set the stage for. Though I agree, he’s not always the easiest to read.

30 September, 2015 at 11:11 am

Greg MartinExcellent post! In Definition 21, you write “We then model an event {E} by subsets {E_\Omega} of the sample space {\Omega}.” Should that instead be “We then model an event {E} by elements {E_\Omega} of the {\sigma}-algebra {\mathcal F}”, or is there a nuance I’m missing that makes the original phrasing more appropriate?

(Also, a typo: in Exercise 23, the penultimate curly bracket of the penultimate math expression should be a close-bracket rather than open-bracket.)

[Sorry, that was a result of a careless cut-and-paste, now corrected. -T]19 October, 2015 at 5:58 am

Fred LunnonThe subsequent example makes it clear that Definition 1 should read

“We then model an event {E} by a subset {E_\Omega} of the sample space {\Omega}.”

or more readably

“We then model events {E} by subsets {E_\Omega} of the sample space {\Omega}.”

[Corrected, thanks – T.]1 October, 2015 at 12:38 pm

JamesDear Terrence:

Have you read or browsed ‘Probability: The Logic of Science’ by the late physicist E.T. Jaynes, and if so, what did you think of it, and would you ever consider using it for parts of such a course?

Sincerely,

James

2 October, 2015 at 2:10 pm

Terence TaoI have not looked in detail at this text, but it may be more suitable for a foundations of probability course rather than a graduate mathematics course in probability like this one. (This current set of notes is indeed devoted to foundations, but the bulk of the course will not be.)

3 October, 2015 at 6:54 pm

AnonymousDo you ever have the intention of doing a foundations course and offering your opinions there?

1 October, 2015 at 3:29 pm

ChrisFrom the measure-theoretic point of view a “random variable” $X$ is considered as a deterministic function (as defined above). On the other hand, in practice one often just writes something like $X \sim N(\mu,\sigma)$ to signify the random variable is distributed as the normal distribution for example, say, without reference to any sample spaces. The above corollary shows this suffices in practices as only the CDF really matters. The question that has always been a bit of a puzzle from a foundations point of view is how to interpret the randomness of the so-called random variable? How does the random variable modelling, say, a coin toss as an explicit function from a sample space fit with the practice of generating a coin toss, either by computer or with physical coins (which is determined by the CDF)? A related corollary issue come with the related concept of a “statistic”, which is taken to be some function of a random variable or several such, and what can be realised/explicitly calculated from a given set of data.

1 October, 2015 at 8:50 pm

Terence TaoIt is the equidistribution of many “real-life” processes that allow them to be usefully modeled by probability theory, even if they are ultimately generated by some deterministic physical or mathematical law. For instance, a well-designed pseudorandom number generator is deterministic, but asymptotically has the same statistics as a genuinely random number generator, and so can be accurately modeled by such.

To me, the more interesting phenomenon is that of

universality– not only can complex systems be modeled by probability theory, but moreover one only needs to use a small set of universal probability distributions (e.g. gaussian, uniform, GUE, etc.) to model such systems, almost irrespective of the underlying mechanics of that system. I discuss this phenomenon in this previous post. Probability theory can help explain this phenomenon by establishing universality results such as the central limit theorem.By the way, the CDF of a random variable only serves to accurately model that random variable in isolation. If that variable is to be coupled with other random variables, knowing the CDFs of the individual random variables is no longer sufficient (unless one assumes joint independence of these variables); one must know the full joint CDF instead.

2 October, 2015 at 6:40 am

ChrisOn the foundation aspect I just want to draw attention to the “lingo” used to bridge between the thinking behind the concept as it is used in applications and the realisation in a mathematical framework. Something like… to answer the question “What is meant by a random variable?” we could say: A random variable is a variable that takes on different values depending on the outcome of a “random event”. Or: A random variable is a variable whose value depends on “unknown events”, where we can summarise the unknown events as a set of outcomes in a sample space so that the random variable can be considered as a function from this space to a set of values (hence the above measure function formulation). The stats textbooks often ignore or don’t mention the event space, as Corollary 26 or 29 are effectively invoked in most cases, and certainly don’t point out that the sample space may need extending or augmenting even though the exact set of outcomes/events might be irrelevant (philosophically very much like the atlas is not normally referred to explicitly in Differential Geometry).

On the universality aspect I think it is important to emphasis the framework popularised by K Pearson / RA Fisher in the way many of the probability models have in mind some Epidemiology/Biostatistics application or philosophy underpinning them (since the research branch was developed from such “model” problems). The particularly statistical/probability terminology used reflects this and the meaning and interpretation that surounds the concepts is often important to be clear on (as it is often subject to misunderstanding).

2 October, 2015 at 11:00 am

Bookmarks for October 1st through October 2nd | Chris's Digital Detritus[…] 275A, Notes 0: Foundations of probability theory | What’s new – […]

2 October, 2015 at 11:55 am

Byron SchmulandThanks very much for posting this!

A minor point: Durrett’s surname has two “r”s.

[Corrected, thanks – T.]2 October, 2015 at 12:31 pm

TravisIn Definition #1, the discrete sample space is defined in terms of itself using again. Is this circular definition intentional? Why not use a different designation for the discrete sample space?

2 October, 2015 at 1:01 pm

Terence Taois used here as a synecdoche, which is a standard convention in mathematics when referring to a structured space. For instance a group is formally a tuple consisting of a set together with various operations on it, but we usually abbreviate it via synecdoche as just , thus by abuse of notation . Similarly for vector spaces, topological spaces, rings, fields, manifolds, metric spaces, etc..

2 October, 2015 at 5:09 pm

Tapen SinhaJumping from the finitely additive probabilities to the countably additive ones looks deceptively simple but it is not. Rajeeva Karandikar has a beautiful lecture on the subject. He should know. He has done some deep work on that matter.

http://math.iisc.ernet.in/~imi/downloads/LimitTheoremsFinitelyAdditiveProbability3.pdf

Tapen Sinha

3 October, 2015 at 12:28 am

Rajeeva KarandikarI would like to elaborate a little on what my friend Tapen Sinha has mentioned.

One naively believes that it is countably additivity assumptions that enables us to prove limit theorems in probability theory. This is not quite correct. I had shown that almost all limit theorems with convergence in law or convergence in distribution that are true in Countably additive framework are also true in the finitely additive framework.

See:

http://goo.gl/uHxANk (paper from Transactions of AMS, 1982)

and

http://goo.gl/KuOHgq (paper from Journal of Multivariate Analysis, 1988).

3 October, 2015 at 3:20 am

First new cache-coherence mechanism in 30 years « Pink Iguana[…] Terry Tao, What’s New, 275A, Notes 0: Foundations of probability theory, here. […]

3 October, 2015 at 2:58 pm

275A, Notes 1: Integration and expectation | What's new[…] Notes 0, we introduced the notion of a measure space , which includes as a special case the notion of a […]

3 October, 2015 at 3:25 pm

RexDear Terry,

I’m not sure that base change in algebraic geometry is a good example of the modelling you have in mind. It is quite different from coordinate changes in linear algebra and differential geometry. For one thing, it is irreversible; one can base change from to but not the other way around.

3 October, 2015 at 5:37 pm

Terence TaoActually, it is this irreversibility that made me include base change in the list of analogues to changing the probabilistic model, because there is a similar irreversibility in probability: one can extend a sample space to a larger one by adding more sources of randomness, but one cannot reverse the process and reduce back to a smaller space (because events and random variables that are measurable in the larger model may cease to be measurable in the smaller model).

3 October, 2015 at 5:56 pm

RexWould you have an example to illustrate this irreversibility for sample spaces?

As for base change, it is not clear to me what the invariant thing that schemes under different bases are modeling. One guess would be the underlying defining equations. However, I usually see it done the other way: that the defining equations are models giving a concrete realization of a variety in affine space.

4 October, 2015 at 8:50 am

Terence TaoSee Example 3: one always has the freedom to extend the sample space to a larger space, but not conversely (one can only pass back down to the smaller space if one has discarded all the events and random variables that cannot be modeled in that smaller space).

The base change analogy is not perfect because the original scheme embeds into the larger scheme, rather than being a quotient (or factor) of it; for instance a variety over the reals embeds into the same variety over the complex numbers, in contrast to probabilistic spaces where the extended space projects back down to the original space. Of course, the change of linear basis analogy (or the change of coordinates analogy) are not perfect either, since in these cases the models are isomorphic rather than being related by an embedding or a projection. However they still are examples of the general theme of distinguishing abstract mathematical objects from their concrete models in order to be able to change the latter without disrupting the former, and my main point was that this distinction is not some quirk unique to probability theory but is in fact pervasive throughout modern mathematics, albeit in somewhat different forms as one goes from one subfield of mathematics to another.

(Perhaps a closer analogy would be some sort of “numerical base change” in which a variety that is being computationally modeled by (say) single-precision reals, is instead modeled by the extended model of double-precision reals, since in this case it is more natural to project the latter to the former than to embed the former in the latter, but this is not a concept that is well suited for formalisation in modern algebraic geometry.)

ADDED LATER: The base change analogy looks more aligned with probabilistic language if one views the base space geometrically rather than algebraically (e.g. working with instead of ). For instance, an analogue of a real-valued random variable modeled by some probability space would be a section of a line bundle over some base space . Extending the probability space to some extension corresponds to performing a base change on the line bundle using some extension of the underlying base. There is also an interesting analogy between “random element of ” in probability theory and “generic point of in algebraic geometry. It might even be possible to put probability theory and algebraic geometry on exactly the same footing by using some topos-theoretic language, but I haven’t worked this through carefully.

4 October, 2015 at 4:04 pm

RexAt first I thought I understood what the “object vs model” analogy you were trying to describe is, but now I’m not so sure.

Are you saying that a fixed sample space is something like an finite-level approximation of some ultimate (perhaps non-mathematical) “space” of all possible outcomes (including future experiments we have not yet decided to perform or record)? In the same way that, say, the -adic integers can be “modeled” by the projections to its finite quotients ?

4 October, 2015 at 4:21 pm

Terence TaoYes, I think this is a good way to think about things. In principle one could try to pass to an inverse limit of “all” sample spaces to work in some “universal” sample space, but one would have to exit the category of sets to do so, in order to avoid Russell type paradoxes, so this seems like more trouble than it is worth.

4 October, 2015 at 4:31 pm

RexAh, okay. That makes more sense. Are there examples where an extension of sample spaces is something more subtle than just adding an extra factor?

In example 3, for instance, rolling the die a second time amounts to taking a direct product of our old sample space with a six-point set. Is it always the case that extending the sample space amounts to taking a direct product with something?

4 October, 2015 at 10:18 pm

Terence TaoOne can be substantially more general than a direct product. Consider for instance a “gamblers ruin” (or “stopping time”) type of scenario in which one performs some random actions at a casino until one’s money runs out. The sample space is extended as time progresses, but not as a direct product because for a certain portion of the outcomes, one is no longer gambling and producing additional sources of randomness. More generally, any new source of randomness that is coupled somehow to prior sources of randomness will lead to an extension that is not a direct product. (One often analyses such situations by jumping to the final extension of the sample space after the game has completely played out, viewing times prior to the final time as being associated to various sub-sigma-algebras of the final sigma algebra, but one can also take a more “real-time” perspective, thinking of the sample space as dynamically evolving over time. The choice of which perspective to use is to some extent a matter of personal taste.)

For a more concrete example, one could be studying some random graph (e.g. an Erdos-Renyi random graph) and find it convenient at some point in the analysis to take some random subgraph of that graph, e.g. by randomly selecting each edge of the graph to survive in the subgraph with some probability p. This requires extending the sample space in a manner more complicated than a direct product, since the amount of randomness one needs to build the subgraph depends on how many edges the original graph had. (In this particular case one could model things by a direct product, e.g. by viewing the random subgraph as the intersection of the original graph with an independent Erdos-Renyi graph, and this perspective may in fact be useful for whatever task one is trying to accomplish, but one could imagine more complicated probabilistic constructions which do not naturally arise from a direct product. In certain parts of theoretical computer science, such as computational complexity and cryptography, it can be useful to view randomness as a finite resource, to be used sparingly if possible, in which case it can be more efficient to use non-direct-product models in which a random number generator is only used when absolutely necessary, which in particular makes each new use of that generator conditional on previous ones.)

3 October, 2015 at 11:46 pm

PerryZhaoReblogged this on 木秀于林.

4 October, 2015 at 8:33 am

Colin RustIn Exercise 28, in the first sentence it looks like the words “real line” are extraneous. In part (iv), I think you’re missing a factor of . Possibly you might also want to add a hint that the formula represents the measure of a box in .

[Corrections and hint added, thanks – T.]4 October, 2015 at 11:58 am

AnonymousThe last centered equation in Example 3 is missing a “\” in the omega and there’s a right parenthesis missing in the first sentence of Exercise 11.

[Corrected, thanks – T.]4 October, 2015 at 1:09 pm

AnonymousIn the definition of the expectation , how does one know that it is independent of the

model?4 October, 2015 at 2:36 pm

Terence TaoThis is not shown in this set of notes (since we did not specify how we were permitted to extend the model), but is shown in Notes 1 (see the discussion at the end of Section 2 there).

6 October, 2015 at 3:55 am

M. KlazarIn the displayed formula after Exercise 24, should not there be rather “=” and disjunction?

[Corrected, thanks – T.]6 October, 2015 at 10:05 am

Ben GolubAs a complement to Durrett’s book, I always found Amir Dembo’s notes on probability to be useful for a course of this type… their main virtue is that they are very systematic and (perhaps obsessively!) organized. http://statweb.stanford.edu/~adembo/stat-310a/lnotes.pdf

(Initially posted this on old notes but thought it would be more relevant here.)

9 October, 2015 at 11:23 am

EXP LOGDear Professor Tao, have you thought about videotaping your classes so that students can benefit from your lectures in a better way, especially given that real analysis and probability theory have a large audience other than PhD students in Math, e.g., engineering, physics, economics, and finance?

11 October, 2015 at 11:10 am

AlikDear Professor Tao, my impression is that in probability people focus a lot on independence of random variables? Why independence is so much more important than any other type of dependence between random variables?

11 October, 2015 at 1:55 pm

Terence TaoFor one thing, it is the simplest model to compute with, since we have for independent random variables and arbitrary . So even if in practice one needs to work with random variables with more complicated couplings, one often works with the independent case first as a simplified model to establish some baselines as to what to expect in more general cases. The other thing is that independence (or something close to it) is often a reasonable assumption in real life settings, for instance when one has two sources of randomness that one believes to not be in communication with each other, e.g. because of physical separation, or lack of knowledge of one source of the other, and which are not both drawing on a common third source of randomness. Contrapositively, one can use an independence assumption to make various predictions which, if then they then fail to match observed data, suggest that there is some non-trivial correlation between two observables, and which can then be used to draw statistical inferences (or, in some cases, even causal inferences).

There are certainly however other types of coupling than independence which are widely studied. For instance, in highly noncommutative settings (e.g. in random matrix theory), free independence often becomes the more natural notion to study than classical independence.

11 October, 2015 at 2:32 pm

AlikThanks!

12 October, 2015 at 12:35 pm

275A, Notes 2: Product measures and independence | What's new[…] number of probability spaces for , where can be infinite or even uncountable. Recall from Notes 0 that the product -algebra on is defined to be the -algebra generated by the sets for and , […]

14 October, 2015 at 1:04 pm

VenkyIn Example 14, shouldn’t E_\beta \in \Beta_\beta, not \Beta_\alpha? Great notes. Thanks.

[Corrected, thanks – T.]16 October, 2015 at 4:43 am

L.Dear Professor Tao,

In the definitions of boolean operations of events below Exercise 3, would it be better to say, for example, “The conjunction {E \wedge F} of two events {E}, {F} is modeled (instead of defined) as the intersection of their models”?

Several lines below regarding the relation {F}, I think “the event {F(R_1,\dots,R_n)}” should be “{F(X_1,\dots,X_n)}”.

Below Remark 25, “{(\sup_n X_n > t) := \bigwedge_{n=1}^\infty (X_n > t)}” should be “{(\sup_n X_n > t) := \bigvee_{n=1}^\infty (X_n > t)}”.

In Exercise 30 (ii), “{F(t)}” should be “{F(t_1,…,t_n)}”, and in (iii), “{F(s)}” should be “{F(s_1,…s_n)}”.

[Corrected, thanks – T.]18 October, 2015 at 3:10 am

MartinScorseseDear professor Tao,

do you know a text (book, article – anything) that focuses a lot on the foundational aspects of probability and develops and treats in greater detail the issue you touched upon in these notes ?

20 October, 2015 at 5:28 pm

Fred LunnonExample 9 : for “Becaue” read “Because” ;

Exercise 15 : for “{\sigma-}algebra” read ” {\sigma}-algebra” ;

Remark 25 foll. : delete “if one is ” ; for “if one has” read “when” .

[Corrected, thanks – T.]2 November, 2015 at 1:52 am

zxmzxmzxmReblogged this on ZXM rules!.

2 November, 2015 at 7:06 pm

275A, Notes 4: The central limit theorem | What's new[…] (iv) Show that converges in distribution to if and only if, after extending the probability space model if necessary, one can find copies and of and respectively such that converges in probability to . (Hint: use the Skorohod representation, Exercise 29 of Notes 0.) […]

10 November, 2015 at 10:28 am

ManInTheDear professor Tao,

In definition 1 and in definition 22, the the meaning of “to model something by something else” is not clear to me. In particular are you saying that the measurable subsets are called events, or are you saying something else?

Thanks for your clarification

10 November, 2015 at 10:50 am

Terence TaoA model is a representation of a mathematical object, but strictly speaking one should distinguish it from the object itself, although often in practice one sees the “abuse of notation” of identifying an object with its model, particularly if one is working with a fixed representation for each class of objects being considered.

A basic example is the distinction between a natural number and its representation in a numeral system such as the decimal system. Strictly speaking, a string of digits such as 13 is not a number, it is rather a numeral used to model a number. Of course, we usually abuse notation and identify numerals with numbers, e.g. “let n be the natural number 13”. This usually does not cause problems unless one needs to use two different numeral systems for the same number. For instance, if n is modeled in the decimal system by the numeral 13, then it is also modeled in the octal system by the numeral 15, despite the fact that 13 and 15 are clearly different numerals (different strings of digits). This can be confusing unless one is very clear about the distinction between a number and the numerals that model it in various representations. Using the notational conventions of this post, the accurate thing to write here is that there is a natural number whose decimal representation is 13, and whose octal representation is 15; they both model the same natural number despite 13 and 15 being distinct numerals. However, to be pedantic, the number itself is distinct from both the numeral 13 and the numeral 15; we have and , but and .

Another example that is the distinction between a vector in some vector space (e.g. a two-dimensional plane) and its representation as a pair of numbers with respect to some basis. Again, if one is committed to only using one basis for the vector space in question, there is little harm in abusing notation and identifying a vector in that space by its model , but this can lead to confusion when one wishes to change the basis (which is often useful, for instance when performing a multi-step linear algebra calculation, one component of which is easy to carry out in one basis, and another component of which is easy to carry out in a different basis). For instance, a vector which is modeled in one basis by the pair might also be modeled in a different basis (rotated from the first one by 45 degrees) by . (Similarly when changing coordinate systems, e.g. from Cartesian coordinates to polar coordinates.) With the notational conventions of this post, I would write that and ; they both model the same underlying vector , despite the fact that are two distinct pairs of real numbers. Again, to be pedantic, the vector should be kept distinct from both the pair and the pair .

[The one-dimensional version of the above example is also instructive, illustrating the role of physical units to model physical quantities (such as length, mass, speed, etc.) as numbers. A given length might be modeled both as 3 meters by the metric system and (approximately) 10 feet in the imperial system, despite 3 and 10 being different numbers; as such, it is strictly speaking inaccurate to think of a length as a number, but rather that numbers can be used to model lengths in various representation systems.]

Once one makes the distinction between an object and its models, we see that the models of the object do not define the object in an ontological sense: if one is told for instance that a numeral is a finite string of digits from the set {0,1,2,3,4,5,6,7,8,9} that does not start with 0, this does not quite define what a natural number

is, unless one abuses notation by identifying a natural number with its numeral model. However, models can serve to define thefunctionalityof a mathematical object, which is often good enough for mathematical purposes. For instance, as we are taught in primary school, the numeral system can be used to define arithmetic operations on natural numbers such as addition and multiplication. This lets us manipulate natural numbers for the purposes of doing mathematics, even if it still doesn’t address the philosophical question of what a natural number actuallyis. (Of course, one has to check at some point that the arithmetic operations defined from different numeral systems are consistent with each other, e.g. that long multiplication in decimal is consistent with long multiplication in octal. This requires some mathematical argument to check; the simplest way to do it is to verify that both operations are consistent with the multiplication operation in an axiomatic description of the natural numbers, e.g. the Peano axiom formulation.)Of course, one can also try to define mathematical objects without recourse to models. For instance, the natural numbers can be described axiomatically using axiom systems such as Peano arithmetic or ZFC. It is also possible to set up the foundations of probability axiomatically (e.g. through the Kolmogorov axioms, together with the sigma-algebra version of the axioms of Boolean algebra), but I prefer to avoid this approach for a first course in graduate probability, for much the same reason that arithmetic is often first taught using the decimal representation, or linear algebra is first taught by modeling vectors as tuples of numbers.

11 November, 2015 at 11:24 am

ManInTheDear professor Tao,

Thanks for you answer. It helped me a lot improve my understanding.

28 November, 2015 at 1:58 pm

AnonymousExercise 17, was the intent to write continuous function from X to Y.

[Corrected, thanks – T.]28 November, 2015 at 2:09 pm

AnonymousIn Definition 22 an event E is meant to be an element (singular) of F, while it is written elements.

[Corrected, thanks – T.]28 November, 2015 at 2:20 pm

AnonymousAfter Remark 25, in the Equation: (sup X_n > t) := …. does the equality need the ‘define equal’ colon?

[Corrected, thanks – T.]28 November, 2015 at 2:38 pm

AnonymousExercise 30 (ii) might be written as a vector in the limit notation.

5 February, 2016 at 2:09 am

Final Word on Richard Carrier | Letters to Nature[…] Bertsekas and Tsitsiklis; Rosenthal; Bayer; Dembo; Sokol and Rønn-Nielsen; Venkatesh; Durrett; Tao. I recently stopped by Sydney University’s Library to pick up a book on nuclear reactions, […]

17 February, 2016 at 9:18 am

Joel KI recently came across a quite intresting approach to probability theory based on using the Henstock integral. See this book by P. Muldowney : A Modern Theory of Random Variation. Draft is available here

This approach simplifies some ideas in standard probability, and can be used for example to provide a nice probabilistic interpretation of Feynman path integrals.

Anyone here that have had experience with this approach and it’s potential advantages and disadvantages?

19 April, 2016 at 7:04 am

AnonymousIn linear algebra, we distinguish between an abstract vector space , and a concrete system of coordinates given by some basis of .In group theory, we distinguish between an abstract group , and a concrete representation of that group as isomorphisms on some space .In the above two cases, one has a set of mathematics axioms for each of the abstract concepts, “vector spaces” and “groups”.

I’m confused by definition 1, and definition 19. You say that the probability spaces are

models, which as I understand, correspond to and above. But what are the axioms for the “stuff” the probability spaces are modeling? This note vaguely mentions that they are modeling “randomness”, but what is “randomness” mathematically? Also, in Definition 1, you distinguish “event ” and its model . Then what is the definition/axiom for the abstract concept “events”? (Similar question for “random variables”.)19 April, 2016 at 9:47 am

Terence TaoAbstract probability spaces are discussed in Section 4 of the blog post. Alternatively, one can eschew axiomatisation algoether, and simply define an abstract probability space to be a structure of events and probabilities that has at least one concrete model. This would be analogous to defining (say) a group to be an object with a multiplication operation that had at least one concrete representation as an action; indeed, this is close to how groups were first viewed historically. (In this particular case, Cayley’s theorem tells us that the two definitions of a group are equivalent.)

14 November, 2016 at 6:28 am

AnonymousIn Definition 1,

We consider two events to be equal if they are modeled by the same set: .

Do we need this to be true for any models? Namely, for any ?

[Yes, we only consider faithful models (aka faithful representations) of probability spaces. -T.]14 November, 2016 at 6:52 am

AnonymousWhen we say “We then model events by subsets of the sample space . “, what mathematical object (or category?) is an “event” ? Is it a set?

[It can be, if you like; but it doesn’t have to be. From a formal mathematical viewpoint, the only requirement is that the events lie in an abstract sigma algebra equipped with a probability measure. An analogy is with vectors in a finite-dimensional space; they can be modeled by row or column vectors of numbers, but they themselves could just be elements of an abstract vector space. -T.]14 November, 2016 at 12:46 pm

AnonymousI have confusion about your comments.

(I) an “event” “does not have to” be a set.

(II)”the only requirement is that the events lie in an abstract sigma algebra equipped with a probability measure”

Would you elaborate what you mean by “abstract sigma algebra”? Isn’t a sigma algebra is a collection of subsets of a given set with the sigma algebra structure and thus any element lies in an abstract sigma algebra must be a “set”? (if an event is not a set, would you give an example to illustrate what else it could be?)

I’m also not very clear about the analogy with vectors in a finite-dimensional space. Suppose is a real vector space of dimension . Then a model for is . But isn’t it true that the abstract vector space $V$ itself is also a “set” in the first place by definition?

14 November, 2016 at 4:21 pm

Terence TaoYou are describing a concrete sigma algebra. See for instance https://terrytao.wordpress.com/2009/01/12/245b-notes-1-the-stone-and-loomis-sikorski-representation-theorems-optional/ for a definition of an abstract sigma algebra and its relationship with the concrete sigma algebra notion. A simple example would be the (abstract) Boolean algebra with the usual Boolean operations. In this case there are two events, and ; depending on how one is modeling the natural numbers, these events might be sets, but there is no necessity for them to be so.

As for the vector analogy, the vector

spaceis always a set, but the individualvectorsdo not have to be sets (or arrays of numbers, or anything else for that matter). Similarly, the event space is always a set, but individual events (the elements of the event space) need not be.15 November, 2016 at 5:00 am

AnonymousI think it is clear now. I find the linked article of the Kolmogrov axioms (https://en.wikipedia.org/wiki/Probability_axioms) is actually very confusing. If one checks the Axiom Three there, one can see that individual events are described as sets while they do not have to be as you pointed out. At least one should notice that those events in the event space do not have to be subsets of the probability space at all.

15 November, 2016 at 5:25 am

AnonymousAs I read from the English translation of the work by Kolmogrov himself (http://www.york.ac.uk/depts/maths/histstat/kolmogorov_foundations.pdf), it seems that Kolmogrov didn’t use the notation of abstract sigma algebra but the concrete one. So the link in

The axioms of a probability space then yield theKolmogorov axiomsfor probability:might be not very appropriate?

In Definition 22,

We then model an event by an element of -algebra , with each such element describing an event.

should be a concrete sigma algebra and thus a collection of subsets of ? And hence, we are not only modeling an event by an element of but also modeling the “event space” by a concrete sigma algebra (hence the notation might be better)?

[Fair enough; I’ve added the subscript. -T]13 May, 2016 at 6:14 pm

Visualising random variables | What's new[…] variables) that do depend (in some measurable fashion) on the state . (As discussed in this previous post, it is often helpful to adopt a perspective that suppresses the sample space as much as possible, […]

20 June, 2016 at 8:14 am

Confused RobitailleI’m also not sure how well your analogy to linear algebra is, because of different levels of abstractions used in the definition of real vector space vs. real vector space model resp. abstract probability space (in the sense that you mentioned of Kolmogorffs axioms + sigma-algebra version of Boolean algebra axioms) vs. probability space (as in Definition 19):

Real vector spaces are abstract, but representations of them are concrete, in the sense each representation is a fixed set built out of numbers that one could write down. But abstract probability spaces and probability spaces are both abstract, because the latter is not a concrete set that one could write down, but itself a set that is given abstractly by axioms that characterize it.

21 June, 2016 at 7:32 am

Terence TaoThis is technically true, but in practice, probability spaces are often quite concrete, being a set of outcomes (with the discrete sigma algebra) or some explicit subset of (with the Borel sigma algebra), or perhaps a finite or infinite product of these spaces.

It’s true though that the object that is considered “concrete” in one setting could be considered “abstract” in another. For instance, one could imagine a probability space, concrete in the sense of probability theory, in which the sample space was a manifold, which was abstract in the sense of differential geometry, but for which the real number system underlying both the probability space and the manifold is given concretely by digit strings, as opposed to some axiomatic presentation of the reals. I don’t see any inconsistency here, though, and as long as one keeps all the different categories (probability theory, differential geometry, real closed fields, etc.) here conceptually distinct from each other there should not be any serious confusion.

14 November, 2016 at 6:57 am

AnonymousAre “conjunction” and “disjunction” actually “set” operations and it is just because we want to distinguish explicitly an event and its model in a sample space that we deliberately use different terminologies than those in set theory?

14 November, 2016 at 7:14 am

Anonymous(I) On the one hand, we have the following:

Given a finite number of random variables taking values in ranges , we can form the joint random variable taking values in the Cartesian product by concatenation of the models, thus

(II) But you write in Example 3 the following:

Now suppose that we wish to roll the die again to obtain a second random variable . The sample space is

inadequatefor modeling both the original die roll and the second die roll .It seems from (I) that one does not need to enlarge the sample space no matter how many random variables are added in Example 3. How should one reconcile such “contradiction”?

14 November, 2016 at 9:41 am

Terence TaoMaking the sample space explicit, (I) asserts that if the sample space is already capable of simultaneously modeling each of the random variables , then it can also model the joint random variable . But in the situation in (II), one cannot model both and simultaneously, and one certainly cannot model the joint random variable . This is not in contradiction with (I) because the hypothesis in (I) was not satisfied.

14 November, 2016 at 8:01 am

AnonymousWould you elaborate how Corollary 27 is proved?

It is said in the notes that

Indeed, we can take the probability space to be with the Borel -algebra and the Lebesgue-Stieltjes measure associated to .

I’m very confused here. What is the random variable with the desired property in Corollary 27?

14 November, 2016 at 9:44 am

Terence TaoIn this case, the random variable is modeled by the identity function .

Strictly speaking, Corollary 27 covers both discrete and continuous random variables (the measure is allowed to be either a discrete measure, a continuous measure, or a combination of both), but given that discrete random variables are already easy to construct, the main novelty of the corollary is in the continuous case.

14 November, 2016 at 8:11 am

AnonymousAlso, as I can see from the statement of Corollary 27, there is nothing about “continuous” random variables? What definition is given for “continuous random variables”?

14 November, 2016 at 8:12 am

AnonymousSimilarly in Corollary 31. I think what you really want to say is “construction of random variables/vectors with the given distribution function “?

15 January, 2017 at 12:35 pm

AnonymousIn Example 3, it is said that

The sample space is inadequate for modeling both the original die roll and the second die roll .On the other hand, if one “only” wants to know the probability of the event that is even, one can only use the old model instead of the new model to conclude that

.

In a more general case, if one assumes that and are “independent”, then the joint distribution completely depends on and $P(Y=j)$, the calculation of which can be done in the model . Can one say in this case that one does not need to expand the sample space?

16 January, 2017 at 11:30 am

AnonymousSorry for the wrong question above.

In Example 3, it is said that “Now suppose that we wish to roll the die again to obtain a second random variable $\latex {Y}$. ” At this point, do we implicitly

assumethat is independent of so that we define later or do we have to see bycalculation(via the definition of and ) of the joint distribution of that and are indeed independent?17 January, 2017 at 10:00 am

Terence TaoIn this example, the die rolls are constructed to be independent through the choice of the probability distribution . (This is not stressed in the example, because the notion of independence is only discussed later in the post.) One could model correlated die rolls using the same set and random variables , , but with a different choice of probability distribution . For instance, if the second die roll is going to be equal to the first die roll with a probability of , and is a completely independent die roll otherwise, one could use the distribution instead.

17 February, 2017 at 9:57 pm

254A, Notes 2: The central limit theorem | What's new[…] (iv) Show that converges in distribution to if and only if, after extending the probability space model if necessary, one can find copies and of and respectively such that converges almost surely to . (Hint: use the Skorohod representation, Exercise 29 of Notes 0.) […]

15 April, 2017 at 10:17 am

fsaymonDear Prof. Tao,

you wrote in an earlier post, on 14 November, 2016 at 4:21 pm, that

“As for the vector analogy, the vector space is always a set, but the individual vectors do not have to be sets (or arrays of numbers, or anything else for that matter).”

This completely confused me, as reviewing the possibilities of formalizing the concept of a vector space I can’t see any way to arrive at it in such a way that the vector space is constructed out of an “underlying” set (+ vector operations on it), but whose elements are not sets, or anything else.

My main point confusion comes from the standard (ZFC induced) convention that everything in “standard” mathematics is a set, resp. can be constructed as one, even though we don’t think of it as a set (in a similar way in which we don’t think of a number as a set). Thus, there can’t be such a thing as a set whose elements are not sets.

Are you perhaps having a different, impure set theory in mind, such as the one you developed in your Analysis 1 book, were you have primitive notions of “sets” and “objects” ?

Even though axiomatization models more closely how we think about mathematical objects such as numbers (which we informally don’t regard as sets), this would still be a non-standard set theory, that probably confuses beginners.

Or did you perhaps meant to perceive the theory of vector spaces in a similar way to the way mathematical theories are approach in introductory books on mathematical logic? (I.e., formal language is defined and axioms are stated, completely out of any set theory.)

Of course, this precludes us from doing various things that are commonly done when developing linear algebra, such as considering maps between two different vector spaces (since, working at this very basic formal level, we are working outside a framework of set theory that permits us to consider mappings between vector spaces).

Couly you please help me (and further readers) clarify this issue?

(And sorry in case I double posted, my account did some weird things the first time I posted this comment.)

16 April, 2017 at 7:29 pm

Terence TaoIt’s more accurate to say that everything in mathematics can be

encodedas a set, as opposed tobeinga set. See for instance https://mathoverflow.net/a/255824/766 and https://mathoverflow.net/a/90945/766 .20 April, 2017 at 9:21 am

fsaymonBut you did not yet clarify how to encode vectors, if they are *not* supposed to be sets “or anything else”.

(Of course I agree with your remark; I used the word “constructed as a set” instead of “encoded”, which seems to convey the intention more clearly.)

20 April, 2017 at 2:27 pm

Terence TaoI prefer to remain agnostic on this issue; one can encode all (or almost all) mathematical objects as sets, if one wishes, but it is also perfectly possible to work in formal systems (e.g. ZFC with urelements) in which one does not do this. Or one can adopt the practices of most working mathematicians and work with a reasoning system which is

formalisable, rather thanformal; one does not strictly adhere to any given formal system such as ZFC, but one structures one’s arguments in such a way that, if one chose, one could translate the arguments into such formal systems (at which point one would indeed be forced to encode various objects, such as vectors in a vector space, as sets, if one insisted on working strictly within ZFC).For instance, take the standard vector space . We generally view elements of this vector space as -tuples of real numbers. Now, if one strictly wanted to adhere to ZFC, one would have to interpret such tuples as sets. One can do so by a variety of means (e.g. by viewing a -tuple as a function from to the reals, and then encoding functions as sets by viewing them as an ordered triple consisting of the domain, range, and graph of the function, and then using something like the Kuratowski pair construction to model the ordered triple as a set). But outside of foundational and model-theoretic considerations, these sorts of encodings play virtually no role in the linear algebra of , and I think it is much clearer conceptually to not try to fix any such encoding when actually doing linear algebra, though it certainly should be noted that these encodings exist should one care to use them.

18 March, 2019 at 7:04 am

RithDear Professor Tao,

This post really deepened my understanding of probability theory. I just got a quick question about the notation used for the joint random variables. For this part you wrote

“Given a finite number of random variables taking values in ranges , we can form the joint random variable taking values in the Cartesian product by concatenation of the models, thus

”

I feel I don’t understand this notation well, and I’m wondering why you are not using the following notation:

As it seems to be a mapping from n outcomes to n numbers.

Thank you very much for the clarification!

18 March, 2019 at 9:10 am

Terence TaoThe random variables , as well as their concatenation , are all being modeled by the same sample space , the elements of which represent the outcomes of

allconceivable measurements on the sample (which includes, but is not restricted to, the outcomes of ).Suppose for instance one is modeling the outcomes of three dice, e.g., a red die, a green die, and a blue die, with the outcome of the red die and the outcome of the green die. One possible sample space to model this is , with each element of this sample space describing the outcomes of all three dice. Then is the map that extracts the outcome of the red die, is the map is the map that extracts the outcome of the green die, and is the map that extracts the outcome of the red and green dice jointly.

Of course, if one is only interested in modeling the red and green dice and are totally uninterested in the blue die, one could also work instead with the smaller sample space that only describes the outcomes of the former two dice and ignores the third die. Or one could work with a much larger sample space incorporating lots of other measurements as well. The point of the probabilistic way of thinking, though, is that the choice of model is to a large extent irrelevant for one’s analysis.

18 March, 2019 at 1:55 pm

RithThank you very much, Prof. Tao. This really helps.