Starting this week, I will be teaching an introductory graduate course (Math 275A) on probability theory here at UCLA. While I find myself *using* probabilistic methods routinely nowadays in my research (for instance, the probabilistic concept of Shannon entropy played a crucial role in my recent paper on the Chowla and Elliott conjectures, and random multiplicative functions similarly played a central role in the paper on the Erdos discrepancy problem), this will actually be the first time I will be *teaching* a course on probability itself (although I did give a course on random matrix theory some years ago that presumed familiarity with graduate-level probability theory). As such, I will be relying primarily on an existing textbook, in this case Durrett’s Probability: Theory and Examples. I still need to prepare lecture notes, though, and so I thought I would continue my practice of putting my notes online, although in this particular case they will be less detailed or complete than with other courses, as they will mostly be focusing on those topics that are not already comprehensively covered in the text of Durrett. Below the fold are my first such set of notes, concerning the classical measure-theoretic foundations of probability. (I wrote on these foundations also in this previous blog post, but in that post I already assumed that the reader was familiar with measure theory and basic probability, whereas in this course not every student will have a strong background in these areas.)

Note: as this set of notes is primarily concerned with foundational issues, it will contain a large number of pedantic (and nearly trivial) formalities and philosophical points. We dwell on these technicalities in this set of notes primarily so that they are out of the way in later notes, when we work with the actual mathematics of probability, rather than on the supporting foundations of that mathematics. In particular, the excessively formal and philosophical language in this set of notes will not be replicated in later notes.

** — 1. Some philosophical generalities — **

By default, mathematical reasoning is understood to take place in a *deterministic* mathematical universe. In such a universe, any given mathematical statement (that is to say, a sentence with no free variables) is either true or false, with no intermediate truth value available. Similarly, any deterministic variable can take on only one specific value at a time.

However, for a variety of reasons, both within pure mathematics and in the applications of mathematics to other disciplines, it is often desirable to have a rigorous mathematical framework in which one can discuss *non-deterministic* statements and variables – that is to say, statements which are not always true or always false, but in some intermediate state, or variables that do not take one particular value or another with definite certainty, but are again in some intermediate state. In probability theory, which is by far the most widely adopted mathematical framework to formally capture the concept of non-determinism, non-deterministic statements are referred to as *events*, and non-deterministic variables are referred to as *random variables*. In the standard foundations of probability theory, as laid out by Kolmogorov, we can then *model* these events and random variables by introducing a sample space (which will be given the structure of a probability space) to capture all the ambient sources of randomness; events are then modeled as measurable subsets of this sample space, and random variables are modeled as measurable functions on this sample space. (We will briefly discuss a more abstract way to set up probability theory, as well as other frameworks to capture non-determinism than classical probability theory, at the end of this set of notes; however, the rest of the course will be concerned exclusively with classical probability theory using the orthodox Kolmogorov models.)

Note carefully that sample spaces (and their attendant structures) will be used to *model* probabilistic concepts, rather than to actually *be* the concepts themselves. This distinction (a mathematical analogue of the map-territory distinction in philosophy) actually is implicit in much of modern mathematics, when we make a distinction between an abstract version of a mathematical object, and a concrete representation (or *model*) of that object. For instance:

- In linear algebra, we distinguish between an abstract vector space , and a concrete system of coordinates given by some basis of .
- In group theory, we distinguish between an abstract group , and a concrete representation of that group as isomorphisms on some space .
- In differential geometry, we distinguish between an abstract manifold , and a concrete atlas of coordinate systems that coordinatises that manifold.
- Though it is rarely mentioned explicitly, the abstract number systems such as are distinguished from the concrete numeral systems (e.g. the decimal or binary systems) that are used to represent them (this distinction is particularly useful to keep in mind when faced with the infamous identity , or when switching from one numeral representation system to another).

The distinction between abstract objects and concrete models can be fairly safely discarded if one is only going to use a single model for each abstract object, particularly if that model is “canonical” in some sense. However, one needs to keep the distinction in mind if one plans to switch between different models of a single object (e.g. to perform change of basis in linear algebra, change of coordinates in differential geometry, or base change in algebraic geometry). As it turns out, in probability theory it is often desirable to change the sample space model (for instance, one could *extend* the sample space by adding in new sources of randomness, or one could couple together two systems of random variables by *joining* their sample space models together). Because of this, we will take some care in this foundational set of notes to distinguish probabilistic concepts (such as events and random variables) from their sample space models. (But we may be more willing to conflate the two in later notes, once the foundational issues are out of the way.)

From a foundational point of view, it is often logical to begin with some axiomatic description of the abstract version of a mathematical object, and discuss the concrete representations of that object later; for instance, one could start with the axioms of an abstract group, and then later consider concrete representations of such a group by permutations, invertible linear transformations, and so forth. This approach is often employed in the more algebraic areas of mathematics. However, there are at least two other ways to present these concepts which can be preferable from a pedagogical point of view. One way is to start with the concrete representations as motivating examples, and only later give the abstract object that these representations are modeling; this is how linear algebra, for instance, is often taught at the undergraduate level, by starting first with , , and , and only later introducing the abstract vector spaces. Another way is to avoid the abstract objects altogether, and focus exclusively on concrete representations, but taking care to emphasise how these representations transform when one switches from one representation to another. For instance, in general relativity courses in undergraduate physics, it is not uncommon to see tensors presented purely through the concrete representation of coordinates indexed by multiple indices, with the transformation of such tensors under changes of variable carefully described; the abstract constructions of tensors and tensor spaces using operations such as tensor product and duality of vector spaces or vector bundles are often left to an advanced differential geometry class to set up properly.

The foundations of probability theory are usually presented (almost by default) using the last of the above three approaches; namely, one talks almost exclusively about sample space models for probabilistic concepts such as events and random variables, and only occasionally dwells on the need to extend or otherwise modify the sample space when one needs to introduce new sources of randomness (or to forget about some existing sources of randomness). However, much as in differential geometry one tends to work with manifolds without specifying any given atlas of coordinate charts, in probability one usually manipulates events and random variables without explicitly specifying any given sample space. For a student raised exclusively on concrete sample space foundations of probability, this can be a bit confusing, for instance it can give the misconception that any given random variable is somehow associated to its own unique sample space, with different random variables possibly living on different sample spaces, which often leads to nonsense when one then tries to combine those random variables together. Because of such confusions, we will try to take particular care in these notes to separate probabilistic concepts from their sample space models.

** — 2. A simple class of models: discrete probability spaces — **

The simplest models of probability theory are those generated by *discrete probability spaces*, which are adequate models for many applications (particularly in combinatorics and other areas of discrete mathematics), and which already capture much of the essence of probability theory while avoiding some of the finer measure-theoretic subtleties. We thus begin by considering discrete sample space models.

Definition 1 (Discrete probability theory)Adiscrete probability spaceis an at most countable set (whose elements will be referred to asoutcomes), together with a non-negative real number assigned to each outcome such that ; we refer to as theprobabilityof the outcome . The set itself, without the structure , is often referred to as thesample space, though we will often abuse notation by using the sample space to refer to the entire discrete probability space .In discrete probability theory, we choose an ambient discrete probability space as the randomness model. We then model

eventsby subsets of the sample space . Theprobabilityof an event is defined to be the quantitynote that this is a real number in the interval . An event is

surely trueor is thesure eventif , and issurely falseor is theempty eventif .We model random variables taking values in the range by functions from the sample space to the range . Random variables taking values in will be called

real random variablesorrandom real numbers. Similarly for random variables taking values in . We refer to real and complex random variables collectively asscalar random variables.We consider two events to be equal if they are modeled by the same set: . Similarly, two random variables taking values in a common range are considered to be equal if they are modeled by the same function: . In particular, if the discrete sample space is understood from context, we will usually abuse notation by identifying an event with its model , and similarly identify a random variable with its model .

Remark 2One can view classical (deterministic) mathematics as the special case of discrete probability theory in which is a singleton set (there is only one outcome ), and the probability assigned to the single outcome in is : . Then there are only two events (the surely true and surely false events), and a random variable in can be identified with a deterministic element of . Thus we can view probability theory as a generalisation of deterministic mathematics.

As discussed in the preceding section, the distinction between a collection of events and random variable and its models becomes important if one ever wishes to modify the sample space, and in particular to *extend* the sample space to a larger space that can accommodate new sources of randomness (an operation which we will define formally later, but which for now can be thought of as an analogue to change of basis in linear algebra, coordinate change in differential geometry, or base change in algebraic geometry). This is best illustrated with a simple example.

Example 3 (Extending the sample space)Suppose one wishes to model the outcome of rolling a single, unbiased six-sided die using discrete probability theory. One can do this by choosing the discrete proability space to be the six-element set , with each outcome given an equal probability of of occurring; this outcome may be interpreted as the state in which the die roll ended up being equal to . The outcome of rolling a die may then be identified with the identity function , defined by for . If we let be the event that the outcome of rolling the die is an even number, then with this model we have , andNow suppose that we wish to roll the die again to obtain a second random variable . The sample space is inadequate for modeling both the original die roll and the second die roll . To accommodate this new source of randomness, we can then move to the larger discrete probability space , where each outcome now having probability ; this outcome can be interpreted as the state in which the die roll ended up being , and the die roll ended up being . The random variable is now modeled by a new function defined by for ; the random variable is similarly modeled by the function defined by for . The event that is even is now modeled by the set

This set is distinct from the previous model of (for instance, has eighteen elements, whereas has just three), but the probability of is unchanged:

One can of course also combine together the random variables in various ways. For instance, the sum of the two die rolls is a random variable taking values in ; it cannot be modeled by the sample space , but in it is modeled by the function

Similarly, the event that the two die rolls are equal cannot be modeled by , but is modeled in by the set

and the probability of this event is

We thus see that extending the probability space has also enlarged the space of events one can consider, as well as the random variables one can define, but that existing events and random variables continue to be interpretable in the extended model, and that probabilistic concepts such as the probability of an event remain unchanged by the extension of the model.

The set-theoretic operations on the sample space induce similar boolean operations on events:

- The
*conjunction*of two events is defined through the intersection of their models: . - The
*disjunction*of two events is defined through the the union of their models: . - The
*symmetric difference*of two events is defined through the symmetric difference of their models: . - The
*complement*of an event is defined through the complement of their models: . - We say that one event is
*contained in*or*implies*another event , and write , if we have containment of their models: . We also write “ is true on ” synonymously with . - Two events are
*disjoint*if their conjunction is the empty event, or equivalently if their models are disjoint.

Thus, for instance, the conjunction of the event that a die roll is even, and that it is less than , is the event that the die roll is exactly . As before, we will usually be in a situation in which the sample space is clear from context, and in that case one can safely identify events with their models, and view the symbols and as being synonymous with their set-theoretic counterparts and (this is for instance what is done in Durrett).

With these operations, the space of all events (known as the *event space*) thus has the structure of a boolean algebra (defined below in Definition 4). We observe that the probability is *finitely additive* in the sense that

whenever are disjoint events; by induction this implies that

whenever are pairwise disjoint events. We have and , and more generally

for any event . We also have monotonicity: if , then .

Now we define operations on random variables. Whenever one has a function from one range to another , and a random variable taking values in , one can define a random variable taking values in by composing the relevant models:

thus maps to for any outcome . Given a finite number of random variables taking values in ranges , we can form the joint random variable taking values in the Cartesian product by concatenation of the models, thus

Combining these two operations, given any function of variables in ranges , and random variables taking values in respectively, we can form a random variable taking values in by the formula

Thus for instance we can add, subtract, or multiply two scalar random variables to obtain another scalar random variable.

A deterministic element of a range will (by abuse of notation) be identified with a random variable taking values in , whose model in is constant: for all . Thus for instance is a scalar random variable.

Given a relation on ranges , and random variables , we can define the event by setting

Thus for instance, for two real random variables , the event is modeled as

and the event is modeled as

At this point we encounter a slight notational conflict between the dual role of the equality symbol as a logical symbol and as a binary relation: we are interpreting both as an external equality relation between the two random variables (which is true iff the functions , are identical), and as an internal event (modeled by ). However, it is clear that is true in the external sense if and only if the internal event is surely true. As such, we shall abuse notation and continue to use the equality symbol for both the internal and external concepts of equality (and use the modifier “surely” for emphasis when referring to the external usage).

It is clear that any equational identity concerning functions or operations on deterministic variables implies the same identity (in the external, or surely true, sense) for random variables. For instance, the commutativity of addition for deterministic real numbers immediately implies the commutativity of addition: is surely true for real random variables ; similarly is surely true for all scalar random variables , etc.. We will freely apply the usual laws of algebra for scalar random variables without further comment.

Given an event , we can associate the indicator random variable (also written as in some texts) to be the unique real random variable such that when is true and when is false, thus is equal to when and otherwise. (The indicator random variable is sometimes called the *characteristic function* in analysis, and sometimes denoted instead of , but we avoid using the term “characteristic function” here, as it will have an unrelated but important meaning in probability theory.) We record the trivial but useful fact that Boolean operations on events correspond to arithmetic manipulations on their indicators. For instance, if are events, we have

and the inclusion-exclusion principle

In particular, if the events are disjoint, then

Also note that if and only if the assertion is surely true. We will use these identities and equivalences throughout the course without further comment.

Given a scalar random variable , we can attempt to define the expectation through the model by the formula

If the discrete sample space is finite, then this sum is always well-defined and so every scalar random variable has an expectation. If however the discrete sample space is infinite, the expectation may not be well defined. There are however two key cases in which one has a meaningful expectation. The first is if the random variable is *unsigned*, that is to say it takes values in the non-negative reals , or more generally in the extended non-negative real line . In that case, one can interpret the expectation as an element of . The other case is when the random variable is *absolutely integrable*, which means that the absolute value (which is an unsigned random variable) has finite expectation: . In that case, the series defining is absolutely convergent to a real or complex number (depending on whether was a real or complex random variable.)

We have the basic link

between probability and expectation, valid for any event . We also have the obvious, but fundamentally important, property of *linearity of expectation*: we have

and

whenever is a scalar and are scalar random variables, either under the assumption that are all unsigned, or that are absolutely integrable. Thus for instance by applying expectations to (1) we obtain the identity

We close this section by noting that discrete probabilistic models stumble when trying to model *continuous* random variables, which take on an uncountable number of values. Suppose for instance one wants to model a random real number drawn uniformly at random from the unit interval , which is an uncountable set. One would then expect, for any subinterval of , that will fall into this interval with probability . Setting (or, if one wishes instead, taking a limit such as ), we conclude in particular that for any real number in , that will equal with probability . If one attempted to model this situation by a discrete probability model, we would find that each outcome of the discrete sample space has to occur with probability (since for each , the random variable has only a single value ). But we are also requiring that the sum is equal to , a contradiction. In order to address this defect we must generalise from discrete models to more general probabilistic models, to which we now turn.

** — 3. The Kolmogorov foundations of probability theory — **

We now present the more general measure-theoretic foundation of Kolmogorov which subsumes the discrete theory, while also allowing one to model continuous random variables. It turns out that in order to perform sums, limits and integrals properly, the finite additivity property of probability needs to be amplified to *countable* additivity (but, as we shall see, *uncountable* additivity is too strong of a property to ask for).

We begin with the notion of a measurable space. (See also this previous blog post, which covers similar material from the perspective of a real analysis graduate class rather than a probability class.)

Definition 4 (Measurable space)Let be a set. A Boolean algebra in is a collection of subsets of which

- contains and ;
- is closed under pairwise unions and intersections (thus if , then and also lie in ); and
- is closed under complements (thus if , then also lies in .
(Note that some of these assumptions are redundant and can be dropped, thanks to de Morgan’s laws.) A -algebra in (also known as a -field) is a Boolean algebra in which is also

- closed under countable unions and countable intersections (thus if , then and ).
Again, thanks to de Morgan’s laws, one only needs to verify closure under just countable union (or just countable intersection) in order to verify that a Boolean algebra is a -algebra. A

measurable spaceis a pair , where is a set and is a -algebra in . Elements of are referred to asmeasurable setsin this measurable space.If are two -algebras in , we say that is

coarser than(or isfiner than) if , thus every set that is measurable in is also measurable in .

Example 5 (Trivial measurable space)Given any set , the collection is a -algebra; in fact it is the coarsest -algebra one can place on . We refer to as thetrivialmeasurable space on .

Example 6 (Discrete measurable space)At the other extreme, given any set , the power set is a -algebra (and is the finest -algebra one can place on ). We refer to as thediscretemeasurable space on .

Example 7 (Atomic measurable spaces)Suppose we have a partition of a set into disjoint subsets (which we will callatoms), indexed by some label set (which may be finite, countable, or uncountable). Such a partition defines a -algebra on , consisting of all sets of the form for subsets of (we allow to be empty); thus a set is measurable here if and only if it can be described as a union of atoms. One can easily verify that this is indeed a -algebra. The trivial and discrete measurable spaces in the preceding two examples are special cases of this atomic construction, corresponding to the trivial partition (in which there is just one atom ) and the discrete partition (in which the atoms are individual points in ).

Example 8Let be an uncountable set, and let be the collection of sets in which are either at most countable, or are cocountable (their complement is at most countable). Show that this is a -algebra on which is non-atomic (i.e. it is not of the form of the preceding example).

Example 9 (Generated measurable spaces)It is easy to see that if one has a non-empty family of -algebras on a set , then their intersection is also a -algebra, even if is uncountably infinite. Because of this, whenever one has an arbitrary collection of subsets in , one can define the -algebrageneratedby to be the intersection of all the -algebras that contain (note that there is always at least one -algebra participating in this intersection, namely the discrete -algebra). Equivalently, is the coarsest -algebra that views every set in as being measurable. (This is a rather indirect way to describe , as it does not make it easy to figure out exactly what sets lie in . There is a more direct description of this -algebra, but it requires the use of the first uncountable ordinal; see Exercise 15 of these notes.) In Durrett, the notation is used in place of .

Example 10 (Borel -algebra)Let be a topological space; to avoid pathologies let us assume that is locally compact Hausdorff and -compact, though the definition below also can be made for more general spaces. For instance, one could take or for some finite . We define the Borel -algebra on to be the -algebra generated by the open sets of . (Due to our topological hypotheses on , the Borel -algebra is also generated by the compact sets of .) Measurable subsets in the Borel -algebra are known asBorel sets. Thus for instance open and closed sets are Borel, and countable unions and countable intersections of Borel sets are Borel. In fact, as a rule of thumb, any subset of or that arises from a “non-pathological” construction (not using the axiom of choice, or from a deliberate attempt to build a non-Borel set) can be expected to be a Borel set. Nevertheless, non-Borel sets exist in abundance if one looks hard enough for them, even without the axiom of choice; see for instance Exercise 16 of this previous blog post.

The following exercise gives a useful tool (somewhat analogous to mathematical induction) to verify properties regarding measurable sets in generated -algebras, such as Borel -algebras.

Exercise 11Let be a collection of subsets of a set , and let be a property of subsets of (thus is true or false for each ). Assume the following axioms:

- is true.
- is true for all .
- If is such that is true, then is also true.
- If are such that is true for all , then is true.
Show that is true for all . (

Hint:what can one say about ?)

Thus, for instance, if a property of subsets of is true for all open sets, and is closed under countable unions and complements, then it is automatically true for all Borel sets.

Example 12 (Pullback)Let be a measurable space, and let be any function from another set to . Then we can define thepullbackof the -algebra to be the collection of all subsets in that are of the form for some . This is easily verified to be a -algebra. We refer to the measurable space as the pullback of the measurable space by . Thus for instance an atomic measurable space on generated by a partition is the pullback of (viewed as a discrete measurable space) by the “colouring” map from to that sends each element of to for all .

Remark 13In probabilistic terms, one can interpret the space in the above construction as a sample space, and the function as some collection of “random variables” or “measurements” on that space, with being all the possible outcomes of these measurements. The pullback then represents all the “information” one can extract from that given set of measurements.

Example 14 (Product space)Let be a family of measurable spaces indexed by a (possibly infinite or uncountable) set . We define the product on the Cartesian product space by defining to be the -algebra generated by the basic cylinder sets of the formfor and . For instance, given two measurable spaces and , the product -algebra is generated by the sets and for . (One can also show that is the -algebra generated by the products for , but this observation does not extend to uncountable products of measurable spaces.)

Exercise 15Show that with the Borel -algebra is the product of copies of with the Borel -algebra.

As with almost any other notion of space in mathematics, there is a natural notion of a map (or *morphism*) between measurable spaces.

Definition 16A function between two measurable spaces , is said to bemeasurableif one has for all .

Thus for instance the pullback of a measurable space by a map could alternatively be defined as the coarsest measurable space structure on for which is still measurable. It is clear that the composition of measurable functions is also measurable.

Exercise 17Show that any continuous map from one topological space to another is necessarily measurable (when one gives and the Borel -algebras).

Exercise 18If are measurable functions into measurable spaces , show that the joint function into the product space defined by is also measurable.

As a corollary of the above exercise, we see that if are measurable, and is measurable, then is also measurable. In particular, if or are scalar measurable functions, then so are , , , etc..

Next, we turn measurable spaces into measure spaces by adding a measure.

Definition 19 (Measure spaces)Let be a measurable space. Afinitely additive measureon this space is a map obeying the following axioms:

- (Empty set) .
- (Finite additivity) If are disjoint, then .
A countably additive measure is a finitely additive measure obeying the following additional axiom:

- (Countable additivity) If are disjoint, then .
A probability measure on is a countably additive measure obeying the following additional axiom:

- (Unit total probability) .
A measure space is a triplet where is a measurable space and is a measure on that space. If is furthermore a probability measure, we call a probability space.

A set of measure zero is known as a

null set. A property that holds for all outside of a null set is said to holdalmost everywhereorfor almost every.

Example 20 (Discrete probability measures)Let be a discrete measurable space, and for each , let be a non-negative real number such that . (Note that this implies that there are at most countably many for which – why?.) Then one can form a probability measure on by definingfor all .

Example 21 (Lebesgue measure)Let be given the Borel -algebra. Then it turns out there is a unique measure on , known as Lebesgue measure (or more precisely, the restriction of Lebesgue measure to the Borel -algebra) such that for every closed interval with (this is also true if one uses open intervals or half-open intervals in place of closed intervals). More generally, there is a unique measure on for any natural number , also known as Lebesgue measure, such thatfor all closed boxes , that is to say products of closed intervals. The construction of Lebesgue measure is a little tricky; see this previous blog post for details.

We can then set up general probability theory similarly to how we set up discrete probability theory:

Definition 22 (Probability theory)In probability theory, we choose an ambient probability space as the randomness model, and refer to the set (without the additional structures , ) as thesample spacefor that model. We then model aneventby an element of -algebra , with each such element describing an event. Theprobabilityof an event is defined to be the quantityAn event is

surely trueor is thesure eventif , and issurely falseor is theempty eventif . It isalmost surely trueor analmost sure eventif , andalmost surely falseor anull eventif .We model random variables taking values in the range by measurable functions from the sample space to the range . We define real, complex, and scalar random variables as in the discrete case.

As in the discrete case, we consider two events to be equal if they are modeled by the same set: . Similarly, two random variables taking values in a common range are considered to be equal if they are modeled by the same function: . Again, if the sample space is understood from context, we will usually abuse notation by identifying an event with its model , and similarly identify a random variable with its model .

As in the discrete case, set-theoretic operations on the sample space induce similar boolean operations on events. Furthermore, since the -algebra is closed under countable unions and countable intersections, we may similarly define the countable conjunction or countable disjunction of a sequence of events; however, we do *not* define uncountable conjunctions or disjunctions as these may not be well-defined as events.

The axioms of a probability space then yield the Kolmogorov axioms for probability:

- .
- .
- If are disjoint events, then .

We can manipulate random variables just as in the discrete case, with the only caveat being that we have to restrict attention to *measurable* operations. For instance, if is a random variable taking values in a measurable space , and is a measurable map, then is well defined as a random variable taking values in . Similarly, if is a measurable map and are random variables taking values in respectively, then is a random variable taking values in . Similarly we can create events out of *measurable* relations (giving the boolean range the discrete -algebra, of course). Finally, we continue to view deterministic elements of a space as a special case of a random element of , and associate the indicator random variable to any event as before.

We say that two random variables *agree almost surely* if the event is almost surely true; this is an equivalence relation. In many cases we are willing to consider random variables up to almost sure equivalence. In particular, we can generalise the notion of a random variable slightly by considering random variables whose models are only defined almost surely, i.e. their domain is not all of , but instead with a set of measure zero removed. This is, technically, not a random variable as we have defined it, but it can be associated canonically with an equivalence class of random variables up to almost sure equivalence, and so we view such objects as random variables “up to almost sure equivalence”. Similarly, we declare two events and *almost surely equivalent* if their symmetric difference is a null event, and will often consider events up to almost sure equivalence only.

We record some simple consequences of the measure-theoretic axioms:

Exercise 23Let be a measure space.

- (Monotonicity) If are measurable, then .
- (Subadditivity) If are measurable (not necessarily disjoint), then .
- (Continuity from below) If are measurable, then .
- (Continuity from above) If are measurable and is finite, then . Give a counterexample to show that the claim can fail when is infinite.

Of course, these measure-theoretic facts immediately imply their probabilistic counterparts (and the pesky hypothesis that is finite is automatic and can thus be dropped):

- (Monotonicity) If are events, then . (In particular, for any event .)
- (Subadditivity) If are events (not necessarily disjoint), then .
- (Continuity from below) If are events, then .
- (Continuity from above) If is events, then .

Note that if a countable sequence of events each hold almost surely, then their conjunction does as well (by applying subadditivity to the complementary events . As a general rule of thumb, the notion of “almost surely” behaves like “surely” as long as one only performs an at most countable number of operations (which already suffices for a large portion of analysis, such as taking limits or performing infinite sums).

Exercise 24Let be a measurable space.

- If is a function taking values in the extended reals , show that is measurable (giving the Borel -algebra) if and only if the sets are measurable for all real .
- If are functions, show that if and only if for all reals .
- If are measurable, show that , , , and are all measurable.

Remark 25Occasionally, there is need to consider uncountable suprema or infima, e.g. . It is then no longer automatically the case that such an uncountable supremum or infimum of measurable functions is again measurable. However, in practice one can avoid this issue by carefully rewriting such uncountable suprema or infima in terms of countable ones. For instance, if it is known that depends continuously on for each , then , and so measurability is not an issue.

Using the above exercise, when given a sequence of random variables taking values in the extended real line , we can define the random variables , , , which also take values in the extended real line, and which obey relations such as

for any real number .

We now say that a sequence of random variables in the extended real line converges almost surely if one has

almost surely, in which case we can define the limit (up to almost sure equivalence) as

This corresponds closely to the concept of almost everywhere convergence in measure theory, which is a slightly weaker notion than pointwise convergence which allows for bad behaviour on a set of measure zero. (See this previous blog post for more discussion on different notions of convergence of measurable functions.)

We will defer the general construction of expectation of a random variable to the next set of notes, where we review the notion of integration on a measure space. For now, we quickly review the basic construction of continuous scalar random variables.

Exercise 26Let be a probability measure on the real line (with the Borel -algebra). Define theStieltjes measure functionassociated to by the formulaEstablish the following properties of :

- (i) is non-decreasing.
- (ii) and .
- (iii) is right-continuous, thus for all .

There is a somewhat difficult converse to this exercise: if is a function obeying the above three properties, then there is a unique probability measure on (the Lebesgue-Stieltjes measure associated to ) for which is the Stieltjes measure function. See Section 3 of this previous post for details. As a consequence of this, we have

Corollary 27 (Construction of a single continuous random variable)Let be a function obeying the properties (i)-(iii) of the above exercise. Then, by using a suitable probability space model, we can construct a real random variable with the property thatfor all .

Indeed, we can take the probability space to be with the Borel -algebra and the Lebesgue-Stieltjes measure associated to . This corollary is not fully satisfactory, because often we may already have chosen a probability space to model some other random variables, and the probability space provided by this corollary may be completely unrelated to the one used. We can resolve these issues with product measures and other joinings, but this will be deferred to a later set of notes.

Define the cumulative distribution function of a real random variable to be the function

Thus we see that cumulative distribution functions obey the properties (i)-(iii) above, and conversely any function with those properties is the cumulative distribution function of some real random variable. We say that two real random variables (possibly on different sample spaces) *agree in distribution* if they have the same cumulative distribution function. One can therefore define a real random variable, up to agreement in distribution, by specifying the cumulative distribution function. See Durrett for some standard real distributions (uniform, normal, geometric, etc.) that one can define in this fashion.

Exercise 28Let be a real random variable with cumulative distribution function . For any real number , show thatand

In particular, one has for all if and only if is continuous.

Note in particular that this illustrates the distinction between almost sure and sure events: if has a continuous cumulative distribution function, and is a real number, then is almost surely false, but it does not have to be surely false. (Indeed, if one takes the sample space to be and to be the identity function, then will not be surely false.) On the other hand, the fact that is equal to *some* real number is of course surely true. The reason these statements are consistent with each other is that there are uncountably many real numbers . (Countable additivity tells us that a countable disjunction of null events is still null, but says nothing about uncountable disjunctions.)

Exercise 29 (Skorokhod representation of scalar variables)Let be a uniform random variable taking values in (thus has cumulative distribution function ), and let be another cumulative distribution function. Show that the random variablesand

are indeed random variables (that is to say, they are measurable in any given model ), and have cumulative distribution function . (This construction is attributed to Skorokhod, but it should not be confused with the Skorokhod representation theorem. It provides a quick way to generate a single scalar variable, but unfortunately it is difficult to modify this construction to generate multiple scalar variables, especially if they are somehow coupled to each other.)

There is a multidimensional analogue of the above theory, which is almost identical, except that the monotonicity property has to be strengthened:

Exercise 30Let be a probability measure on (with the Borel -algebra). Define theStieltjes measure functionassociated to by the formulaEstablish the following properties of :

- (i) is non-decreasing: whenever for all .
- (ii) and .
- (iii) is right-continuous, thus for all , where the superscript denotes that we restrict each to be greater than or equal to .
- (iv) One has
whenever are real numbers for . (

Hint:try to express the measure of a box with respect to in terms of the Stieltjes measure function .)

Again, there is a difficult converse to this exercise: if is a function obeying the above four properties, then there is a unique probability measure on for which is the Stieltjes measure function. See Durrett for details; one can also modify the arguments in this previous post. In particular, we have

Corollary 31 (Construction of several continuous random variables)Let be a function obeying the properties (i)-(iv) of the above exercise. Then, by using a suitable probability space model, we can construct real random variables with the property thatfor all .

Again, this corollary is not completely satisfactory because the probability space produced by it (which one can take to be with the Borel -algebra and the Lebesgue-Stieltjes measure on ) may not be the probability space one wants to use; we will return to this point later.

** — 4. Variants of the standard foundations (optional) — **

We have focused on the orthodox foundations of probability theory in which we model events and random variables through probability spaces. In this section, we briefly discuss some alternate ways to set up the foundations, as well as alternatives to probability theory itself. (Actually, many of the basic objects and concepts in mathematics have multiple such foundations; see for instance this blog post exploring the many different ways to define the notion of a group.) We mention them here in order exclude them from discussion in subsequent notes, which will be focused almost exclusively on orthodox probability.

One approach to the foundations of probability is to view the event space as an *abstract* -algebra – a collection of abstract objects with operations such as and (and and ) that obey a number of axioms; see this previous post for a formal definition. The probability map can then be viewed as an abstract probability measure on , that is to say a map from to that obeys the Kolmogorov axioms. Random variables taking values in a measurable space can be identified with their pullback map , which is the morphism of (abstract) -algebras that sends a measurable set to the event in ; with some care one can then redefine all the operations in previous sections (e.g. applying a measurable map to a random variable taking values in to obtain a random variable taking values in ) in terms of this pullback map, allowing one to define random variables satisfactorily in this abstract setting. The probability space models discussed above can then be viewed as *representations* of abstract probability spaces by concrete ones. It turns out that (up to null events) any abstract probability space can be represented by a concrete one, a result known as the *Loomis-Sikorski theorem*; see this previous post for details.

Another, related, approach is to start not with the event space, but with the space of scalar random variables, and more specifically with the space of *almost surely bounded* scalar random variables (thus, there is a deterministic scalar such that almost surely). It turns out that this space has the structure of a commutative tracial (abstract) von Neumann algebra. Conversely, starting from a commutative tracial von Neumann algebra one can form an abstract probability space (using the idempotent elements of the algebra as the events), and thus represent this algebra (up to null events) by a concrete probability space. This particular choice of probabilistic foundations is particularly convenient when one wishes to generalise classical probability to *noncommutative* probability, as this is simply a matter of dropping the axiom that the von Neumann algebra is commutative. This leads in particular to the subjects of quantum probability and free probability, which are generalisations of classical probability that are beyond the scope of this course (but see this blog post for an introduction to the latter, and this previous post for an abstract algebraic description of a probability space).

It is also possible to model continuous probability via a nonstandard version of discrete probability (or even finite probability), which removes some of the technicalities of measure theory at the cost of replacing them with the formalism of nonstandard analysis instead. This approach was pioneered by Ed Nelson, but will not be discussed further here. (See also these previous posts on the Loeb measure construction, which is a closely related way to combine the power of measure theory with the conveniences of nonstandard analysis.)

One can generalise the traditional, countably additive, form of probability by replacing countable additivity with finite additivity, but then one loses much of the ability to take limits or infinite sums, which reduces the amount of analysis one can perform in this setting. Still, finite additivity is good enough for many applications, particularly in discrete mathematics. An even broader generalisation is that of *qualitative* probability, in which events that are neither almost surely true or almost surely false are not assigned any specific numerical probability between or , but are simply assigned a symbol such as to indicate their indeterminate status; see this previous blog post for this generalisation, which can for instance be used to view the concept of a “generic point” in algebraic geometry or metric space topology in probabilistic terms.

There have been multiple attempts to move more radically beyond the paradigm of probability theory and its relatives as discussed above, in order to more accurately capture mathematically the concept of non-determinism. One family of approaches is based on replacing deterministic *logic* by some sort of probabilistic logic; another is based on allowing several parameters in one’s model to be unknown (as opposed to being probabilistic random variables), leading to the area of uncertainty quantification. These topics are well beyond the scope of this course.

## 118 comments

Comments feed for this article

14 June, 2019 at 7:59 pm

Rajeeva KarandikarDear Prof Tao,

This is regarding your comment about finitely additive probability theory : “One can generalise the traditional, countably additive, form of probability by replacing countable additivity with finite additivity, but then one loses much of the ability to take limits or infinite sums, which reduces the amount of analysis one can perform in this setting. Still, finite additivity is good enough for many applications, particularly in discrete mathematics.”

Well. Lots of limit theory can be done in finitely additive setting. Long ago, I had proven a meta-theorem: Almost all limit theorems that are true in countably additive setup are also true in finitely additive setup:

For the case of a sequence of independent random variables- see:

A General Principle For Limit Theorems In Finitely Additive Probability

Transactions Of The American Mathematical Society Volume 273, Number 2. October 1982

Click to access S0002-9947-1982-0667159-6.pdf

and for dependent case:

A general principle for limit theorems in finitely additive probability: The dependent case

Journal of Multivariate Analysis

Volume 24, Issue 2, February 1988, Pages 189-206

https://doi.org/10.1016/0047-259X(88)90035-8

30 August, 2019 at 8:36 am

QuentinDear Professor Tao,

Regarding Exercise 29, let be the sample space and suppose that there exists such that . Is this possibility allowed by the definition of ? Isn’t undefined?

Thanks,

Quentin

30 August, 2019 at 10:35 am

Terence TaoThis possibility can only occur with probability 0, so this is not a major issue (all the usual concepts of probability continue to hold for random variables that are only defined almost surely, as long as one only works with at most countably many such random variables at once; cf. the discussion preceding Exercise 23).

12 November, 2019 at 6:47 pm

254A, Notes 9 – second moment and entropy methods | What's new[…] space , but this distinction will not be so important in these notes and so we shall ignore it. See this previous set of notes for more […]

30 December, 2019 at 2:06 pm

AnonymousDear Prof. Tao:

May I ask you if this course notes together with Durrett’s book are suitable for someone interested in learning probability theory and that also read your books on analysis?

31 December, 2019 at 6:03 pm

Terence TaoThese are notes for a first graduate course in probability; it would be expected that the students would already have been exposed to an undergraduate course in probability (similar to how students in a first course in real analysis would have been expected to already been exposed to single and multivariable calculus).

26 February, 2020 at 8:32 pm

AnonymousThe link to Durrett’s book at the top of the post is now dead, but this Wayback link works: https://web.archive.org/web/20141030190358if_/http://www.math.duke.edu:80/~rtd/PTE/PTE4_1.pdf

Can I ask if you like Rényi’s books “Foundations of Probability” and “Probability Theory”?

26 February, 2020 at 2:11 am

יומיות 03.10.2015: ספרי המדב”פ הטובים ביותר של אוקטובר, משחק ברווז וערב מפתחי משחקי מחשב בהאקרספייס, טרילוגיית קוטל המלכים, פורומים לקוסמים וע[…] טאו, המתמטיקאי החביב על הבלוג, מעביר קורס חדש: הבסיס של תורת ההסתברות. כרגיל אצל טאו, טקסט מלא של כל ההרצאות מופיע אצלו בבלוג. […]

29 February, 2020 at 8:29 am

AnonymousDear Prof Tao,

In example 14, “product sigma-algebra B1xB2 is generated by E1xR2 and R1xE2”, I have difficulty understanding which set in B1 is E1 referring to since B1 contains many sets, if E1 means taking each set in B1 consecutively, wouldn’t it mean the product sigma-algebra is generated by R1xR2? Thank you.

29 February, 2020 at 9:04 am

AnonymousSorry, I realize the problem of my last question, and figure out what is E1, please ignore the question and thank you for the wonderful notes.

27 March, 2020 at 3:37 pm

AnonymousIn the world of mathematics, there seems to be no difference between the notion of “Bayesian probability” and that of “frequentist probability”, which are very different in the realm of statistics. Are there axioms for these two notions? Or are they really the same thing in mathematics?

6 October, 2020 at 8:45 am

adityaguharoyLet me reply on behalf.

So, let’s address the questions one by one.

1. What is a “formal definition” of a probability ?

It comes from measure theory : a probability (or a probability measure) is a normalized finite measure.

The first definition of probability people (including myself) hear is usually the axiomatic definition named after Kolmogorov. But these two definitions (i.e., Kolmogorov’s axiomatic definition and the above mentioned measure theoretic definition) are exactly the same.

2. Do Bayesian and / or frequentists use a different definition of probability ?

NO. What they call a probability is also a normalized finite measure.

3.Do Bayesian and frequentists have different opinions on the notion of probability ?

YES. While the formal definition of probability remains the same always, the notion is slightly different. Frequentists believe in the long term behaviour to shape their probabilities. So, a frequentist would say that a fair coin has probability of showing heads = 0.5, because if you toss it many times, you’ll get roughly equal number of heads as the number of tails.

Bayesian statistics assumes a prior distribution of the parameters. So, for instance if you have a coin and you don’t know the probability that it will shows heads, then you may assume that the probability itself follows some distribution (this may be easier to understand if you consider several parallel events and the probabilities being sampled). This distribution depends on several factors (but I will not discuss about them over here).

Hope this resolves your doubt.

6 December, 2020 at 10:53 pm

AnonymousIn the definition of indicator random variable, how to derive “E is true ?” I cannot find the definition of “E is true”, it seems not the same as “E is surely true” in Definition 1, thanks.

[This is done in the fifth bullet point after Example 3: the statement “ is true when is true” is synonymous with “ is true when $\omega \in E_\Omega$” for all models and all . -T]7 December, 2020 at 6:26 pm

AnonymousThank you!

23 December, 2020 at 8:05 am

AnonymousIn Definition 1,

whatare (definiton of) “events” and ? It is clear that they aremodeledby “sets” and . Is it intentional that they are somehow agnostic in this definition?In the abstract-vs-models examples at the beginning of this text, the

abstractnotions of vector spaces, groups, and manifolds are all “sets” (together with some other structure): the vector space , the group , and the manifold . Are “events” in Definition 1 sets as well?23 December, 2020 at 6:45 pm

Terence TaoYes, this is intentionally agnostic. Depending on one’s personal taste and the application at hand, one could think of an event in many different ways, such as

(1) A measurable subset of a sample space (identifying an event with its representative in some designated probability space model );

(2) An element of an abstract sigma-complete Boolean algebra (which is equipped with a countably additive probability measure);

(3) An idempotent in an abstract commutative tracial von Neumann algebra (this is the noncommutative probability perspective);

(4) (if one has a sufficiently informal metamathematics) actual potentially realisable events in the physical universe;

(5) etc.

Only interpretation (1) actually views an event as a set; interpretations (2), (3) view events instead as elements of various abstract spaces. The spaces themselves are sets (with additional structures), but the events need not be. (To connect with the examples at the beginning of the text, a single event would be analogous to a single vector in an abstract vector space, a single group element of an abstract group, or a single point in an abstract manifold. These objects are typically not viewed primarily as sets, though in a pure set theoretic universe they could be modeled as such.) All of the above viewpoints are useful and so I choose to remain agnostic regarding which one is to be “preferred”; any interpretation is viable so long as it can be modeled by probability spaces in the manner indicated in this blog post. In particular if one wishes to avoid philosophical and foundational issues altogether one can simply go with interpretation (1) which is the route taken in most probability texts, though if one does so one can encounter conceptual difficulties whenever one wishes to extend the probability space to add more sources of randomness, as discussed in this post. (The presentation given in this post is probably most compatible with interpretation (2); presentations in other texts may be optimised for other interpretations.)

See also the discussion in Section 4 of this post.

23 December, 2020 at 9:23 am

Anonymous(I) On the one hand, in example 3, it is said that

… the sum of the two die rolls is a random variable taking values in ; it cannot be modeled by the sample space , but in it is modeled by the function

(II) On the other hand, it is said after example 3 that

Combining these two operations, given any function of variables in ranges , and random variables taking values in respectively, we can form a random variable taking values in by the formula

Thus for instance we can add, subtract, or multiply two scalar random variables to obtain another scalar random variable.

As a special case where in (II), can’t we model in (I) by with ? Why is necessary in (I) but not in (II)?

23 December, 2020 at 4:46 pm

AnonymousOkay, this question seems to have been answered in this thread:

https://terrytao.wordpress.com/2015/09/29/275a-notes-0-foundations-of-probability-theory/#comment-474361

(II)

assumesthat has been sufficient to model and together, while in (I) is not sufficient.24 December, 2020 at 6:15 pm

Ed TamHi Terry,

In Exercise 18, I believe there is a small typo: it should be “R_n” instead of “R_m”.

Thanks!

[Corrected, thanks – T.]21 March, 2022 at 5:59 am

AnonymousIn Example 3, consider the event being and . Then this event can be only modeled in , not .

By counting in , one has .

On the other hand, the probability can be “calculated/interpreted” as , where the right-hand side involves only events in the “smaller” model . In this particular case, one finds the probability of an event that can be only modeled in using only the information of the smaller model .

Is this coincidence?

21 March, 2022 at 7:42 am

Terence TaoThis occurs because we are modeling independent identically distributed variables (iid) in this example. If one is modeling variables that are independent but not identically distributed (e.g., rolling a 6-sided die and then an 8-sided die), then one can model the joint variables by the product of several different probability spaces, each of which models one of the variables, but one cannot reduce all the probability computations to that of a single random variable as in the iid case.

17 April, 2022 at 11:56 pm

Aditya Guha RoyIn Exercise 11, in should be I think

This is a beautiful result that captures the good-set principle which as far as I have seen is one of the most widely used proving techniques in measure theory, and which in my opinion is a broader form of the well-ordering principle one uses in number theory (the connection is that here we can speak about smallest sigma algebra containing a collection of subsets of and in number theory one utilizes the existence of a smallest element of every non-empty subset of the natural numbers).