— 1. Product measures —
It is intuitively obvious that Lebesgue measure on ought to be related to Lebesgue measure on by the relationship
for any Borel sets . This is in fact true (see Exercise 4 below), and is part of a more general phenomenon, which we phrase here in the case of probability measures:
Theorem 1 (Product of two probability spaces) Let and be probability spaces. Then there is a unique probability measure on with the property that
for all . Furthermore, we have the following two facts:
- (Tonelli theorem) If is measurable, then for each , the function is measurable on , and the function is measurable on . Similarly, for each , the function is measurable on and is measurable on . Finally, we have
- (Fubini theorem) If is absolutely integrable, then for -almost every , the function is absolutely integrable on , and the function is absolutely integrable on . Similarly, for -almost every , the function is absolutely integrable on and is absolutely integrable on . Finally, we have
The Fubini and Tonelli theorems are often used together (so much so that one may refer to them as a single theorem, the Fubini-Tonelli theorem, often also just referred to as Fubini’s theorem in the literature). For instance, given an absolutely integrable function and an absolutely integrable function , the Tonelli theorem tells us that the tensor product defined by
for , is absolutely integrable and one has the factorisation
Our proof of Theorem 1 will be based on the monotone class lemma that allows one to conveniently generate a -algebra from a Boolean algebra. (In Durrett, the closely related theorem is used in place of the monotone class lemma.) Define a monotone class in a set to be a collection of subsets of with the following two closure properties:
Thus for instance any -algebra is a monotone class, but not conversely. Nevertheless, there is a key way in which monotone classes “behave like” -algebras:
Lemma 2 (Monotone class lemma) Let be a Boolean algebra on . Then is the smallest monotone class that contains .
Proof: Let be the intersection of all the monotone classes that contain . Since is clearly one such class, is a subset of . Our task is then to show that contains .
It is also clear that is a monotone class that contains . By replacing all the elements of with their complements, we see that is necessarily closed under complements.
For any , consider the set of all sets such that , , , and all lie in . It is clear that contains ; since is a monotone class, we see that is also. By definition of , we conclude that for all .
Next, let be the set of all such that , , , and all lie in for all . By the previous discussion, we see that contains . One also easily verifies that is a monotone class. By definition of , we conclude that . Since is also closed under complements, this implies that is closed with respect to finite unions. Since this class also contains , which contains , we conclude that is a Boolean algebra. Since is also closed under increasing countable unions, we conclude that it is closed under arbitrary countable unions, and is thus a -algebra. As it contains , it must also contain .
We now begin the proof of Theorem 1. We begin with the uniqueness claim. Suppose that we have two measures on that are product measures of and in the sense that
for all and . If we then set to be the collection of all such that , then contains all sets of the form with and . In fact contains the collection of all sets that are “elementary” in the sense that they are of the form for finite and for , since such sets can be easily decomposed into a finite union of disjoint products , at which point the claim follows from (4) and finite additivity. But is a Boolean algebra that generates as a -algebra, and from continuity from above and below we see that is a monotone class. By the monotone class lemma, we conclude that is all of , and hence . This gives uniqueness. Now we prove existence. We first claim that for any measurable set , the sets are measurable in . Indeed, the claim is obvious for sets that are “elementary” in the sense that they belong to the Boolean algebra defined previously, and the collection of all such sets is a monotone class, so the claim follows from the monotone class lemma. A similar argument (relying on monotone or dominated convergence) shows that the function
is measurable in for all . Thus, for any , we can define the quantity by
A routine application of the monotone convergence theorem verifies that is a countably additive measure; one easily checks that (2) holds for all , and in particular is a probability measure.
By construction, we see that the identity
holds (with all functions integrated being measurable) whenever is an indicator function with . By linearity of integration, the same identity holds (again with all functions measurable) when is an unsigned simple function. Since any unsigned measurable function can be expressed as the monotone non-decreasing limit of unsigned simple functions (for instance, one can round down to the largest multiple of that is less than and ), the above identity also holds for unsigned measurable by the monotone convergence theorem. Applying this fact to the absolute value of an absolutely integrable function , we conclude for such functions that
which by Markov’s inequality implies that
for -almost every . In other words, the function is absolutely integrable on for -almost every . By monotonicity we conclude that
and hence the function is absolutely integrable. Hence it makes sense to ask whether the identity
holds for absolutely integrable , as both sides are well-defined. We have already established this claim when is unsigned and absolutely integrable; by subtraction this implies the claim for real-valued absolutely integrable , and by taking real and imaginary parts we obtain the claim for complex-valued absolutely integrable .
We may reverse the roles of and , and define instead by the formula
By the previously proved uniqueness of product measure, we see that this defines the same product measure as previously. Repeating the previous arguments we obtain all the above claims with the roles of and reversed. This gives all the claims required for Theorem 1.
One can extend the product construction easily to finite products:
Exercise 3 (Finite products) Show that for any finite collection of probability spaces, there exists a unique probability measure on such that
whenever for . Furthermore, show that
for any partition (after making the obvious identification between and ). Thus for instance one has the associativity property
for any probability spaces for .
By writing as products of pairs of probability spaces in many different ways, one can obtain a higher-dimensional analogue of the Fubini and Tonelli theorems; we leave the precise statement of such a theorem to the interested reader.
It is important to be aware that the Fubini theorem identity
for measurable functions that are not unsigned, are usually only justified when is absolutely integrable on , or equivalently (by the Tonelli theorem) the function is absolutely integrable on (or that is absolutely integrable on . Without this joint absolute integrability (and without any unsigned property on ), the identity (5) can fail even if both sides are well-defined. For instance, let be the unit interval , and let be the uniform probability measure on this interval, and set
One can check that both sides of (5) are well-defined, but that the left-hand side is and the right-hand side is . Of course, this function is neither unsigned nor jointly absolutely integrable, so this counterexample does not violate either of the Fubini or Tonelli theorems. Thus one should take care to only blindly interchange integrals when the integrands are known to be either unsigned or jointly absolutely integrable.
The above theory extends from probability spaces to finite measure spaces, and more generally to measure spaces that are -finite, that is to say they are expressable as the countable union of sets of finite measure. (With a bit of care, some portions of product measure theory are even extendible to non-sigma-finite settings, though I urge caution in applying these results blindly in that case.) We will not give the details of these generalisations here, but content ourselves with one example:
Exercise 4 Establish (4) for all Borel sets . (Hint: can be viewed as the disjoint union of a countable sequence of sets of measure .)
Remark 5 When doing real analysis (as opposed to probability), it is convenient to complete the Borel -algebra on spaces such as , to form the larger Lebesgue -algebra , defined as the collection of all subsets in that differ from a Borel set in by a sub-null set, in the sense that for some Borel subset of of zero Lebesgue measure. There are analogues of the Fubini and Tonelli theorems for such complete -algebras; see this previous lecture notes for details. However one should be cautioned that the product of Lebesgue -algebras is not the Lebesgue -algebra , but is instead an intermediate -algebra between and , which causes some additional small complications. For instance, if is Lebesgue measurable, then the functions can only be found to be Lebesgue measurable on for almost every , rather than for all . We will not dwell on these subtleties further here, as we will rarely have any need to complete the -algebras used in probability theory.
It is also important in probability theory applications to form the product of an infinite number of probability spaces for , where can be infinite or even uncountable. Recall from Notes 0 that the product -algebra on is defined to be the -algebra generated by the sets for and , where is the usual coordinate projection. Equivalently, if we define an elementary set to be a subset of of the form , where is a finite subset of , is the obvious projection map to , and is a measurable set in , then can be defined as the -algebra generated by the collection of elementary sets. (Elementary sets are the measure-theoretic analogue of cylinder sets in point set topology.) For future reference we note the useful fact that is a Boolean algebra.
We define a product measure to be a probability measure on the measurable space which extends all of the finite products in the sense that
for all finite subsets of and all in , where . If this product measure exists, it is unique:
Exercise 6 Show that for any collection of probability spaces for , there is at most one product measure . (Hint: adapt the uniqueness argument in Theorem 1 that used the monotone class lemma.)
Exercise 7 Let be probability measures on , and let be their Stieltjes measure functions. Show that is the unique probability measure on whose Stietljes transform is the tensor product of .
In the case of finite , the finite product constructed in Exercise 3 is clearly the unique product. But for infinite , the construction of product measure is a more nontrivial issue. We can generalise the problem as follows:
Problem 8 (Extension problem) Let be a collection of measurable spaces. For each finite , let be a probability measure on obeying the compatibility condition
for all finite and , where is the obvious restriction. Can one then define a probability measure on such that
Note that the compatibility condition (6) is clearly necessary if one is to find a measure obeying (7).
Again, one has uniqueness:
Exercise 9 Show that for any and for finite as in the above extension problem, there is at most one probability measure with the stated properties.
The extension problem is trivial for finite , but for infinite there are unfortunately examples where the probability measure fails to exist. However, there is one key case in which we can build the extension, thanks to the Kolmogorov extension theorem. Call a measurable space standard Borel if it is isomorphic as a measurable space to a Borel subset of the unit interval with Borel measure, that is to say there is a bijection from to a Borel subset to such that and are both measurable. (In Durrett, such spaces are called nice spaces.) Note that one can easily replace by other standard spaces such as if desired, since these spaces are isomorphic as measurable spaces (why?).
Theorem 10 (Kolmogorov extension theorem) Let the situation be as in Problem 8. If all the measurable spaces are standard Borel, then there exists probability measure solving the extension problem (which is then unique, thanks to Exercise 9).
The proof of this theorem is lengthy and is deferred to the next (optional) section. Specialising to the product case, we conclude
Corollary 11 Let be a collection of probability spaces with standard Borel. Then there exists a product measure (which is then unique, thanks to Exercise 6).
Of course, to use this theorem we would like to have a large supply of standard Borel spaces. Here is one tool that often suffices:
Lemma 12 Let be a complete separable metric space, and let be a Borel subset of . Then (with the Borel -algebra) is standard Borel.
Proof: Let us call two topological spaces Borel isomorphic if their corresponding Borel structures are isomorphic as measurable spaces. Using the binary expansion, we see that is Borel isomorphic to (the countable number of points that have two binary expansions can be easily permuted to obtain a genuine isomorphism). Similarly is Borel isomorphic . Since is in bijection with , we conclude hat is Borel isomorphic . Thus it will suffice to to show that every complete separable metric space is Borel isomorphic to a Borel subset of . But if we let be a countable dense subset in , the map
can easily be seen to be a homeomorphism between and a subset of , which is completely metrisable and hence Borel (in fact it is a set – the countable intersection of open sets – why?). The claim follows.
Exercise 13 (Kolmogorov extension theorem, alternate form) For each natural number , let be a probability measure on with the property that
for and any box in , where we identify with in the usual manner. Show that there exists a unique probability measure on (with the product -algebra, or equivalently the Borel -algebra on the product topology) such that
for all and Borel sets .
— 2. Proof of the Kolmogorov extension theorem (optional) —
We now prove Theorem 10. By the definition of a standard Borel space, we may assume without loss of generality that each is a Borel subset of with the Borel -algebra, and then by extending each to we may in fact assume without loss of generality that each is simply with the Borel -algebra. Thus each for finite is a probability measure on the cube .
We will exploit the regularity properties of such measures:
Exercise 14 Let be a finite set, and let be a probability measure on (with the Borel -algebra). For any Borel set in , establish the inner regularity property
and the outer regularity property
Hint: use the monotone class lemma.
Another way of stating the above exercise is that finite Borel measures on the cube are automatically Radon measures. In fact there is nothing particularly special about the unit cube here; the claim holds for any compact separable metric spaces. Radon measures are often used in real analysis (see e.g. these lecture notes) but we will not develop their theory further here.
Observe that one can define the elementary measure of any elementary set in by defining
for any finite and any Borel . This definition is well-defined thanks to the compatibility hypothesis (6). From the finite additivity of the it is easy to see that is a finitely additive probability measure on the Boolean algebra of elementary sets.
We would like to extend to a countably additive probability measure on . The standard approach to do this is via the Carathéodory extension theorem in measure theory (or the closely related Hahn-Kolmogorov theorem); this approach is presented in these previous lecture notes, and a similar approach is taken in Durrett. Here, we will try to avoid developing the Carathéodory extension theorem, and instead take a more direct approach similar to the direct construction of Lebesgue measure, given for instance in these previous lecture notes.
Given any subset (not necessarily Borel), we define its outer measure to be the quantity
where we say that is an open elementary cover of if each is an open elementary set, and . Some properties of this outer measure are easily established:
Exercise 15
- (i) Show that .
- (ii) (Monotonicity) Show that if then .
- (iii) (Countable subadditivity) For any countable sequence of subsets of , show that . In particular (from part (i)) we have the finite subadditivity for all .
- (iv) (Elementary sets) If is an elementary set, show that . (Hint: first establish the claim when is compact, relying heavily on the regularity properties of the provided by Exercise 14, then extend to the general case by further heavy reliance on regularity.) In particular, we have .
- (v) (Approximation) Show that if , then for any there exists an elementary set such that . (Hint: use the monotone class lemma. When dealing with an increasing sequence of measurable sets obeying the required property, approximate these sets by an increasing sequence of elementary sets , and use the finite additivity of elementary measure and the fact that bounded monotone sequences converge.)
From part (v)of the above exercise, we see that every can be viewed as a “limit” of a sequence of elementary sets such that . From parts (iii), (iv) we see that the sequence is a Cauchy sequence and thus converges to a limit, which we denote ; one can check from further application of (iii), (iv) that this quantity does not depend on the specific choice of . From definition we see that extends (thus for any elementary set ), and from the above exercise one checks that is countably additive. Thus is a probability measure with the desired properties, and the proof of the Kolmogorov extension theorem is complete.
— 3. Independence —
Using the notion of product measure, we can now quickly define the notion of independence:
Definition 16 A collection of random variables (each of which take values in some measurable space ) is said to be jointly independent, if the distribution of is the product of the distributions of the . Or equivalently (after expanding all the definitions), we have
for all finite and all measurable subsets of . We say that two random variables are independent (or that is independent of ) if the pair is jointly independent.
It is worth reiterating that unless otherwise specified, all random variables under consideration are being modeled by a single probability space. The notion of independence between random variables does not make sense if the random variables are only being modeled by separate probability spaces; they have to be coupled together into a single probability space before independence becomes a meaningful notion.
Independence is a non-trivial notion only when one has two or more random variables; by chasing through the definitions we see that any collection of zero or one variables is automatically jointly independent.
Example 17 If we let be drawn uniformly from a product of two Borel sets in of positive finite Lebesgue measure, then and are independent. However, if is drawn from uniformly from another shape (e.g. a parallelogram), then one usually does not expect to have independence.
Thus, for instance, a finite family of random variables taking values in is jointly independent if one has
for all measurable in for .
Suppose that is a family of independent random variables, with each taking values in . From Exercise 3 we see that
whenever are disjoint finite subsets of , is the tuple , and is a measurable subset of . In particular, we see that the tuples are also jointly independent. This implies in turn that are jointly independent for any measurable functions . Thus, for instance, if are jointly independent random variables taking values in respectively, then and are independent for any measurable and . In particular, if two scalar random variables are jointly independent of a third random variable (i.e. the triple are jointly independent), then combinations such as or are also independent of .
If and are scalar random variables, then from the Fubini and Tonelli theorems we see that
if and are either both unsigned, or both absolutely integrable. We caution however that the converse is not true: just because two random variables happen to obey (8) does not necessarily mean that they are independent; instead, we say merely that they are uncorrelated, which is a weaker statement.
More generally, if and are random variables taking values in ranges respectively, then
for any scalar functions on respectively, provided that and are either both unsigned, or both absolutely integrable. This is the property of and which is equivalent to independence (as can be seen by specialising to those that take values in ): thus for instance independence of two unsigned random variables entails not only (8), but , , etc.. Similarly when discussing the joint independence of larger numbers of random variables. It is this ability to easily decouple expectations of independent random variables that make independent variables particularly easy to compute with in probability.
Exercise 18 Show that a random variable is independent of itself (i.e. and are independent) if and only if is almost surely equal to a constant.
Exercise 19 Show that a constant (deterministic) random variable is independent of any other random variable.
Exercise 20 Let be discrete random variables (i.e. they take values in at most countable spaces equipped with the discrete sigma-algebra). Show that are jointly independent if and only if one has
for all .
Exercise 21 Let be real scalar random variables. Show that are jointly independent if and only if one has
for all .
The following exercise demonstrates that probabilistic independence is analogous to linear independence:
Exercise 22 Let be a finite-dimensional vector space over a finite field , and let be a random variable drawn uniformly at random from . Let be a non-degenerate bilinear form on , and let be vectors in . Show that the random variables are jointly independent if and only if the vectors are linearly independent.
Exercise 23 Give an example of three random variables which are pairwise independent (that is, any two of are independent of each other), but not jointly independent. (Hint: one can use the preceding exercise.)
Another analogy is with orthogonality:
Exercise 24 Let be a random variable taking values in with the Gaussian distribution, in the sense that
(where denotes the Euclidean norm on ), and let be vectors in . Show that the random variables (with denoting the Euclidean inner product) are jointly independent if and only if the are pairwise orthogonal.
We say that a family of events are jointly independent if their indicator random variables are jointly independent. Undoing the definitions, this is equivalent to requiring that
for all disjoint finite subsets of . This condition is complicated, but simplifies in the case of just two events:
Exercise 25
- (i) Show that two events are independent if and only if .
- (ii) If are events, show that the condition is necessary, but not sufficient, to ensure that are jointly independent.
- (iii) Given an example of three events that are pairwise independent, but not jointly independent.
Because of the product measure construction, it is easy to insert independent sources of randomness into an existing randomness model by extending that model, thus giving a more useful version of Corollaries 27 and 31 of Notes 0:
Proposition 26 Suppose one has a collection of events and random variables modeled by some probability space , and let be a probability measure on a measurable space . Then there exists an extension of the probability space , and a random variable modeled by taking values in , such that has distribution and is independent of all random variables that were previously modeled by .
More generally, given a finite collection of probability spaces on measurable spaces , there exists an extension of and random variables modeled by taking values in for each , such that each has distribution and and are jointly independent for any random variable that was previously modeled by .
If the are all standard Borel spaces, then one can also take to be infinite (even if is uncountable).
Proof: For the first part, we define the extension to be the product of with the probability space , with factor map defined by , and with modeled by . It is then routine to verify all the claimed properties. The other parts of the proposition are proven similarly, using Proposition 11 for the final part.
Using this proposition, for instance, one can start with a given random variable and create an independent copy of that variable, which has the same distribution as but is independent of , by extending the probability model. Indeed one can create any finite number of independent copies, or even an infinite number of takes values in a standard Borel space (in particular, one can do this if is a scalar random variable). A finite or infinite sequence of random variables that are jointly independent and all have the same distribution is said to be an independent and identically distributed (or iid for short) sequence of random variables. The above proposition allows us to easily generate such sequences by extending the sample space as necessary.
Exercise 27 Let be random variables that are independent and identically distributed copies of the Bernoulli random variable with expectation , that is to say the are jointly independent with for all .
- (i) Show that the random variable is uniformly distributed on the unit interval .
- (ii) Show that the random variable has the distribution of Cantor measure (constructed for instance in Example 1.2.4 of Durrett).
Note that part (i) of this exercise provides a means to construct Lebesgue measure on the unit interval (although, when one unpacks the construction, it is actually not too different from the standard construction, as given for instance in this previous set of notes).
Given two square integrable random variables , the covariance between the two is defined by the formula
Of course, for real-valued the complex conjugation sign may be dropped. The covariance is well-defined thanks to the Cauchy-Schwarz inequality, and it is not difficult to see that one has the alternate formula
for the covariance. Note that the variance is a special case of the covariance: .
From construction we see that if are independent square integrable variables, then the covariance vanishes. The converse is not true:
Exercise 28 Give an example of two square-integrable variables which have vanishing covariance , but are not independent.
However, there is one key case in which the converse does hold, namely that of gaussian random vectors.
Exercise 29 A random vector taking values in is said to be a gaussian random vector if there exists and an positive definite matrix such that
for all Borel sets (where we identify elements of with column vectors). The distribution of is called a multivariate normal distribution.
- (i) If is a gaussian random vector with the indicated parameters , show that and for . In particular . Thus we see that the parameters of a gaussian random variable can be recovered from the mean and covariances.
- (ii) If is a gaussian random vector and , show that and are independent if and only if the covariance vanishes. Furthermore, show that are jointly independent if and only if all the covariances for vanish. In particular, for gaussian random vectors, joint independence is equivalent to pairwise independence. (Contrast this with Exercise 23.)
- (iii) Give an example of two real random variables , each of which is gaussian, and for which , but such that and are not independent. (Hint: take to be the product of with a random sign.) Why does this not contradict (ii)?
We have discussed independence of random variables, and independence of events. It is also possible to define a notion of independence of -algebras. More precisely, define a -algebra of events to be a collection of events that contains the empty event, is closed under Boolean operations (in particular, under complements ) and under countable conjunctions and countable disjunctions. Each such -algebra, when using a probability space model , is modeled by a -algebra of measurable sets in , which behaves under an extension in the obvious pullback fashion:
A random variable taking values in some range is said to be measurable with respect to a -algebra of events if the event lies in for every measurable subset of ; in terms of a probabilistic model , is measurable with respect to if and only if is measurable with respect to . Note that every random variable generates a -algebra of events, defined to be the collection of all events of the form for a measurable subset of ; this is the smallest -algebra with respect to which is measurable. More generally, any collection of random variables, one can define the -algebra to be the smallest -algebra of events with respect to which all of the are measurable; in terms of a model , we have
where is the range of . Similarly, any collection of events generates a -algebra of events , defined as the smallest -algebra of events that contains all of the ; with respect to a model , one has
Definition 30 A collection of -algebras of events are said to be jointly independent if, whenever is a random variable measurable with respect to for , the tuple is jointly independent. Equivalently, is jointly independent if and only if one has
whenever is a finite subset of and for (why is this equivalent?).
Thus, for instance, and are independent -algebras of events if and only if one has
for all and , that is to say that all the events in are independent of all the events in .
The above notion generalises the notion of independence for random variables:
Exercise 31 If are a collection of random variables, show that are jointly independent random variables if and only if are jointly independent -algebras.
Exercise 32 Let be a sequence of random variables. Show that are jointly independent if and only if is independent of for all natural numbers .
Suppose one has a sequence of random variables (such a sequence can be referred to as a discrete stochastic process). For each natural number , we can define the -algebras , as the smallest algebra that makes all of the for measurable; for instance, this -algebra contains any event that is definable in terms of measurable relations of finitely many of the , together with countable boolean operations on such events. These -algebras are clearly decreasing in . We can define the tail -algebra to be the intersection of all these -algebras, that is to say consists of those events which lie in for every . For instance, if the are scalar random variables that converge almost surely to a limit , then we see that (after modification on a null set) is measurable with respect to the tail -algebra .
We have the remarkable that says that the tail -algebra of a sequence of independent random variables is essentially trivial:
Theorem 33 (Kolmogorov zero-one law) Let be a sequence of jointly independent random variables. Then every event in the tail -algebra has probability equal to either or .
Example 34 Let be a sequence of jointly independent random variables in (not necessarily identically distributed). The random variable is measurable in the tail algebra, and hence must be almost surely constant, thus there exists such that almost surely. Similarly there exists such that . Thus, either we have and the converge almost surely to a deterministic limit, or and the almost surely do not converge. What cannot happen is (for instance) that converges with probability , and diverges with probability ; the zero-one law forces the only available probabilities of tail events to be zero or one.
Proof: Since are jointly independent, the -algebra is independent of for any . In particular, is independent of . Since the -algebra is generated by the for , a simple application of the monotone class lemma then shows that is also independent of . But contains , hence is independent of itself. But the only events that are independent of themselves have probability or , and the claim follows.
Note that the zero-one law gives no guidance as to which of the two probabilities actually occurs for a given tail event. This usually cannot be determined from such “soft” tools as the zero-one law; instead one has towork with more “hard” estimates, in particular in explicit inequalities for the probabilities of various events that approximate the given tail event.
The zero-one law suggests that many asymptotic statistics of random variables will almost surely have deterministic values. We will see specific examples of this in the next few notes, when we discuss the law of large numbers and the central limit theorem.
for all . Thus for instance if , and is written in block form as
for some row vector , column vector , and minor , one has
The inverse sweep operation is given by a nearly identical set of formulae:
for all . One can check that these operations invert each other. Actually, each sweep turns out to have order , so that : an inverse sweep performs the same operation as three forward sweeps. Sweeps also preserve the space of symmetric matrices (allowing one to cut down computational run time in that case by a factor of two), and behave well with respect to principal minors; a sweep of a principal minor is a principal minor of a sweep, after adjusting indices appropriately.
Remarkably, the sweep operators all commute with each other: . If and we perform the first sweeps (in any order) to a matrix
with a minor, a matrix, a matrix, and a matrix, one obtains the new matrix
Note the appearance of the Schur complement in the bottom right block. Thus, for instance, one can essentially invert a matrix by performing all sweeps:
If a matrix has the form
for a minor , column vector , row vector , and scalar , then performing the first sweeps gives
and all the components of this matrix are usable for various numerical linear algebra applications in statistics (e.g. in least squares regression). Given that sweeps behave well with inverses, it is perhaps not surprising that sweeps also behave well under determinants: the determinant of can be factored as the product of the entry and the determinant of the matrix formed from by removing the row and column. As a consequence, one can compute the determinant of fairly efficiently (so long as the sweep operations don’t come close to dividing by zero) by sweeping the matrix for in turn, and multiplying together the entry of the matrix just before the sweep for to obtain the determinant.
It turns out that there is a simple geometric explanation for these seemingly magical properties of the sweep operation. Any matrix creates a graph (where we think of as the space of column vectors). This graph is an -dimensional subspace of . Conversely, most subspaces of arises as graphs; there are some that fail the vertical line test, but these are a positive codimension set of counterexamples.
We use to denote the standard basis of , with the standard basis for the first factor of and the standard basis for the second factor. The operation of sweeping the entry then corresponds to a ninety degree rotation in the plane, that sends to (and to ), keeping all other basis vectors fixed: thus we have
for generic (more precisely, those with non-vanishing entry ). For instance, if and is of the form (1), then is the set of tuples obeying the equations
The image of under is . Since we can write the above system of equations (for ) as
we see from (2) that is the graph of . Thus the sweep operation is a multidimensional generalisation of the high school geometry fact that the line in the plane becomes after applying a ninety degree rotation.
It is then an instructive exercise to use this geometric interpretation of the sweep operator to recover all the remarkable properties about these operations listed above. It is also useful to compare the geometric interpretation of sweeping as rotation of the graph to that of Gaussian elimination, which instead shears and reflects the graph by various elementary transformations (this is what is going on geometrically when one performs Gaussian elimination on an augmented matrix). Rotations are less distorting than shears, so one can see geometrically why sweeping can produce fewer numerical artefacts than Gaussian elimination.
These operations obey various axioms; for instance, the boolean operations on events obey the axioms of a Boolean algebra, and the probabilility function obeys the Kolmogorov axioms. However, we will not focus on the axiomatic approach to probability theory here, instead basing the foundations of probability theory on the sample space models as discussed in Notes 0. (But see this previous post for a treatment of one such axiomatic approach.)
It turns out that almost all of the other operations on random events and variables we need can be constructed in terms of the above basic operations. In particular, this allows one to safely extend the sample space in probability theory whenever needed, provided one uses an extension that respects the above basic operations. We gave a simple example of such an extension in the previous notes, but now we give a more formal definition:
Definition 1 Suppose that we are using a probability space as the model for a collection of events and random variables. An extension of this probability space is a probability space , together with a measurable map (sometimes called the factor map) which is probability-preserving in the sense that
for all . (Caution: this does not imply that for all – why not?)
An event which is modeled by a measurable subset in the sample space , will be modeled by the measurable set in the extended sample space . Similarly, a random variable taking values in some range that is modeled by a measurable function in , will be modeled instead by the measurable function in . We also allow the extension to model additional events and random variables that were not modeled by the original sample space (indeed, this is one of the main reasons why we perform extensions in probability in the first place).
Thus, for instance, the sample space in Example 3 of the previous post is an extension of the sample space in that example, with the factor map given by the first coordinate projection . One can verify that all of the basic operations on events and random variables listed above are unaffected by the above extension (with one caveat, see remark below). For instance, the conjunction of two events can be defined via the original model by the formula
or via the extension via the formula
The two definitions are consistent with each other, thanks to the obvious set-theoretic identity
Similarly, the assumption (1) is precisely what is needed to ensure that the probability of an event remains unchanged when one replaces a sample space model with an extension. We leave the verification of preservation of the other basic operations described above under extension as exercises to the reader.
Remark 2 There is one minor exception to this general rule if we do not impose the additional requirement that the factor map is surjective. Namely, for non-surjective , it can become possible that two events are unequal in the original sample space model, but become equal in the extension (and similarly for random variables), although the converse never happens (events that are equal in the original sample space always remain equal in the extension). For instance, let be the discrete probability space with and , and let be the discrete probability space with , and non-surjective factor map defined by . Then the event modeled by in is distinct from the empty event when viewed in , but becomes equal to that event when viewed in . Thus we see that extending the sample space by a non-surjective factor map can identify previously distinct events together (though of course, being probability preserving, this can only happen if those two events were already almost surely equal anyway). This turns out to be fairly harmless though; while it is nice to know if two given events are equal, or if they differ by a non-null event, it is almost never useful to know that two events are unequal if they are already almost surely equal. Alternatively, one can add the additional requirement of surjectivity in the definition of an extension, which is also a fairly harmless constraint to impose (this is what I chose to do in this previous set of notes).
Roughly speaking, one can define probability theory as the study of those properties of random events and random variables that are model-independent in the sense that they are preserved by extensions. For instance, the cardinality of the model of an event is not a concept within the scope of probability theory, as it is not preserved by extensions: continuing Example 3 from Notes 0, the event that a die roll is even is modeled by a set of cardinality in the original sample space model , but by a set of cardinality in the extension. Thus it does not make sense in the context of probability theory to refer to the “cardinality of an event “.
On the other hand, the supremum of a collection of random variables in the extended real line is a valid probabilistic concept. This can be seen by manually verifying that this operation is preserved under extension of the sample space, but one can also see this by defining the supremum in terms of existing basic operations. Indeed, note from Exercise 24 of Notes 0 that a random variable in the extended real line is completely specified by the threshold events for ; in particular, two such random variables are equal if and only if the events and are surely equal for all . From the identity
we thus see that one can completely specify in terms of using only the basic operations provided in the above list (and in particular using the countable conjunction .) Of course, the same considerations hold if one replaces supremum, by infimum, limit superior, limit inferior, or (if it exists) the limit.
In this set of notes, we will define some further important operations on scalar random variables, in particular the expectation of these variables. In the sample space models, expectation corresponds to the notion of integration on a measure space. As we will need to use both expectation and integration in this course, we will thus begin by quickly reviewing the basics of integration on a measure space, although we will then translate the key results of this theory into probabilistic language.
As the finer details of the Lebesgue integral construction are not the core focus of this probability course, some of the details of this construction will be left to exercises. See also Chapter 1 of Durrett, or these previous blog notes, for a more detailed treatment.
— 1. Integration on measure spaces —
Let be a measure space, and let be a measurable function on , taking values either in the reals , the non-negative extended reals , the extended reals , or the complex numbers . We would like to define the integral
of on . (One could make the integration variable explicit, e.g. by writing , but we will usually not do so here.) When integrating a reasonably nice function (e.g. a continuous function) on a reasonably nice domain (e.g. a box in ), the Riemann integral that one learns about in undergraduate calculus classes suffices for this task; however, for the purposes of probability theory, we need the much more general notion of a Lebesgue integral in order to properly define (2) for the spaces and functions we will need to study.
Not every measurable function can be integrated by the Lebesgue integral. There are two key classes of functions for which the integral exists and is well behaved:
One could in principle extend the Lebesgue integral to slightly more general classes of functions, e.g. to sums of absolutely integrable functions and unsigned functions. However, the above two classes already suffice for most applications (and as a general rule of thumb, it is dangerous to apply the Lebesgue integral to functions that are not unsigned or absolutely integrable, unless you really know what you are doing).
We will construct the Lebesgue integral in the following four stages. First, we will define the Lebesgue integral just for unsigned simple functions – unsigned measurable functions that take on only finitely many values. Then, by a limiting procedure, we extend the Lebesgue integral to unsigned functions. After that, by decomposing a real absolutely integrable function into unsigned components, we extend the integral to real absolutely integrable functions. Finally, by taking real and imaginary parts, we extend to complex absolutely integrable functions. (This is not the only order in which one could perform this construction; for instance, in Durrett, one first constructs integration of bounded functions on finite measure support before passing to arbitrary unsigned functions.)
First consider an unsigned simple function , thus is measurable and only takes values at a finite number of values. Then we can express as a finite linear combination (in ) of indicator functions. Indeed, if we enumerate the values that takes as (avoiding repetitions) and setting for , then it is clear that
(It should be noted at this point that the operations of addition and multiplication on are defined by setting for all , and for all positive , but that is defined to equal . To put it another way, multiplication is defined to be continuous from below, rather than from above: . One can verify that the commutative, associative, and distributive laws continue to hold on , but we caution that the cancellation laws do not hold when is involved.)
Conversely, given any coefficients (not necessarily distinct) and measurable sets in (not necessarily disjoint), the sum is an unsigned simple function.
A single simple function can be decomposed in multiple ways as a linear combination of unsigned simple functions. For instance, on the real line , the function can also be written as or as . However, there is an invariant of all these decompositions:
Exercise 3 Suppose that an unsigned simple function has two representations as the linear combination of indicator functions:
where are nonnegative integers, lie in , and are measurable sets. Show that
(Hint: first handle the special case where the are all disjoint and non-empty, and each of the is expressible as the union of some subcollection of the . Then handle the general case by considering the atoms of the finite boolean algebra generated by and .)
We capture this invariant by introducing the simple integral of an unsigned simple function by the formula
whenever admits a decomposition . The above exercise is then precisely the assertion that the simple integral is well-defined as an element of .
Exercise 4 Let be unsigned simple functions, and let .
- (i) (Linearity) Show that
and
- (ii) Show that if and are equal almost everywhere, then
- (iii) Show that , with equality if and only if is zero almost everywhere.
- (iv) (Monotonicity) If almost everywhere, show that .
- (v) (Markov inequality) Show that for any .
Now we extend from unsigned simple functions to more general unsigned functions. If is an unsigned measurable function, we define the unsigned integral as
where the supremum is over all unsigned simple functions such that for all .
Many of the properties of the simple integral carry over to the unsigned integral easily:
Exercise 5 Let be unsigned functions, and let .
- (i) (Superadditivity) Show that
and
- (ii) Show that if and are equal almost everywhere, then
- (iii) Show that , with equality if and only if is zero almost everywhere.
- (iv) (Monotonicity) If almost everywhere, show that .
- (v) (Markov inequality) Show that for any . In particular, if , then is finite almost everywhere.
- (vi) (Compatibility with simple integral) If is simple, show that .
- (vii) (Compatibility with measure) For any measurable set , show that .
Exercise 6 If is a discrete probability space (with the associated probability measure ), and is a function, show that
(Note that the condition in the definition of a discrete probability space is not required to prove this identity.)
The observant reader will notice that the linearity property of simple functions has been weakened to superadditivity. This can be traced back to a breakdown of symmetry in the definition (3); the unsigned simple integral of is defined via approximation from below, but not from above. Indeed the opposite claim
can fail. For a counterexample, take to be the discrete probability space with probabilities , and let be the function . By Exercise 6 we have . On the other hand, any simple function with must equal on a set of positive measure (why?) and so the right-hand side of (4) can be infinite. However, one can get around this difficulty under some further assumptions on , and thus recover full linearity for the unsigned integral:
Exercise 7 (Linearity of the unsigned integral) Let be a measure space.
- (i) Let be an unsigned measurable function which is both bounded (i.e., there is a finite such that for all ) and has finite measure support (i.e., there is a measurable set with such that for all ). Show that (4) holds for this function .
- (ii) Establish the additivity property
whenever are unsigned measurable functions that are bounded with finite measure support.
- (iii) Show that
as whenever is unsigned measurable.
- (iv) Using (iii), extend (ii) to the case where are unsigned measurable functions with finite measure support, but are not necessarily bounded.
- (v) Show that
as whenever is unsigned measurable.
- (vi) Using (iii) and (v), show that (ii) holds for any unsigned measurable (which are not necessarily bounded or of finite measure support).
Next, we apply the integral to absolutely integrable functions. We call a scalar function or absolutely integrable if it is measurable and the unsigned integral is finite. A real-valued absolutely integrable function can be expressed as the difference of two unsigned absolutely integrable functions ; indeed, one can check that the choice and work for this. Conversely, any difference of unsigned absolutely integrable functions is absolutely integrable (this follows from the triangle inequality ). A single absolutely integrable function may be written as a difference of unsigned absolutely integrable functions in more than one way, for instance we might have
for unsigned absolutely integrable functions . But when this happens, we can rearrange to obtain
and thus by linearity of the unsigned integral
By the absolute integrability of , all the integrals are finite, so we may rearrange this identity as
This allows us to define the Lebesgue integral of a real-valued absolutely integrable function to be the expression
for any given decomposition of as the difference of two unsigned absolutely integrable functions. Note that if is both unsigned and absolutely integrable, then the unsigned integral and the Lebesgue integral of agree (as can be seen by using the decomposition ), and so there is no ambiguity in using the same notation to denote both integrals. (By the same token, we may now drop the modifier from the simple integral of a simple unsigned , which we may now also denote by .)
The Lebesgue integral also enjoys good linearity properties:
Exercise 8 Let be real-valued absolutely integrable functions, and let .
- (i) (Linearity) Show that and are also real-valued absolutely integrable functions, with
and
(For the second relation, one may wish to first treat the special cases and .)
- (ii) Show that if and are equal almost everywhere, then
- (iii) Show that , with equality if and only if is zero almost everywhere.
- (iv) (Monotonicity) If almost everywhere, show that .
- (v) (Markov inequality) Show that for any .
Because of part (iii) of the above exercise, we can extend the Lebesgue integral to real-valued absolutely integrable functions that are only defined and real-valued almost everywhere, rather than everywhere. In particular, we can apply the Lebesgue integral to functions that are sometimes infinite, so long as they are only infinite on a set of measure zero, and the function is absolutely integrable everywhere else.
Finally, we extend to complex-valued functions. If is absolutely integrable, observe that the real and imaginary parts are also absolutely integrable (because ). We then define the (complex) Lebesgue integral of in terms of the real Lebesgue integral by the formula
Clearly, if is real-valued and absolutely integrable, then the real Lebesgue integral and the complex Lebesgue integral of coincide, so it does not create ambiguity to use the same symbol for both concepts. It is routine to extend the linearity properties of the real Lebesgue integral to its complex counterpart:
Exercise 9 Let be complex-valued absolutely integrable functions, and let .
- (i) (Linearity) Show that and are also complex-valued absolutely integrable functions, with
and
(For the second relation, one may wish to first treat the special cases and .)
- (ii) Show that if and are equal almost everywhere, then
- (iii) Show that , with equality if and only if is zero almost everywhere.
- (iv) (Markov inequality) Show that for any .
We record a simple, but incredibly fundamental, inequality concerning the Lebesgue integral:
Lemma 10 (Triangle inequality) If is a complex-valued absolutely integrable function, then
Proof: We have
This looks weaker than what we want to prove, but we can “amplify” this inequality to the full strength triangle inequality as follows. Replacing by for any real , we have
Since we can choose the phase to make the expression equal to , the claim follows.
Finally, we observe that the Lebesgue integral extends the Riemann integral, which is particularly useful when it comes to actually computing some of these integrals:
Exercise 11 If is a Riemann integrable function on a compact interval , show that is also absolutely integrable, and that the Lebesgue integral (with Lebesgue measure restricted to ) coincides with the Riemann integral . Similarly if is Riemann integrable on a box .
— 2. Expectation of random variables —
We now translate the above notions of integration on measure spaces to the probabilistic setting.
A random variable taking values in the unsigned extended real line is said to be simple if it takes on at most finitely many values. Equivalently, can be expressed as a finite unsigned linear combination
of indicator random variables, where are unsigned and are events. We then define the simple expectation of to be the quantity
and checks that this definition is independent of the choice of decomposition of into indicator functions. Observe that if we model the random variable using a probability space , then the simple expectation of is precisely the simple integral of the corresponding unsigned simple function .
Next, given an arbitrary unsigned random variable taking values in , one defines its (unsigned) expectation as
where ranges over all simple unsigned random variables such that is surely true. This extends the simple expectation (thus for all simple unsigned ), and in terms of a probability space model , the expectation is precisely the unsigned integral of .
A scalar random variable is said to be absolutely integrable if , thus for instance any bounded random variable is absolutely integrable. If is real-valued and absolutely integrable, we define its expectation by the formula
where is any representation of as the difference of unsigned absolutely integrable random variables ; one can check that this definition does not depend on the choice of representation and is thus well-defined. For complex-valued absolutely integrable , we then define
In all of these cases, the expectation of is equal to the integral of the representation of in any probability space model; in the case that is given by a discrete probability model, one can check that this definition of expectation agrees with the one given in Notes 0. Using the former fact, we can translate the properties of integration already established to the probabilistic setting:
Proposition 12
- (i) (Unsigned linearity) If are unsigned random variables, and is a deterministic unsigned quantity, then and . (Note that these identities hold even when are not absolutely integrable.)
- (ii) (Complex linearity) If are absolutely integrable random variables, and is a deterministic complex quantity, then and are also absolutely integrable, with and .
- (iii) (Compatibility with probability) If is an event, then . In particular, .
- (iv) (Almost sure equivalence) If are unsigned (resp. absolutely integrable) and almost surely, then .
- (v) If is unsigned or absolutely integrable, then , with equality if and only if almost surely.
- (vi) (Monotonicity) If are unsigned or real-valued absolutely integrable, and almost surely, then .
- (vii) (Markov inequality) If is unsigned or absolutely integrable, then for any deterministic .
- (viii) (Triangle inequality) If is absolutely integrable, then .
As before, we can use part (iv) to define expectation of scalar random variables that are only defined and finite almost surely, rather than surely.
Note that we have built the notion of expectation (and of related notions, such as absolute integrability) out of notions that were already probabilistic in nature, in the sense that they were unaffected if one replaced the underlying probabilistic model with an extension. Therefore, the notion of expectation is automatically probabilistic in the same sense. Because of this, we will be easily able to manipulate expectations of random variables without having to explicitly mention an underlying probability space , and so one will now see such spaces fade from view starting from this point in the course.
— 3. Exchanging limits with integrals or expectations —
When performing analysis on measure spaces, it is important to know if one can interchange a limit with an integral:
Similarly, in probability theory, we often wish to interchange a limit with an expectation:
Of course, one needs the integrands or random variables to be either unsigned or absolutely integrable, and the limits to be well-defined to have any hope of doing this. Naively, one could hope that limits and integrals could always be exchanged when the expressions involved are well-defined, but this is unfortunately not the case. In the case of integration on, say, the real line using Lebesgue measure , we already see four key examples:
In all these examples, the limit of the integral exceeds the integral of the limit; by replacing with in the first three examples (which involve absolutely integrable functions) one can also build examples where the limit of the integral is less than the integral of the limit. Most of these examples rely on the infinite measure of the real line and thus do not directly have probabilistic analogues, but the concentrating bump example involves functions that are all supported on the unit interval and thus also poses a problem in the probabilistic setting.
Nevertheless, there are three important cases in which we can relate the limit (or, in the case of Fatou’s lemma, the limit inferior) of the integral to the integral of the limit (or limit inferior). Informally, they are:
These three results then have analogues for convergence of random variables. We will also mention a fourth useful tool in that setting, which allows one to exchange limits and expectations when one controls a higher moment. There are a few more such general results allowing limits to be exchanged with integrals or expectations, but my advice would be to work out such exchanges by hand rather than blindly cite (possibly incorrectly) an additional convergence theorem beyond the four mentioned above, as this is safer and will help strengthen one’s intuition on the situation.
We now state and prove these results more explicitly.
Lemma 13 (Fatou’s lemma) Let be a measure space, and let be a sequence of unsigned measurable functions. Then
An equivalent form of this lemma is that if one has
for some and all sufficiently large , then one has
as well. That is to say, if the original unsigned functions eventually have “mass” less than or equal to , then the limit (inferior) also has “mass” less than or equal to . The limit may have substantially less mass, as the four examples above show, but it can never have more mass (asymptotically) than the functions that comprise the limit. Of course, one can replace limit inferior by limit in the left or right hand side if one knows that the relevant limit actually exists (but one cannot replace limit inferior by limit superior if one does not already have convergence, see Example 15 below). On the other hand, it is essential that the are unsigned for Fatou’s lemma to work, as can be seen by negating one of the first three key examples mentioned above.
Proof: By definition of the unsigned integral, it suffices to show that
whenever is an unsigned simple function with . Multiplying by , it thus suffices to show that
for any and any unsigned as above.
We can write as the sum for some strictly positive and disjoint ; we allow the and the measures to be infinite. On each , we have . Thus, if we define
then the increase to as : . By continuity from below (Exercise 23 of Notes 0), we thus have
as . Since
we conclude upon integration that
and thus on taking limit inferior
But the right-hand side is , and the claim follows.
Of course, Fatou’s lemma may be phrased probabilistically:
Lemma 14 (Fatou’s lemma for random variables) Let be a sequence of unsigned random variables. Then
As a corollary, if are unsigned and converge almost surely to a random variable , then
Example 15 We now give an example to show that limit inferior cannot be replaced with limit superior in Fatou’s lemma. Let be drawn uniformly at random from , and for each , let be the binary digit of , thus when has odd integer part, and otherwise. (There is some ambiguity with the binary expansion when is a terminating binary decimal, but this event almost surely does not occur and can thus be safely ignored.) One has for all (why?). It is then easy to see that is almost surely (which is consistent with Fatou’s lemma) but is almost surely (so Fatou’s lemma fails if one replaces limit inferior with limit superior).
Next, we establish the monotone convergence theorem.
Theorem 16 (Monotone convergence theorem) Let be a measure space, and let be a sequence of unsigned measurable functions which is monotone increasing, thus for all and . Then
Note that the limits exist on both sides because monotone sequences always have limits. Indeed the limit in either side is equal to the supremum. The receding infinity example shows that it is important that the functions here are monotone increasing rather than monotone decreasing. We also observe that it is enough for the to be increasing almost everywhere rather than everywhere, since one can then modify the on a set of measure zero to be increasing everywhere, which does not affect the integrals on either side of this theorem.
Proof: From Fatou’s lemma we already have
On the other hand, from monotonicity we see that
for any natural number , and on taking limits as we obtain the claim.
An important corollary of the monotone convergence theorem is that one can freely interchange infinite sums with integrals for unsigned functions, that is to say
for any unsigned (not necessarily monotone). Indeed, to see this one simply applies the monotone convergence theorem to the partial sums .
We of course can translate this into the probabilistic context:
Theorem 17 (Monotone convergence theorem for random variables) Let be a monotone non-decreasing sequence of unsigned random variables. Then
Similarly, for any unsigned random variables , we have
Again, it is sufficient for the to be non-decreasing almost surely. We note a basic but important corollary of this theorem, namely the (first) Borel-Cantelli lemma:
Lemma 18 (Borel-Cantelli lemma) Let be a sequence of events with . Then almost surely, at most finitely many of the events hold; that is to say, one has almost surely.
Proof: From the monotone convergence theorem, we have
By Markov’s inequality, this implies that is almost surely finite, as required.
We will develop a partial converse to this lemma (the “second” Borel-Cantelli lemma) in a subsequent set of notes. For now, we give a crude converse in which we assume not only that the sum to infinity, but they are in fact uniformly bounded from below:
Exercise 19 Let be a sequence of events with . Show that with positive probability, an infinite number of the hold; that is to say, . (Hint: if for all , establish the lower bound for all . Alternatively, one can apply Fatou’s lemma to the random variables .)
Finally, we give the dominated convergence theorem.
Theorem 20 (Dominated convergence theorem) Let be a measure space, and let be measurable functions which converge pointwise to some limit. Suppose that there is an unsigned absolutely integrable function which dominates the in the sense that for all and all . Then
In particular, the limit on the right-hand side exists.
Again, it will suffice for to dominate each almost everywhere rather than everywhere, as one can upgrade this to everywhere domination by modifying each on a set of measure zero. Similarly, pointwise convergence can be replaced with pointwise convergence almost everywhere. The domination of each by a single function implies that the integrals are uniformly bounded in , but this latter condition is not sufficient by itself to guarantee interchangeability of the limit and integral, as can be seen by the first three examples given at the start of this section.
Proof: By splitting into real and imaginary parts, we may assume without loss of generality that the are real-valued. As is absolutely integrable, it is finite almost everywhere; after modification on a set of measure zero we may assume it is finite everywhere. Let denote the pointwise limit of the . From Fatou’s lemma applied to the unsigned functions and , we have
and
Rearranging this (taking crucial advantage of the finite nature of the , and hence and ), we conclude that
and the claim follows.
Remark 21 Amusingly, one can use the dominated convergence theorem to give an (extremely indirect) proof of the divergence of the harmonic series . For, if that series was convergent, then the function would be absolutely integrable, and the spreading bump example described above would contradict the dominated convergence theorem. (Expert challenge: see if you can deconstruct the above argument enough to lower bound the rate of divergence of the harmonic series .)
We again translate the above theorem to the probabilistic context:
Theorem 22 (Dominated convergence theorem for random variables) Let be scalar random variables which converge almost surely to a limit . Suppose there is an unsigned absolutely integrable random variable such that almost surely for each . Then
As a corollary of the dominated convergence theorem for random variables we have the bounded convergence theorem: if are scalar random variables that converge almost surely to a limit , and are almost surely bounded in magnitude by a uniform constant , then we have
(In Durrett, the bounded convergence theorem is proven first, and then used to establish Fatou’s theorem and the dominated and monotone convergence theorems. The order in which one establishes these results – which are all closely related to each other – is largely a matter of personal taste.) A further corollary of the dominated convergence theorem is that one has the identity
whenever are scalar random variables with absolutely integrable (or equivalently, that is finite).
Another useful variant of the dominated convergence theorem is
Theorem 23 (Convergence for random variables with bounded moment) Let be scalar random variables which converge almost surely to a limit . Suppose there is and such that for all . Then
This theorem fails for , as the concentrating bump example shows. The case (that is to say, bounded second moment ) is already quite useful. The intuition here is that concentrating bumps are in some sense the only obstruction to interchanging limits and expectations, and these can be eliminated by hypotheses such as a bounded higher moment hypothesis or a domination hypothesis.
Proof: By taking real and imaginary parts we may assume that the (and hence ) are real-valued. For any natural number , let denote the truncation of to the interval , and similarly define . Then converges pointwise to , and hence by the bounded convergence theorem
On the other hand, we have
(why?) and thus on taking expectations and using the triangle inequality
where we are using the asymptotic notation to denote a quantity bounded in magnitude by for an absolute constant . Also, from Fatou’s lemma we have
so we similarly have
Putting all this together, we see that
Sending , we obtain the claim.
Remark 24 The essential point about the condition was that the function grew faster than linearly as . One could accomplish the same result with any other function with this property, e.g. a hypothesis such as would also suffice. The most natural general condition to impose here is that of uniform integrability, which encompasses the hypotheses already mentioned, but we will not focus on this condition here.
Exercise 25 (Scheffé’s lemma) Let be a sequence of absolutely integrable scalar random variables that converge almost surely to another absolutely integrable scalar random variable . Suppose also that converges to as . Show that converges to zero as . (Hint: there are several ways to prove this result, known as Scheffe’s lemma. One is to split into two components , such that is dominated by but converges almost surely to , and is such that . Then apply the dominated convergence theorem.)
— 4. The distribution of a random variable —
We have seen that the expectation of a random variable is a special case of the more general notion of Lebesgue integration on a measure space. There is however another way to think of expectation as a special case of integration, which is particularly convenient for computing expectations. We first need the following definition.
Definition 26 Let be a random variable taking values in a measurable space . The distribution of (also known as the law of ) is the probability measure on defined by the formula
for all measurable sets ; one easily sees from the Kolmogorov axioms that this is indeed a probability measure.
Example 27 If only takes on at most countably many values (and if every point in is measurable), then the distribution is the discrete measure that assigns each point in the range of a measure of .
Example 28 If is a real random variable with cumulative distribution function , then is the Lebesgue-Stieltjes measure associated to . For instance, if is drawn uniformly at random from , then is Lebesgue measure restricted to . In particular, two scalar variables are equal in distribution if and only if they have the same cumulative distribution function.
Example 29 If and are the results of two separate rolls of a fair die (as in Example 3 of Notes 0), then and are equal in distribution, but are not equal as random variables.
Remark 30 In the converse direction, given a probability measure on a measurable space , one can always build a probability space model and a random variable represented by that model whose distribution is . Indeed, one can perform the “tautological” construction of defining the probability space model to be , and to be the identity function , and then one easily checks that . Compare with Corollaries 26 and 29 of Notes 0. Furthermore, one can view this tautological model as a “base” model for random variables of distribution as follows. Suppose one has a random variable of distribution which is modeled by some other probability space , thus is a measurable function such that
for all . Then one can view the probability space as an extension of the tautological probability space using as the factor map.
We say that two random variables are equal in distribution, and write , if they have the same law: , that is to say for any measurable set in the range. This definition makes sense even when are defined on different sample spaces. Roughly speaking, the distribution captures the “size” and “shape” of the random variable, but not its “location” or how it relates to other random variables.
Theorem 31 (Change of variables formula) Let be a random variable taking values in a measurable space . Let or be a measurable scalar function (giving or the Borel -algebra of course) such that either , or that . Then
Thus for instance, if is a real random variable, then
and more generally
for all ; furthermore, if is unsigned or absolutely integrable, one has
The point here is that the integration is not over some unspecified sample space , but over a very explicit domain, namely the reals; we have “changed variables” to integrate over instead over , with the distribution representing the “Jacobian” factor that typically shows up in such change of variables formulae.
Proof: First suppose that is unsigned and only takes on a finite number of values. Then
and hence
as required.
Next, suppose that is unsigned but can take on infinitely many values. We can express as the monotone increasing limit of functions that only take a finite number of values; for instance we can define to be rounded down to the largest multiple of less than both and . By the preceding computation, we have
and on taking limits as using the monotone convergence theorem we obtain the claim in this case.
Now suppose that is real-valuked with . We write where and , then we have and
for . Subtracting these two identities together, we obtain the claim.
Finally, the case of complex-valued with follows from the real-valued case by taking real and imaginary parts.
Example 32 Let be the uniform distribution on , then
for any Riemann integrable ; thus for instance
for any .
Remark 33 An alternate way to prove the change of variables formula is to observe that the formula is obviously true when one uses the tautological model for , and then the claim follows from the model-independence of expectation and the observation from Remark 30 that any other model for is an extension of the tautological model.
— 5. Some basic inequalities —
We record here for future reference some basic inequalities concerning expectation that we will need in the sequel. We have already seen the triangle inequality
for absolutely integrable , and the Markov inequality
for arbitrary scalar and (note the inequality is trivial if is not absolutely integrable). Applying the Markov inequality to the quantity we obtain the important Chebyshev inequality
for absolutely integrable and , where the Variance of is defined as
Next, we record
Lemma 34 (Jensen’s inequality) If is a convex function, is a real random variable with and both absolutely integrable, then
Proof: Let be a real number. Being convex, the graph of must be supported by some line at , that is to say there exists a slope (depending on ) such that for all . (If is differentiable at , one can take to be the derivative of at , but one always has a supporting line even in the non-differentiable case.) In particular
Taking expectations and using linearity of expectation, we conclude
and the claim follows from setting .
Exercise 35 (Complex Jensen inequality) Let be a convex function (thus for all complex and all , and let be a complex random variable with and both absolutely integrable. Show that
Note that the triangle inequality is the special case of Jensen’s inequality (or the complex Jensen’s inequality, if is complex-valued) corresponding to the convex function on (or on ). Another useful example is
As a related application of convexity, observe from the convexity of the function that
for any and . This implies in particular Young’s inequality
for any scalar and any exponents with ; note that this inequality is also trivially true if one or both of are infinite. Taking expectations, we conclude that
if are scalar random variabels and are deterministic exponents with . In particular, if are absolutely integrable, then so is , and
We can amplify this inequality as follows. Multiplying by some and dividing by the same , we conclude that
optimising the right-hand side in , we obtain (after some algebra, and after disposing of some edge cases when or is almost surely zero) the important Hölder inequality
where we use the notation
for . Using the convention
(thus is the essential supremum of ), we also see from the triangle inequality that the Hölder inequality applies in the boundary case when one of is allowed to be (so that the other is equal to ):
The case is the important Cauchy-Schwarz inequality
valid whenever are square-integrable in the sense that are finite.
Exercise 36 Show that the expressions are non-decreasing in for . In particular, if is finite for some , then it is automatically finite for all smaller values of .
Exercise 37 For any square-integrable , show that
Exercise 38 If and are scalar random variables with , use Hölder’s inequality to establish that
and
and then conclude the Minkowski inequality
Show that this inequality is also valid at the endpoint cases and .
Exercise 39 If is non-negative and square-integrable, and , establish the Paley-Zygmund inequality
(Hint: use the Cauchy-Schwarz inequality to upper bound in terms of and .)
Note: as this set of notes is primarily concerned with foundational issues, it will contain a large number of pedantic (and nearly trivial) formalities and philosophical points. We dwell on these technicalities in this set of notes primarily so that they are out of the way in later notes, when we work with the actual mathematics of probability, rather than on the supporting foundations of that mathematics. In particular, the excessively formal and philosophical language in this set of notes will not be replicated in later notes.
— 1. Some philosophical generalities —
By default, mathematical reasoning is understood to take place in a deterministic mathematical universe. In such a universe, any given mathematical statement (that is to say, a sentence with no free variables) is either true or false, with no intermediate truth value available. Similarly, any deterministic variable can take on only one specific value at a time.
However, for a variety of reasons, both within pure mathematics and in the applications of mathematics to other disciplines, it is often desirable to have a rigorous mathematical framework in which one can discuss non-deterministic statements and variables – that is to say, statements which are not always true or always false, but in some intermediate state, or variables that do not take one particular value or another with definite certainty, but are again in some intermediate state. In probability theory, which is by far the most widely adopted mathematical framework to formally capture the concept of non-determinism, non-deterministic statements are referred to as events, and non-deterministic variables are referred to as random variables. In the standard foundations of probability theory, as laid out by Kolmogorov, we can then model these events and random variables by introducing a sample space (which will be given the structure of a probability space) to capture all the ambient sources of randomness; events are then modeled as measurable subsets of this sample space, and random variables are modeled as measurable functions on this sample space. (We will briefly discuss a more abstract way to set up probability theory, as well as other frameworks to capture non-determinism than classical probability theory, at the end of this set of notes; however, the rest of the course will be concerned exclusively with classical probability theory using the orthodox Kolmogorov models.)
Note carefully that sample spaces (and their attendant structures) will be used to model probabilistic concepts, rather than to actually be the concepts themselves. This distinction (a mathematical analogue of the map-territory distinction in philosophy) actually is implicit in much of modern mathematics, when we make a distinction between an abstract version of a mathematical object, and a concrete representation (or model) of that object. For instance:
The distinction between abstract objects and concrete models can be fairly safely discarded if one is only going to use a single model for each abstract object, particularly if that model is “canonical” in some sense. However, one needs to keep the distinction in mind if one plans to switch between different models of a single object (e.g. to perform change of basis in linear algebra, change of coordinates in differential geometry, or base change in algebraic geometry). As it turns out, in probability theory it is often desirable to change the sample space model (for instance, one could extend the sample space by adding in new sources of randomness, or one could couple together two systems of random variables by joining their sample space models together). Because of this, we will take some care in this foundational set of notes to distinguish probabilistic concepts (such as events and random variables) from their sample space models. (But we may be more willing to conflate the two in later notes, once the foundational issues are out of the way.)
From a foundational point of view, it is often logical to begin with some axiomatic description of the abstract version of a mathematical object, and discuss the concrete representations of that object later; for instance, one could start with the axioms of an abstract group, and then later consider concrete representations of such a group by permutations, invertible linear transformations, and so forth. This approach is often employed in the more algebraic areas of mathematics. However, there are at least two other ways to present these concepts which can be preferable from a pedagogical point of view. One way is to start with the concrete representations as motivating examples, and only later give the abstract object that these representations are modeling; this is how linear algebra, for instance, is often taught at the undergraduate level, by starting first with , , and , and only later introducing the abstract vector spaces. Another way is to avoid the abstract objects altogether, and focus exclusively on concrete representations, but taking care to emphasise how these representations transform when one switches from one representation to another. For instance, in general relativity courses in undergraduate physics, it is not uncommon to see tensors presented purely through the concrete representation of coordinates indexed by multiple indices, with the transformation of such tensors under changes of variable carefully described; the abstract constructions of tensors and tensor spaces using operations such as tensor product and duality of vector spaces or vector bundles are often left to an advanced differential geometry class to set up properly.
The foundations of probability theory are usually presented (almost by default) using the last of the above three approaches; namely, one talks almost exclusively about sample space models for probabilistic concepts such as events and random variables, and only occasionally dwells on the need to extend or otherwise modify the sample space when one needs to introduce new sources of randomness (or to forget about some existing sources of randomness). However, much as in differential geometry one tends to work with manifolds without specifying any given atlas of coordinate charts, in probability one usually manipulates events and random variables without explicitly specifying any given sample space. For a student raised exclusively on concrete sample space foundations of probability, this can be a bit confusing, for instance it can give the misconception that any given random variable is somehow associated to its own unique sample space, with different random variables possibly living on different sample spaces, which often leads to nonsense when one then tries to combine those random variables together. Because of such confusions, we will try to take particular care in these notes to separate probabilistic concepts from their sample space models.
— 2. A simple class of models: discrete probability spaces —
The simplest models of probability theory are those generated by discrete probability spaces, which are adequate models for many applications (particularly in combinatorics and other areas of discrete mathematics), and which already capture much of the essence of probability theory while avoiding some of the finer measure-theoretic subtleties. We thus begin by considering discrete sample space models.
Definition 1 (Discrete probability theory) A discrete probability space is an at most countable set (whose elements will be referred to as outcomes), together with a non-negative real number assigned to each outcome such that ; we refer to as the probability of the outcome . The set itself, without the structure , is often referred to as the sample space, though we will often abuse notation by using the sample space to refer to the entire discrete probability space .
In discrete probability theory, we choose an ambient discrete probability space as the randomness model. We then model an event by subsets of the sample space . The probability of an event is defined to be the quantity
note that this is a real number in the interval . An event is surely true or is the sure event if , and is surely false or is the empty event if .
We model random variables taking values in the range by functions from the sample space to the range . Random variables taking values in will be called real random variables or random real numbers. Similarly for random variables taking values in . We refer to real and complex random variables collectively as scalar random variables.
We consider two events to be equal if they are modeled by the same set: . Similarly, two random variables taking values in a common range are considered to be equal if they are modeled by the same function: . In particular, if the discrete sample space is understood from context, we will usually abuse notation by identifying an event with its model , and similarly identify a random variable with its model .
Remark 2 One can view classical (deterministic) mathematics as the special case of discrete probability theory in which is a singleton set (there is only one outcome ), and the probability assigned to the single outcome in is : . Then there are only two events (the surely true and surely false events), and a random variable in can be identified with a deterministic element of . Thus we can view probability theory as a generalisation of deterministic mathematics.
As discussed in the preceding section, the distinction between a collection of events and random variable and its models becomes important if one ever wishes to modify the sample space, and in particular to extend the sample space to a larger space that can accommodate new sources of randomness (an operation which we will define formally later, but which for now can be thought of as an analogue to change of basis in linear algebra, coordinate change in differential geometry, or base change in algebraic geometry). This is best illustrated with a simple example.
Example 3 (Extending the sample space) Suppose one wishes to model the outcome of rolling a single, unbiased six-sided die using discrete probability theory. One can do this by choosing the discrete proability space to be the six-element set , with each outcome given an equal probability of of occurring; this outcome may be interpreted as the state in which the die roll ended up being equal to . The outcome of rolling a die may then be identified with the identity function , defined by for . If we let be the event that the outcome of rolling the die is an even number, then with this model we have , and
Now suppose that we wish to roll the die again to obtain a second random variable . The sample space is inadequate for modeling both the original die roll and the second die roll . To accommodate this new source of randomness, we can then move to the larger discrete probability space , where each outcome now having probability ; this outcome can be interpreted as the state in which the die roll ended up being , and the die roll ended up being . The random variable is now modeled by a new function defined by for ; the random variable is similarly modeled by the function defined by for . The event that is even is now modeled by the set
This set is distinct from the previous model of (for instance, has eighteen elements, whereas has just three), but the probability of is unchanged:
One can of course also combine together the random variables in various ways. For instance, the sum of the two die rolls is a random variable taking values in ; it cannot be modeled by the sample space , but in it is modeled by the function
Similarly, the event that the two die rolls are equal cannot be modeled by , but is modeled in by the set
and the probability of this event is
We thus see that extending the probability space has also enlarged the space of events one can consider, as well as the random variables one can define, but that existing events and random variables continue to be interpretable in the extended model, and that probabilistic concepts such as the probability of an event remain unchanged by the extension of the model.
The set-theoretic operations on the sample space induce similar boolean operations on events:
Thus, for instance, the conjunction of the event that a die roll is even, and that it is less than , is the event that the die roll is exactly . As before, we will usually be in a situation in which the sample space is clear from context, and in that case one can safely identify events with their models, and view the symbols and as being synonymous with their set-theoretic counterparts and (this is for instance what is done in Durrett).
With these operations, the space of all events (known as the event space) thus has the structure of a boolean algebra (defined below in Definition 4). We observe that the probability is finitely additive in the sense that
whenever are disjoint events; by induction this implies that
whenever are pairwise disjoint events. We have and , and more generally
for any event . We also have monotonicity: if , then .
Now we define operations on random variables. Whenever one has a function from one range to another , and a random variable taking values in , one can define a random variable taking values in by composing the relevant models:
thus maps to for any outcome . Given a finite number of random variables taking values in ranges , we can form the joint random variable taking values in the Cartesian product by concatenation of the models, thus
Combining these two operations, given any function of variables in ranges , and random variables taking values in respectively, we can form a random variable taking values in by the formula
Thus for instance we can add, subtract, or multiply two scalar random variables to obtain another scalar random variable.
A deterministic element of a range will (by abuse of notation) be identified with a random variable taking values in , whose model in is constant: for all . Thus for instance is a scalar random variable.
Given a relation on ranges , and random variables , we can define the event by setting
Thus for instance, for two real random variables , the event is modeled as
and the event is modeled as
At this point we encounter a slight notational conflict between the dual role of the equality symbol as a logical symbol and as a binary relation: we are interpreting both as an external equality relation between the two random variables (which is true iff the functions , are identical), and as an internal event (modeled by ). However, it is clear that is true in the external sense if and only if the internal event is surely true. As such, we shall abuse notation and continue to use the equality symbol for both the internal and external concepts of equality (and use the modifier “surely” for emphasis when referring to the external usage).
It is clear that any equational identity concerning functions or operations on deterministic variables implies the same identity (in the external, or surely true, sense) for random variables. For instance, the commutativity of addition for deterministic real numbers immediately implies the commutativity of addition: is surely true for real random variables ; similarly is surely true for all scalar random variables , etc.. We will freely apply the usual laws of algebra for scalar random variables without further comment.
Given an event , we can associate the indicator random variable (also written as in some texts) to be the unique real random variable such that when is true and when is false, thus is equal to when and otherwise. (The indicator random variable is sometimes called the characteristic function in analysis, and sometimes denoted instead of , but we avoid using the term “characteristic function” here, as it will have an unrelated but important meaning in probability theory.) We record the trivial but useful fact that Boolean operations on events correspond to arithmetic manipulations on their indicators. For instance, if are events, we have
and the inclusion-exclusion principle
In particular, if the events are disjoint, then
Also note that if and only if the assertion is surely true. We will use these identities and equivalences throughout the course without further comment.
Given a scalar random variable , we can attempt to define the expectation through the model by the formula
If the discrete sample space is finite, then this sum is always well-defined and so every scalar random variable has an expectation. If however the discrete sample space is infinite, the expectation may not be well defined. There are however two key cases in which one has a meaningful expectation. The first is if the random variable is unsigned, that is to say it takes values in the non-negative reals , or more generally in the extended non-negative real line . In that case, one can interpret the expectation as an element of . The other case is when the random variable is absolutely integrable, which means that the absolute value (which is an unsigned random variable) has finite expectation: . In that case, the series defining is absolutely convergent to a real or complex number (depending on whether was a real or complex random variable.)
We have the basic link
between probability and expectation, valid for any event . We also have the obvious, but fundamentally important, property of linearity of expectation: we have
and
whenever is a scalar and are scalar random variables, either under the assumption that are all unsigned, or that are absolutely integrable. Thus for instance by applying expectations to (1) we obtain the identity
We close this section by noting that discrete probabilistic models stumble when trying to model continuous random variables, which take on an uncountable number of values. Suppose for instance one wants to model a random real number drawn uniformly at random from the unit interval , which is an uncountable set. One would then expect, for any subinterval of , that will fall into this interval with probability . Setting (or, if one wishes instead, taking a limit such as ), we conclude in particular that for any real number in , that will equal with probability . If one attempted to model this situation by a discrete probability model, we would find that each outcome of the discrete sample space has to occur with probability (since for each , the random variable has only a single value ). But we are also requiring that the sum is equal to , a contradiction. In order to address this defect we must generalise from discrete models to more general probabilistic models, to which we now turn.
— 3. The Kolmogorov foundations of probability theory —
We now present the more general measure-theoretic foundation of Kolmogorov which subsumes the discrete theory, while also allowing one to model continuous random variables. It turns out that in order to perform sums, limits and integrals properly, the finite additivity property of probability needs to be amplified to countable additivity (but, as we shall see, uncountable additivity is too strong of a property to ask for).
We begin with the notion of a measurable space. (See also this previous blog post, which covers similar material from the perspective of a real analysis graduate class rather than a probability class.)
Definition 4 (Measurable space) Let be a set. A Boolean algebra in is a collection of subsets of which
- contains and ;
- is closed under pairwise unions and intersections (thus if , then and also lie in ); and
- is closed under complements (thus if , then also lies in .
(Note that some of these assumptions are redundant and can be dropped, thanks to de Morgan’s laws.) A -algebra in (also known as a -field) is a Boolean algebra in which is also
- closed under countable unions and countable intersections (thus if , then and ).
Again, thanks to de Morgan’s laws, one only needs to verify closure under just countable union (or just countable intersection) in order to verify that a Boolean algebra is a -algebra. A measurable space is a pair , where is a set and is a -algebra in . Elements of are referred to as measurable sets in this measurable space.
If are two -algebras in , we say that is coarser than (or is finer than ) if , thus every set that is measurable in is also measurable in .
Example 5 (Trivial measurable space) Given any set , the collection is a -algebra; in fact it is the coarsest -algebra one can place on . We refer to as the trivial measurable space on .
Example 6 (Discrete measurable space) At the other extreme, given any set , the power set is a -algebra (and is the finest -algebra one can place on ). We refer to as the discrete measurable space on .
Example 7 (Atomic measurable spaces) Suppose we have a partition of a set into disjoint subsets (which we will call atoms), indexed by some label set (which may be finite, countable, or uncountable). Such a partition defines a -algebra on , consisting of all sets of the form for subsets of (we allow to be empty); thus a set is measurable here if and only if it can be described as a union of atoms. One can easily verify that this is indeed a -algebra. The trivial and discrete measurable spaces in the preceding two examples are special cases of this atomic construction, corresponding to the trivial partition (in which there is just one atom ) and the discrete partition (in which the atoms are individual points in ).
Example 8 Let be an uncountable set, and let be the collection of sets in which are either at most countable, or are cocountable (their complement is at most countable). Show that this is a -algebra on which is non-atomic (i.e. it is not of the form of the preceding example).
Example 9 (Generated measurable spaces) It is easy to see that if one has a non-empty family of -algebras on a set , then their intersection is also a -algebra, even if is uncountably infinite. Becaue of this, whenever one has an arbitrary collection of subsets in , one can define the -algebra generated by to be the intersection of all the -algebras that contain (note that there is always at least one -algebra participating in this intersection, namely the discrete -algebra). Equivalently, is the coarsest -algebra that views every set in as being measurable. (This is a rather indirect way to describe , as it does not make it easy to figure out exactly what sets lie in . There is a more direct description of this -algebra, but it requires the use of the first uncountable ordinal; see Exercise 15 of these notes.) In Durrett, the notation is used in place of .
Example 10 (Borel -algebra) Let be a topological space; to avoid pathologies let us assume that is locally compact Hausdorff and -compact, though the definition below also can be made for more general spaces. For instance, one could take or for some finite . We define the Borel -algebra on to be the -algebra generated by the open sets of . (Due to our topological hypotheses on , the Borel -algebra is also generated by the compact sets of .) Measurable subsets in the Borel -algebra are known as Borel sets. Thus for instance open and closed sets are Borel, and countable unions and countable intersections of Borel sets are Borel. In fact, as a rule of thumb, any subset of or that arises from a “non-pathological” construction (not using the axiom of choice, or from a deliberate attempt to build a non-Borel set) can be expected to be a Borel set. Nevertheless, non-Borel sets exist in abundance if one looks hard enough for them, even without the axiom of choice; see for instance Exercise 16 of this previous blog post.
The following exercise gives a useful tool (somewhat analogous to mathematical induction) to verify properties regarding measurable sets in generated -algebras, such as Borel -algebras.
Exercise 11 Let be a collection of subsets of a set , and let be a property of subsets of (thus is true or false for each in ). Assume the following axioms:
- is true.
- is true for all .
- If is such that is true, then is also true.
- If are such that is true for all , then is true.
Show that is true for all . (Hint: what can one say about ?)
Thus, for instance, if a property of subsets of is true for all open sets, and is closed under countable unions and complements, then it is automatically true for all Borel sets.
Example 12 (Pullback) Let be a measurable space, and let be any function from another set to . Then we can define the pullback of the -algebra to be the collection of all subsets in that are of the form for some . This is easily verified to be a -algebra. We refer to the measurable space as the pullback of the measurable space by . Thus for instance an atomic measurable space on generated by a partition is the pullback of (viewed as a discrete measurable space) by the “colouring” map from to that sends each element of to for all .
Remark 13 In probabilistic terms, one can interpret the space in the above construction as a sample space, and the function as some collection of “random variables” or “measurements” on that space, with being all the possible outcomes of these measurements. The pullback then represents all the “information” one can extract from that given set of measurements.
Example 14 (Product space) Let be a family of measurable spaces indexed by a (possibly infinite or uncountable) set . We define the product on the Cartesian product space by defining to be the -algebra generated by the basic cylinder sets of the form
for and . For instance, given two measurable spaces and , the product -algebra is generated by the sets and for . (One can also show that is the -algebra generated by the products for , but this observation does not extend to uncountable products of measurable spaces.)
Exercise 15 Show that with the Borel algebra is the product of copies of with the Borel -algebra.
As with almost any other notion of space in mathematics, there is a natural notion of a map (or morphism) between measurable spaces.
Definition 16 A function between two measurable spaces , is said to be measurable if one has for all .
Thus for instance the pullback of a measurable space by a map could alternatively be defined as the coarsest measurable space structure on for which is still measurable. It is clear that the composition of measurable functions is also measurable.
Exercise 17 Show that any continuous map from topological spaces is measurable (when one gives and the Borel -algebras).
Exercise 18 If are measurable functions into measurable spaces , show that the joint function into the product space defined by is also measurable.
As a corollary of the above exercise, we see that if are measurable, and is measurable, then is also measurable. In particular, if or are scalar measurable functions, then so are , , , etc..
Next, we turn measurable spaces into measure spaces by adding a measure.
Definition 19 (Measure spaces) Let be a measurable space. A finitely additive measure on this space is a map obeying the following axioms:
- (Empty set) .
- (Finite additivity) If are disjoint, then .
A countably additive measure is a finitely additive measure obeying the following additional axiom:
- (Countable additivity) If are disjoint, then .
A probability measure on is a countably additive measure obeying the following additional axiom:
- (Unit total probability) .
A measure space is a triplet where is a measurable space and is a measure on that space. If is furthermore a probability measure, we call a probability space.
Example 20 (Discrete probability measures) Let be a discrete measurable space, and for each , let be a non-negative real number such that . (Note that this implies that there are at most countably many for which – why?.) Then one can form a probability measure on by defining
for all .
Example 21 (Lebesgue measure) Let be given the Borel -algebra. Then it turns out there is a unique measure on , known as Lebesgue measure (or more precisely, the restriction of Lebesgue measure to the Borel -algebra) such that for every closed interval with (this is also true if one uses open intervals or half-open intervals in place of closed intervals). More generally, there is a unique measure on for any natural number , also known as Lebesgue measure, such that
for all closed boxes , that is to say products of closed intervals. The construction of Lebesgue measure is a little tricky; see this previous blog post for details.
We can then set up general probability theory similarly to how we set up discrete probability theory:
Definition 22 (Probability theory) In probability theory, we choose an ambient probability space as the randomness model, and refer to the set (without the additional structures , ) as the sample space for that model. We then model an event by elements of -algebra . The probability of an event is defined to be the quantity
An event is surely true or is the sure event if , and is surely false or is the empty event if . It is almost surely true or an almost sure event if , and almost surely false or a null event if .
We model random variables taking values in the range by measurable functions from the sample space to the range . We define real, complex, and scalar random variables as in the discrete case.
As in the discrete case, we consider two events to be equal if they are modeled by the same set: . Similarly, two random variables taking values in a common range are considered to be equal if they are modeled by the same function: . Again, if the sample space is understood from context, we will usually abuse notation by identifying an event with its model , and similarly identify a random variable with its model .
As in the discrete case, set-theoretic operations on the sample space induce similar boolean operations on events. Furthermore, since the -algebra is closed under countable unions and countable intersections, we may similarly define the countable conjunction or countable disjunction of a sequence of events; however, we do not define uncountable conjunctions or disjunctions as these may not be well-defined as events.
The axioms of a probability space then yield the Kolmogorov axioms for probability:
We can manipulate random variables just as in the discrete case, with the only caveat being that we have to restrict attention to measurable operations. For instance, if is a random variable taking values in a measurable space , and is a measurable map, then is well defined as a random variable taking values in . Similarly, if is a measurable map and are random variables taking values in respectively, then is a random variable taking values in . Similarly we can create events out of measurable relations (giving the boolean range the discrete -algebra, of course). Finally, we continue to view deterministic elements of a space as a special case of a random element of , and associate the indicator random variable to any event as before.
We say that two random variables agree almost surely if the event is almost surely true; this is an equivalence relation. In many cases we are willing to consider random variables up to almost sure equivalence. In particular, we can generalise the notion of a random variable slightly by considering random variables whose models are only defined almost surely, i.e. their domain is not all of , but instead with a set of measure zero removed. This is, technically, not a random variable as we have defined it, but it can be associated canonically with an equivalence class of random variables up to almost sure equivalence, and so we view such objects as random variables “up to almost sure equivalence”. Similarly, we declare two events and almost surely equivalent if their symmetric difference is a null event, and will often consider events up to almost sure equivalence only.
We record some simple consequences of the measure-theoretic axioms:
Exercise 23 Let be a measure space.
- (Monotonicity) If are measurable, then .
- (Subadditivity) If are measurable (not necessarily disjoint), then .
- (Continuity from below) If are measurable, then .
- (Continuity from above) If are measurable and is finite, then . Give a counterexample to show that the claim can fail when is infinite.
Of course, these measure-theoretic facts immediately imply their probabilistic counterparts (and the pesky hypothesis that is finite is automatic and can thus be dropped):
Note that if a countable sequence of events each hold almost surely, then their conjunction does as well (by applying subadditivity to the complementary events . As a general rule of thumb, the notion of “almost surely” behaves like “surely” as long as one only performs an at most countable number of operations (which already suffices for a large portion of analysis, such as taking limits or performing infinite sums).
Exercise 24 Let be a measurable space.
- If is a function taking values in the extended reals , show that is measurable (giving the Borel -algebra) if and only if the sets are measurable for all real .
- If are functions, show that if and only if for all reals .
- If are measurable, show that , , , and are all measurable.
Remark 25 Occasionally, there is need to consider uncountable suprema or infima, e.g. . It is then no longer automatically the case that such an uncountable supremum or infimum of measurable functions is again measurable. However, in practice one can avoid this issue by carefully rewriting such uncountable suprema or infima in terms of countable ones. For instance, if it is known that depends continuously on for each , then , and so measurability is not an issue.
Using the above exercise, if one is given a sequence of random variables taking values in the extended real line , we can define the random variables , , , which also take values in the extended real line, and which obey relations such as
for any real number .
We now say that a sequence of random variables in the extended real line converges almost surely if one has
almost surely, in which case we can define the limit (up to almost sure equivalence) as
This corresponds closely to the concept of almost everywhere convergence in measure theory, which is a slightly weaker notion than pointwise convergence which allows for bad behaviour on a set of measure zero. (See this previous blog post for more discussion on different notions of convergence of measurable functions.)
We will defer the general construction of expectation of a random variable to the next set of notes, where we review the notion of integration on a measure space. For now, we quickly review the basic construction of continuous scalar random variables.
Exercise 26 Let be a probability measure on the real line (with the Borel -algebra). Define the Stieltjes measure function associated to by the formula
Establish the following properties of :
- (i) is non-decreasing.
- (ii) and .
- (iii) is right-continuous, thus for all .
There is a somewhat difficult converse to this exercise: if is a function obeying the above three properties, then there is a unique probability measure on (the Lebesgue-Stieltjes measure associated to ) for which is the Stieltjes measure function. See Section 3 of this previous post for details. As a consequence of this, we have
Corollary 27 (Construction of a single continuous random variable) Let be a function obeying the properties (i)-(iii) of the above exercise. Then, by using a suitable probability space model, we can construct a real random variable with the property that
for all .
Indeed, we can take the probability space to be with the Borel -algebra and the Lebesgue-Stieltjes measure associated to . This corollary is not fully satisfactory, because often we may already have chosen a probability space to model some other random variables, and the probability space provided by this corollary may be completely unrelated to the one used. We can resolve these issues with product measures and other joinings, but this will be deferred to a later set of notes.
Define the cumulative distribution function of a real random variable to be the function
Thus we see that cumulative distribution functions obey the properties (i)-(iii) above, and conversely any function with those properties is the cumulative distribution function of some real random variable. We say that two real random variables (possibly on different sample spaces) agree in distribution if they have the same cumulative distribution function. One can therefore define a real random variable, up to agreement in distribution, by specifying the cumulative distribution function. See Durrett for some standard real distributions (uniform, normal, geometric, etc.) that one can define in this fashion.
Exercise 28 Let be a real random variable with cumulative distribution function . For any real number , show that
and
In particular, one has for all if and only if is continuous.
Note in particular that this illustrates the distinction between almost sure and sure events: if has a continuous cumulative distribution function, and is a real number, then is almost surely false, but it does not have to be surely false. (Indeed, if one takes the sample space to be and to be the identity function, then will not be surely false.) On the other hand, the fact that is equal to some real number is of course surely true. The reason these statements are consistent with each other is that there are uncountably many real numbers . (Countable additivity tells us that a countable disjunction of null events is still null, but says nothing about uncountable disjunctions.)
Exercise 29 (Skorokhod representation of scalar variables) Let be a uniform random variable taking values in (thus has cumulative distribution function ), and let be another cumulative distribution function. Show that the random variables
and
are indeed random variables (that is to say, they are measurable in any given model ), and have cumulative distribution function . (This construction is attributed to Skorokhod, but it should not be confused with the Skorokhod representation theorem. It provides a quick way to generate a single scalar variable, but unfortunately it is difficult to modify this construction to generate multiple scalar variables, especially if they are somehow coupled to each other.)
There is a multidimensional analogue of the above theory, which is almost identical, except that the monotonicity property has to be strengthened:
Exercise 30 Let be a probability measure on (with the Borel -algebra). Define the Stieltjes measure function associated to by the formula
Establish the following properties of :
- (i) is non-decreasing: whenever for all .
- (ii) and .
- (iii) is right-continuous, thus for all , where the superscript denotes that we restrict each to be greater than or equal to .
- (iv) One has
whenever are real numbers for . (Hint: try to express the measure of a box with respect to in terms of the Stieltjes measure function .)
Again, there is a difficult converse to this exercise: if is a function obeying the above four properties, then there is a unique probability measure on for which is the Stieltjes measure function. See Durrett for details; one can also modify the arguments in this previous post. In particular, we have
Corollary 31 (Construction of several continuous random variables) Let be a function obeying the properties (i)-(iv) of the above exercise. Then, by using a suitable probability space model, we can construct real random variables with the property that
for all .
Again, this corollary is not completely satisfactory because the probability space produced by it (which one can take to be with the Borel -algebra and the Lebesgue-Stieltjes measure on ) may not be the probability space one wants to use; we will return to this point later.
— 4. Variants of the standard foundations (optional) —
We have focused on the orthodox foundations of probability theory in which we model events and random variables through probability spaces. In this section, we briefly discuss some alternate ways to set up the foundations, as well as alternatives to probability theory itself. (Actually, many of the basic objects and concepts in mathematics have multiple such foundations; see for instance this blog post exploring the many different ways to define the notion of a group.) We mention them here in order exclude them from discussion in subsequent notes, which will be focused almost exclusively on orthodox probability.
One approach to the foundations of probability is to view the event space as an abstract -algebra – a collection of abstract objects with operations such as and (and and ) that obey a number of axioms; see this previous post for a formal definition. The probability map can then be viewed as an abstract probability measure on , that is to say a map from to that obeys the Kolmogorov axioms. Random variables taking values in a measurable space can be identified with their pullback map , which is the morphism of (abstract) -algebras that sends a measurable set to the event in ; with some care one can then redefine all the operations in previous sections (e.g. applying a measurable map to a random variable taking values in to obtain a random variable taking values in ) in terms of this pullback map, allowing one to define random variables satisfactorily in this abstract setting. The probability space models discussed above can then be viewed as representations of abstract probability spaces by concrete ones. It turns out that (up to null events) any abstract probability space can be represented by a concrete one, a result known as the Loomis-Sikorski theorem; see this previous post for details.
Another, related, approach is to start not with the event space, but with the space of scalar random variables, and more specifically with the space of almost surely bounded scalar random variables (thus, there is a deterministic scalar such that almost surely). It turns out that this space has the structure of a commutative tracial (abstract) von Neumann algebra. Conversely, starting from a commutative tracial von Neumann algebra one can form an abstract probability space (using the idempotent elements of the algebra as the events), and thus represent this algebra (up to null events) by a concrete probability space. This particular choice of probabilistic foundations is particularly convenient when one wishes to generalise classical probability to noncommutative probability, as this is simply a matter of dropping the axiom that the von Neumann algebra is commutative. This leads in particular to the subjects of quantum probability and free probability, which are generalisations of classical probability that are beyond the scope of this course (but see this blog post for an introduction to the latter, and this previous post for an abstract algebraic description of a probability space).
It is also possible to model continuous probability via a nonstandard version of discrete probability (or even finite probability), which removes some of the technicalities of measure theory at the cost of replacing them with the formalism of nonstandard analysis instead. This approach was pioneered by Ed Nelson, but will not be discussed further here. (See also these previous posts on the Loeb measure construction, which is a closely related way to combine the power of measure theory with the conveniences of nonstandard analysis.)
One can generalise the traditional, countably additive, form of probability by replacing countable additivity with finite additivity, but then one loses much of the ability to take limits or infinite sums, which reduces the amount of analysis one can perform in this setting. Still, finite additivity is good enough for many applications, particularly in discrete mathematics. An even broader generalisation is that of qualitative probability, in which events that are neither almost surely true or almost surely false are not assigned any specific numerical probability between or , but are simply assigned a symbol such as to indicate their indeterminate status; see this previous blog post for this generalisation, which can for instance be used to view the concept of a “generic point” in algebraic geometry or metric space topology in probabilistic terms.
There have been multiple attempts to move more radically beyond the paradigm of probability theory and its relatives as discussed above, in order to more accurately capture mathematically the concept of non-determinism. One family of approaches is based on replacing deterministic logic by some sort of probabilistic logic; another is based on allowing several parameters in one’s model to be unknown (as opposed to being probabilistic random variables), leading to the area of uncertainty quantification. These topics are well beyond the scope of this course.
In particular, if it is rare for to lie in , then it is also rare for to lie in .
If and do not have exactly the same probability distribution, but their probability distributions are close to each other in some sense, then we can expect to have an approximate version of the above statement. For instance, from the definition of the total variation distance between two random variables (or more precisely, the total variation distance between the probability distributions of two random variables), we see that
for any . In particular, if it is rare for to lie in , and are close in total variation, then it is also rare for to lie in .
A basic inequality in information theory is Pinsker’s inequality
where the Kullback-Leibler divergence is defined by the formula
(See this previous blog post for a proof of this inequality.) A standard application of Jensen’s inequality reveals that is non-negative (Gibbs’ inequality), and vanishes if and only if , have the same distribution; thus one can think of as a measure of how close the distributions of and are to each other, although one should caution that this is not a symmetric notion of distance, as in general. Inserting Pinsker’s inequality into (1), we see for instance that
Thus, if is close to in the Kullback-Leibler sense, and it is rare for to lie in , then it is rare for to lie in as well.
We can specialise this inequality to the case when a uniform random variable on a finite range of some cardinality , in which case the Kullback-Leibler divergence simplifies to
where
is the Shannon entropy of . Again, a routine application of Jensen’s inequality shows that , with equality if and only if is uniformly distributed on . The above inequality then becomes
Thus, if is a small fraction of (so that it is rare for to lie in ), and the entropy of is very close to the maximum possible value of , then it is rare for to lie in also.
The inequality (2) is only useful when the entropy is close to in the sense that , otherwise the bound is worse than the trivial bound of . In my recent paper on the Chowla and Elliott conjectures, I ended up using a variant of (2) which was still non-trivial when the entropy was allowed to be smaller than . More precisely, I used the following simple inequality, which is implicit in the arguments of that paper but which I would like to make more explicit in this post:
Lemma 1 (Pinsker-type inequality) Let be a random variable taking values in a finite range of cardinality , let be a uniformly distributed random variable in , and let be a subset of . Then
Proof: Consider the conditional entropy . On the one hand, we have
by Jensen’s inequality. On the other hand, one has
where we have again used Jensen’s inequality. Putting the two inequalities together, we obtain the claim.
Remark 2 As noted in comments, this inequality can be viewed as a special case of the more general inequality
for arbitrary random variables taking values in the same discrete range , which follows from the data processing inequality
for arbitrary functions , applied to the indicator function . Indeed one has
where is the entropy function.
Thus, for instance, if one has
and
for some much larger than (so that ), then
More informally: if the entropy of is somewhat close to the maximum possible value of , and it is exponentially rare for a uniform variable to lie in , then it is still somewhat rare for to lie in . The estimate given is close to sharp in this regime, as can be seen by calculating the entropy of a random variable which is uniformly distributed inside a small set with some probability and uniformly distributed outside of with probability , for some parameter .
It turns out that the above lemma combines well with concentration of measure estimates; in my paper, I used one of the simplest such estimates, namely Hoeffding’s inequality, but there are of course many other estimates of this type (see e.g. this previous blog post for some others). Roughly speaking, concentration of measure inequalities allow one to make approximations such as
with exponentially high probability, where is a uniform distribution and is some reasonable function of . Combining this with the above lemma, we can then obtain approximations of the form
with somewhat high probability, if the entropy of is somewhat close to maximum. This observation, combined with an “entropy decrement argument” that allowed one to arrive at a situation in which the relevant random variable did have a near-maximum entropy, is the key new idea in my recent paper; for instance, one can use the approximation (3) to obtain an approximation of the form
for “most” choices of and a suitable choice of (with the latter being provided by the entropy decrement argument). The left-hand side is tied to Chowla-type sums such as through the multiplicativity of , while the right-hand side, being a linear correlation involving two parameters rather than just one, has “finite complexity” and can be treated by existing techniques such as the Hardy-Littlewood circle method. One could hope that one could similarly use approximations such as (3) in other problems in analytic number theory or combinatorics.
This pair of papers is an outgrowth of these two recent blog posts and the ensuing discussion. In the first paper, we establish the following logarithmically averaged version of the Chowla conjecture (in the case of two-point correlations (or “pair correlations”)):
Theorem 1 (Logarithmically averaged Chowla conjecture) Let be natural numbers, and let be integers such that . Let be a quantity depending on that goes to infinity as . Let denote the Liouville function. Then one has
For comparison, the non-averaged Chowla conjecture would imply that
which is a strictly stronger estimate than (2), and remains open.
The arguments also extend to other completely multiplicative functions than the Liouville function. In particular, one obtains a slightly averaged version of the non-asymptotic Elliott conjecture that was shown in the previous blog post to imply a positive solution to the Erdos discrepancy problem. The averaged version of the conjecture established in this paper is slightly weaker than the one assumed in the previous blog post, but it turns out that the arguments there can be modified without much difficulty to accept this averaged Elliott conjecture as input. In particular, we obtain an unconditional solution to the Erdos discrepancy problem as a consequence; this is detailed in the second paper listed above. In fact we can also handle the vector-valued version of the Erdos discrepancy problem, in which the sequence takes values in the unit sphere of an arbitrary Hilbert space, rather than in .
Estimates such as (2) or (3) are known to be subject to the “parity problem” (discussed numerous times previously on this blog), which roughly speaking means that they cannot be proven solely using “linear” estimates on functions such as the von Mangoldt function. However, it is known that the parity problem can be circumvented using “bilinear” estimates, and this is basically what is done here.
We now describe in informal terms the proof of Theorem 1, focusing on the model case (2) for simplicity. Suppose for contradiction that the left-hand side of (2) was large and (say) positive. Using the multiplicativity , we conclude that
is also large and positive for all primes that are not too large; note here how the logarithmic averaging allows us to leave the constraint unchanged. Summing in , we conclude that
is large and positive for any given set of medium-sized primes. By a standard averaging argument, this implies that
is large for many choices of , where is a medium-sized parameter at our disposal to choose, and we take to be some set of primes that are somewhat smaller than . (A similar approach was taken in this recent paper of Matomaki, Radziwill, and myself to study sign patterns of the Möbius function.) To obtain the required contradiction, one thus wants to demonstrate significant cancellation in the expression (4). As in that paper, we view as a random variable, in which case (4) is essentially a bilinear sum of the random sequence along a random graph on , in which two vertices are connected if they differ by a prime in that divides . A key difficulty in controlling this sum is that for randomly chosen , the sequence and the graph need not be independent. To get around this obstacle we introduce a new argument which we call the “entropy decrement argument” (in analogy with the “density increment argument” and “energy increment argument” that appear in the literature surrounding Szemerédi’s theorem on arithmetic progressions, and also reminiscent of the “entropy compression argument” of Moser and Tardos, discussed in this previous post). This argument, which is a simple consequence of the Shannon entropy inequalities, can be viewed as a quantitative version of the standard subadditivity argument that establishes the existence of Kolmogorov-Sinai entropy in topological dynamical systems; it allows one to select a scale parameter (in some suitable range ) for which the sequence and the graph exhibit some weak independence properties (or more precisely, the mutual information between the two random variables is small).
Informally, the entropy decrement argument goes like this: if the sequence has significant mutual information with , then the entropy of the sequence for will grow a little slower than linearly, due to the fact that the graph has zero entropy (knowledge of more or less completely determines the shifts of the graph); this can be formalised using the classical Shannon inequalities for entropy (and specifically, the non-negativity of conditional mutual information). But the entropy cannot drop below zero, so by increasing as necessary, at some point one must reach a metastable region (cf. the finite convergence principle discussed in this previous blog post), within which very little mutual information can be shared between the sequence and the graph . Curiously, for the application it is not enough to have a purely quantitative version of this argument; one needs a quantitative bound (which gains a factor of a bit more than on the trivial bound for mutual information), and this is surprisingly delicate (it ultimately comes down to the fact that the series diverges, which is only barely true).
Once one locates a scale with the low mutual information property, one can use standard concentration of measure results such as the Hoeffding inequality to approximate (4) by the significantly simpler expression
The important thing here is that Hoeffding’s inequality gives exponentially strong bounds on the failure probability, which is needed to counteract the logarithms that are inevitably present whenever trying to use entropy inequalities. The expression (5) can then be controlled in turn by an application of the Hardy-Littlewood circle method and a non-trivial estimate
for averaged short sums of a modulated Liouville function established in another recent paper by Matomäki, Radziwill and myself.
When one uses this method to study more general sums such as
one ends up having to consider expressions such as
where is the coefficient . When attacking this sum with the circle method, one soon finds oneself in the situation of wanting to locate the large Fourier coefficients of the exponential sum
In many cases (such as in the application to the Erdös discrepancy problem), the coefficient is identically , and one can understand this sum satisfactorily using the classical results of Vinogradov: basically, is large when lies in a “major arc” and is small when it lies in a “minor arc”. For more general functions , the coefficients are more or less arbitrary; the large values of are no longer confined to the major arc case. Fortunately, even in this general situation one can use a restriction theorem for the primes established some time ago by Ben Green and myself to show that there are still only a bounded number of possible locations (up to the uncertainty mandated by the Heisenberg uncertainty principle) where is large, and we can still conclude by using (6). (Actually, as recently pointed out to me by Ben, one does not need the full strength of our result; one only needs the restriction theorem for the primes, which can be proven fairly directly using Plancherel’s theorem and some sieve theory.)
It is tempting to also use the method to attack higher order cases of the (logarithmically) averaged Chowla conjecture, for instance one could try to prove the estimate
The above arguments reduce matters to obtaining some non-trivial cancellation for sums of the form
A little bit of “higher order Fourier analysis” (as was done for very similar sums in the ergodic theory context by Frantzikinakis-Host-Kra and Wooley-Ziegler) lets one control this sort of sum if one can establish a bound of the form
where goes to infinity and is a very slowly growing function of . This looks very similar to (6), but the fact that the supremum is now inside the integral makes the problem much more difficult. However it looks worth attacking (7) further, as this estimate looks like it should have many nice applications (beyond just the case of the logarithmically averaged Chowla or Elliott conjectures, which is already interesting).
For higher than , the same line of analysis requires one to replace the linear phase by more complicated phases, such as quadratic phases or even -step nilsequences. Given that (7) is already beyond the reach of current literature, these even more complicated expressions are also unavailable at present, but one can imagine that they will eventually become tractable, in which case we would obtain an averaged form of the Chowla conjecture for all , which would have a number of consequences (such as a logarithmically averaged version of Sarnak’s conjecture, as per this blog post).
It would of course be very nice to remove the logarithmic averaging, and be able to establish bounds such as (3). I did attempt to do so, but I do not see a way to use the entropy decrement argument in a manner that does not require some sort of averaging of logarithmic type, as it requires one to pick a scale that one cannot specify in advance, which is not a problem for logarithmic averages (which are quite stable with respect to dilations) but is problematic for ordinary averages. But perhaps the problem can be circumvented by some clever modification of the argument. One possible approach would be to start exploiting multiplicativity at products of primes, and not just individual primes, to try to keep the scale fixed, but this makes the concentration of measure part of the argument much more complicated as one loses some independence properties (coming from the Chinese remainder theorem) which allowed one to conclude just from the Hoeffding inequality.
as for any fixed natural number . This conjecture remains open, though there are a number of partial results (e.g. these two previous results of Matomaki, Radziwill, and myself).
A natural generalisation of Chowla’s conjecture was proposed by Elliott. For simplicity we will only consider Elliott’s conjecture for the pair correlations
For such correlations, the conjecture was that one had
as for any natural number , as long as was a completely multiplicative function with magnitude bounded by , and such that
for any Dirichlet character and any real number . In the language of “pretentious number theory”, as developed by Granville and Soundararajan, the hypothesis (2) asserts that the completely multiplicative function does not “pretend” to be like the completely multiplicative function for any character and real number . A condition of this form is necessary; for instance, if is precisely equal to and has period , then is equal to as and (1) clearly fails. The prime number theorem in arithmetic progressions implies that the Liouville function obeys (2), and so the Elliott conjecture contains the Chowla conjecture as a special case.
As it turns out, Elliott’s conjecture is false as stated, with the counterexample having the property that “pretends” locally to be the function for in various intervals , where and go to infinity in a certain prescribed sense. See this paper of Matomaki, Radziwill, and myself for details. However, we view this as a technicality, and continue to believe that certain “repaired” versions of Elliott’s conjecture still hold. For instance, our counterexample does not apply when is restricted to be real-valued rather than complex, and we believe that Elliott’s conjecture is valid in this setting. Returning to the complex-valued case, we still expect the asymptotic (1) provided that the condition (2) is replaced by the stronger condition
as for all fixed Dirichlet characters . In our paper we supported this claim by establishing a certain “averaged” version of this conjecture; see that paper for further details. (See also this recent paper of Frantzikinakis and Host which establishes a different averaged version of this conjecture.)
One can make a stronger “non-asymptotic” version of this corrected Elliott conjecture, in which the parameter does not go to infinity, or equivalently that the function is permitted to depend on :
Conjecture 1 (Non-asymptotic Elliott conjecture) Let , let be sufficiently large depending on , and let be sufficiently large depending on . Suppose that is a completely multiplicative function with magnitude bounded by , such that
for all Dirichlet characters of period at most . Then one has
for all natural numbers .
The -dependent factor in the constraint is necessary, as can be seen by considering the completely multiplicative function (for instance). Again, the results in my previous paper with Matomaki and Radziwill can be viewed as establishing an averaged version of this conjecture.
Meanwhile, we have the following conjecture that is the focus of the Polymath5 project:
Conjecture 2 (Erdös discrepancy conjecture) For any function , the discrepancy
is infinite.
It is instructive to compute some near-counterexamples to Conjecture 2 that illustrate the difficulty of the Erdös discrepancy problem. The first near-counterexample is that of a non-principal Dirichlet character that takes values in rather than . For this function, one has from the complete multiplicativity of that
If denotes the period of , then has mean zero on every interval of length , and thus
Thus has bounded discrepancy.
Of course, this is not a true counterexample to Conjecture 2 because can take the value . Let us now consider the following variant example, which is the simplest member of a family of examples studied by Borwein, Choi, and Coons. Let be the non-principal Dirichlet character of period (thus equals when , when , and when ), and define the completely multiplicative function by setting when and . This is about the simplest modification one can make to the previous near-counterexample to eliminate the zeroes. Now consider the sum
with for some large . Writing with coprime to and at most , we can write this sum as
Now observe that . The function has mean zero on every interval of length three, and is equal to mod , and thus
for every , and thus
Thus also has unbounded discrepancy, but only barely so (it grows logarithmically in ). These examples suggest that the main “enemy” to proving Conjecture 2 comes from completely multiplicative functions that somehow “pretend” to be like a Dirichlet character but do not vanish at the zeroes of that character. (Indeed, the special case of Conjecture 2 when is completely multiplicative is already open, appears to be an important subcase.)
All of these conjectures remain open. However, I would like to record in this blog post the following striking connection, illustrating the power of the Elliott conjecture (particularly in its nonasymptotic formulation):
Theorem 3 (Elliott conjecture implies unbounded discrepancy) Conjecture 1 implies Conjecture 2.
The argument relies heavily on two observations that were previously made in connection with the Polymath5 project. The first is a Fourier-analytic reduction that replaces the Erdos Discrepancy Problem with an averaged version for completely multiplicative functions . An application of Cauchy-Schwarz then shows that any counterexample to that version will violate the conclusion of Conjecture 1, so if one assumes that conjecture then must pretend to be like a function of the form . One then uses (a generalisation) of a second argument from Polymath5 to rule out this case, basically by reducing matters to a more complicated version of the Borwein-Choi-Coons analysis. Details are provided below the fold.
There is some hope that the Chowla and Elliott conjectures can be attacked, as the parity barrier which is so impervious to attack for the twin prime conjecture seems to be more permeable in this setting. (For instance, in my previous post I raised a possible approach, based on establishing expander properties of a certain random graph, which seems to get around the parity problem, in principle at least.)
(Update, Sep 25: fixed some treatment of error terms, following a suggestion of Andrew Granville.)
— 1. Fourier reduction —
We will prove Theorem 3 by contradiction, assuming that there is a function with bounded discrepancy and then concluding a violation of the Elliott conjecture.
The function need not have any multiplicativity properties, but by using an argument from Polymath5 we can extract a random completely multiplicative function which also has good discrepancy properties (albeit in an probabilistic sense only):
Proposition 4 (Fourier reduction) Suppose that is a function such that
Then there exists a random completely multiplicative function of magnitude such that
uniformly for all natural numbers (we allow implied constants to depend on ).
Proof: For the readers convenience, we reproduce the Polymath5 argument.
The space of completely multiplicative functions of magnitude can be identified with the infinite product since is determined by its values at the primes. In particular, this space is compact metrisable in the product topology. The functions are continuous in this topology for all . By vague compactness of probability measures on compact metrisable spaces (Prokhorov’s theorem), it thus suffices to construct, for each , a random completely multiplicative function of magnitude such that
for all , where the implied constant is uniform in and .
for all (the implied constant can depend on but is otherwise absolute). Let , and let be the primes up to . Let be a natural number that we assume to be sufficiently large depending on . Define a function by the formula
for . We also define the function by setting whenever is in (this is well defined for ). Applying (4) for and of the form with , we conclude that
for all and all but of the elements of . For the exceptional elements, we have the trivial bound
Square-summing in , we conclude (if is sufficiently large depending on ) that
By Fourier expansion, we can write
where , , and
A little Fourier-analytic calculation then allows us to write the left-hand side of (5) as
On the other hand, from the Plancherel identity we have
and so we can interpret as the probability distribution of a random frequency . The estimate (5) now takes the form
for all . If we then define the completely multiplicative function by setting for , and for all other primes, we obtain
for all , as desired.
Remark 5 A similar reduction applies if the original function took values in the unit sphere of a complex Hilbert space, rather than in . Conversely, the random constructed above can be viewed as an element of a unit sphere of a suitable Hilbert space, so the conclusion of Proposition 4 is in fact logically equivalent to failure of the Hilbert space version of the Erdös discrepancy conjecture.
Remark 6 From linearity of expectation, we see from Proposition 4 that for any natural number , we have
and hence for each we conclude that there exists a deterministic completely multiplicative function of unit magnitude such that
This was the original formulation of the Fourier reduction in Polymath5, however the fact that varied with made this formulation inconvenient for our argument.
— 2. Applying the Elliott conjecture —
Suppose for contradiction that Conjecture 1 holds but that there exists a function of bounded discrepancy in the sense of (3). By Proposition 4, we may thus find a random completely multiplicative function of magnitude such that
We now use Elliott’s conjecture as a sort of “inverse theorem” (in the spirit of the inverse sumset theorem of Frieman, and the inverse theorems for the Gowers uniformity norms) to force to pretend to behave like a modulated character quite often.
Proposition 7 Let the hypotheses and notation be as above. Let , and suppose that is sufficiently large depending on . Then with probability , one can find a Dirichlet character of period and a real number such that
Proof: We use the van der Corput trick. Let be a moderately large natural number depending on to be chosen later, and suppose that is sufficiently large depending on . From (6) and the triangle inequality we have
so from Markov’s inequality we see with probability that
Let us condition to this event. Shifting by we conclude (for large enough) that
and hence by the triangle inequality
which we rewrite as
We can square the left-hand side out as
The diagonal term contributes to this expression. Thus, for sufficiently large depending on , we can apply the triangle inequality and pigeonhole principle to find distinct such that
By symmetry we can take . Setting , we conclude (for large enough) that
Applying Conjecture 1 in the contrapositive, we obtain the claim.
The conclusion (8) asserts that in some sense, “pretends” to be like the function ; as it has magnitude one, it should resemble the function discussed in the introduction. The remaining task is to find some generalisation of the argument that shows that had (logarithmically) large discrepancy to show that likewise fails to obey (6).
— 3. Ruling out correlation with modulated characters —
We now use (a generalisation of) this Polymath5 argument. Let be the random completely multiplicative function provided by Proposition 4. We will need the following parameters:
By Proposition 7, we see with probability that there exists a Dirichlet character of period and a real number such that
By reducing if necessary we may assume that is primitive.
It will be convenient to cut down the size of .
Proof: By Proposition 7 with replaced by , we see that with probability , one can find a Dirichlet character of period and a real number such that
We may assume that , since we are done otherwise. Applying the pretentious triangle inequality (see Lemma 3.1 of this paper of Granville and Soundararajan), we conclude that
However, from the Vinogradov-Korobov zero-free region for (see this previous blog post) it is not difficult to show that
if is sufficiently large depending on , a contradiction. The claim follows.
Let us now condition to the probability event that , exist obeying (8) and the bound (9).
The bound (8) asserts that “pretends” to be like the completely multiplicative function . We can formalise this by making the factorisation
where is the completely multiplicative function of magnitude defined by setting for and for , and is the completely multiplicative function of magnitude defined by setting for , and for . The function should be compared with the function of the same name studied in the introduction.
The bound (8) then becomes
We now perform some manipulations to remove the and factors from and isolate the factor, which is more tractable to compute with; then we will perform more computations to arrive at an expression just involving which we will be able to evaluate fairly easily.
From (6) and the triangle inequality we have
for all (even after conditioning to the event). The averaging will not be used until much later in the argument, and the reader may wish to ignore it for now.
By (10), the above estimate can be written as
For we can use (9) to conclude that . The contribution of the error term is negligible, thus
for all . We can factor out the to obtain
For we can crudely bound the left-hand side by . If is sufficiently small, we can then sum weighted by and conclude that
(The zeta function type weight of will be convenient later in the argument when one has to perform some multiplicative number theory, as the relevant sums can be computed quite directly and easily using Euler products.) Thus, with probability , one has
We condition to this event. We have successfully eliminated the role of ; we now work to eliminate . Call a residue class bad if is divisible by for some and . and good otherwise. We restrict to good residue classes, thus
By Cauchy-Schwarz, we conclude that
Now we claim that for a in a good residue class , the quantity does not depend on . Indeed, by hypothesis, is not divisible by for any and is thus a factor of , and is coprime to . We then factor
where in the last line we use the periodicity of . Thus we have , and so
Shifting by we see that
Now, we perform some multiplicative number theory to understand the innermost sum. From taking Euler products we have
for ; from (11) and Mertens’ theorem one can easily verify that
More generally, for any Dirichlet character we have
Since
we have
which after using (11), Cauchy-Schwarz (using and Mertens theorem gives
for any non-principal character of period dividing ; for a principal character of period dividing we have
since and hence for all , where we have also used (13). By expansion into Dirichlet characters we conclude that
for all and primitive residue classes . For non-primitive residue classes , we write and . The previous arguments then give
which since gives (again using (13))
for all (not necessarily primitive). Inserting this back into (12) we see that
and thus by (13) we conclude (for large enough) that
We have now eliminated both and . The remaining task is to establish some lower bound on the discrepancy of that will contradict (14). As mentioned above, this will be a more complicated variant of the Borwein-Choi-Coons analysis in the introduction. The square in (14) will be helpful in dealing with the fact that we don’t have good control on the for (as we shall see, the squaring introduces two terms of this type that end up cancelling each other).
We expand (14) to obtain
Write and , thus and for we have
We thus have
We reinstate the bad . The number of such is at most , so their total contribution here is which is negligible, thus
For or , the inner sum is , which by the divisor bound gives a negligible contribution. Thus we may restrict to . Note that as is already restricted to numbers coprime to , and divide , we may replace the constraints with for .
We consider the contribution of an off-diagonal term for a fixed choice of . To handle these terms we expand the non-principal character as a linear combination of for with Fourier coefficients . Thus we can expand out
as a linear combination of expressions of the form
with and coefficients of size .
The constraints are either inconsistent, or constrain to a single residue class . Writing , we have
for some phase that can depend on but is independent of . If , then at least one of the two quantities and is divisible by a prime that does not divide the other quantity. Therefore cannot be divisible by , and thus by . We can then sum the geometric series in (or ) and conclude that
and so by the divisor bound the off-diagonal terms contribute at most to (15). For large, this is negligible, and thus we only need to consider the diagonal contribution . Here the terms helpfully cancel, and we obtain
We have now eliminated , leaving only the Dirichlet character which is much easier to work with. We gather terms and write the left-hand side as
The summand in is now non-negative. We can thus throw away all the except of the form with , to conclude that
It is now that we finally take advantage of the averaging to simplify the summation. Observe from the triangle inequality that for any and one has
summing over we conclude that
In particular, by the pigeonhole principle there exists such that
Shifting by and discarding some terms, we conclude that
Observe that for a fixed there is exactly one in the inner sum, and . Thus we have
Making the change of variables , we thus have
But is periodic of period with mean , thus
and thus
which leads to a contradiction for large enough (note the logarithmic growth in here, matching the logarithmic growth in the Borwein-Choi-Coons analysis). The claim follows.
Theorem 1 Let . Then each of the sign patterns in is attained by the Liouville function for a set of natural numbers of positive lower density.
Thus for instance one has for a set of of positive lower density. The case of this theorem already appears in the original paper of MatomÃ¤ki and RadziwiÅ‚Å‚ (and the significantly simpler case of the sign patterns and was treated previously by Harman, Pintz, and Wolke).
The basic strategy in all of these arguments is to assume for sake of contradiction that a certain sign pattern occurs extremely rarely, and then exploit the complete multiplicativity of (which implies in particular that , , and for all ) together with some combinatorial arguments (vaguely analogous to solving a Sudoku puzzle!) to establish more complex sign patterns for the Liouville function, that are either inconsistent with each other, or with results such as the MatomÃ¤ki-RadziwiÅ‚Å‚ result. To illustrate this, let us give some examples, arguing a little informally to emphasise the combinatorial aspects of the argument. First suppose that the sign pattern almost never occurs. The prime number theorem tells us that and are each equal to about half of the time, which by inclusion-exclusion implies that the sign pattern almost never occurs. In other words, we have for almost all . But from the multiplicativity property this implies that one should have
and
for almost all . But the above three statements are contradictory, and the claim follows.
Similarly, if we assume that the sign pattern almost never occurs, then a similar argument to the above shows that for any fixed , one has for almost all . But this means that the mean is abnormally large for most , which (for large enough) contradicts the results of MatomÃ¤ki and RadziwiÅ‚Å‚. Here we see that the “enemy” to defeat is the scenario in which only changes sign very rarely, in which case one rarely sees the pattern .
It turns out that similar (but more combinatorially intricate) arguments work for sign patterns of length three (but are unlikely to work for most sign patterns of length four or greater). We give here one fragment of such an argument (due to Hildebrand) which hopefully conveys the Sudoku-type flavour of the combinatorics. Suppose for instance that the sign pattern almost never occurs. Now suppose is a typical number with . Since we almost never have the sign pattern , we must (almost always) then have . By multiplicativity this implies that
We claim that this (almost always) forces . For if , then by the lack of the sign pattern , this (almost always) forces , which by multiplicativity forces , which by lack of (almost always) forces , which by multiplicativity contradicts . Thus we have ; a similar argument gives almost always, which by multiplicativity gives , a contradiction. Thus we almost never have , which by the inclusion-exclusion argument mentioned previously shows that for almost all .
One can continue these Sudoku-type arguments and conclude eventually that for almost all . To put it another way, if denotes the non-principal Dirichlet character of modulus , then is almost always constant away from the multiples of . (Conversely, if changed sign very rarely outside of the multiples of three, then the sign pattern would never occur.) Fortunately, the main result of MatomÃ¤ki and RadziwiÅ‚Å‚ shows that this scenario cannot occur, which establishes that the sign pattern must occur rather frequently. The other sign patterns are handled by variants of these arguments.
Excluding a sign pattern of length three leads to useful implications like “if , then ” which turn out are just barely strong enough to quite rigidly constrain the Liouville function using Sudoku-like arguments. In contrast, excluding a sign pattern of length four only gives rise to implications like “`if , then “, and these seem to be much weaker for this purpose (the hypothesis in these implications just isn’t satisfied nearly often enough). So a different idea seems to be needed if one wishes to extend the above theorem to larger values of .
Our second theorem gives an analogous result for the MÃ¶bius function (which takes values in rather than ), but the analysis turns out to be remarkably difficult and we are only able to get up to :
Theorem 2 Let . Then each of the sign patterns in is attained by the MÃ¶bius function for a set of positive lower density.
It turns out that the prime number theorem and elementary sieve theory can be used to handle the case and all the cases that involve at least one , leaving only the four sign patterns to handle. It is here that the zeroes of the MÃ¶bius function cause a significant new obstacle. Suppose for instance that the sign pattern almost never occurs for the MÃ¶bius function. The same arguments that were used in the Liouville case then show that will be almost always equal to , provided that are both square-free. One can try to chain this together as before to create a long string where the MÃ¶bius function is constant, but this cannot work for any larger than three, because the MÃ¶bius function vanishes at every multiple of four.
The constraints we assume on the MÃ¶bius function can be depicted using a graph on the squarefree natural numbers, in which any two adjacent squarefree natural numbers are connected by an edge. The main difficulty is then that this graph is highly disconnected due to the multiples of four not being squarefree.
To get around this, we need to enlarge the graph. Note from multiplicativity that if is almost always equal to when are squarefree, then is almost always equal to when are squarefree and is divisible by . We can then form a graph on the squarefree natural numbers by connecting to whenever are squarefree and is divisible by . If this graph is “locally connected” in some sense, then will be constant on almost all of the squarefree numbers in a large interval, which turns out to be incompatible with the results of MatomÃ¤ki and RadziwiÅ‚Å‚. Because of this, matters are reduced to establishing the connectedness of a certain graph. More precisely, it turns out to be sufficient to establish the following claim:
Theorem 3 For each prime , let be a residue class chosen uniformly at random. Let be the random graph whose vertices consist of those integers not equal to for any , and whose edges consist of pairs in with . Then with probability , the graph is connected.
We were able to show the connectedness of this graph, though it turned out to be remarkably tricky to do so. Roughly speaking (and suppressing a number of technicalities), the main steps in the argument were as follows.
It seems of interest to understand random graphs like further. In particular, the graph on the integers formed by connecting to for all in a randomly selected residue class mod for each prime is particularly interesting (it is to the Liouville function as is to the MÃ¶bius function); if one could show some “local expander” properties of this graph , then one would have a chance of modifying the above methods to attack the first unsolved case of the Chowla conjecture, namely that has asymptotic density zero (perhaps working with logarithmic density instead of natural density to avoids some technicalities).
However, there is an intriguing “alternate universe” in which the Möbius function is strongly correlated with some structured functions, and specifically with some Dirichlet characters, leading to the existence of the infamous “Siegel zero“. In this scenario, the parity problem obstruction disappears, and it becomes possible, in principle, to attack problems such as the twin prime conjecture. In particular, we have the following result of Heath-Brown:
Theorem 1 At least one of the following two statements are true:
- (Twin prime conjecture) There are infinitely many primes such that is also prime.
- (No Siegel zeroes) There exists a constant such that for every real Dirichlet character of conductor , the associated Dirichlet -function has no zeroes in the interval .
Informally, this result asserts that if one had an infinite sequence of Siegel zeroes, one could use this to generate infinitely many twin primes. See this survey of Friedlander and Iwaniec for more on this “illusory” or “ghostly” parallel universe in analytic number theory that should not actually exist, but is surprisingly self-consistent and to date proven to be impossible to banish from the realm of possibility.
The strategy of Heath-Brown’s proof is fairly straightforward to describe. The usual starting point is to try to lower bound
for some large value of , where is the von Mangoldt function. Actually, in this post we will work with the slight variant
where
is the second von Mangoldt function, and denotes Dirichlet convolution, and is an (unsquared) Selberg sieve that damps out small prime factors. This sum also detects twin primes, but will lead to slightly simpler computations. For technical reasons we will also smooth out the interval and remove very small primes from , but we will skip over these steps for the purpose of this informal discussion. (In Heath-Brown’s original paper, the Selberg sieve is essentially replaced by the more combinatorial restriction for some large , where is the primorial of , but I found the computations to be slightly easier if one works with a Selberg sieve, particularly if the sieve is not squared to make it nonnegative.)
If there is a Siegel zero with close to and a Dirichlet character of conductor , then multiplicative number theory methods can be used to show that the Möbius function “pretends” to be like the character in the sense that for “most” primes near (e.g. in the range for some small and large ). Traditionally, one uses complex-analytic methods to demonstrate this, but one can also use elementary multiplicative number theory methods to establish these results (qualitatively at least), as will be shown below the fold.
The fact that pretends to be like can be used to construct a tractable approximation (after inserting the sieve weight ) in the range (where for some large ) for the second von Mangoldt function , namely the function
Roughly speaking, we think of the periodic function and the slowly varying function as being of about the same “complexity” as the constant function , so that is roughly of the same “complexity” as the divisor function
which is considerably simpler to obtain asymptotics for than the von Mangoldt function as the Möbius function is no longer present. (For instance, note from the Dirichlet hyperbola method that one can estimate to accuracy with little difficulty, whereas to obtain a comparable level of accuracy for or is essentially the Riemann hypothesis.)
One expects to be a good approximant to if is of size and has no prime factors less than for some large constant . The Selberg sieve will be mostly supported on numbers with no prime factor less than . As such, one can hope to approximate (1) by the expression
as it turns out, the error between this expression and (1) is easily controlled by sieve-theoretic techniques. Let us ignore the Selberg sieve for now and focus on the slightly simpler sum
As discussed above, this sum should be thought of as a slightly more complicated version of the sum
Accordingly, let us look (somewhat informally) at the task of estimating the model sum (3). One can think of this problem as basically that of counting solutions to the equation with in various ranges; this is clearly related to understanding the equidistribution of the hyperbola in . Taking Fourier transforms, the latter problem is closely related to estimation of the Kloosterman sums
where denotes the inverse of in . One can then use the Weil bound
where is the greatest common divisor of (with the convention that this is equal to if vanish), and the decays to zero as . The Weil bound yields good enough control on error terms to estimate (3), and as it turns out the same method also works to estimate (2) (provided that with large enough).
Actually one does not need the full strength of the Weil bound here; any power savings over the trivial bound of will do. In particular, it will suffice to use the weaker, but easier to prove, bounds of Kloosterman:
Lemma 2 (Kloosterman bound) One has
whenever and are coprime to , where the is with respect to the limit (and is uniform in ).
Proof: Observe from change of variables that the Kloosterman sum is unchanged if one replaces with for . For fixed , the number of such pairs is at least , thanks to the divisor bound. Thus it will suffice to establish the fourth moment bound
The left-hand side can be rearranged as
which by Fourier summation is equal to
Observe from the quadratic formula and the divisor bound that each pair has at most solutions to the system of equations . Hence the number of quadruples of the desired form is , and the claim follows.
We will also need another easy case of the Weil bound to handle some other portions of (2):
Lemma 3 (Easy Weil bound) Let be a primitive real Dirichlet character of conductor , and let . Then
Proof: As is the conductor of a primitive real Dirichlet character, is equal to times a squarefree odd number for some . By the Chinese remainder theorem, it thus suffices to establish the claim when is an odd prime. We may assume that is not divisible by this prime , as the claim is trivial otherwise. If vanishes then does not vanish, and the claim follows from the mean zero nature of ; similarly if vanishes. Hence we may assume that do not vanish, and then we can normalise them to equal . By completing the square it now suffices to show that
whenever . As is on the quadratic residues and on the non-residues, it now suffices to show that
But by making the change of variables , the left-hand side becomes , and the claim follows.
While the basic strategy of Heath-Brown’s argument is relatively straightforward, implementing it requires a large amount of computation to control both main terms and error terms. I experimented for a while with rearranging the argument to try to reduce the amount of computation; I did not fully succeed in arriving at a satisfactorily minimal amount of superfluous calculation, but I was able to at least reduce this amount a bit, mostly by replacing a combinatorial sieve with a Selberg-type sieve (which was not needed to be positive, so I dispensed with the squaring aspect of the Selberg sieve to simplify the calculations a little further; also for minor reasons it was convenient to retain a tiny portion of the combinatorial sieve to eliminate extremely small primes). Also some modest reductions in complexity can be obtained by using the second von Mangoldt function in place of . These exercises were primarily for my own benefit, but I am placing them here in case they are of interest to some other readers.
— 1. Consequences of a Siegel zero —
It is convenient to phrase Heath-Brown’s theorem in the following equivalent form:
Theorem 4 Suppose one has a sequence of real Dirichlet characters of conductor going to infinity, and a sequence of real zeroes with as . Then there are infinitely many prime twins.
Henceforth, we omit the dependence on from all of our quantities (unless they are explicitly declared to be “fixed”), and the asymptotic notation , , , etc. will always be understood to be with respect to the parameter, e.g. means that for some fixed . (In the language of this previous blog post, we are thus implicitly using “cheap nonstandard analysis”, although we will not explicitly use nonstandard analysis notation (other than the asymptotic notation mentioned above) further in this post. With this convention, we now have a single (but not fixed) Dirichlet character of some conductor with a Siegel zero
It will also be convenient to use the crude bound
which can be proven by elementary means (see e.g. Exercise 57 of this post), although one can use Siegel’s theorem to obtain the better bound . Standard arguments (see also Lemma 59 of this blog post) then give
We now use this Siegel zero to show that pretends to be like for primes that are comparable (in log-scale) to :
For more precise estimates on the error, see the paper of Heath-Brown (particularly Lemma 3).
Proof: It suffices to show, for sufficiently large fixed , that
for each fixed natural number .
We begin by considering the sum
for some large (which we will eventually take to be a power of ); we will exploit the fact that this sum is very stable for comparable to in log-scale. By the Dirichlet hyperbola method, we can write this as
Since , one can show through summation by parts (see Lemma 71 of this previous post) that
for any , while from the integral test (see Lemma 2 of this previous post) we have
We can thus estimate (9) as
From summation by parts we again have
and we have the crude bound
so by using (7) and we arrive at
for any , where the exponent does not depend on . In particular, if and is large enough, then by (6), (7), (8) we have
Setting and and subtracting, we conclude that
On the other hand, observe that is always non-negative, and that whenever and , with primes with . Since any number with has at most representations of the form with and , and no outside of the range has such a representation, we thus see that
Comparing this with (10), we conclude that
since , the claim follows.
— 2. Main argument —
We let be a large absolute constant ( will do) and set to be the primorial of . Set for some large fixed (large compared to or ). Let be a smooth non-negative function supported on and equal to at . Set
and
Thus is a smooth cutoff to the region , and is a smooth cutoff to the region . It will suffice to establish the lower bound
because the non-twin primes contribute at most to the left-hand side. The weight is an unsquared Selberg sieve designed to damp out those for which or have somewhat small prime factors; we did not square this weight as is customary with the Selberg sieve in order to simplify the calculations slightly (the fact that the weight can be non-negative sometimes will not be a serious concern for us).
Thus is non-negative, and supported on those products of primes with and . Convolving (1