You are currently browsing the monthly archive for October 2015.

The Chowla conjecture asserts, among other things, that one has the asymptotic

$\displaystyle \frac{1}{X} \sum_{n \leq X} \lambda(n+h_1) \dots \lambda(n+h_k) = o(1)$

as ${X \rightarrow \infty}$ for any distinct integers ${h_1,\dots,h_k}$, where ${\lambda}$ is the Liouville function. (The usual formulation of the conjecture also allows one to consider more general linear forms ${a_i n + b_i}$ than the shifts ${n+h_i}$, but for sake of discussion let us focus on the shift case.) This conjecture remains open for ${k \geq 2}$, though there are now some partial results when one averages either in ${x}$ or in the ${h_1,\dots,h_k}$, as discussed in this recent post.

A natural generalisation of the Chowla conjecture is the Elliott conjecture. Its original formulation was basically as follows: one had

$\displaystyle \frac{1}{X} \sum_{n \leq X} g_1(n+h_1) \dots g_k(n+h_k) = o(1) \ \ \ \ \ (1)$

whenever ${g_1,\dots,g_k}$ were bounded completely multiplicative functions and ${h_1,\dots,h_k}$ were distinct integers, and one of the ${g_i}$ was “non-pretentious” in the sense that

$\displaystyle \sum_p \frac{1 - \hbox{Re}( g_i(p) \overline{\chi(p)} p^{-it})}{p} = +\infty \ \ \ \ \ (2)$

for all Dirichlet characters ${\chi}$ and real numbers ${t}$. It is easy to see that some condition like (2) is necessary; for instance if ${g(n) := \chi(n) n^{it}}$ and ${\chi}$ has period ${q}$ then ${\frac{1}{X} \sum_{n \leq X} g(n+q) \overline{g(n)}}$ can be verified to be bounded away from zero as ${X \rightarrow \infty}$.

In a previous paper with Matomaki and Radziwill, we provided a counterexample to the original formulation of the Elliott conjecture, and proposed that (2) be replaced with the stronger condition

$\displaystyle \inf_{|t| \leq X} \sum_{p \leq X} \frac{1 - \hbox{Re}( g_i(p) \overline{\chi(p)} p^{-it})}{p} \rightarrow +\infty \ \ \ \ \ (3)$

as ${X \rightarrow \infty}$ for any Dirichlet character ${\chi}$. To support this conjecture, we proved an averaged and non-asymptotic version of this conjecture which roughly speaking showed a bound of the form

$\displaystyle \frac{1}{H^k} \sum_{h_1,\dots,h_k \leq H} |\frac{1}{X} \sum_{n \leq X} g_1(n+h_1) \dots g_k(n+h_k)| \leq \varepsilon$

whenever ${H}$ was an arbitrarily slowly growing function of ${X}$, ${X}$ was sufficiently large (depending on ${\varepsilon,k}$ and the rate at which ${H}$ grows), and one of the ${g_i}$ obeyed the condition

$\displaystyle \inf_{|t| \leq AX} \sum_{p \leq X} \frac{1 - \hbox{Re}( g_i(p) \overline{\chi(p)} p^{-it})}{p} \geq A \ \ \ \ \ (4)$

for some ${A}$ that was sufficiently large depending on ${k,\varepsilon}$, and all Dirichlet characters ${\chi}$ of period at most ${A}$. As further support of this conjecture, I recently established the bound

$\displaystyle \frac{1}{\log \omega} |\sum_{X/\omega \leq n \leq X} \frac{g_1(n+h_1) g_2(n+h_2)}{n}| \leq \varepsilon$

under the same hypotheses, where ${\omega}$ is an arbitrarily slowly growing function of ${X}$.

In view of these results, it is tempting to conjecture that the condition (4) for one of the ${g_i}$ should be sufficient to obtain the bound

$\displaystyle |\frac{1}{X} \sum_{n \leq X} g_1(n+h_1) \dots g_k(n+h_k)| \leq \varepsilon$

when ${A}$ is large enough depending on ${k,\varepsilon}$. This may well be the case for ${k=2}$. However, the purpose of this blog post is to record a simple counterexample for ${k>2}$. Let’s take ${k=3}$ for simplicity. Let ${t_0}$ be a quantity much larger than ${X}$ but much smaller than ${X^2}$ (e.g. ${t = X^{3/2}}$), and set

$\displaystyle g_1(n) := n^{it_0}; \quad g_2(n) := n^{-2it_0}; \quad g_3(n) := n^{it_0}.$

For ${X/2 \leq n \leq X}$, Taylor expansion gives

$\displaystyle (n+1)^{it} = n^{it_0} \exp( i t_0 / n ) + o(1)$

and

$\displaystyle (n+2)^{it} = n^{it_0} \exp( 2 i t_0 / n ) + o(1)$

and hence

$\displaystyle g_1(n) g_2(n+1) g_3(n+2) = 1 + o(1)$

and hence

$\displaystyle |\frac{1}{X} \sum_{X/2 \leq n \leq X} g_1(n) g_2(n+1) g_3(n+2)| \gg 1.$

On the other hand one can easily verify that all of the ${g_1,g_2,g_3}$ obey (4) (the restriction ${|t| \leq AX}$ there prevents ${t}$ from getting anywhere close to ${t_0}$). So it seems the correct non-asymptotic version of the Elliott conjecture is the following:

Conjecture 1 (Non-asymptotic Elliott conjecture) Let ${k}$ be a natural number, and let ${h_1,\dots,h_k}$ be integers. Let ${\varepsilon > 0}$, let ${A}$ be sufficiently large depending on ${k,\varepsilon,h_1,\dots,h_k}$, and let ${X}$ be sufficiently large depending on ${k,\varepsilon,h_1,\dots,h_k,A}$. Let ${g_1,\dots,g_k}$ be bounded multiplicative functions such that for some ${1 \leq i \leq k}$, one has

$\displaystyle \inf_{|t| \leq AX^{k-1}} \sum_{p \leq X} \frac{1 - \hbox{Re}( g_i(p) \overline{\chi(p)} p^{-it})}{p} \geq A$

for all Dirichlet characters ${\chi}$ of conductor at most ${A}$. Then

$\displaystyle |\frac{1}{X} \sum_{n \leq X} g_1(n+h_1) \dots g_k(n+h_k)| \leq \varepsilon.$

The ${k=1}$ case of this conjecture follows from the work of Halasz; in my recent paper a logarithmically averaged version of the ${k=2}$ case of this conjecture is established. The requirement to take ${t}$ to be as large as ${A X^{k-1}}$ does not emerge in the averaged Elliott conjecture in my previous paper with Matomaki and Radziwill; it thus seems that this averaging has concealed some of the subtler features of the Elliott conjecture. (However, this subtlety does not seem to affect the asymptotic version of the conjecture formulated in that paper, in which the hypothesis is of the form (3), and the conclusion is of the form (1).)

A similar subtlety arises when trying to control the maximal integral

$\displaystyle \frac{1}{X} \int_X^{2X} \sup_\alpha \frac{1}{H} |\sum_{x \leq n \leq x+H} g(n) e(\alpha n)|\ dx. \ \ \ \ \ (5)$

In my previous paper with Matomaki and Radziwill, we could show that easier expression

$\displaystyle \frac{1}{X} \sup_\alpha \int_X^{2X} \frac{1}{H} |\sum_{x \leq n \leq x+H} g(n) e(\alpha n)|\ dx. \ \ \ \ \ (6)$

was small (for ${H}$ a slowly growing function of ${X}$) if ${g}$ was bounded and completely multiplicative, and one had a condition of the form

$\displaystyle \inf_{|t| \leq AX} \sum_{p \leq X} \frac{1 - \hbox{Re}( g(p) \overline{\chi(p)} p^{-it})}{p} \geq A \ \ \ \ \ (7)$

for some large ${A}$. However, to obtain an analogous bound for (5) it now appears that one needs to strengthen the above condition to

$\displaystyle \inf_{|t| \leq AX^2} \sum_{p \leq X} \frac{1 - \hbox{Re}( g(p) \overline{\chi(p)} p^{-it})}{p} \geq A$

in order to address the counterexample in which ${g(n) = n^{it_0}}$ for some ${t_0}$ between ${X}$ and ${X^2}$. This seems to suggest that proving (5) (which is closely related to the ${k=3}$ case of the Chowla conjecture) could in fact be rather difficult; the estimation of (6) relied primarily of prior work of Matomaki and Radziwill which used the hypothesis (7), but as this hypothesis is not sufficient to conclude (5), some additional input must also be used.

One of the major activities in probability theory is studying the various statistics that can be produced from a complex system with many components. One of the simplest possible systems one can consider is a finite sequence ${X_1,\dots,X_n}$ or an infinite sequence ${X_1,X_2,\dots}$ of jointly independent scalar random variables, with the case when the ${X_i}$ are also identically distributed (i.e. the ${X_i}$ are iid) being a model case of particular interest. (In some cases one may consider a triangular array ${(X_{n,i})_{1 \leq i \leq n}}$ of scalar random variables, rather than a finite or infinite sequence.) There are many statistics of such sequences that one can study, but one of the most basic such statistics are the partial sums

$\displaystyle S_n := X_1 + \dots + X_n.$

The first fundamental result about these sums is the law of large numbers (or LLN for short), which comes in two formulations, weak (WLLN) and strong (SLLN). To state these laws, we first must define the notion of convergence in probability.

Definition 1 Let ${X_n}$ be a sequence of random variables taking values in a separable metric space ${R = (R,d)}$ (e.g. the ${X_n}$ could be scalar random variables, taking values in ${{\bf R}}$ or ${{\bf C}}$), and let ${X}$ be another random variable taking values in ${R}$. We say that ${X_n}$ converges in probability to ${X}$ if, for every radius ${\varepsilon > 0}$, one has ${{\bf P}( d(X_n,X) > \varepsilon ) \rightarrow 0}$ as ${n \rightarrow \infty}$. Thus, if ${X_n, X}$ are scalar, we have ${X_n}$ converging to ${X}$ in probability if ${{\bf P}( |X_n-X| > \varepsilon ) \rightarrow 0}$ as ${n \rightarrow \infty}$ for any given ${\varepsilon > 0}$.

The measure-theoretic analogue of convergence in probability is convergence in measure.

It is instructive to compare the notion of convergence in probability with almost sure convergence. it is easy to see that ${X_n}$ converges almost surely to ${X}$ if and only if, for every radius ${\varepsilon > 0}$, one has ${{\bf P}( \bigvee_{n \geq N} (d(X_n,X)>\varepsilon) ) \rightarrow 0}$ as ${N \rightarrow \infty}$; thus, roughly speaking, convergence in probability is good for controlling how a single random variable ${X_n}$ is close to its putative limiting value ${X}$, while almost sure convergence is good for controlling how the entire tail ${(X_n)_{n \geq N}}$ of a sequence of random variables is close to its putative limit ${X}$.

We have the following easy relationships between convergence in probability and almost sure convergence:

Exercise 2 Let ${X_n}$ be a sequence of scalar random variables, and let ${X}$ be another scalar random variable.

• (i) If ${X_n \rightarrow X}$ almost surely, show that ${X_n \rightarrow X}$ in probability. Give a counterexample to show that the converse does not necessarily hold.
• (ii) Suppose that ${\sum_n {\bf P}( |X_n-X| > \varepsilon ) < \infty}$ for all ${\varepsilon > 0}$. Show that ${X_n \rightarrow X}$ almost surely. Give a counterexample to show that the converse does not necessarily hold.
• (iii) If ${X_n \rightarrow X}$ in probability, show that there is a subsequence ${X_{n_j}}$ of the ${X_n}$ such that ${X_{n_j} \rightarrow X}$ almost surely.
• (iv) If ${X_n,X}$ are absolutely integrable and ${{\bf E} |X_n-X| \rightarrow 0}$ as ${n \rightarrow \infty}$, show that ${X_n \rightarrow X}$ in probability. Give a counterexample to show that the converse does not necessarily hold.
• (v) (Urysohn subsequence principle) Suppose that every subsequence ${X_{n_j}}$ of ${X_n}$ has a further subsequence ${X_{n_{j_k}}}$ that converges to ${X}$ in probability. Show that ${X_n}$ also converges to ${X}$ in probability.
• (vi) Does the Urysohn subsequence principle still hold if “in probability” is replaced with “almost surely” throughout?
• (vii) If ${X_n}$ converges in probability to ${X}$, and ${F: {\bf R} \rightarrow {\bf R}}$ or ${F: {\bf C} \rightarrow {\bf C}}$ is continuous, show that ${F(X_n)}$ converges in probability to ${F(X)}$. More generally, if for each ${i=1,\dots,k}$, ${X^{(i)}_n}$ is a sequence of scalar random variables that converge in probability to ${X^{(i)}}$, and ${F: {\bf R}^k \rightarrow {\bf R}}$ or ${F: {\bf C}^k \rightarrow {\bf C}}$ is continuous, show that ${F(X^{(1)}_n,\dots,X^{(k)}_n)}$ converges in probability to ${F(X^{(1)},\dots,X^{(k)})}$. (Thus, for instance, if ${X_n}$ and ${Y_n}$ converge in probability to ${X}$ and ${Y}$ respectively, then ${X_n + Y_n}$ and ${X_n Y_n}$ converge in probability to ${X+Y}$ and ${XY}$ respectively.
• (viii) (Fatou’s lemma for convergence in probability) If ${X_n}$ are non-negative and converge in probability to ${X}$, show that ${{\bf E} X \leq \liminf_{n \rightarrow \infty} {\bf E} X_n}$.
• (ix) (Dominated convergence in probability) If ${X_n}$ converge in probability to ${X}$, and one almost surely has ${|X_n| \leq Y}$ for all ${n}$ and some absolutely integrable ${Y}$, show that ${{\bf E} X_n}$ converges to ${{\bf E} X}$.

Exercise 3 Let ${X_1,X_2,\dots}$ be a sequence of scalar random variables converging in probability to another random variable ${X}$.

• (i) Suppose that there is a random variable ${Y}$ which is independent of ${X_i}$ for each individual ${i}$. Show that ${Y}$ is also independent of ${X}$.
• (ii) Suppose that the ${X_1,X_2,\dots}$ are jointly independent. Show that ${X}$ is almost surely constant (i.e. there is a deterministic scalar ${c}$ such that ${X=c}$ almost surely).

We can now state the weak and strong law of large numbers, in the model case of iid random variables.

Theorem 4 (Law of large numbers, model case) Let ${X_1, X_2, \dots}$ be an iid sequence of copies of an absolutely integrable random variable ${X}$ (thus the ${X_i}$ are independent and all have the same distribution as ${X}$). Write ${\mu := {\bf E} X}$, and for each natural number ${n}$, let ${S_n}$ denote the random variable ${S_n := X_1 + \dots + X_n}$.

• (i) (Weak law of large numbers) The random variables ${S_n/n}$ converge in probability to ${\mu}$.
• (ii) (Strong law of large numbers) The random variables ${S_n/n}$ converge almost surely to ${\mu}$.

Informally: if ${X_1,\dots,X_n}$ are iid with mean ${\mu}$, then ${X_1 + \dots + X_n \approx \mu n}$ for ${n}$ large. Clearly the strong law of large numbers implies the weak law, but the weak law is easier to prove (and has somewhat better quantitative estimates). There are several variants of the law of large numbers, for instance when one drops the hypothesis of identical distribution, or when the random variable ${X}$ is not absolutely integrable, or if one seeks more quantitative bounds on the rate of convergence; we will discuss some of these variants below the fold.

It is instructive to compare the law of large numbers with what one can obtain from the Kolmogorov zero-one law, discussed in Notes 2. Observe that if the ${X_n}$ are real-valued, then the limit superior ${\limsup_{n \rightarrow \infty} S_n/n}$ and ${\liminf_{n \rightarrow \infty} S_n/n}$ are tail random variables in the sense that they are not affected if one changes finitely many of the ${X_n}$; in particular, events such as ${\limsup_{n \rightarrow \infty} S_n/n > t}$ are tail events for any ${t \in {\bf R}}$. From this and the zero-one law we see that there must exist deterministic quantities ${-\infty \leq \mu_- \leq \mu_+ \leq +\infty}$ such that ${\limsup_{n \rightarrow \infty} S_n/n = \mu_+}$ and ${\liminf_{n \rightarrow \infty} S_n/n = \mu_-}$ almost surely. The strong law of large numbers can then be viewed as the assertion that ${\mu_- = \mu_+ = \mu}$ when ${X}$ is absolutely integrable. On the other hand, the zero-one law argument does not require absolute integrability (and one can replace the denominator ${n}$ by other functions of ${n}$ that go to infinity as ${n \rightarrow \infty}$).

The law of large numbers asserts, roughly speaking, that the theoretical expectation ${\mu}$ of a random variable ${X}$ can be approximated by taking a large number of independent samples ${X_1,\dots,X_n}$ of ${X}$ and then forming the empirical mean ${S_n/n = \frac{X_1+\dots+X_n}{n}}$. This ability to approximate the theoretical statistics of a probability distribution through empirical data is one of the basic starting points for mathematical statistics, though this is not the focus of the course here. The tendency of statistics such as ${S_n/n}$ to cluster closely around their mean value ${\mu}$ is the simplest instance of the concentration of measure phenomenon, which is of tremendous significance not only within probability, but also in applications of probability to disciplines such as statistics, theoretical computer science, combinatorics, random matrix theory and high dimensional geometry. We will not discuss these topics much in this course, but see this previous blog post for some further discussion.

There are several ways to prove the law of large numbers (in both forms). One basic strategy is to use the moment method – controlling statistics such as ${S_n/n}$ by computing moments such as the mean ${{\bf E} S_n/n}$, variance ${{\bf E} |S_n/n - {\bf E} S_n/n|^2}$, or higher moments such as ${{\bf E} |S_n/n - {\bf E} S_n/n|^k}$ for ${k = 4, 6, \dots}$. The joint independence of the ${X_i}$ make such moments fairly easy to compute, requiring only some elementary combinatorics. A direct application of the moment method typically requires one to make a finite moment assumption such as ${{\bf E} |X|^k < \infty}$, but as we shall see, one can reduce fairly easily to this case by a truncation argument.

For the strong law of large numbers, one can also use methods relating to the theory of martingales, such as stopping time arguments and maximal inequalities; we present some classical arguments of Kolmogorov in this regard.

In the previous set of notes, we constructed the measure-theoretic notion of the Lebesgue integral, and used this to set up the probabilistic notion of expectation on a rigorous footing. In this set of notes, we will similarly construct the measure-theoretic concept of a product measure (restricting to the case of probability measures to avoid unnecessary techncialities), and use this to set up the probabilistic notion of independence on a rigorous footing. (To quote Durrett: “measure theory ends and probability theory begins with the definition of independence.”) We will be able to take virtually any collection of random variables (or probability distributions) and couple them together to be independent via the product measure construction, though for infinite products there is the slight technicality (a requirement of the Kolmogorov extension theorem) that the random variables need to range in standard Borel spaces. This is not the only way to couple together such random variables, but it is the simplest and the easiest to compute with in practice, as we shall see in the next few sets of notes.

I recently learned about a curious operation on square matrices known as sweeping, which is used in numerical linear algebra (particularly in applications to statistics), as a useful and more robust variant of the usual Gaussian elimination operations seen in undergraduate linear algebra courses. Given an ${n \times n}$ matrix ${A := (a_{ij})_{1 \leq i,j \leq n}}$ (with, say, complex entries) and an index ${1 \leq k \leq n}$, with the entry ${a_{kk}}$ non-zero, the sweep ${\hbox{Sweep}_k[A] = (\hat a_{ij})_{1 \leq i,j \leq n}}$ of ${A}$ at ${k}$ is the matrix given by the formulae

$\displaystyle \hat a_{ij} := a_{ij} - \frac{a_{ik} a_{kj}}{a_{kk}}$

$\displaystyle \hat a_{ik} := \frac{a_{ik}}{a_{kk}}$

$\displaystyle \hat a_{kj} := \frac{a_{kj}}{a_{kk}}$

$\displaystyle \hat a_{kk} := \frac{-1}{a_{kk}}$

for all ${i,j \in \{1,\dots,n\} \backslash \{k\}}$. Thus for instance if ${k=1}$, and ${A}$ is written in block form as

$\displaystyle A = \begin{pmatrix} a_{11} & X \\ Y & B \end{pmatrix} \ \ \ \ \ (1)$

for some ${1 \times n-1}$ row vector ${X}$, ${n-1 \times 1}$ column vector ${Y}$, and ${n-1 \times n-1}$ minor ${B}$, one has

$\displaystyle \hbox{Sweep}_1[A] = \begin{pmatrix} -1/a_{11} & X / a_{11} \\ Y/a_{11} & B - a_{11}^{-1} YX \end{pmatrix}. \ \ \ \ \ (2)$

The inverse sweep operation ${\hbox{Sweep}_k^{-1}[A] = (\check a_{ij})_{1 \leq i,j \leq n}}$ is given by a nearly identical set of formulae:

$\displaystyle \check a_{ij} := a_{ij} - \frac{a_{ik} a_{kj}}{a_{kk}}$

$\displaystyle \check a_{ik} := -\frac{a_{ik}}{a_{kk}}$

$\displaystyle \check a_{kj} := -\frac{a_{kj}}{a_{kk}}$

$\displaystyle \check a_{kk} := \frac{-1}{a_{kk}}$

for all ${i,j \in \{1,\dots,n\} \backslash \{k\}}$. One can check that these operations invert each other. Actually, each sweep turns out to have order ${4}$, so that ${\hbox{Sweep}_k^{-1} = \hbox{Sweep}_k^3}$: an inverse sweep performs the same operation as three forward sweeps. Sweeps also preserve the space of symmetric matrices (allowing one to cut down computational run time in that case by a factor of two), and behave well with respect to principal minors; a sweep of a principal minor is a principal minor of a sweep, after adjusting indices appropriately.

Remarkably, the sweep operators all commute with each other: ${\hbox{Sweep}_k \hbox{Sweep}_l = \hbox{Sweep}_l \hbox{Sweep}_k}$. If ${1 \leq k \leq n}$ and we perform the first ${k}$ sweeps (in any order) to a matrix

$\displaystyle A = \begin{pmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{pmatrix}$

with ${A_{11}}$ a ${k \times k}$ minor, ${A_{12}}$ a ${k \times n-k}$ matrix, ${A_{12}}$ a ${n-k \times k}$ matrix, and ${A_{22}}$ a ${n-k \times n-k}$ matrix, one obtains the new matrix

$\displaystyle \hbox{Sweep}_1 \dots \hbox{Sweep}_k[A] = \begin{pmatrix} -A_{11}^{-1} & A_{11}^{-1} A_{12} \\ A_{21} A_{11}^{-1} & A_{22} - A_{21} A_{11}^{-1} A_{12} \end{pmatrix}.$

Note the appearance of the Schur complement in the bottom right block. Thus, for instance, one can essentially invert a matrix ${A}$ by performing all ${n}$ sweeps:

$\displaystyle \hbox{Sweep}_1 \dots \hbox{Sweep}_n[A] = -A^{-1}.$

If a matrix has the form

$\displaystyle A = \begin{pmatrix} B & X \\ Y & a \end{pmatrix}$

for a ${n-1 \times n-1}$ minor ${B}$, ${n-1 \times 1}$ column vector ${X}$, ${1 \times n-1}$ row vector ${Y}$, and scalar ${a}$, then performing the first ${n-1}$ sweeps gives

$\displaystyle \hbox{Sweep}_1 \dots \hbox{Sweep}_{n-1}[A] = \begin{pmatrix} -B^{-1} & B^{-1} X \\ Y B^{-1} & a - Y B^{-1} X \end{pmatrix}$

and all the components of this matrix are usable for various numerical linear algebra applications in statistics (e.g. in least squares regression). Given that sweeps behave well with inverses, it is perhaps not surprising that sweeps also behave well under determinants: the determinant of ${A}$ can be factored as the product of the entry ${a_{kk}}$ and the determinant of the ${n-1 \times n-1}$ matrix formed from ${\hbox{Sweep}_k[A]}$ by removing the ${k^{th}}$ row and column. As a consequence, one can compute the determinant of ${A}$ fairly efficiently (so long as the sweep operations don’t come close to dividing by zero) by sweeping the matrix for ${k=1,\dots,n}$ in turn, and multiplying together the ${kk^{th}}$ entry of the matrix just before the ${k^{th}}$ sweep for ${k=1,\dots,n}$ to obtain the determinant.

It turns out that there is a simple geometric explanation for these seemingly magical properties of the sweep operation. Any ${n \times n}$ matrix ${A}$ creates a graph ${\hbox{Graph}[A] := \{ (X, AX): X \in {\bf R}^n \}}$ (where we think of ${{\bf R}^n}$ as the space of column vectors). This graph is an ${n}$-dimensional subspace of ${{\bf R}^n \times {\bf R}^n}$. Conversely, most subspaces of ${{\bf R}^n \times {\bf R}^n}$ arises as graphs; there are some that fail the vertical line test, but these are a positive codimension set of counterexamples.

We use ${e_1,\dots,e_n,f_1,\dots,f_n}$ to denote the standard basis of ${{\bf R}^n \times {\bf R}^n}$, with ${e_1,\dots,e_n}$ the standard basis for the first factor of ${{\bf R}^n}$ and ${f_1,\dots,f_n}$ the standard basis for the second factor. The operation of sweeping the ${k^{th}}$ entry then corresponds to a ninety degree rotation ${\hbox{Rot}_k: {\bf R}^n \times {\bf R}^n \rightarrow {\bf R}^n \times {\bf R}^n}$ in the ${e_k,f_k}$ plane, that sends ${f_k}$ to ${e_k}$ (and ${e_k}$ to ${-f_k}$), keeping all other basis vectors fixed: thus we have

$\displaystyle \hbox{Graph}[ \hbox{Sweep}_k[A] ] = \hbox{Rot}_k \hbox{Graph}[A]$

for generic ${n \times n}$ ${A}$ (more precisely, those ${A}$ with non-vanishing entry ${a_{kk}}$). For instance, if ${k=1}$ and ${A}$ is of the form (1), then ${\hbox{Graph}[A]}$ is the set of tuples ${(r,R,s,S) \in {\bf R} \times {\bf R}^{n-1} \times {\bf R} \times {\bf R}^{n-1}}$ obeying the equations

$\displaystyle a_{11} r + X R = s$

$\displaystyle Y r + B R = S.$

The image of ${(r,R,s,S)}$ under ${\hbox{Rot}_1}$ is ${(s, R, -r, S)}$. Since we can write the above system of equations (for ${a_{11} \neq 0}$) as

$\displaystyle \frac{-1}{a_{11}} s + \frac{X}{a_{11}} R = -r$

$\displaystyle \frac{Y}{a_{11}} s + (B - a_{11}^{-1} YX) R = S$

we see from (2) that ${\hbox{Rot}_1 \hbox{Graph}[A]}$ is the graph of ${\hbox{Sweep}_1[A]}$. Thus the sweep operation is a multidimensional generalisation of the high school geometry fact that the line ${y = mx}$ in the plane becomes ${y = \frac{-1}{m} x}$ after applying a ninety degree rotation.

It is then an instructive exercise to use this geometric interpretation of the sweep operator to recover all the remarkable properties about these operations listed above. It is also useful to compare the geometric interpretation of sweeping as rotation of the graph to that of Gaussian elimination, which instead shears and reflects the graph by various elementary transformations (this is what is going on geometrically when one performs Gaussian elimination on an augmented matrix). Rotations are less distorting than shears, so one can see geometrically why sweeping can produce fewer numerical artefacts than Gaussian elimination.

In Notes 0, we introduced the notion of a measure space ${\Omega = (\Omega, {\mathcal F}, \mu)}$, which includes as a special case the notion of a probability space. By selecting one such probability space ${(\Omega,{\mathcal F},\mu)}$ as a sample space, one obtains a model for random events and random variables, with random events ${E}$ being modeled by measurable sets ${E_\Omega}$ in ${{\mathcal F}}$, and random variables ${X}$ taking values in a measurable space ${R}$ being modeled by measurable functions ${X_\Omega: \Omega \rightarrow R}$. We then defined some basic operations on these random events and variables:

• Given events ${E,F}$, we defined the conjunction ${E \wedge F}$, the disjunction ${E \vee F}$, and the complement ${\overline{E}}$. For countable families ${E_1,E_2,\dots}$ of events, we similarly defined ${\bigwedge_{n=1}^\infty E_n}$ and ${\bigvee_{n=1}^\infty E_n}$. We also defined the empty event ${\emptyset}$ and the sure event ${\overline{\emptyset}}$, and what it meant for two events to be equal.
• Given random variables ${X_1,\dots,X_n}$ in ranges ${R_1,\dots,R_n}$ respectively, and a measurable function ${F: R_1 \times \dots \times R_n \rightarrow S}$, we defined the random variable ${F(X_1,\dots,X_n)}$ in range ${S}$. (As the special case ${n=0}$ of this, every deterministic element ${s}$ of ${S}$ was also a random variable taking values in ${S}$.) Given a relation ${P: R_1 \times \dots \times R_n \rightarrow \{\hbox{true}, \hbox{false}\}}$, we similarly defined the event ${P(X_1,\dots,X_n)}$. Conversely, given an event ${E}$, we defined the indicator random variable ${1_E}$. Finally, we defined what it meant for two random variables to be equal.
• Given an event ${E}$, we defined its probability ${{\bf P}(E)}$.

These operations obey various axioms; for instance, the boolean operations on events obey the axioms of a Boolean algebra, and the probabilility function ${E \mapsto {\bf P}(E)}$ obeys the Kolmogorov axioms. However, we will not focus on the axiomatic approach to probability theory here, instead basing the foundations of probability theory on the sample space models as discussed in Notes 0. (But see this previous post for a treatment of one such axiomatic approach.)

It turns out that almost all of the other operations on random events and variables we need can be constructed in terms of the above basic operations. In particular, this allows one to safely extend the sample space in probability theory whenever needed, provided one uses an extension that respects the above basic operations; this is an important operation when one needs to add new sources of randomness to an existing system of events and random variables, or to couple together two separate such systems into a joint system that extends both of the original systems. We gave a simple example of such an extension in the previous notes, but now we give a more formal definition:

Definition 1 Suppose that we are using a probability space ${\Omega = (\Omega, {\mathcal F}, \mu)}$ as the model for a collection of events and random variables. An extension of this probability space is a probability space ${\Omega' = (\Omega', {\mathcal F}', \mu')}$, together with a measurable map ${\pi: \Omega' \rightarrow \Omega}$ (sometimes called the factor map) which is probability-preserving in the sense that

$\displaystyle \mu'( \pi^{-1}(E) ) = \mu(E) \ \ \ \ \ (1)$

for all ${E \in {\mathcal F}}$. (Caution: this does not imply that ${\mu(\pi(F)) = \mu'(F)}$ for all ${F \in {\mathcal F}'}$ – why not?)

An event ${E}$ which is modeled by a measurable subset ${E_\Omega}$ in the sample space ${\Omega}$, will be modeled by the measurable set ${E_{\Omega'} := \pi^{-1}(E_\Omega)}$ in the extended sample space ${\Omega'}$. Similarly, a random variable ${X}$ taking values in some range ${R}$ that is modeled by a measurable function ${X_\Omega: \Omega \rightarrow R}$ in ${\Omega}$, will be modeled instead by the measurable function ${X_{\Omega'} := X_\Omega \circ \pi}$ in ${\Omega'}$. We also allow the extension ${\Omega'}$ to model additional events and random variables that were not modeled by the original sample space ${\Omega}$ (indeed, this is one of the main reasons why we perform extensions in probability in the first place).

Thus, for instance, the sample space ${\Omega'}$ in Example 3 of the previous post is an extension of the sample space ${\Omega}$ in that example, with the factor map ${\pi: \Omega' \rightarrow \Omega}$ given by the first coordinate projection ${\pi(i,j) := i}$. One can verify that all of the basic operations on events and random variables listed above are unaffected by the above extension (with one caveat, see remark below). For instance, the conjunction ${E \wedge F}$ of two events can be defined via the original model ${\Omega}$ by the formula

$\displaystyle (E \wedge F)_\Omega := E_\Omega \cap F_\Omega$

or via the extension ${\Omega'}$ via the formula

$\displaystyle (E \wedge F)_{\Omega'} := E_{\Omega'} \cap F_{\Omega'}.$

The two definitions are consistent with each other, thanks to the obvious set-theoretic identity

$\displaystyle \pi^{-1}( E_\Omega \cap F_\Omega ) = \pi^{-1}(E_\Omega) \cap \pi^{-1}(F_\Omega).$

Similarly, the assumption (1) is precisely what is needed to ensure that the probability ${\mathop{\bf P}(E)}$ of an event remains unchanged when one replaces a sample space model with an extension. We leave the verification of preservation of the other basic operations described above under extension as exercises to the reader.

Remark 2 There is one minor exception to this general rule if we do not impose the additional requirement that the factor map ${\pi}$ is surjective. Namely, for non-surjective ${\pi}$, it can become possible that two events ${E, F}$ are unequal in the original sample space model, but become equal in the extension (and similarly for random variables), although the converse never happens (events that are equal in the original sample space always remain equal in the extension). For instance, let ${\Omega}$ be the discrete probability space ${\{a,b\}}$ with ${p_a=1}$ and ${p_b=0}$, and let ${\Omega'}$ be the discrete probability space ${\{ a'\}}$ with ${p'_{a'}=1}$, and non-surjective factor map ${\pi: \Omega' \rightarrow \Omega}$ defined by ${\pi(a') := a}$. Then the event modeled by ${\{b\}}$ in ${\Omega}$ is distinct from the empty event when viewed in ${\Omega}$, but becomes equal to that event when viewed in ${\Omega'}$. Thus we see that extending the sample space by a non-surjective factor map can identify previously distinct events together (though of course, being probability preserving, this can only happen if those two events were already almost surely equal anyway). This turns out to be fairly harmless though; while it is nice to know if two given events are equal, or if they differ by a non-null event, it is almost never useful to know that two events are unequal if they are already almost surely equal. Alternatively, one can add the additional requirement of surjectivity in the definition of an extension, which is also a fairly harmless constraint to impose (this is what I chose to do in this previous set of notes).

Roughly speaking, one can define probability theory as the study of those properties of random events and random variables that are model-independent in the sense that they are preserved by extensions. For instance, the cardinality ${|E_\Omega|}$ of the model ${E_\Omega}$ of an event ${E}$ is not a concept within the scope of probability theory, as it is not preserved by extensions: continuing Example 3 from Notes 0, the event ${E}$ that a die roll ${X}$ is even is modeled by a set ${E_\Omega = \{2,4,6\}}$ of cardinality ${3}$ in the original sample space model ${\Omega}$, but by a set ${E_{\Omega'} = \{2,4,6\} \times \{1,2,3,4,5,6\}}$ of cardinality ${18}$ in the extension. Thus it does not make sense in the context of probability theory to refer to the “cardinality of an event ${E}$“.

On the other hand, the supremum ${\sup_n X_n}$ of a collection of random variables ${X_n}$ in the extended real line ${[-\infty,+\infty]}$ is a valid probabilistic concept. This can be seen by manually verifying that this operation is preserved under extension of the sample space, but one can also see this by defining the supremum in terms of existing basic operations. Indeed, note from Exercise 24 of Notes 0 that a random variable ${X}$ in the extended real line is completely specified by the threshold events ${(X \leq t)}$ for ${t \in {\bf R}}$; in particular, two such random variables ${X,Y}$ are equal if and only if the events ${(X \leq t)}$ and ${(Y \leq t)}$ are surely equal for all ${t}$. From the identity

$\displaystyle (\sup_n X_n \leq t) = \bigwedge_{n=1}^\infty (X_n \leq t)$

we thus see that one can completely specify ${\sup_n X_n}$ in terms of ${X_n}$ using only the basic operations provided in the above list (and in particular using the countable conjunction ${\bigwedge_{n=1}^\infty}$.) Of course, the same considerations hold if one replaces supremum, by infimum, limit superior, limit inferior, or (if it exists) the limit.

In this set of notes, we will define some further important operations on scalar random variables, in particular the expectation of these variables. In the sample space models, expectation corresponds to the notion of integration on a measure space. As we will need to use both expectation and integration in this course, we will thus begin by quickly reviewing the basics of integration on a measure space, although we will then translate the key results of this theory into probabilistic language.

As the finer details of the Lebesgue integral construction are not the core focus of this probability course, some of the details of this construction will be left to exercises. See also Chapter 1 of Durrett, or these previous blog notes, for a more detailed treatment.