We continue our study of basic ergodic theorems, establishing the maximal and pointwise ergodic theorems of Birkhoff. Using these theorems, we can then give several equivalent notions of the fundamental concept of ergodicity, which (roughly speaking) plays the role in measure-preserving dynamics that minimality plays in topological dynamics. A general measure-preserving system is not necessarily ergodic, but we shall introduce the ergodic decomposition, which allows one to express any non-ergodic measure as an average of ergodic measures (generalising the decomposition of a permutation into disjoint cycles).

– The maximal ergodic theorem –

Just as we derived the mean ergodic theorem from the more abstract von Neumann ergodic theorem in the previous lecture, we shall derive the maximal ergodic theorem from the following abstract maximal inequality.

Theorem 1. (Dunford-Schwartz maximal inequality) Let $(X, {\mathcal X}, \mu)$ be a probability space, and let $P: L^1(X,{\mathcal X},\mu) \to L^1(X,{\mathcal X},\mu)$ be a linear operator with P1=1 and $P^* 1 = 1$ (i.e. $\int_X Pf\ d\mu = \int_X f\ d\mu$ for all $f \in L^1(X,{\mathcal X},\mu)$. Assume also that P maps non-negative functions to non-negative functions. Then the maximal function $Mf := \sup_{N > 0} \frac{1}{N} \sum_{n=0}^{N-1} P^n f$ obeys the inequality

$\lambda \mu( \{ Mf > \lambda \} ) \leq \int_{Mf > \lambda} f\ d\mu$ (1)

for any $\lambda \in {\Bbb R}$.

Proof. We can rewrite (1) as

$\int_{Mf - \lambda > 0} (f-\lambda)\ d\mu \geq 0$. (2)

Since $Mf-\lambda = M(f-\lambda)$, we thus see (by replacing f with $f-\lambda$) that we can reduce to proving (2) in the case $\lambda=0$.

For every $m \geq 1$, consider the modified maximal function $F_m := \sup_{0 \leq N \leq m} \sum_{n=0}^{N-1} P^n f$. Observe that $Mf(x) > 0$ if and only if $F_m(x) > 0$ for all sufficiently large m. By the dominated convergence theorem, it thus suffices to show that

$\int_{F_m > 0} f\ d\mu \geq 0$ (3)

for all m. But observe from definition of $F_m$ (and the positivity preserving nature of P) that we have the pointwise recursive inequality

$F_m(x) \leq F_{m+1}(x) = \max( 0, f + P F_m(x) )$. (4)

Integrating this on the region $F_m > 0$ and using the non-negativity of $F_m$, we obtain

$\int_X F_m\ d\mu \leq \int_{F_m > 0} f + \int_X P F_m\ d\mu$. (6)

Since $F_m \in L^1(X,{\mathcal X},\mu)$ and $P^* 1 = 1$, the claim follows. $\Box$

Applying this in the case when P is a shift operator, and replacing f by |f|, we obtain

Corollary 1. (Maximal ergodic theorem) Let $(X,{\mathcal X},\mu,T)$ be a measure-preserving system. Then for any $f \in L^1(X,{\mathcal X},\mu)$ and $\lambda > 0$ one has

$\mu( \{ \sup_N \frac{1}{N} \sum_{n=0}^{N-1} |T^n f| > \lambda \} ) \leq \frac{1}{\lambda} \|f\|_{L^1(X,{\mathcal X},\mu)}$. (7)

Note that this inequality implies Markov’s inequality

$\mu( \{ |f| > \lambda \} ) \leq \frac{1}{\lambda} \int_X |f|\ d\mu$. (8)

as a special case. Applying the real interpolation method, one also easily deduces the maximal inequality

$\| \sup_N \frac{1}{N} \sum_{n=0}^{N-1} |T^n f| \|_{L^p(X,{\mathcal X},\mu)} \leq C_p \|f\|_{L^p(X,{\mathcal X},\mu)}$ (9)

for all $1 < p \leq \infty$, where the constant $C_p$ depends on p (it blows up like $O(1/(p-1))$ in the limit $p \to 1$).

Exercise 1 (Rising sun inequality). If $f \in l^1({\Bbb Z})$, and $f^*(m) := \sup_N \frac{1}{N} \sum_{n=0}^{N-1} f(m+n)$, establish the rising sun inequality

$\lambda | \{ m \in {\Bbb Z}: f^*(m) > \lambda \}| \leq \sum_{m \in {\Bbb Z}: f^*(m) > \lambda} f(m)$ (10)

for any $\lambda > 0$. (Hint: one can either adapt the proof of Theorem 1, or else partition the set appearing in (10) into disjoint intervals. The latter proof also leads to a proof of Corollary 1 which avoids the Dunford-Schwartz trick of introducing the functions $F_m$. The terminology “rising sun” comes from seeing how these intervals interact with the graph of the partial sums of f, which resembles the shadows cast on a hilly terrain by a rising sun.) $\diamond$

Exercise 2. (Transference principle) Show that Corollary 1 can be deduced directly from (10). (Hint: given $f \in L^1(X, {\mathcal X},\mu)$, apply (10) to the functions $f_x(n) := T^n f(x)$ for each $x \in X$ (truncating the integers to a finite set if necessary), and then integrate in x using Fubini’s theorem.) This is an example of a transference principle between maximal inequalities on ${\Bbb Z}$ and maximal inequalities on measure-preserving systems. $\diamond$

Exercise 3 (Stein-Stromberg maximal inequality). Derive a continuous version of the Dunford-Schwartz maximal inequality, in which the operators $P^n$ are replaced by a semigroup $P_t$ acting on both $L^1$ and $L^\infty$, in which the underlying measure space is only assumed to be $\sigma$-finite rather than a probability space, and the averages $\frac{1}{N} \sum_{n=0}^{N-1} P^n$ are replaced by $\frac{1}{T} \int_0^T P^t\ dt$. Apply this continuous version with $P_t := e^{t\Delta}$ equal to the heat operator on ${\Bbb R}^d$ for $d \geq 1$ to deduce the Stein-Stromberg maximal inequality

$m( \{ x \in {\Bbb R}^d: \sup_{R > 0} \frac{1}{m(B(x,R))} \int_{B(x,r)} |f|\ dm> \lambda \} ) \leq \frac{Cd}{\lambda} \| f \|_{L^1({\Bbb R}^d, dm)}$ (11)

for all $\lambda > 0$ and $f \in L^1({\Bbb R}^d, dm)$, where m is Lebesgue measure, $B(x,R)$ is the Euclidean ball of radius R centred at x, and the constant C is absolute (independent of d). This improves upon the Hardy-Littlewood maximal inequality, which gives the same estimate but with $Cd$ replaced by $C^d$. It is an open question whether the dependence on d can be removed entirely; the estimate (11) is still the best known in high dimension. For d=1, the best constant C is known to be $\frac{11+\sqrt{61}}{12}=1.567\ldots$, a result of Melas. $\diamond$

Remark 1. The study of maximal inequalities in ergodic theory is, of course, a subject in itself; a classical reference is this monograph of Stein. $\diamond$

– The pointwise ergodic theorem –

Using the maximal ergodic theorem and a standard limiting argument we can now deduce

Theorem 2 (Pointwise ergodic theorem). Let $(X, {\mathcal X}, \mu, T)$ be a measure-preserving system, and let $f \in L^1(X, {\mathcal X}, \mu)$. Then for $\mu$-almost every $x \in X$, $\frac{1}{N} \sum_{n=0}^{N-1} T^n f(x)$ converges to ${\Bbb E}(f|{\mathcal X}^T)(x)$.

Proof. By subtracting ${\Bbb E}(f|{\mathcal X}^T)$ from f if necessary, it suffices to show that

$\limsup_{N \to \infty} |\frac{1}{N} \sum_{n=0}^{N-1} T^n f(x)| = 0$ (12)

a.e. whenever ${\Bbb E}(f|{\mathcal X}^T) = 0$. By telescoping series, (12) is already true when f takes the form $f=Tg-g$ for some $g \in L^\infty(X,{\mathcal X},\mu)$. So by the arguments used to prove the von Neumann ergodic theorem from the previous lecture, we have already established the claim for a dense class of functions f in $L^2(X,{\mathcal X},\mu)$ with ${\Bbb E}(f|{\mathcal X}^T) = 0$, and thus also for a dense class of functions in $L^1(X,{\mathcal X},\mu)$ with ${\Bbb E}(f|{\mathcal X}^T) = 0$ (since the latter space is dense in the former, and the $L^2$ norm controls the $L^1$ norm by the Cauchy-Schwarz inequality).

Now we use a standard limiting argument. Let $f \in L^1(X,{\mathcal X},\mu)$ with ${\Bbb E}(f|{\mathcal X}^T) = 0$. Then we can find a sequence $f_j$ in the above dense class which converges in $L^1$ to f. For almost every x, we thus have

$\lim_{N \to \infty} |\frac{1}{N} \sum_{n=0}^{N-1} T^n f_j(x)| = 0$ (13)

for all j, and so by the triangle inequality we have

$\limsup_{N \to \infty} |\frac{1}{N} \sum_{n=0}^{N-1} T^n f(x)| \leq \sup_N \frac{1}{N} \sum_{n=0}^{N-1} T^n |f-f_j|(x)$. (14)

But by Corollary 1 we see that the right-hand side of (14) converges to zero in measure as $j \to \infty$. Since the left-hand side does not depend on j, it must vanish almost everywhere, as required. $\Box$

Remark 2. More generally, one can derive a pointwise convergence result on a class of rough functions by first establishing convergence for a dense subclass of functions, and then establishing a maximal inequality which is strong enough to allow one to take limits and establish pointwise convergence for all functions in the larger class. Conversely, principles such as Stein’s maximal principle indicate that in many cases this is in some sense the only way to establish such pointwise convergence results for rough functions. $\diamond$

Remark 3. Using the dominated convergence theorem (starting first with bounded functions f in order to get the domination), one can deduce the mean ergodic theorem from the pointwise ergodic theorem. But the converse is significantly more difficult; pointwise convergence for various ergodic averages is often a much harder result to establish than the corresponding norm convergence result (in particular, many of the techniques discussed in this course appear to be of sharply limited utility for pointwise convergence problems), and many questions in this area remain open. $\diamond$

Exercise 4 (Lebesgue differentiation theorem). Let $f \in L^1({\Bbb R}^d, dm)$ with Lebesgue measure dm. Show that for almost every $x \in {\Bbb R}^d$, we have $\lim_{r \to 0^+} \frac{1}{m(B(x,r))} \int_{B(x,r)} |f(y)-f(x)|\ dy = 0$, and in particular that $\lim_{r \to 0^+} \frac{1}{m(B(x,r))} \int_{B(x,r)} f(y)\ dy = f(x)$. $\diamond$

– Ergodicity –

Combining the mean ergodic theorem with the pointwise ergodic theorem (and with Exercises 7, 8 from the previous lecture) we have

Theorem 3 (Characterisations of ergodicity) Let $(X, {\mathcal X}, \mu, T)$ be a measure-preserving system. Then the following are equivalent:

1. Any set $E \in {\mathcal X}$ which is invariant (thus TE=E) has either full measure $\mu(E)=1$ or zero measure $\mu(E)=0$.
2. Any set $E \in {\mathcal X}$ which is almost invariant (thus TE differs from E by a null set) has either full measure or zero measure.
3. Any measurable function f with $Tf = f$ a.e. is constant a.e.
4. For any $1 < p < \infty$ and $f \in L^p(X,{\mathcal X},\mu)$, the averages $\frac{1}{N} \sum_{n=0}^N T^n f$ converge in $L^p$ norm to $\int_X f$.
5. For any two $f, g \in L^\infty(X, {\mathcal X}, \mu)$, we have $\lim_{N \to \infty} \frac{1}{N} \sum_{n=1}^N \int_X (T^n f) g\ d\mu = (\int_X f\ d\mu) (\int_X g\ d\mu)$.
6. For any two measurable sets E and F, we have $\lim_{N \to \infty} \frac{1}{N} \sum_{n=1}^N \mu(T^n E \cap F) = \mu(E) \mu(F)$.
7. For any $f \in L^1(X, {\mathcal X},\mu)$, the averages $\frac{1}{N} \sum_{n=0}^N T^n f$ converge pointwise almost everywhere to $\int_X f\ d\mu$.

A measure-preserving system with any (and hence all) of the above properties is said to be ergodic.

Remark 4. Strictly speaking, ergodicity is a property that applies to a measure-preserving system $(X, {\mathcal X}, \mu, T)$. However, we shall sometimes abuse notation and apply the adjective “ergodic” to a single component of a system, such as the measure $\mu$ or the shift T, when the other three components of the system are clear from context. $\diamond$

Here are some simple examples of ergodicity:

Example 1. If X is finite with uniform measure, then a shift map $T: X \to X$ is ergodic if and only if it is a cycle. $\diamond$

Example 2. If a shift T is ergodic, then so is $T^{-1}$. However, from Example 1 we see that it is not necessarily true that $T^n$ is ergodic for all n (this latter property is also known as total ergodicity). $\diamond$

Exercise 5. Show that the circle shift $({\Bbb R}/{\Bbb Z}, x \mapsto x+\alpha)$ (with the usual Lebesgue measure) is ergodic if and only if $\alpha$ is irrational. (Hint: analyse the equation $Tf=f$ for (say) $f \in L^2(X,{\mathcal X},\mu)$ using Fourier analysis. Added, Feb 21: As pointed out to me in class, another way to proceed is to use the Lebesgue density theorem (or Lebesgue differentiation theorem) combined with Exercise 14 from Lecture 6.) $\diamond$

Exercise 6. Let $(\Omega,{\mathcal B},\mu)$ be a standard Borel probability space. Show that the Bernoulli shift on the product system $(\Omega^{\Bbb Z}, {\mathcal B}^{\Bbb Z}, \mu^{\Bbb Z})$ is ergodic. (Hint: first establish property 6 of Theorem 3 when E and F each depend on only finitely many of the coordinates of $\Omega^{\Bbb Z}$.) $\diamond$

Exercise 7. Let $(X, {\mathcal X}, \mu, T)$ be an ergodic system. Show that if $\lambda$ is an eigenvalue of $T: L^2(X, {\mathcal X}, \mu) \to L^2(X, {\mathcal X}, \mu)$, then $|\lambda|=1$, the eigenspace $\{ f \in L^2(X, {\mathcal X}, \mu): Tf = \lambda f \}$ is one-dimensional, and that every eigenfunction f has constant magnitude |f| a.e.. Show that the the eigenspaces are orthogonal to each other in $L^2(X, {\mathcal X}, \mu)$, and the set of all eigenvalues of T forms an at most countable subgroup of the unit circle $S^1$. $\diamond$

Now we give a less trivial example of an ergodic system.

Proposition 1. (Ergodicity of skew shift) Let $\alpha \in {\Bbb R}$ be irrational. Then the skew shift $(({\Bbb R}/{\Bbb Z})^2, (x,y) \mapsto (x+\alpha,y+x))$ is ergodic.

Proof. Write the skew shift system as $(X, {\mathcal X}, \mu, T)$. To simplify the notation we shall omit the phrase “almost everywhere” in what follows.

We use an argument of Parry. If the system is not ergodic, then we can find a non-constant $f \in L^2(X, {\mathcal X}, \mu)$ such that Tf = f. Next, we use Fourier analysis to write $f = \sum_m f_m$, where $f_m(x,y) := \int_{{\Bbb R}/{\Bbb Z}} f(x,y+\theta) e^{-2\pi i m\theta}\ d\theta$. Since f is T-invariant, and the vertical rotations $(x,y) \mapsto (x,y+\theta)$ commute with T, we see that the $f_m$ are also T-invariant. The function $f_0$ depends only on the x variable, and so is constant by Exercise 5. So it suffices to show that $f_m$ is zero for all non-zero m.

Fix m. We can factorise $f_m(x,y) = F_m(x) e^{2\pi i my}$. The T-invariance of $f_m$ now implies that $F_m(x+\alpha) = e^{-2\pi i mx} F_m(x)$. If we then define $F_{m,\theta} := F_m(x+\theta) \overline{F_m(x)}$ for $\theta \in {\Bbb R}$, we see that $F_{m,\theta}(x+\alpha) = e^{-2\pi i m \theta} F_{m,\theta}(x)$, thus $F_{m,\theta}$ is an eigenfunction of the circle shift with eigenvalue $e^{-2\pi i m \theta}$. But this implies (by Exercise 7) that $F_{m,\theta}$ is orthogonal to $F_{m,0}$ for $\theta$ close to zero. Taking limits we see that $F_{m,0}$ is orthogonal to itself and must vanish; this implies that $F_m$ and hence $f_m$ vanish as well, as desired. $\Box$

Exercise 8.  Show that for any irrational $\alpha$ and any $d \geq 1$, the iterated skew shift system $({\Bbb R}/{\Bbb Z}^d, (x_1,\ldots,x_d) \to (x_1+\alpha,x_2+x_1,\ldots,x_d+x_{d-1}))$ is ergodic. $\diamond$

– Generic points –

Now let us suppose that we have a topological measure preserving system $(X, {\mathcal F}, \mu, T)$, i.e. a measure-preserving system $(X, {\mathcal X}, \mu, T)$ which is also a topological dynamical system $(X, {\mathcal F}, T)$, with ${\mathcal X}$ the Borel $\sigma$-algebra of T. Then we have the space C(X) of continuous (real or complex-valued) functions on X, which is dense inside $L^2(X)$. From the Stone-Weierstrass theorem we also see that C(X) is separable.

A sequence $x_1, x_2, x_3, \ldots$ in X is said to be uniformly distributed with respect to $\mu$ if we have

$\lim_{N \to \infty} \frac{1}{N} \sum_{n=1}^N f(x_n) = \int_X f\ d\mu$ (15)

for all $f \in C(X)$. A point x in X is said to be generic if the forward orbit $x, Tx, T^2 x, \ldots$ is uniformly distributed.
Exercise 9. Let $(X, {\mathcal F},\mu)$ be a compact metrisable space with a Borel probability measure $\mu$, and let $x_1,x_2,\ldots$ be a sequence in X. Show that this sequence is uniformly distributed if and only if $\lim_{N \to \infty} \frac{1}{N} |\{ 1\leq i \leq N: x_i \in U\}| = \mu(U)$ for all open sets U in X with $\mu(\partial U)=0$.   What happens if the hypothesis that the boundary of $U$ has measure zero is removed? $\diamond$

From Theorem 3 and the separability of C(X) we obtain

Proposition 2. A topological measure-preserving system is ergodic if and only if almost every point is generic.

A topological measure-preserving system is said to be uniquely ergodic if every point is generic. The following exercise explains the terminology:

Exercise 10. Show that a topological measure-preserving system $(X, {\mathcal F}, \mu, T)$ is uniquely ergodic if and only if the only T-invariant Borel probability measure on T is $\mu$. (Hint: use Lemma 1 from Lecture 7.) Because of this fact, one can sensibly define what it means for a topological dynamical system $(X, {\mathcal F}, T)$ to be uniquely ergodic, namely that it has a unique T-invariant Borel probability measure. $\diamond$

It is not always the case that an ergodic system is uniquely ergodic. For instance, in the Bernoulli system $\{0,1\}^{\Bbb Z}$ (with uniform measure on $\{0,1\}$, say), the point $0^{\Bbb Z}$ is not generic. However, for more algebraic systems, it turns out that ergodicity and unique ergodicity are largely equivalent. We illustrate this with the circle and skew shifts:

Exercise 11. Show that the circle shift $({\Bbb R}/{\Bbb Z}, x \mapsto x+\alpha)$ (with the usual Lebesgue measure) is uniquely ergodic if and only if $\alpha$ is irrational. (Hint: first show in the circle shift system that any translate of a generic point is generic.) $\diamond$

Proposition 3. (Unique ergodicity of skew shift) Let $\alpha \in {\Bbb R}$ be irrational. Then the skew shift $(({\Bbb R}/{\Bbb Z})^2, (x,y) \mapsto (x+\alpha,y+x))$ is uniquely ergodic.

Proof. We use an argument of Furstenberg. We again write the skew shift as $(X, {\mathcal X}, \mu, T)$. Suppose this system was not uniquely ergodic, then by Exercise 10 there is another shift-invariant Borel probability measure $\mu' \neq \mu$. If we push $\mu$ and $\mu'$ down to the circle shift system $({\Bbb R}/{\Bbb Z}, x \mapsto x+\alpha$ by the projection map $(x,y) \mapsto x$, then by Exercises 10, 11 we must get the same measure. Thus $\mu$ and $\mu'$ must agree on any set of the form $A \times ({\Bbb R}/{\Bbb Z})$.

Let E denote the points in $X$ which are generic with respect to $\mu$; note that this set is Borel measurable. By Proposition 2, this set has full measure in $\mu$. Also, since the vertical rotations $(x,y) \mapsto (x,y+\theta)$ commute with T and preserve $\mu$, we see that E must be invariant under such rotations; thus they are of the form $A \times ({\Bbb R}/{\Bbb Z})$ for some A. By the preceding discussion, we conclude that E also has full measure in $\mu'$. But then (by the pointwise or mean ergodic theorem for $(X, {\mathcal X}, \mu', T)$) we conclude that ${\Bbb E}_{\mu'}(f|{\mathcal X}^T) = \int_X f\ d\mu$ $\mu'$-almost everywhere for every continuous f, and thus on integrating with respect to $\mu'$ we obtain $\int_X f\ d\mu' = \int_X f\ d\mu$ for every continuous f. But then by the Riesz representation theorem we have $\mu = \mu'$, a contradiction. $\Box$

Corollary 2. If $\alpha \in {\Bbb R}$ is irrational, then the sequence $(\alpha n^2 \hbox{ mod } 1)_{n \in {\Bbb N}}$ is uniformly distributed in ${\Bbb R}/{\Bbb Z}$ (with respect to uniform measure).

Exercise 11a. Show that the systems considered in Exercise 8 are uniquely ergodic.   Conclude that the exponent 2 in Corollary 2 can be replaced by any positive integer d. $\diamond$

Note that the topological dynamics theory developed in Lecture 6 only establishes the weaker statement that the above sequence is dense in ${\Bbb R}/{\Bbb Z}$ rather than uniformly distributed. More generally, it seems that ergodic theory methods can prove topological dynamics results, but not vice versa. Here is another simple example of the same phenomenon:

Exercise 12. Show that a uniquely ergodic topological dynamical system (with the support of the measure equal to the whole space) is necessarily minimal. (The converse is not necessarily true, as already mentioned in Remark 6 of Lecture 7.) $\diamond$

– The ergodic decomposition –

Just as not every topological dynamical system is minimal, not every measure-preserving system is ergodic. Nevertheless, there is an important decomposition that allows one to represent non-ergodic measures as averages of ergodic measures. One can already see this in the finite case, when X is a finite set with the discrete $\sigma$-algebra, and $T: X \to X$ is a permutation on X, which can be decomposed as the disjoint union of cycles on a partition $X = C_1 \cup \ldots \cup C_m$ of X. In this case, all shift-invariant probability measures take the form

$\mu = \sum_{j=1}^{m} \alpha_j \mu_j$ (16)

where $\mu_j$ is the uniform probability measure on the cycle $C_j$, and $\alpha_j$ are non-negative constants adding up to 1. Each of the $\mu_j$ are ergodic, but no non-trivial linear combination of these measures is ergodic. Thus we see in the finite case that every shift-invariant measure can be uniquely expressed as a convex combination of ergodic measures.

It turns out that a similar decomposition is available in general, at least if the underlying measure space is a compact topological space (or more generally, a Radon space). This is because of the following general theorem from measure theory.

Definition 1 (Probability kernel). Let $(X, {\mathcal X})$ and $(Y, {\mathcal Y})$ be measurable spaces. A probability kernel $y \mapsto \mu_y$ is an assignment of a probability measure $\mu_y$ on X to each $y \in Y$ in such a way that the map $y \mapsto \int_X f\ d\mu_y$ is measurable for every bounded measurable $f: X \mapsto {\Bbb C}$.

Example 3. Every measurable map $\phi: Y \to X$ induces a probability kernel $y \mapsto \delta_{\phi(y)}$. Every probability measure on X can be viewed as a probability kernel from a point to X. If $y \mapsto \mu_y$ and $x \mapsto \nu_x$ are two probability kernels from Y to Z and from X to Y respectively, their composition $x \mapsto (\mu \circ \nu)_x := \int_Y \mu_y\ d\nu_x(y)$ is also a probability kernel, where $\int_Y \mu_y\ d\nu_x(y)$ is the measure that assigns $\int_Y \mu_y(E)\ d\nu_x(y)$ to any measurable set E in Z. Thus one can view the class of measurable spaces and their probability kernels as a category, which includes the class of measurable spaces and their measurable maps as a subcategory. $\diamond$

Definition 2. (Regular space) A measurable space $(X, {\mathcal X})$ is said to be regular if there exists a compact metrisable topology ${\mathcal F}$ on X for which ${\mathcal X}$ is the Borel $\sigma$-algebra.

Example 4. Every topological measure-preserving system is regular. $\diamond$

Remark 5. Measurable spaces $(X, {\mathcal X})$ in which ${\mathcal X}$ is the Borel $\sigma$-algebra of a topological space generated by a separable complete metric space (i.e. a Polish space) are known as standard Borel spaces. It is a non-trivial theorem from descriptive set theory that up to measurable isomorphism, there are only three types of standard Borel spaces: finite discrete spaces, countable discrete spaces, and the unit interval [0,1] with the usual Borel $\sigma$-algebra. From this one can see that regular spaces are the same as standard Borel spaces, though we will not need this fact here. $\diamond$

Theorem 4 (Disintegration theorem). Let $(X, {\mathcal X},\mu)$ and $(Y, {\mathcal Y},\nu)$ be probability spaces, with $(X,{\mathcal X})$ regular. Let $\pi: X \to Y$ be a morphism (thus $\nu = \pi_\# \mu$). Then there exists a probability kernel $y \mapsto \mu_y$ such that

$\int_X f (g \circ \pi)\ d\mu = \int_Y (\int_X f\ d\mu_y) g(y)\ d\nu(y)$ (17)

for any bounded measurable $f: X \to {\Bbb C}$ and $g: Y \to {\Bbb C}$. Also, for any such g, we have

$g \circ \pi = g(y)$ $\mu_y$-a.e. (18)

for $\nu$-a.e. y.

Furthermore, this probability kernel is unique up to $\nu$-almost everywhere equivalence, in the sense that if $y \mapsto \mu'_y$ is another probability kernel with the same properties, then $\mu_y = \mu'_y$ for $\nu$-almost every $y$.

We refer to the probability kernel $y \mapsto \mu_y$ generated by the above theorem as the disintegration of $\mu$ relative to the factor map $\pi$.

Proof. We begin by proving uniqueness. Suppose we have two probability kernels $y \mapsto \mu_y, y \mapsto \mu'_y$ with the above properties. Then on subtraction we have

$\int_Y (\int_X f\ d(\mu_y-\mu'_y)) g(y)\ d\nu(y) = 0$ (19)

for all bounded measurable $f: X \to {\Bbb C}$, $g: Y \to {\Bbb C}$. Specialising to $f=1_E$ for some measurable set $E \in {\mathcal X}$, we conclude that $\mu_y(E) = \mu'_y(E)$ for $\nu$-almost every y. Since ${\mathcal X}$ is regular, it is separable and we conclude that $\mu_y = \mu'_y$ for $\nu$-almost every y, as required.

Now we prove existence. The pullback map $\pi^\#: L^2( Y, {\mathcal Y}, \nu) \to L^2( X, {\mathcal X}, \mu)$ defined by $g \mapsto g \circ \pi$ has an adjoint $\pi_\#: L^2(X, {\mathcal X},\mu) \to L^2(Y, {\mathcal Y}, \nu)$, thus

$\int_X f (g \circ \pi)\ d\mu = \int_Y (\pi_\# f) g\ d\nu$ (20)

for all $f \in L^2(X,{\mathcal X},\mu)$ and $g \in L^2(Y,{\mathcal Y},\nu)$. It is easy to see from duality that we have $\| \pi_\# f \|_{L^\infty(Y, {\mathcal Y}, \nu)} \leq \| f \|_{C(X)}$ for all $f \in C(X)$ (where we select a compact metrisable topology that generates the regular $\sigma$-algebra ${\mathcal X}$). Recall that $\pi_\# f$ is not quite a measurable function, but is instead an equivalence class of measurable functions modulo $\nu$-almost everywhere equivalence. Since C(X) is separable, we find a measurable representative $\tilde \pi_\# f: Y \to {\Bbb C}$ of $\pi_\# f$ to every $f \in C(X)$ which varies linearly with f, and is such that $|\tilde \pi_\# f(y)| \leq \|f\|_{C(X)}$ for all y outside of a set E of $\nu$-measure zero and for all $f \in C(X)$. For all such y, we can then apply the Riesz representation theorem to obtain a Radon probability measure $\mu_y$ such that

$\tilde \pi_\# f(y) = \int_X f\ d\mu_y$ (21)

for all such y. We set $\mu_y$ equal to some arbitrarily fixed Radon probability measure for $y \in E$. We then observe that the required properties (including the measurability of $y \mapsto \int_X f\ d\mu_y$) are already obeyed for $f \in C(X)$. To generalise this to bounded measurable f, observe that the class ${\mathcal C}$ of f obeying the required properties is closed under dominated pointwise convergence, and so contains the indicator functions of open or compact sets (by Urysohn’s lemma). Applying dominated pointwise convergence again and inner and outer regularity, we see that the indicator functions of any Borel set lies in ${\mathcal C}$. Thus all simple measurable functions lie in ${\mathcal C}$, and on taking uniform limits we obtain the claim.

Finally, we prove (18). From two applications of (17) we have

$\int_Y (\int_X f (g \circ \pi)\ d\mu_y) h(y)\ d\nu(y) = \int_Y (\int_X f g(y)\ d\mu_y) h(y)\ d\nu(y)$ (22)

for all bounded measurable $f: X \to {\Bbb C}$ and $h: Y \to {\Bbb C}$. The claim follows (using the separability of the space of all f). $\diamond$

Exercise 13. Let the notation and assumptions be as in Theorem 4. Suppose that ${\mathcal Y}$ is also regular, and that the map $\pi: X \to Y$ is continuous with respect to some compact metrisable topologies that generate ${\mathcal X}$ and ${\mathcal Y}$ respectively. Then show that for $\nu$-almost every y, the probability measure $\mu_y$ is supported in $\pi^{-1}(\{y\})$. $\diamond$

Proposition 4 (Ergodic decomposition). Let $(X, {\mathcal X}, \mu, T)$ be a regular measure-preserving system. Let $(Y, {\mathcal Y}, \nu, S)$ be the system defined by $Y := X$, ${\mathcal Y} := {\mathcal X}^T$, $\nu := \mu\downharpoonright_{\mathcal Y}$, and $S := T$, and let $\pi: X \to Y$ be the identity map. Let $y \mapsto \mu_y$ be the disintegration of $\mu$ with respect to the factor map $\pi$. Then for $\nu$-almost every y, the measure $\mu_y$ is T-invariant and ergodic.

Proof. Observe from the T-invariance $\mu = T_\# \mu$ of $\mu$ (and of ${\mathcal X}^T$) that the probability kernel $y \mapsto T_\# \mu_y$ would also be a disintegration of $\mu$. Thus we have $\mu_y = T_\# \mu_y$ for $\nu$-almost every y.

Now we show the ergodicity. As the space of bounded measurable $f: X \to {\Bbb C}$ is separable, it suffices by Theorem 3 and a limiting argument to show that for any fixed such f, the averages $\frac{1}{N} \sum_{n=1}^N T^n f$ converge pointwise $\mu_y$-a.e. to $\int_X f\ d\mu_y$ for $\nu$-a.e. y.

From the pointwise ergodic theorem, we already know that $\frac{1}{N} \sum_{n=1}^N T^n f$ converges to ${\Bbb E}(f|{\mathcal X}^T)$ outside of a set of $\mu$-measure zero. By (17), this set also has $\mu_y$-measure zero for $\nu$-almost every y. Thus it will suffice to show that ${\Bbb E}(f|{\mathcal X}^T)$ is $\mu_y$-a.e. equal to $\int_X f\ d\mu_y$ for $\nu$-a.e. y. Now observe that ${\Bbb E}(f|{\mathcal X}^T)(x) = \pi_\# f( \pi(x) )$, so the claim follows from (18) and (21). $\diamond$

Exercise 14. Let $(X, {\mathcal X})$ be a separable measurable space, and let T be bimeasurable bijection $T: X \to X$. Let $M({\mathcal X})$ denote the Banach space of all finite measures on ${\mathcal X}$ with the total variation norm. Let $\hbox{Pr}({\mathcal X})^T \subset M({\mathcal X})$ denote the collection of probability measures on ${\mathcal X}$ which are T-invariant. Show that this is a closed convex subset of $M({\mathcal X})$, and the extreme points of $\hbox{Pr}({\mathcal X})^T$ are precisely the ergodic probability measures (which also form a closed subset of $M({\mathcal X})$). This allows one to prove a variant of Proposition 4 using Choquet’s theorem. $\diamond$

Exercise 15. Show that a topological measure-preserving system $(X, {\mathcal F}, T, \mu)$ is uniquely ergodic if and only if the only ergodic shift-invariant Borel probability measure on X is $\mu$. $\diamond$

[Update, Feb 6: Some corrections; new exercises added.]

[Update, Feb 23: More exercises added.]