We now begin our study of measure-preserving systems $(X, {\mathcal X}, \mu, T)$, i.e. a probability space $(X, {\mathcal X}, \mu)$ together with a probability space isomorphism $T: (X, {\mathcal X}, \mu) \to (X, {\mathcal X}, \mu)$ (thus $T: X \to X$ is invertible, with T and $T^{-1}$ both being measurable, and $\mu(T^n E) = \mu(E)$ for all $E \in {\mathcal X}$ and all n). For various technical reasons it is convenient to restrict to the case when the $\sigma$-algebra ${\mathcal X}$ is separable, i.e. countably generated. One reason for this is as follows:

Exercise 1. Let $(X, {\mathcal X}, \mu)$ be a probability space with ${\mathcal X}$ separable. Then the Banach spaces $L^p(X, {\mathcal X}, \mu)$ are separable (i.e. have a countable dense subset) for every $1 \leq p < \infty$; in particular, the Hilbert space $L^2(X, {\mathcal X}, \mu)$ is separable. Show that the claim can fail for $p = \infty$. (We allow the $L^p$ spaces to be either real or complex valued, unless otherwise specified.) $\diamond$

Remark 1. In practice, the requirement that ${\mathcal X}$ be separable is not particularly onerous. For instance, if one is studying the recurrence properties of a function $f: X \to {\Bbb R}$ on a non-separable measure-preserving system $(X, {\mathcal X}, \mu, T)$, one can restrict ${\mathcal X}$ to the separable sub-$\sigma$-algebra ${\mathcal X}'$ generated by the level sets $\{ x \in X: T^n f(x) > q \}$ for integer n and rational q, thus passing to a separable measure-preserving system $(X, {\mathcal X}', \mu, T)$ on which f is still measurable. Thus we see that in many cases of interest, we can immediately reduce to the separable case. (In particular, for many of the theorems in this course, the hypothesis of separability can be dropped, though we won’t bother to specify for which ones this is the case.) $\diamond$

We are interested in the recurrence properties of sets $E \in {\mathcal X}$ or functions $f \in L^p(X, {\mathcal X}, \mu)$. The simplest such recurrence theorem is

Theorem 1. (Poincaré recurrence theorem) Let $(X,{\mathcal X},\mu,T)$ be a measure-preserving system, and let $E \in {\mathcal X}$ be a set of positive measure. Then $\limsup_{n \to +\infty} \mu( E \cap T^n E ) \geq \mu(E)^2$. In particular, $E \cap T^n E$ has positive measure (and is thus non-empty) for infinitely many n.

(Compare with Theorem 1 of Lecture 3.)

Proof. For any integer $N > 1$, observe that $\int_X \sum_{n=1}^N 1_{T^n E}\ d\mu = N \mu(E)$, and thus by Cauchy-Schwarz

$\int_X (\sum_{n=1}^N 1_{T^n E})^2\ d\mu \geq N^2 \mu(E)^2.$ (1)

The left-hand side of (1) can be rearranged as

$\sum_{n=1}^N \sum_{m=1}^N \mu( T^n E \cap T^m E ).$ (2)

On the other hand, $\mu( T^n E \cap T^m E) = \mu( E \cap T^{m-n} E )$. From this one easily obtains the asymptotic

$(2)\leq (\limsup_{n \to \infty} \mu( E \cap T^n E ) + o(1)) N^2,$ (3)

where o(1) denotes an expression which goes to zero as N goes to infinity. Combining (1), (2), (3) and taking limits as $N \to +\infty$ we obtain

$\limsup_{n \to \infty} \mu( E \cap T^n E ) \geq \mu(E)^2$ (4)

as desired. $\Box$

Remark 2. In classical physics, the evolution of a physical system in a compact phase space is given by a (continuous-time) measure-preserving system (this is Hamilton’s equations of motion combined with Liouville’s theorem). The Poincaré recurrence theorem then has the following unintuitive consequence: every collection E of states of positive measure, no matter how small, must eventually return to overlap itself given sufficient time. For instance, if one were to burn a piece of paper in a closed system, then there exist arbitrarily small perturbations of the initial conditions such that, if one waits long enough, the piece of paper will eventually reassemble (modulo arbitrarily small error)! This seems to contradict the second law of thermodynamics, but the reason for the discrepancy is because the time required for the recurrence theorem to take effect is inversely proportional to the measure of the set E, which in physical situations is exponentially small in the number of degrees of freedom (which is already typically quite large, e.g. of the order of the Avogadro constant). This gives more than enough opportunity for Maxwell’s demon to come into play to reverse the increase of entropy. (This can be viewed as a manifestation of the curse of dimensionality.) The more sophisticated recurrence theorems we will see later have much poorer quantitative bounds still, so much so that they basically have no direct significance for any physical dynamical system with many relevant degrees of freedom. $\diamond$

Exercise 2. Prove the following generalisation of the Poincaré recurrence theorem: if $(X, {\mathcal X}, \mu, T)$ is a measure-preserving system and $f \in L^1(X, {\mathcal X},\mu)$ is non-negative, then $\limsup_{n \to +\infty} \int_X f T^n f \geq (\int_X f\ d\mu)^2$. $\diamond$

Exercise 3. Give examples to show that the quantity $\mu(E)^2$ in the conclusion of Theorem 1 cannot be replaced by any larger quantity in general, regardless of the actual value of $\mu(E)$. (Hint: use a Bernoulli system example.) $\diamond$

Exercise 4. Using the pigeonhole principle instead of the Cauchy-Schwarz inequality (and in particular, the statement that if $\mu(E_1) + \ldots + \mu(E_n) > 1$, then the sets $E_1,\ldots,E_n$ cannot all be disjoint), prove the weaker statement that for any set E of positive measure in a measure-preserving system, the set $E \cap T^n E$ is non-empty for infinitely many n. (This exercise illustrates the general point that the Cauchy-Schwarz inequality can be viewed as a quantitative strengthening of the pigeonhole principle.) $\diamond$

For this lecture and the next we shall study several variants of the Poincaré recurrence theorem. We begin by looking at the mean ergodic theorem, which studies the limiting behaviour of the ergodic averages $\frac{1}{N} \sum_{n=1}^N T^n f$ in various $L^p$ spaces, and in particular in $L^2$.

— Hilbert space formulation —

We begin with the Hilbert space formulation of the mean ergodic theorem, due to von Neumann.

Theorem 2. (Von Neumann ergodic theorem) Let $U: H \to H$ be a unitary operator on a separable Hilbert space H. Then for every $v \in H$ we have

$\lim_{N \to +\infty} \frac{1}{N} \sum_{n=0}^{N-1} U^n v = \pi(v)$, (5)

where $\pi: H \to H^U$ is the orthogonal projection from H to the closed subspace $H^U := \{ v \in H: Uv = v \}$ consisting of the U-invariant vectors.

Proof. We give the slick (but not particularly illuminating) proof of Riesz. It is clear that (5) holds if v is already invariant (i.e. $v \in H^U$). Next, let W denote the (possibly non-closed) space $W := \{ Uw - w: w \in H \}$. If Uw-w lies in W and v lies in $H^U$, then by unitarity

$\langle Uw-w, v \rangle = \langle w, U^{-1} v \rangle - \langle w, v \rangle = \langle w, v \rangle - \langle w, v \rangle = 0$ (6)

and thus W is orthogonal to $H^U$. In particular $\pi(Uw-w) = 0$. From the telescoping identity

$\frac{1}{N} \sum_{n=0}^{N-1} U^n (Uw - w) = \frac{1}{N} (U^{N} w - w )$ (7)

we conclude that (5) also holds if $v \in W$; by linearity we conclude that (5) holds for all v in $H^U + \overline{W}$. A standard limiting argument (using the fact that the linear transformations $v \mapsto \pi(v)$ and $v \mapsto \frac{1}{N} \sum_{n=0}^{N-1} U^n v$ are bounded on H, uniformly in n) then shows that (5) holds for v in the closure $\overline{H^U + W}$.

To conclude, it suffices to show that the closed space $\overline{H^U + W}$ is all of H. Suppose for contradiction that this is not the case. Then there exists a non-zero vector w which is orthogonal to all of $\overline{H^U + W}$. In particular, w is orthogonal to Uw – w. Applying the easily verified identity $\| Uw-w\|^2 = -2 \hbox{Re} \langle Uw-w, w\rangle$ (related to the parallelogram law) we conclude that Uw=w, thus w lies in $H^U$. This implies that w is orthogonal to itself and is thus zero, a contradiction. $\Box$

On a measure-preserving system $(X, {\mathcal X}, \mu, T)$, the shift map $f \mapsto Tf$ is a unitary transformation on the separable Hilbert space $L^2(X, {\mathcal X}, \mu)$. We conclude

Corollary 1. (mean ergodic theorem) Let $(X, {\mathcal X}, \mu, T)$ be a measure-preserving system, and let $f \in L^2(X,{\mathcal X},\mu)$. Then we have $\frac{1}{N} \sum_{n=1}^N T^n f$ converges in $L^2(X,{\mathcal X},\mu)$ norm to $\pi(f)$, where $\pi(f): L^2(X,{\mathcal X},\mu) \to L^2(X,{\mathcal X},\mu)^T$ is the orthogonal projection to the space $\{ f \in L^2(X,{\mathcal X},\mu): Tf =f \}$ consists of the shift-invariant functions in $L^2(X, {\mathcal X},\mu)$.

Example 4. (Finite case) Suppose that $(X, {\mathcal X}, \mu, T)$ is a finite measure-preserving system, with ${\mathcal X}$ discrete and $\mu$ the uniform probability measure. Then T is a permutation on X and thus decomposes as the direct sum of disjoint cycles (possibly including trivial cycles of length 1). Then the shift-invariant functions are precisely those functions which are constant on each of these cycles, and the map $f \mapsto \pi(f)$ replaces a function $f: X \to {\Bbb C}$ with its average value on each of these cycles. It is then an instructive exercise to verify the mean ergodic theorem by hand in this case. $\diamond$

Exercise 5. With the notation and assumptions of Corollary 1, show that the limit $\lim_{N \to \infty} \frac{1}{N} \sum_{n=0}^{N-1} \int_X T^n f \overline{f}\ d\mu$ exists, is real, and is greater than or equal to $|\int_X f|^2$. (Hint: the constant function $1$ lies in $L^2(X, {\mathcal X}, \mu)^T$.) Note that this is stronger than the conclusion of Exercise 2. $\diamond$

Let us now give some other proofs of the von Neumann ergodic theorem. We first give a proof (close to von Neumann’s original proof) using the spectral theorem for unitary operators. This theorem asserts (among other things) that a unitary operator $U: H \to H$ can be expressed in the form $U = \int_{S^1} \lambda\ d\mu(\lambda)$, where $S^1 := \{ z \in {\Bbb C}: |z|=1\}$ is the unit circle and $\mu$ is a projection-valued Borel measure on the circle. More generally, we have

$U^n = \int_{S^1} \lambda^n\ d\mu(\lambda)$ (8)

and so for any vector v in H and any positive integer N

$\frac{1}{N} \sum_{n=0}^{N-1} U^n v = \int_{S^1} \frac{1}{N} \sum_{n=0}^{N-1} \lambda^n\ d\mu(\lambda) v$. (9)

We separate off the $\lambda=1$ portion of this integral. For $\lambda \neq 1$, we have the geometric series formula

$\frac{1}{N} \sum_{n=0}^{N-1} \lambda^n = \frac{1}{N} \frac{\lambda^{N}-1}{\lambda-1}$ (10)

(compare with (7)), thus we can rewrite (9) as

$\mu(\{1\}) v + \int_{S^1 \backslash \{1\}} \frac{1}{N} \frac{\lambda^{N}-1}{\lambda-1}\ d\mu(\lambda) v$. (11)

Now observe (using (10)) that $\frac{1}{N} \frac{\lambda^{N}-1}{\lambda-1}$ is bounded in magnitude by 1 and converges to zero as $N \to \infty$ for any fixed $\lambda \neq 1$. Applying the dominated convergence theorem (which requires a little bit of justification in this vector-valued case), we see that the second term in (11) goes to zero as $N \to \infty$. So we see that (9) converges to $\mu(\{1\}) v$. But $\mu(\{1\})$ is just the orthogonal projection to the eigenspace of U with eigenvalue 1, i.e. the space $H^U$, thus recovering the von Neumann ergodic theorem. (It is instructive to use spectral theory to interpret Riesz’s proof of this theorem and see how it relates to the argument just given.)

Remark 3. The above argument in fact shows that the rate of convergence in the von Neumann ergodic theorem is controlled by the spectral gap of U – i.e. how well-separated the trivial component $\{1\}$ of the spectrum is from the rest of the spectrum. This is one of the reasons why results on spectral gaps of various operators are highly prized. $\diamond$

We now give another proof of Theorem 2, based on the energy decrement method; this proof is significantly lengthier, but is particularly well suited for conversion to finitary quantitative settings. For any positive integer N, define the averaging operators $A_N := \frac{1}{N} \sum_{n=0}^{N-1} U^n$; by the triangle inequality we see that $\| A_N v \| \leq \|v\|$ for all v. Now we observe

Lemma 1. (Lack of uniformity implies energy decrement) Suppose $\| A_N v \| \geq \varepsilon$. Then $\| v - A_N^* A_N v \|^2 \leq \|v\|^2 - \varepsilon^2$.

Proof. This follows from the identity

$\| v - A_N^* A_N v \|^2 = \|v\|^2 - 2 \|A_N v \|^2 + \| A_N^* A_N v \|^2$ (12)

and the fact that $A_N^*$ has operator norm at most 1. $\Box$

We now iterate this to obtain

Proposition 1. (Koopman-von Neumann type theorem) Let v be a unit vector, let $\varepsilon > 0$, and let $1 < N_1 < N_2 < \ldots < N_J$ be a sequence of integers with $J > 1/\varepsilon^2 + 2$. Then there exists $1 \leq j < J$ and a decomposition $v = s + r$ where $\| Us - s\| = O( J \frac{1}{N_{j+1}} )$ and $\| A_N r \| \leq \varepsilon$ for all $N \geq N_j$.

(The letters s, r stand for “structured” and “random” (or “residual”) respectively. For more on decompositions into structured and random components, see my FOCS lecture notes.)

Proof. We perform the following algorithm:

1. Initialise j := J-1, s := 0, and r := v.
2. If $\| A_N r \| \leq \varepsilon$ for all $N \geq N_j$ then STOP. If instead $\|A_N r \| > \varepsilon$ for some $N \geq N_j$, observe from Lemma 1 that $\| r - A_N^* A_N r \|^2 \leq \|r\|^2 - \varepsilon^2$.
3. Replace r with $r - A_N^* A_N r$, replace s with $s + A_N^* A_N r$, and replace j with j-1. Then return to Step 2.

Observe that this procedure must terminate in at most $1/\varepsilon^2$ steps (since the energy $\|r\|^2$ starts at 1, drops by at least $\varepsilon^2$ at each stage, and cannot go below zero). In particular, j stays positive. Observe also that r always has norm at most 1, and thus $\| (U - I) A_N^* A_N r \| = O( 1/N )$ at any given stage of the algorithm. From this and the triangle inequality one easily verifies the required claims. $\Box$

Corollary 2 (partial von Neumann ergodic theorem). For any vector v, the averages $A_N v$ form a Cauchy sequence in H.

Proof. Without loss of generality we can take v to be a unit vector. Suppose for contradiction that $A_N v$ was not Cauchy. Then one could find $\varepsilon > 0$ and $1 < N_1 < M_1 < N_2 < M_2 < \ldots$ such that $\|A_{N_j} v - A_{M_j} v \| \geq 5 \varepsilon$ (say) for all j. By sparsifying the sequence if necessary we can assume that $N_{j+1}$ is large compared to $N_j$, $M_j$ and $\varepsilon$. Now we apply Proposition 1 to find $j = O_\varepsilon(1)$ and a decomposition v = s + r such that $\|Us-s\| = O_\varepsilon( 1 / N_{j+1} )$ and $\| A_{N_j} r \|, \|A_{M_j} r \| \leq \varepsilon$. If $N_{j+1}$ is large enough depending on $N_j, M_j, \varepsilon$, we thus have $\|A_{N_j} s - s\|, \|A_{M_j} s - s \| \leq \varepsilon$, and thus by the triangle inequality, $\|A_{N_j} v - A_{M_j} v \| \leq 4 \varepsilon$, a contradiction. $\diamond$

Remark 4. This result looks weaker than Theorem 2, but the argument is much more robust; for instance, one can modify it to establish convergence of multiple averages such as $\frac{1}{N} \sum_{n=1}^N T_1^n f_1 T_2^n f_2 T_3^n f_3$ in $L^p$ norms for commuting shifts $T_1, T_2, T_3$, which does not seem possible using the other arguments given here; see this paper of mine for details. Further quantitative analysis of the mean ergodic theorem can be found in this paper of Avigad, Gerhardy, and Towsner. $\diamond$

Corollary 2 can be used to recover Theorem 2 in its full strength, by combining it with a weak form of Theorem 2:

Proposition 2 (Weak von Neumann ergodic theorem) The conclusion (5) of Theorem 2 holds in the weak topology.

Proof. The averages $A_N v$ lie in a bounded subset of the separable Hilbert space H, and are thus precompact in the weak topology by the sequential Banach-Alaoglu theorem. Thus, if (5) fails, then there exists a subsequence $A_{N_j} v$ which converges in the weak topology to some limit w other than $\pi(v)$. By telescoping series we see that $\| U A_{N_j} v - A_{N_j} v \| \leq 2 \|v\|/N_j$, and so on taking limits we see that $\|Uw - w\|=0$, i.e. $w \in H^U$. On the other hand, if y is any vector in $H^U$, then $A_{N_j}^* y = y$, and thus on taking inner products with v we obtain $\langle y, A_{N_j} v \rangle = \langle y, v \rangle$. Taking limits we obtain $\langle y, w \rangle = \langle y, v \rangle$, i.e. v-w is orthogonal to $H^U$. These facts imply that $w = \pi(v)$, giving the desired contradiction. $\Box$

— Conditional expectation —

We now turn away from the abstract Hilbert approach to the ergodic theorem (which is excellent for proving the mean ergodic theorem, but not flexible enough to handle more general ergodic theorems) and turn to a more measure-theoretic dynamics approach, based on manipulating the four components $X, {\mathcal X}, \mu, T$ of the underlying system separately, rather than working with the single object $L^2( X, {\mathcal X}, \mu)$ (with the unitary shift T). In particular it is useful to replace the $\sigma$-algebra ${\mathcal X}$ by a sub-$\sigma$-algebra ${\mathcal X}' \subset {\mathcal X}$, thus reducing the number of measurable functions. This creates an isometric embedding of Hilbert spaces

$L^2( X, {\mathcal X}', \mu) \subset L^2( X, {\mathcal X}, \mu)$ (13)

and so the former space is a closed subspace of the latter. In particular, we have an orthogonal projection ${\Bbb E}( \cdot|{\mathcal X}'): L^2( X, {\mathcal X}, \mu) \to L^2( X, {\mathcal X}', \mu)$, which can be viewed as the adjoint of the inclusion (13). In other words, for any $f \in L^2(X,{\mathcal X},\mu)$, ${\Bbb E}(f|{\mathcal X}')$ is the unique element of $L^2(X, {\mathcal X}',\mu)$ such that

$\int_X {\Bbb E}(f|{\mathcal X}') \overline{g}\ d\mu = \int_X f \overline{g}\ d\mu$ (14)

for all $g \in L^2(X, {\mathcal X}', \mu)$. (A reminder: when dealing with $L^p$ spaces, we identify any two functions which agree $\mu$-almost everywhere. Thus, technically speaking, elements of $L^p$ spaces are not actually functions, but rather equivalence classes of functions.)

Example 5. (Finite case) Let X be a finite set, thus ${\mathcal X}$ can be viewed as a partition of X, and ${\mathcal X}' \subset {\mathcal X}$ is a coarser partition of X. To avoid degeneracies, assume that every point in X has positive measure with respect to $\mu$. Then an element f of $L^2(X, {\mathcal X}, \mu)$ is just a function $f: X \to {\Bbb C}$ which is constant on each atom of ${\mathcal X}$. Similarly for $L^2(X, {\mathcal X}', \mu)$. The conditional expectation ${\Bbb E}(f|{\mathcal X}')$ is then the function whose value on each atom A of ${\mathcal X}'$ is equal to the average value $\frac{1}{\mu(A)} \int_A f\ d\mu$ on that atom. (What needs to be changed here if some points have zero measure?) $\diamond$

We leave the following standard properties of conditional expectation as an exercise.

Exercise 6. Let $(X, {\mathcal X}, \mu)$ be a probability space, and let ${\mathcal X}'$ be a sub-$\sigma$-algebra. Let $f \in L^2(X, {\mathcal X}, \mu)$.

1. The operator $f \mapsto {\Bbb E}( f|{\mathcal X'})$ is a bounded self-adjoint projection on $L^2(X,{\mathcal X},\mu)$. It maps real functions to real functions, it preserves constant functions (and more generally preserves ${\mathcal X'}$-measurable functions), and commutes with complex conjugation.
2. If f is non-negative, then ${\Bbb E}( f|{\mathcal X'})$ is non-negative (up to sets of measure zero, of course). More generally, we have a comparison principle: if f, g are real-valued and $f \leq g$ pointwise a. e., then ${\Bbb E}( f|{\mathcal X'}) \leq {\Bbb E}( g|{\mathcal X'})$ a.e. Similarly, we have the triangle inequality $|{\Bbb E}( f|{\mathcal X'})| \leq {\Bbb E}( |f||{\mathcal X'})$ a.e..
3. (Module property) If $g \in L^\infty(X, {\mathcal X}', \mu)$, then ${\Bbb E}( f g|{\mathcal X'}) = {\Bbb E}( f|{\mathcal X'}) g$ a.e..
4. (Contraction) If $f \in L^2(X, {\mathcal X},\mu) \cap L^p(X, {\mathcal X},\mu)$ for some $1 \leq p \leq \infty$, then $\|{\Bbb E}(f|{\mathcal X'})\|_{L^p} \leq \|f\|_{L^p}$. (Hint: do the p=1 and $p=\infty$ cases first.) This implies in particular that conditional expectation has a unique continuous extension to $L^p(X, {\mathcal X},\mu)$ for $1 \leq p \leq \infty$ (the $p=\infty$ case is exceptional, but note that $L^\infty$ is contained in $L^2$ since $\mu$ is finite). $\diamond$

For applications to ergodic theory, we will only be interested in taking conditional expectations with respect to a shift-invariant sub-$\sigma$-algebra ${\mathcal X'}$, thus $T$ and $T^{-1}$ preserve ${\mathcal X'}$. In that case T preserves $L^2(X,{\mathcal X}',\mu)$, and thus T commutes with conditional expectation, or in other words that

${\Bbb E}( T^n f | {\mathcal X}' ) = T^n {\Bbb E}( f | {\mathcal X}' )$ (15)

a.e. for all $f \in L^2(X,{\mathcal X}, \mu)$ and all n.

Now we connect conditional expectation to the mean ergodic theorem. Let ${\mathcal X}^T := \{ E \in {\mathcal X}: TE = E \hbox{ a.e.} \}$ be the set of essentially shift-invariant sets. One easily verifies that this is a shift-invariant sub-$\sigma$-algebra of ${\mathcal X}$.

Exercise 7. Show that if E lies in ${\mathcal X}^T$, then there exists a set $F \in {\mathcal X}$ which is genuinely invariant (TF=F) and which differs from E only by a set of measure zero. Thus it does not matter whether we deal with shift-invariance or essential shift-invariance here. (More generally, it will not make any significant difference if we modify any of the sets in our $\sigma$-algebras by null sets.) $\diamond$

The relevance of this algebra to the mean ergodic theorem arises from the following identity:

Exercise 8. Show that $L^2( X, {\mathcal X}, \mu)^T = L^2( X, {\mathcal X}^T, \mu)$. $\diamond$

As a corollary of this and Corollary 1, we have

Corollary 2. (Mean ergodic theorem, again) Let $(X, {\mathcal X}, \mu, T)$ be a measure-preserving system. Then for any $f \in L^2(X, {\mathcal X}, \mu)$, the averages $\frac{1}{N} \sum_{n=0}^{N-1} T^n f$ converge in $L^2$ norm to ${\Bbb E}(f|{\mathcal X}^T)$.

Exercise 9. Show that Corollary 2 continues to hold if $L^2$ is replaced throughout by $L^p$ for any $1 \leq p < \infty$. (Hint: for the case $p<2$, use that $L^2$ is dense in $L^p$. For the case $p>2$, use that $L^\infty$ is dense in $L^p$.) What happens when $p = \infty$? $\diamond$

Let us now give another proof of Corollary 2 (leading to a fourth proof of the mean ergodic theorem). The key here will be the decomposition $f = f_{U^\perp} + f_U$, where $f_{U^\perp} := {\Bbb E}(f|{\mathcal X}^T)$ is the “structured” part of f (at least as far as the mean ergodic theorem is concerned) and $f_U := f - f_{U^\perp}$ is the “random” part. (The subscripts $U^\perp, U$ stand for “anti-uniform” and “uniform” respectively; this notation is not standard.) As $f_{U^\perp}$ is shift-invariant, we clearly have

$\frac{1}{N} \sum_{n=0}^{N-1} T^n f_{U^\perp} = f_{U^\perp}$ (16)

so it suffices to show that

$\| \frac{1}{N} \sum_{n=0}^{N-1} T^n f_U \|_{L^2}^2 \to 0$ (17)
as $N \to \infty$. But we can expand out the left-hand side (using the unitarity of T) as

$\langle F_N, f_U \rangle := \int_X F_N \overline{f_U}\ d\mu$ (18)

where $F_N$ is the dual function of f_U, defined as

$F_N := \frac{1}{N^2} \sum_{n=0}^{N-1} \sum_{m=0}^{N-1} T^{n-m} f_U$. (19)

Now, from the triangle inequality we know that the sequence of dual functions $F_N$ is uniformly bounded in $L^2$ norm, and so by Cauchy-Schwarz we know that the inner products $\langle F_N, f_U \rangle$ are bounded. If they converge to zero, we are done; otherwise, by the Bolzano-Weierstrass theorem, we have $\langle F_{N_j}, f_U \rangle \to c$ for some subsequence $N_j$ and some non-zero c.
(One could also use ultrafilters instead of subsequences here if desired, it makes little difference to the argument.) By the Banach-Alaoglu theorem (or more precisely, the sequential version of this in the separable case), there is a further subsequence $F_{N'_j}$ which converges weakly (or equivalently in this Hilbert space case, in the weak-* sense) to some limit $F_\infty \in L^2(X, {\mathcal X}, \mu)$. Since c is non-zero, $F_\infty$ must also be non-zero. On the other hand, from telescoping series one easily computes that $\| T F_N - F_N\|_{L^2}$ decays like O(1/N) as $N \to \infty$, so on taking limits we have $T F_\infty - F_\infty = 0$. In other words, $F_\infty$ lies in $L^2(X, {\mathcal X}^T, \mu)$.

On the other hand, by construction of $f_U$ we have ${\Bbb E}(f_U|{\mathcal X}^T) = 0$. From (15) and linearity we conclude that ${\Bbb E}(F_N|{\mathcal X}^T) = 0$ for all N, so on taking limits we have ${\Bbb E}(F_\infty|{\mathcal X}^T) = 0$. But since $F_\infty$ is already in $L^2(X, {\mathcal X}^T, \mu)$, we conclude $F_\infty=0$, a contradiction.

Remark 5. This argument is lengthier than some of the other proofs of the mean ergodic theorem, but it turns out to be fairly robust; it demonstrates (using the compactness properties of certain “dual functions”) that a function $f_U$ with sufficiently strong “mixing” properties (in this case, we require that ${\Bbb E}(f_U | {\mathcal X}^T) = 0$) will cancel itself out when taking suitable ergodic averages, thus reducing the study of averages of f to the study of averages of $f_U = {\Bbb E}(f|{\mathcal X}^T)$. In the modern jargon, this means that ${\mathcal X}^T$ is (the $\sigma$-algebra induced by) a characteristic factor of the ergodic average $f \mapsto \lim_{N \to \infty} \frac{1}{N} \sum_{n=1}^N T^n f$. We will see further examples of characteristic factors for other averages later in this course. $\diamond$

Exercise 10. Let $(\Gamma,\cdot)$ be a countably infinite discrete group. A Følner sequence is a sequence of increasing finite non-empty sets $F_n$ in $\Gamma$ with $\bigcup_n F_n = \Gamma$ with the property that for any given finite set $S \subset \Gamma$, we have $|(F_n \cdot S) \Delta F_n|/|F_n| \to 0$ as $n \to \infty$, where $F_n \cdot S := \{ fs: f \in F_n, s \in S\}$ is the product set of $F_n$ and S, $|F_n|$ denotes the cardinality of $F_n$, and $\Delta$ denotes symmetric difference. (For instance, in the case $\Gamma = {\Bbb Z}$, the sequence $F_n := \{-n,\ldots,n\}$ is a Følner sequence.) If $\Gamma$ acts (on the left) in a measure-preserving manner on a probability space $(X, {\mathcal X}, \mu)$, and $f \in L^2(X, {\mathcal X}, \mu)$, show that $\frac{1}{|F_n|} \sum_{\gamma \in F_n} f \circ \gamma^{-1}$ converges in $L^2$ to ${\Bbb E}(f|{\mathcal X}^\Gamma)$, where ${\mathcal X}^\Gamma$ is the collection of all measurable sets which are $\Gamma$-invariant modulo null sets, and $f \circ \gamma^{-1}$ is the function $x \mapsto f(\gamma^{-1} x)$. $\diamond$

[Update, Jan 30: exercise corrected, another exercise added.]

[Update, Feb 1: Some corrections.]

[Update, Feb 4: Ergodic averages changed to sum over 0 to N-1 rather than over 1 to N.]

[Update, Feb 11: Discussion of the ergodic theorem in weak topologies (Proposition 2) added.]

[Update, Feb 21: Exercise 10 corrected.]

[Update, Mar 31: Exercise 5 corrected.]

[Update, Jun 14: Some corrections.]