In the previous lecture, we studied the recurrence properties of compact systems, which are systems in which all measurable functions exhibit almost periodicity – they almost return completely to themselves after repeated shifting. Now, we consider the opposite extreme of mixing systems – those in which all measurable functions (of mean zero) exhibit mixing – they become orthogonal to themselves after repeated shifting. (Actually, there are two different types of mixing, strong mixing and weak mixing, depending on whether the orthogonality occurs individually or on the average; it is the latter concept which is of more importance to the task of establishing the Furstenberg recurrence theorem.)

We shall see that for weakly mixing systems, averages such as $\frac{1}{N} \sum_{n=0}^{N-1} T^n f \ldots T^{(k-1)n} f$ can be computed very explicitly (in fact, this average converges to the constant $(\int_X f\ d\mu)^{k-1}$). More generally, we shall see that weakly mixing components of a system tend to average themselves out and thus become irrelevant when studying many types of ergodic averages. Our main tool here will be the humble Cauchy-Schwarz inequality, and in particular a certain consequence of it, known as the van der Corput lemma.

As one application of this theory, we will be able to establish Roth’s theorem (the k=3 case of Szemerédi’s theorem).

— Mixing functions —

Much as compact systems were characterised by their abundance of almost periodic functions, we will characterise mixing systems by their abundance of mixing functions (this is not standard terminology). To define and motivate this concept, it will be convenient to introduce a weak notion of convergence (this notation is also not standard):

Definition 1. (Cesàro convergence) A sequence $c_n$ in a normed vector space is said to converge in the Cesàro sense to a limit c if the averages $\frac{1}{N} \sum_{n=0}^{N-1} c_n$ converge strongly to c, in which case we write $C\!-\!\lim_{n \to \infty} c_n = c$. We also write $C\!-\!\sup_{n \to \infty} c_n := \limsup_{N \to \infty} \|\frac{1}{N} \sum_{n=0}^{N-1} c_n\|$ (thus $C\!-\!\lim_{n \to \infty} c_n = 0$ if and only if $C\!-\!\sup_{n \to \infty} c_n = 0$).

Example 1. The sequence $0,1,0,1,\ldots$ has a Cesàro limit of 1/2. $\diamond$

Exercise 1. Let $c_n$ be a bounded sequence of non-negative numbers. Show that the following three statements are equivalent:

1. $C\!-\!\lim_{n \to \infty} c_n = 0$.
2. $C\!-\!\lim_{n \to \infty} |c_n|^2 = 0$.
3. $c_n$ converges to zero in density. [We say $c_n$ converges in density to c if for any $\varepsilon > 0$, the set $\{n \in {\Bbb N}: |c_n-c| > \varepsilon\}$ has upper density zero.]

Which of the implications between 1, 2, 3 remain valid if $c_n$ is not bounded? $\diamond$Let $(X,{\mathcal X},\mu,T)$ be a measure-preserving system, and let $f \in L^2(X,{\mathcal X},\mu)$ be a function. We consider the correlation coefficients $\langle T^n f, f \rangle := \int_X T^n f \overline{f}\ d\mu$ as n goes to infinity. Note that we have the symmetry $\langle T^n f, f \rangle = \overline{\langle T^{-n} f, f \rangle}$, so we only need to consider the case when n is positive. The mean ergodic theorem (Corollary 2 from Lecture 8) tells us the Cesàro behaviour of these coefficients. Indeed, we have

$C\!-\!\lim_{n \to \infty} \langle T^n f, f \rangle = \langle {\Bbb E}(f|{\mathcal X}^T), f \rangle = \| {\Bbb E}(f|{\mathcal X}^T) \|_{L^2(X, {\mathcal X}, \mu)}^2$ (1)

where ${\mathcal X}^T$ is the $\sigma$-algebra of essentially shift-invariant sets. In particular, if the system is ergodic, and f has mean zero (i.e. $\int_X f\ d\mu = 0$), then we have

$C\!-\!\lim_{n \to \infty} \langle T^n f, f \rangle = 0$, (2)

thus the correlation coefficients go to zero in the Cesàro sense. However, this does not necessarily imply that these coefficients go to zero pointwise. For instance, consider a circle shift system $({\Bbb R}/{\Bbb Z}, x \mapsto x+\alpha)$ with $\alpha$ irrational (and with uniform measure), thus this system is ergodic by Exercise 5 from Lecture 9. Then the function $f(x) := e^{2\pi i x}$ has mean zero, but one easily computes that $\langle T^n f, f \rangle = e^{2\pi i n \alpha}$. The coefficients $e^{2\pi i n \alpha}$ converge in the Cesàro sense to zero, but have magnitude 1 and thus do not converge to zero pointwise.

Definition 2. (Mixing) Let $(X,{\mathcal X},\mu,T)$ be a measure-preserving system. A function $f \in L^2(X,{\mathcal X},\mu)$ is strongly mixing if $\lim_{n \to \infty} \langle T^n f, f \rangle = 0$, and weakly mixing if $C\!-\!\lim_{n \to \infty} |\langle T^n f, f \rangle| = 0$.

Remark 1. Clearly strong mixing implies weak mixing. From (1) we also see that if f is weakly mixing, then ${\Bbb E}(f|{\mathcal X}^T)$ must vanish a.e. $\diamond$

Exercise 2. Show that if f is both almost periodic and weakly mixing, then it must be 0 almost everywhere. In particular, in a compact system, the only weakly mixing function is 0 (up to a.e. equivalence). $\diamond$

Exercise 3. In any Bernoulli system $\Omega^{\Bbb Z}$ with the product $\sigma$-algebra and a product measure, and the standard shift, show that any function of mean zero is strongly mixing. (Hint: first do this for functions that depend on only finitely many of the variables.) $\diamond$

Exercise 4. Consider a skew shift system $( ({\Bbb R}/{\Bbb Z})^2, (x,y) \mapsto (x+\alpha,y+x) )$ with the usual Lebesgue measure and Borel $\sigma$-algebra, and with $\alpha$ irrational. Show that the function $f(x,y) := e^{2\pi i x}$ is neither strongly mixing nor weakly mixing, but that the function $g(x,y) := e^{2\pi i y}$ is both strongly mixing and weakly mixing. $\diamond$

Exercise 5. Let $X := {\Bbb C}^{\Bbb Z}$ be given the product Borel $\sigma$-algebra ${\mathcal X}$ and the shift $T: (z_n)_{n \in {\Bbb Z}} \to (z_{n+1})_{n \in {\Bbb Z}}$. For each $d \geq 1$, let $\mu_d$ be the probability distribution in X of the random sequence $(z_n)_{n \in {\Bbb Z}}$ given by the rule

$z_n := \frac{1}{2^{d/2}} \sum_{\omega_1,\ldots,\omega_d \in \{0,1\}} w_{\omega_1,\ldots,\omega_d} e^{2\pi i \sum_{j=1}^d \omega_j n / 100^j}$, (3)

where the $w_{\omega_1,\ldots,\omega_d}$ are iid standard complex Gaussians (thus each w has probability distribution $e^{-\pi |w|^2}\ dw$). Show that each $\mu_d$ is shift invariant. If $\mu$ is a vague limit point of the sequence $\mu_d$, and $f: X \to {\Bbb C}$ is the function defined as $f( (z_n)_{n \in {\Bbb Z}} ) := \hbox{sgn}( \hbox{Re} z_0 )$, show that f is weakly mixing but not strongly mixing (and more specifically, that $\langle T^{100^j} f, f \rangle$ stays bounded away from zero) with respect to the system $(X, {\mathcal X}, \mu, T)$. $\diamond$

Remark 2. Exercise 5 illustrates an important point, namely that stationary processes yield a rich source of measure-preserving systems (indeed the two notions are almost equivalent in some sense, especially after one distinguishes a specific function f on the measure-preserving system). However, we will not adopt this more probabilistic perspective to ergodic theory here. $\diamond$

Remark 3. We briefly discuss the finitary analogue of the weak mixing concept in the context of functions $f: {\Bbb Z}/N{\Bbb Z} \to {\Bbb C}$ on a large cyclic group ${\Bbb Z}/N{\Bbb Z}$ with the usual shift $x \mapsto x+1$. Then one can compute

$C\!-\!\lim_{n \to \infty} |\langle T^n f, f \rangle|^2 = \sum_{\xi \in {\Bbb Z}/N{\Bbb Z}} |\hat f(\xi)|^4$ (4)

where $\hat f(\xi) := \frac{1}{N} \sum_{x \in {\Bbb Z}/N{\Bbb Z}} f(x) e^{-2\pi i x \xi/N}$ are the Fourier coefficients of f. Comparing this against the Plancherel identity $\| f \|_{L^2}^2 = \sum_{\xi \in {\Bbb Z}/N{\Bbb Z}} |\hat f(\xi)|^2$ we thus see that a function f bounded in $L^2$ norm should be considered “weakly mixing” if it has no large Fourier coefficients. Contrast this with Remark 7 from Lecture 11. $\diamond$

Now let us see some consequences of the weak mixing property. We need the following lemma, which gives a useful criterion as to whether a sequence of bounded vectors in a Hilbert space converges in the Cesàro sense to zero.

Lemma 1 (van der Corput lemma). Let $v_1, v_2, v_3, \ldots$ be a bounded sequence of vectors in a Hilbert space H. If

$C\!-\!\lim_{h\to \infty} C\!-\!\sup_{n \to \infty} \langle v_n, v_{n+h} \rangle = 0$ (5)

then $C\!-\!\lim_{n \to \infty} v_n = 0$.

Informally, this lemma asserts that if each vector in a bounded sequence tends to be orthogonal to nearby elements in that sequence, then the vectors will converge to zero in the Cesàro sense. This formulation of the lemma is essentially the version in this paper by Bergelson, except that we have made the minor change of replacing one of the Cesàro limits with a Cesàro supremum.

Proof. We can normalise so that $\|v_n\| \leq 1$ for all n. In particular, we have $v_n = O(1)$, where O(1) denotes a vector of bounded magnitude. For any h and $N \geq 1$, we thus have the telescoping identity

$\frac{1}{N} \sum_{n=0}^{N-1} v_{n+h} = \frac{1}{N} \sum_{n=0}^{N-1} v_n + O( |h| / N );$ (6)

averaging this over all h from 0 to H-1 for some $H \geq 1$, we obtain

$\frac{1}{N} \sum_{n=0}^{N-1} \frac{1}{H} \sum_{h=0}^{H-1} v_{n+h} = \frac{1}{N} \sum_{n=0}^{N-1} v_n + O( H / N );$ (7)

by the triangle inequality we thus have

$\| \frac{1}{N} \sum_{n=0}^{N-1} v_n \| \leq \frac{1}{N} \sum_{n=0}^{N-1} \| \frac{1}{H} \sum_{h=0}^{H-1} v_{n+h} \| + O( H/N )$ (8)

where the O() terms are now scalars rather than vectors. We square this (using the crude inequality $(a+b)^2 \leq 2a^2 + 2b^2$) and apply Cauchy-Schwarz to obtain

$\| \frac{1}{N} \sum_{n=0}^{N-1} v_n \|^2 \leq O( \frac{1}{N} \sum_{n=0}^{N-1} \| \frac{1}{H} \sum_{h=0}^{H-1} v_{n+h} \|^2 ) + O( H^2/N^2 )$ (9)

which we rearrange as

$\| \frac{1}{N} \sum_{n=0}^{N-1} v_n \|^2 \leq O( \frac{1}{H^2} \sum_{0 \leq h,h' < H} \frac{1}{N} \sum_{n=0}^{N-1} \langle v_{n+h}, v_{n+h'} \rangle ) + O( H^2/N^2 )$. (10)

We take limits as $N \to \infty$ (keeping H fixed for now) to conclude

$\limsup_{N \to \infty} \| \frac{1}{N} \sum_{n=0}^{N-1} v_n \|^2$

$\leq O( \frac{1}{H^2} \sum_{0 \leq h,h' < H} C\!-\!\sup_{n \to \infty} \langle v_{n+h}, v_{n+h'} \rangle )$. (11)

Another telescoping argument (and symmetry) gives us

$C\!-\!\sup_{n \to \infty} \langle v_{n+h}, v_{n+h'} \rangle = C\!-\!\sup_{n \to \infty} \langle v_{n+|h-h'|}, v_n \rangle$ (12)

and so

$\limsup_{N \to \infty} \| \frac{1}{N} \sum_{n=0}^{N-1} v_n \|^2 \leq O( \frac{1}{H} \sum_{0 \leq h < H} C\!-\!\sup_{n \to \infty} \langle v_{n+h}, v_{n} \rangle )$. (13)

Taking limits as $H \to \infty$ and using (5) we obtain the claim. $\Box$

Exercise 6. Let $P: {\Bbb Z} \to {\Bbb R}/{\Bbb Z}$ be a polynomial with at least one irrational non-constant coefficient. Using Lemma 1 (in the scalar case $H = {\Bbb C}$) and an induction on degree, show that $C\!-\!\lim_{n \to \infty} e^{2\pi i P(n)} = 0$. Conclude that the sequence $(P(n))_{n \in {\Bbb N}}$ is uniformly distributed with respect to uniform measure (see Lecture 9 for a definition of uniform distribution). $\diamond$

Exercise 7. Using Exercise 6, give another proof of Theorem 1 from Lecture 6. $\diamond$

We now apply the van der Corput lemma to weakly mixing functions.

Corollary 1. Let $(X,{\mathcal X},\mu,T)$ be a measure-preserving system, and let $f \in L^2(X, {\mathcal X},\mu)$ be weakly mixing. Then for any $g \in L^2(X, {\mathcal X},\mu)$ we have $C\!-\!\lim_{n \to \infty} |\langle T^n f, g \rangle| = 0$ and $C\!-\!\lim_{n \to \infty} |\langle f, T^n g \rangle| = 0$.

Proof. We just prove the first claim, as the second claim is similar. By Exercise 1, it suffices to show that

$\frac{1}{N} \sum_{n=0}^{N-1} |\langle T^n f, g \rangle|^2 \to 0$ (14)

as $N \to \infty$. The left-hand side can be rewritten as

$\langle \frac{1}{N} \sum_{n=0}^{N-1} \langle g, T^n f \rangle T^n f, g \rangle$ (15)

so by Cauchy-Schwarz it suffices to show that

$C\!-\!\lim_{N \to \infty} \langle g, T^n f \rangle T^n f = 0.$ (16)

Applying the van der Corput lemma and discarding the bounded coefficients $\langle g, T^n f \rangle$, it suffices to show that

$C\!-\!\lim_{H \to \infty} C\!-\!\sup_{n \to \infty} |\langle T^{n+h} f, T^n f \rangle| = 0.$ (17)

But $\langle T^{n+h} f, T^n f \rangle = \langle T^h f, f \rangle$, and the claim now follows from the weakly mixing nature of f. $\Box$

— Weakly mixing systems —

Now we consider systems which are full of mixing functions.

Definition 3. (Mixing systems) A measure-preserving system $(X, {\mathcal X}, \mu, T)$ is weakly mixing (resp. strongly mixing) if every function $f \in L^2(X, {\mathcal X}, \mu)$ with mean zero is weakly mixing (resp. strongly mixing).

Example 2. From Exercise 2, we know that any system with a non-trivial Kronecker factor is not weakly mixing (and thus not strongly mixing). On the other hand, from Exercise 3, we know that any Bernoulli system is strongly mixing (and thus weakly mixing also). From Remark 1 we see that any strongly or weakly mixing system must be ergodic. $\diamond$

Exercise 8. Show that the system in Exercise 5 is weakly mixing but not strongly mixing. $\diamond$

Here is another characterisation of weak mixing:

Exercise 9. Let $(X, {\mathcal X}, \mu, T)$ be a measure preserving system. Show that the following are equivalent:

1. $(X, {\mathcal X}, \mu, T)$ is weakly mixing.
2. For every $f, g \in L^2(X, {\mathcal X}, \mu)$, $\langle T^n f, g \rangle$ converges in density to $(\int_X f\ d\mu) (\int_X \overline{g}\ d\mu)$. (See Exercise 1 for a definition of convergence in density.)
3. For any measurable $E, F$, $\mu( T^n E \cap F )$ converges in density to $\mu(E) \mu(F)$.
4. The product system $(X \times X, {\mathcal X} \times {\mathcal X}, \mu \times \mu, T \times T)$ is ergodic.

[Hints: To equate 1 and 2, use the decomposition $f = (f-\int_X f\ d\mu) + \int_X f\ d\mu$ of a function into its mean and mean-free components. To equate 2 and 4, use the fact that the space $L^2( X \times X, {\mathcal X} \times {\mathcal X}, \mu \times \mu)$ is spanned (in the topological vector space sense) by tensor products $(x,y) \mapsto f(x) g(y)$ with $f, g \in L^2(X, {\mathcal X}, \mu)$.] $\diamond$

Exercise 10. Show that the equivalences between 1, 2, 3 in Exercise 9 remain if “weak mixing” and “converges in density” are replaced by “strong mixing” and “converges” respectively. $\diamond$

Exercise 11. Let $(X, {\mathcal F}, T)$ be any minimal topological system with Borel $\sigma$-algebra ${\mathcal B}$, and let $\mu$ be a shift invariant Borel probability measure. Show that if $(X, {\mathcal B},\mu, T)$ is weakly mixing (resp. strongly mixing), then $(X, {\mathcal F}, T)$ is topologically weakly mixing (resp. topologically mixing), as defined in Definition 3 and Exercise 12 of Lecture 7. $\diamond$

Exercise 12. If $(X, {\mathcal X}, \mu, T)$ is weakly mixing, show that $(X, {\mathcal X}, \mu, T^n)$ is weakly mixing for any non-zero n. $\diamond$

Exercise 13. Let $(X, {\mathcal X}, \mu, T)$ be a measure preserving system. Show that the following are equivalent:

1. $(X, {\mathcal X}, \mu, T)$ is weakly mixing.
2. Whenever $(Y, {\mathcal Y}, \nu, S)$ is ergodic, the product system $(X \times Y, {\mathcal X} \times {\mathcal Y}, \mu \times \nu, T \times S)$ is ergodic.

(Hint: To obtain 1 from 2, use Exercise 9. To obtain 2 from 1, repeat the methods used to prove Exercise 9.) $\diamond$

Exercise 14. Show that the product of two weakly mixing systems is again weakly mixing. (Hint: use Exercises 9 and 13.) $\diamond$

Now we come to an important type of observation for the purposes of establishing the Furstenberg recurrence theorem: in weakly mixing systems, functions of mean zero are negligible as far as multiple averages are concerned.

Proposition 1. Let $a_1, \ldots, a_k \in {\Bbb Z}$ be distinct non-zero integers for some $k \geq 1$. Let $(X, {\mathcal X}, \mu, T)$ be weakly mixing, and let $f_1, \ldots, f_k \in L^\infty(X, {\mathcal X}, \mu)$ be such that at least one of $f_1,\ldots,f_k$ has mean zero. Then we have

$C\!-\!\lim_{n \to \infty} T^{a_1 n} f_1 \ldots T^{a_k n} f_k = 0$ (18)

in $L^2(X, {\mathcal X}, \mu)$.

Proof. We induct on k. When k=1 the claim follows from the mean ergodic theorem and Exercise 12 (recall from Example 2 that all weakly mixing systems are ergodic).

Now let $k \geq 2$ and suppose that the claim has already been proven for k-1. Without loss of generality we may assume that it is $f_1$ which has mean zero. Applying the van der Corput lemma (Lemma 1), it suffices to show that

$C\!-\!\sup_{n \to \infty} \langle T^{a_1 (n+h)} f_1 \ldots T^{a_k (n+h)} f_k, T^{a_1 n} f_1 \ldots T^{a_k n} f_k \rangle$ (19)

converges in density to zero as $h \to \infty$. But the left-hand side can be rearranged as

$C\!-\!\sup_{n \to \infty} \int_X T^{(a_1-a_k)n} f_{1,h} \ldots T^{(a_{k-1}-a_k)n} f_{k-1,h} f_{k,h}\ d\mu$ (20)

where $f_{j,h} := T^{a_j h} f_j \overline{f_j}$. Applying Cauchy-Schwarz, it suffices to show that

$C\!-\!\sup_{n \to \infty} T^{(a_1-a_k)n} f_{1,h} \ldots T^{(a_{k-1}-a_k)n} f_{k-1,h}$ (21)

converges in density to zero as $h \to \infty$.

Since $(X, {\mathcal X}, \mu, T)$ is weakly mixing, the mean-zero function $f_1$ is weakly mixing, and so the mean of $f_{1,h}$ goes to zero in density as $h \to \infty$. As all functions are assumed to be bounded, we can thus subtract the mean from $f_{1,h}$ in (21) without affecting the desired conclusion, leaving behind the mean-zero component $f_{1,h} - \int_X f_{1,h}\ d\mu$. But then the contribution of this expression to (21) vanishes by the induction hypothesis. $\Box$

Remark 4. The key point here was that functions f of mean zero were weakly mixing and thus had the property that $T^h f \overline{f}$ almost had mean zero, and were thus almost weakly mixing. One could iterate this further to investigate the behaviour of “higher derivatives” of f such as $T^{h+h'} f \overline{T^h f} \overline{T^{h'} f} f$. Pursuing this analysis further leads to the Gowers-Host-Kra seminorms, which are closely related to the Gowers uniformity norms in additive combinatorics. $\diamond$

Corollary 2. Let $a_1, \ldots, a_k \in {\Bbb Z}$ be distinct integers for some $k \geq 1$, let $(X, {\mathcal X}, \mu, T)$ be a weakly mixing system, and let $f_1,\ldots,f_k \in L^\infty(X, {\mathcal X}, \mu)$. Then $\int_X T^{a_1 n} f_1 \ldots T^{a_k n} f_k\ d\mu$ converges in the Cesáro sense to $(\int_X f_1\ d\mu) \ldots (\int_X f_k\ d\mu)$.

Note in particular that this establishes the Furstenberg recurrence theorem (Theorem 1 from Lecture 11) in the case of weakly mixing systems.

Proof. We again induct on k. The k=1 case is trivial, so suppose $k > 1$ and the claim has already been proven for k-1. If any of the functions $f_j$ is constant then the claim follows from the induction hypothesis, so we may subtract off the mean from each function and suppose that all functions have mean zero. By shift-invariance we may also fix $a_k$ (say) to be zero. The claim now follows from Proposition 1 and Cauchy-Schwarz. $\Box$

Exercise 15. Show that the Cesáro convergence in Corollary 2 can be strengthened to convergence in density. (Hint: first reduce to the mean zero case, then apply Exercise 14 to work with the product system instead.) $\diamond$

Exercise 16. Let $(X, {\mathcal X}, \mu, T)$ be a weakly mixing system, and let $f \in L^\infty(X, {\mathcal X}, \mu)$ have mean zero. Show that $T^{n^2} f$ converges in the Cesáro sense in $L^2(X, {\mathcal X}, \mu)$ to zero. (Hint: use van der Corput and Proposition 1 or Corollary 2.) $\diamond$

Exercise 17. Show that Corollary 2 continues to hold if the linear polynomials $a_1 n, \ldots, a_k n$ are replaced by arbitrary polynomials $P_1(n),\ldots,P_k(n)$ from the integers to the integers, so long as the difference between any two of these polynomials is non-constant. (Hint: you will need the “PET induction” machinery from Exercise 3 of Lecture 5. This result was first established by Bergelson.) $\diamond$

— Hilbert-Schmidt operators —

We have now established the Furstenberg recurrence theorem for two distinct types of systems: compact systems and weakly mixing systems. From Example 2 we know that these systems are indeed quite distinct from each other. Here is another indication of “distinctness”:

Exercise 18. In any measure-preserving system $(X, {\mathcal X}, \mu, T)$, show that almost periodic functions and weakly mixing functions are always orthogonal to each other. $\diamond$

On the other hand, there are certainly systems which are neither weakly mixing nor compact (e.g. the skew shift). But we have the following important dichotomy (cf. Theorem 3 from Lecture 7):

Theorem 1. Suppose that $(X, {\mathcal X}, \mu, T)$ is a measure-preserving system. Then exactly one of the following statements is true:

1. (Structure) $(X, {\mathcal X}, \mu, T)$ has a non-trivial compact factor.
2. (Randomness) $(X, {\mathcal X}, \mu, T)$ is weakly mixing.

[Note: in ergodic theory, a factor of a measure-preserving system is simply a morphism from that system to some other measure-preserving system. Unlike the case with topological dynamics, we do not need to assume surjectivity of the morphism, since in the measure-theoretic setting, the image of a morphism always has full measure.]

In Example 2 we have already shown that 1 and 2 cannot be both true; the tricky part is to show that lack of weak mixing implies a non-trivial compact factor.

In order to prove this result, we recall some standard results about Hilbert-Schmidt operators on a separable Hilbert space. (As usual, the hypothesis of separability is not absolutely essential, but is convenient to assume throughout; for instance, it assures that orthonormal bases always exist and are at most countable.) We begin by recalling the notion of a tensor product of two Hilbert spaces:

Proposition 2. Let $H, H'$ be two separable Hilbert spaces. Then there exists another separable Hilbert space $H \otimes H'$ and a bilinear tensor product map $\otimes: H \times H' \to H \otimes H'$ such that

$\langle v \otimes v', w \otimes w' \rangle_{H \otimes H'} = \langle v, w \rangle_H \langle v', w' \rangle_{H'}$ (22)

for all $v, w \in H$ and $v', w' \in H'$. Furthermore, the tensor products $(e_n \otimes e'_{n'})_{n \in A, n' \in A'}$ between any orthonormal bases $(e_n)_{n \in A}$, $(e'_{n'})_{n' \in A'}$ of H and H’ respectively, form an orthonormal basis of $H \otimes H'$.

It is easy to see that $H \otimes H'$ is unique up to isomorphism, and so we shall abuse notation slightly and refer to $H \otimes H'$ as the tensor product of H and H’.

Example 3. The tensor product of $L^2(X,{\mathcal X},\mu)$ and $L^2(Y, {\mathcal Y},\nu)$ is $L^2(X \times Y, {\mathcal X} \times {\mathcal Y}, \mu \times \nu)$, with the tensor product operation $f \otimes g(x,y) := f(x) g(y)$. The tensor product of ${}{\Bbb C}^m$ and ${\Bbb C}^n$ is ${\Bbb C}^{n \times m}$, which can be thought of as the Hilbert space of $n \times m$ (or $m \times n$) matrices, with the inner product $\langle A, B \rangle := \hbox{tr}(A B^\dagger) = \hbox{tr}(A^\dagger B)$. $\diamond$

Proof. Take any orthonormal bases $(e_n)_{n \in A}$ and $(e'_{n'})_{n' \in A'}$ of H and H’ respectively, and let $H \otimes H'$ be the Hilbert space generated by declaring the formal quantities $e_n \otimes e'_{n'}$ to be an orthonormal basis. If one then defines

$(\sum_n c_n e_n) \otimes (\sum_{n'} c'_{n'} e_{n'}) := \sum_n \sum_{n'} c_n c'_{n'} e_n \otimes e_{n'}$ (23)

for all square-summable sequences $c_n$ and $c'_{n'}$, one easily verifies that $\otimes$ is indeed a bilinear map that obeys (22). in particular, if $(f_m)_{m \in B}$ and $(f'_{m'})_{m' \in B'}$ are some other orthonormal bases of $H, H'$ respectively, then from (22) $(f_m \otimes f'_{m'})_{m \in B, m' \in B'}$ is an orthonormal set, and one can approximate any element $e_n \otimes e'_{n'}$ in the original orthonormal basis to arbitrary accuracy by linear combinations from this orthonormal set, and so this set is in fact an orthonormal basis as required. $\Box$

Given a Hilbert space H, define its complex conjugate $\overline{H}$ to be the same set as H, but with the conjugated scalar multiplication structure $z, v \mapsto \overline{z} v$ and the conjugated inner product $\langle z, w \rangle_{\overline H} := \overline{\langle z, w \rangle_H} = \langle w, z \rangle_H$, but with all other structures unchanged. This is also a Hilbert space. (Of course, for real Hilbert spaces rather than complex, the notion of complex conjugation is trivial.)

Example 4. The conjugation map $f \mapsto \overline{f}$ is a Hilbert space isometry between the Hilbert space $L^2(X, {\mathcal X}, \mu)$ and its complex conjugate. $\diamond$

Every element $K \in \overline{H} \otimes H'$ induces a bounded linear operator $T_K: H \to H'$, defined via duality by the formula

$\langle T_K v, v' \rangle_{H'} := \langle K, v \otimes v' \rangle$ (24)

for all $v \in H, v' \in H'$. We refer to K as the kernel of $T_K$. Any operator $T = T_K$ that arises in this manner is called a Hilbert-Schmidt operator from H to H’. The Hilbert space structure on the space $\overline{H} \otimes H'$ of kernels induces an analogous Hilbert space structure on the Hilbert-Schmidt operators, leading to the Hilbert-Schmidt norm $\|T\|_{HS}$ and inner product $\langle S, T \rangle_{HS}$ for such operators. Here are some other characterisations of this concept:

Exercise 19. Let $H, H'$ be Hilbert spaces with orthonormal bases $(e_n)_{n \in A}$ and $(e'_{n'})_{n' \in A'}$ respectively, and let $T: H \to H'$ be a bounded linear operator. Show that the following are equivalent:

1. T is a Hilbert-Schmidt operator.
2. $\sum_{n \in A} \| T e_n \|_{H'}^2 < \infty$.
3. $\sum_{n \in A} \sum_{n' \in A'} |\langle T e_n, e'_{n'} \rangle_{H'}|^2 < \infty$.

Also, show that if $T, S: H \to H'$ are Hilbert-Schmidt operators, then

$\langle T, S \rangle_{HS} = \sum_{n \in A} \langle T e_n, S e_n \rangle_{H'}$ (25)

and

$\|T\|_{HS}^2 = \sum_{n \in A} \|T e_n\|_{H'}^2 = \sum_{n \in A} \sum_{n' \in A'} |\langle T e_n, e'_{n'} \rangle_{H'}|^2$. (26) $\diamond$

As one consequence of the above exercise, we see that the Hilbert-Schmidt norm controls the operator norm, thus $\|Tv\| \leq \| T\|_{HS} \|v\|$ for all vectors v.

Remark 5. From this exercise and Fatou’s lemma, we see in particular that the limit (in either the norm, strong or weak operator topologies) of a sequence of Hilbert-Schmidt operators with uniformly bounded Hilbert-Schmidt norm, is still Hilbert-Schmidt. We also see that the composition of a Hilbert-Schmidt operator with a bounded operator is still Hilbert-Schmidt (thus the Hilbert-Schmidt operators can be viewed as a closed two-sided ideal in the space of bounded operators). $\diamond$

Example 5. An operator $T: L^2(X, {\mathcal X}, \mu) \to L^2(Y, {\mathcal Y},\nu)$ is Hilbert-Schmidt if and only if it takes the form $Tf(y) := \int_X K(x,y) f(x)\ d\mu(x)$ for some kernel $K \in L^2(X \times Y, {\mathcal X} \times {\mathcal Y}, \mu \times \nu)$, in which case the Hilbert-Schmidt norm is $\| K \|_{L^2(X \times Y, {\mathcal X} \times {\mathcal Y}, \mu \times \nu)}$. The Hilbert-Schmidt inner product is defined similarly. $\diamond$

Example 6. The identity operator on an infinite-dimensional Hilbert space is never Hilbert-Schmidt, despite being bounded. On the other hand, every finite rank operator is Hilbert-Schmidt. $\diamond$

One of the key properties of Hilbert-Schmidt operators which will be relevant to us is the following.

Lemma 2. If $T: H \to H'$ is Hilbert-Schmidt, then it is compact (i.e. the image of any bounded set is precompact).

Proof. Let $\varepsilon > 0$ be arbitrary. By Exercise 19 and monotone convergence, we can find a finite orthonormal set $e_1,\ldots,e_N$ such that $\sum_{n=1}^N \| Te_n\|_{H'}^2 \geq \|T\|_{HS}^2 - \varepsilon^2$, and in particular that $\|Te_{n+1}\|_{H'} \leq \varepsilon$ for any $e_{n+1}$ orthogonal to $e_1,\ldots,e_n$. As a consequence, the image of the unit ball of H under T lies within $\varepsilon$ of the image of the unit ball of the finite-dimensional space $\hbox{span}(e_1,\ldots,e_N)$. This image is therefore totally bounded and thus precompact. $\Box$

The following exercise may help illuminate the distinction between bounded operators, Hilbert-Schmidt operators, and compact operators:

Exercise 20. Let $\lambda_n$ be a sequence of complex numbers, and consider the diagonal operator $T: (z_n)_{n \in {\Bbb N}} \mapsto (\lambda_n z_n)_{n \in {\Bbb N}}$ on $l^2({\Bbb N})$.

1. Show that T is a well-defined bounded linear operator on $l^2({\Bbb N})$ if and only if the sequence $(\lambda_n)$ is bounded.
2. Show that T is Hilbert-Schmidt if and only if the sequence $(\lambda_n)$ is square-summable.
3. Show that T is compact if and only if the sequence $(\lambda_n)$ goes to zero as $n \to \infty$. $\diamond$

Now we apply the above theory to establish Theorem 1. Let $(X, {\mathcal X}, \mu, T)$ be a measure-preserving system, and let $f \in L^2(X, {\mathcal X}, \mu)$. The rank one operators $g \mapsto \langle g, T^n f \rangle T^n f$ can easily be verified to have a Hilbert-Schmidt norm of $\|f\|_{L^2}^2$, and so by the triangle inequality, their averages $S_{f,N}: g \mapsto \frac{1}{N} \sum_{n=0}^{N-1} \langle g, T^n f \rangle T^n f$ have a Hilbert-Schmidt norm of at most $\|f\|_{L^2}^2$. On the other hand, from the identity

$\langle S_{f,N} g, h \rangle = \frac{1}{N} \sum_{n=0}^{N-1} \langle g \otimes \overline{h}, (T \otimes T)^n (f \otimes \overline{f}) \rangle$ (27)

and the mean ergodic theorem (applied to the product space) we see that $S_{f,N}$ converges in the weak operator norm to some limit $S_f$, which is then also Hilbert-Schmidt by Remark 5, and thus compact by Lemma 2. (Actually, $S_{f,N}$ converges to $S_f$ in the Hilbert-Schmidt norm, and thus also in the operator norm and in the strong topology: this is another application of the mean ergodic theorem, which we leave as an exercise.  Since each of the $S_{f,N}$ is clearly finite rank, this gives a direct proof of the compactness of $S_f$.) Also, it is easy to see that $S_f$ is self-adjoint and commutes with T. As a consequence, we conclude that for any $g \in L^2(X, {\mathcal X}, \mu)$, the image $S_f g$ is almost periodic (since $\{ T^n S_f g: n \in {\Bbb Z} \} = S_f \{ T^n g: n \in {\Bbb Z} \}$ is the image of a bounded set by the compact operator $S_f$ and therefore precompact).

On the other hand, observe that

$\langle S_f f, f \rangle = C\!-\!\lim_{n \to \infty} |\langle T^n f, f \rangle|^2$. (28)

Thus by Definition 2 (and Exercise 1), we see that $\langle S_f f, f \rangle \neq 0$ whenever f is not weakly mixing. In particular, f is not orthogonal to the almost periodic function $S_f f$. From this and Exercise 18, we have thus shown

Proposition 3. (Dichotomy between structure and randomness) Let $(X, {\mathcal X}, \mu, T)$ be a measure-preserving system. A function $f \in L^2(X, {\mathcal X}, \mu)$ is weakly mixing if and only if it is orthogonal to all almost periodic functions (or equivalently, orthogonal to all eigenfunctions).

Remark 6. Interestingly, essentially the same result appears in the spectral and scattering theory of linear Schrödinger equations, which in that context is known as the “RAGE theorem” (after Ruelle, Amrein-Georgescu, and Enss). $\diamond$

Remark 7. The finitary analogue of the expression $S_f f$ is the dual function (of order 2) of f (the dual function of order 1 was briefly discussed in Lecture 8). If we are working on ${\Bbb Z}/N{\Bbb Z}$ with the usual shift, then $S_f$ can be viewed as a Fourier multiplier which multiplies the Fourier coefficient at $\xi$ by $|\hat f(\xi)|^2$; informally, $S_f$ filters out all the low amplitude frequencies of f, leaving only a handful of high-amplitude frequencies. $\diamond$

Recall from Proposition 2 and Exercise 5 of Lecture 11 that a function $f \in L^2(X, {\mathcal X}, \mu)$ is almost periodic if and only if it is ${\mathcal Z}_1$-measurable, or if it lies in the pure point component ${\Bbb H}_{pp}$ of the shift operator T. We thus have

Corollary 3. (Koopman-von Neumann theorem) Let $(X, {\mathcal X}, \mu, T)$ be a measure-preserving system, and let $f \in L^2(X, {\mathcal X}, \mu)$. Let ${\mathcal Z}_1$ be the $\sigma$-algebra generated by the eigenfunctions of T.

1. f is almost periodic if and only if $f \in L^2(X, {\mathcal Z}_1, \mu)$ if and only if $f \in {\Bbb H}_{pp}$.
2. f is weakly mixing if and only if ${\Bbb E}(f|{\mathcal Z}_1) = 0$ a.e. if and only if $f \in {\Bbb H}_c = {\Bbb H}_{sc} + {\Bbb H}_{ac}$ (corresponding to the continuous spectrum of T).
3. In general, f has a unique decomposition $f = f_{U^\perp} + f_U$ into an almost periodic function $f_{U^\perp}$ and a weakly mixing function $f_U$. Indeed, $f_{U^\perp} = {\Bbb E}(f|{\mathcal Z}_1)$ and $f_U = f - {\Bbb E}(f|{\mathcal Z}_1)$.

Theorem 1 follows immediately from this Corollary. Indeed, if a system is not weakly mixing, then by the above Corollary we see that ${\mathcal Z}_1$ is non-trivial, and the identity map from $(X, {\mathcal X}, \mu, T)$ to $(X, {\mathcal Z}_1, \mu, T)$ yields a non-trivial compact factor.

— Roth’s theorem —

As a quick application of the above machinery we give a proof of Roth’s theorem. We first need a variant of Corollary 2, which is proven by much the same means:

Exercise 21. Let $(X, {\mathcal X}, \mu, T)$ be an ergodic measure-preserving system, let $a_1, a_2, a_3$ be distinct integers, and let $f_1, f_2, f_3 \in L^\infty(X, {\mathcal X}, \mu)$ with at least one of $f_1, f_2, f_3$ weakly mixing. Show that $C\!-\!\lim_{n \to \infty} \int_X T^{a_1 n} f_1 T^{a_2 n} f_2 T^{a_3 n} f_3\ d\mu = 0$. $\diamond$

Theorem 2 (Roth’s theorem). Let $(X, {\mathcal X}, \mu, T)$ be an ergodic measure-preserving system, and let $f \in L^\infty(X, {\mathcal X}, \mu)$ be non-negative with $\int_X f\ d\mu > 0$. Then

$\liminf_{N \to \infty} \frac{1}{N} \sum_{n=0}^{N-1} \int_X f T^n f T^{2n} f\ d\mu > 0$. (29)

Proof. We decompose $f = f_{U^\perp} + f_U$ as in Corollary 3. The contribution of $f_U$ is negligible by Exercise 21, so it suffices to show that

$\liminf_{N \to \infty} \frac{1}{N} \sum_{n=0}^{N-1} \int_X f_{U^\perp} T^n f_{U^\perp} T^{2n} f_{U^\perp}\ d\mu > 0$. (30)

But as $f_{U^\perp}$ is almost periodic, the claim follows from Proposition 1 of Lecture 11. $\Box$

One can then immediately establish the k=3 case of Furstenberg’s theorem (Theorem 2 from Lecture 10) by combining the above result with the ergodic decomposition (Proposition 4 from Lecture 9). The k=3 case of Szemerèdi’s theorem (i.e. Roth’s theorem) then follows from the Furstenberg correspondence principle (see Lecture 10).

Exercise 22. Let $(X, {\mathcal X}, \mu, T)$ be a measure-preserving system, and let $f \in L^2(X, {\mathcal X}, \mu)$ be non-negative. Show that for every $\varepsilon > 0$, one has $\langle T^n f, f \rangle \geq \int_X (f\ d\mu)^2 - \varepsilon$ for infinitely many n. (Hint: first show this when f is almost periodic, and then use Corollary 1 and Corollary 3 to prove the general case.) This is a simplified version of the Khintchine recurrence theorem, which asserts that the set of such n is not only infinite, but is also syndetic. Analogues of the Khintchine recurrence theorem hold for double recurrence but not for triple recurrence; see this paper of Bergelson, Host, and Kra for details. $\diamond$

[Update, Mar 3: Added the observation that as $S_{f,N}$ converges to $S_f$ in the operator norm, $S_f$ is the limit of finite rank operators and thus clearly compact.  Thanks to Guatam Zubin for this remark.]