You are currently browsing the category archive for the ‘math.DS’ category.

Tamar Ziegler and I have just uploaded to the arXiv two related papers: “Concatenation theorems for anti-Gowers-uniform functions and Host-Kra characteoristic factors” and “polynomial patterns in primes“, with the former developing a “quantitative Bessel inequality” for local Gowers norms that is crucial in the latter.

We use the term “concatenation theorem” to denote results in which structural control of a function in two or more “directions” can be “concatenated” into structural control in a joint direction. A trivial example of such a concatenation theorem is the following: if a function ${f: {\bf Z} \times {\bf Z} \rightarrow {\bf R}}$ is constant in the first variable (thus ${x \mapsto f(x,y)}$ is constant for each ${y}$), and also constant in the second variable (thus ${y \mapsto f(x,y)}$ is constant for each ${x}$), then it is constant in the joint variable ${(x,y)}$. A slightly less trivial example: if a function ${f: {\bf Z} \times {\bf Z} \rightarrow {\bf R}}$ is affine-linear in the first variable (thus, for each ${y}$, there exist ${\alpha(y), \beta(y)}$ such that ${f(x,y) = \alpha(y) x + \beta(y)}$ for all ${x}$) and affine-linear in the second variable (thus, for each ${x}$, there exist ${\gamma(x), \delta(x)}$ such that ${f(x,y) = \gamma(x)y + \delta(x)}$ for all ${y}$) then ${f}$ is a quadratic polynomial in ${x,y}$; in fact it must take the form

$\displaystyle f(x,y) = \epsilon xy + \zeta x + \eta y + \theta \ \ \ \ \ (1)$

for some real numbers ${\epsilon, \zeta, \eta, \theta}$. (This can be seen for instance by using the affine linearity in ${y}$ to show that the coefficients ${\alpha(y), \beta(y)}$ are also affine linear.)

The same phenomenon extends to higher degree polynomials. Given a function ${f: G \rightarrow K}$ from one additive group ${G}$ to another, we say that ${f}$ is of degree less than ${d}$ along a subgroup ${H}$ of ${G}$ if all the ${d}$-fold iterated differences of ${f}$ along directions in ${H}$ vanish, that is to say

$\displaystyle \partial_{h_1} \dots \partial_{h_d} f(x) = 0$

for all ${x \in G}$ and ${h_1,\dots,h_d \in H}$, where ${\partial_h}$ is the difference operator

$\displaystyle \partial_h f(x) := f(x+h) - f(x).$

(We adopt the convention that the only ${f}$ of degree less than ${0}$ is the zero function.)

We then have the following simple proposition:

Proposition 1 (Concatenation of polynomiality) Let ${f: G \rightarrow K}$ be of degree less than ${d_1}$ along one subgroup ${H_1}$ of ${G}$, and of degree less than ${d_2}$ along another subgroup ${H_2}$ of ${G}$, for some ${d_1,d_2 \geq 1}$. Then ${f}$ is of degree less than ${d_1+d_2-1}$ along the subgroup ${H_1+H_2}$ of ${G}$.

Note the previous example was basically the case when ${G = {\bf Z} \times {\bf Z}}$, ${H_1 = {\bf Z} \times \{0\}}$, ${H_2 = \{0\} \times {\bf Z}}$, ${K = {\bf R}}$, and ${d_1=d_2=2}$.

Proof: The claim is trivial for ${d_1=1}$ or ${d_2=1}$ (in which ${f}$ is constant along ${H_1}$ or ${H_2}$ respectively), so suppose inductively ${d_1,d_2 \geq 2}$ and the claim has already been proven for smaller values of ${d_1-1}$.

We take a derivative in a direction ${h_1 \in H_1}$ along ${h_1}$ to obtain

$\displaystyle T^{-h_1} f = f + \partial_{h_1} f$

where ${T^{-h_1} f(x) = f(x+h_1)}$ is the shift of ${f}$ by ${-h_1}$. Then we take a further shift by a direction ${h_2 \in H_2}$ to obtain

$\displaystyle T^{-h_1-h_2} f = T^{-h_2} f + T^{-h_2} \partial_{h_1} f = f + \partial_{h_2} f + T^{-h_2} \partial_{h_1} f$

$\displaystyle \partial_{h_1+h_2} f = \partial_{h_2} f + T^{-h_2} \partial_{h_1} f.$

Since ${f}$ has degree less than ${d_1}$ along ${H_1}$ and degree less than ${d_2}$ along ${H_2}$, ${\partial_{h_1} f}$ has degree less than ${d_1-1}$ along ${H_1}$ and less than ${d_2}$ along ${H_2}$, so is degree less than ${d_1+d_2-2}$ along ${H_1+H_2}$ by induction hypothesis. Similarly ${\partial_{h_2} f}$ is also of degree less than ${d_1+d_2-2}$ along ${H_1+H_2}$. Combining this with the cocycle equation we see that ${\partial_{h_1+h_2}f}$ is of degree less than ${d_1+d_2-2}$ along ${H_1+H_2}$ for any ${h_1+h_2 \in H_1+H_2}$, and hence ${f}$ is of degree less than ${d_1+d_2-1}$ along ${H_1+H_2}$, as required. $\Box$

While this proposition is simple, it already illustrates some basic principles regarding how one would go about proving a concatenation theorem:

• (i) One should perform induction on the degrees ${d_1,d_2}$ involved, and take advantage of the recursive nature of degree (in this case, the fact that a function is of less than degree ${d}$ along some subgroup ${H}$ of directions iff all of its first derivatives along ${H}$ are of degree less than ${d-1}$).
• (ii) Structure is preserved by operations such as addition, shifting, and taking derivatives. In particular, if a function ${f}$ is of degree less than ${d}$ along some subgroup ${H}$, then any derivative ${\partial_k f}$ of ${f}$ is also of degree less than ${d}$ along ${H}$, even if ${k}$ does not belong to ${H}$.

Here is another simple example of a concatenation theorem. Suppose an at most countable additive group ${G}$ acts by measure-preserving shifts ${T: g \mapsto T^g}$ on some probability space ${(X, {\mathcal X}, \mu)}$; we call the pair ${(X,T)}$ (or more precisely ${(X, {\mathcal X}, \mu, T)}$) a ${G}$-system. We say that a function ${f \in L^\infty(X)}$ is a generalised eigenfunction of degree less than ${d}$ along some subgroup ${H}$ of ${G}$ and some ${d \geq 1}$ if one has

$\displaystyle T^h f = \lambda_h f$

almost everywhere for all ${h \in H}$, and some functions ${\lambda_h \in L^\infty(X)}$ of degree less than ${d-1}$ along ${H}$, with the convention that a function has degree less than ${0}$ if and only if it is equal to ${1}$. Thus for instance, a function ${f}$ is an generalised eigenfunction of degree less than ${1}$ along ${H}$ if it is constant on almost every ${H}$-ergodic component of ${G}$, and is a generalised function of degree less than ${2}$ along ${H}$ if it is an eigenfunction of the shift action on almost every ${H}$-ergodic component of ${G}$. A basic example of a higher order eigenfunction is the function ${f(x,y) := e^{2\pi i y}}$ on the skew shift ${({\bf R}/{\bf Z})^2}$ with ${{\bf Z}}$ action given by the generator ${T(x,y) := (x+\alpha,y+x)}$ for some irrational ${\alpha}$. One can check that ${T^h f = \lambda_h f}$ for every integer ${h}$, where ${\lambda_h: x \mapsto e^{2\pi i \binom{h}{2} \alpha} e^{2\pi i h x}}$ is a generalised eigenfunction of degree less than ${2}$ along ${{\bf Z}}$, so ${f}$ is of degree less than ${3}$ along ${{\bf Z}}$.

We then have

Proposition 2 (Concatenation of higher order eigenfunctions) Let ${(X,T)}$ be a ${G}$-system, and let ${f \in L^\infty(X)}$ be a generalised eigenfunction of degree less than ${d_1}$ along one subgroup ${H_1}$ of ${G}$, and a generalised eigenfunction of degree less than ${d_2}$ along another subgroup ${H_2}$ of ${G}$, for some ${d_1,d_2 \geq 1}$. Then ${f}$ is a generalised eigenfunction of degree less than ${d_1+d_2-1}$ along the subgroup ${H_1+H_2}$ of ${G}$.

The argument is almost identical to that of the previous proposition and is left as an exercise to the reader. The key point is the point (ii) identified earlier: the space of generalised eigenfunctions of degree less than ${d}$ along ${H}$ is preserved by multiplication and shifts, as well as the operation of “taking derivatives” ${f \mapsto \lambda_k}$ even along directions ${k}$ that do not lie in ${H}$. (To prove this latter claim, one should restrict to the region where ${f}$ is non-zero, and then divide ${T^k f}$ by ${f}$ to locate ${\lambda_k}$.)

A typical example of this proposition in action is as follows: consider the ${{\bf Z}^2}$-system given by the ${3}$-torus ${({\bf R}/{\bf Z})^3}$ with generating shifts

$\displaystyle T^{(1,0)}(x,y,z) := (x+\alpha,y,z+y)$

$\displaystyle T^{(0,1)}(x,y,z) := (x,y+\alpha,z+x)$

for some irrational ${\alpha}$, which can be checked to give a ${{\bf Z}^2}$ action

$\displaystyle T^{(n,m)}(x,y,z) := (x+n\alpha, y+m\alpha, z+ny+mx+nm\alpha).$

The function ${f(x,y,z) := e^{2\pi i z}}$ can then be checked to be a generalised eigenfunction of degree less than ${2}$ along ${{\bf Z} \times \{0\}}$, and also less than ${2}$ along ${\{0\} \times {\bf Z}}$, and less than ${3}$ along ${{\bf Z}^2}$. One can view this example as the dynamical systems translation of the example (1) (see this previous post for some more discussion of this sort of correspondence).

The main results of our concatenation paper are analogues of these propositions concerning a more complicated notion of “polynomial-like” structure that are of importance in additive combinatorics and in ergodic theory. On the ergodic theory side, the notion of structure is captured by the Host-Kra characteristic factors ${Z^{ of a ${G}$-system ${X}$ along a subgroup ${H}$. These factors can be defined in a number of ways. One is by duality, using the Gowers-Host-Kra uniformity seminorms (defined for instance here) ${\| \|_{U^d_H(X)}}$. Namely, ${Z^{ is the factor of ${X}$ defined up to equivalence by the requirement that

$\displaystyle \|f\|_{U^d_H(X)} = 0 \iff {\bf E}(f | Z^{

An equivalent definition is in terms of the dual functions ${{\mathcal D}^d_H(f)}$ of ${f}$ along ${H}$, which can be defined recursively by setting ${{\mathcal D}^0_H(f) = 1}$ and

$\displaystyle {\mathcal D}^d_H(f) = {\bf E}_h T^h f {\mathcal D}^{d-1}( f \overline{T^h f} )$

where ${{\bf E}_h}$ denotes the ergodic average along a Følner sequence in ${G}$ (in fact one can also define these concepts in non-amenable abelian settings as per this previous post). The factor ${Z^{ can then be alternately defined as the factor generated by the dual functions ${{\mathcal D}^d_H(f)}$ for ${f \in L^\infty(X)}$.

In the case when ${G=H={\bf Z}}$ and ${X}$ is ${G}$-ergodic, a deep theorem of Host and Kra shows that the factor ${Z^{ is equivalent to the inverse limit of nilsystems of step less than ${d}$. A similar statement holds with ${{\bf Z}}$ replaced by any finitely generated group by Griesmer, while the case of an infinite vector space over a finite field was treated in this paper of Bergelson, Ziegler, and myself. The situation is more subtle when ${X}$ is not ${G}$-ergodic, or when ${X}$ is ${G}$-ergodic but ${H}$ is a proper subgroup of ${G}$ acting non-ergodically, when one has to start considering measurable families of directional nilsystems; see for instance this paper of Austin for some of the subtleties involved (for instance, higher order group cohomology begins to become relevant!).

One of our main theorems is then

Proposition 3 (Concatenation of characteristic factors) Let ${(X,T)}$ be a ${G}$-system, and let ${f}$ be measurable with respect to the factor ${Z^{ and with respect to the factor ${Z^{ for some ${d_1,d_2 \geq 1}$ and some subgroups ${H_1,H_2}$ of ${G}$. Then ${f}$ is also measurable with respect to the factor ${Z^{.

We give two proofs of this proposition in the paper; an ergodic-theoretic proof using the Host-Kra theory of “cocycles of type ${ (along a subgroup ${H}$)”, which can be used to inductively describe the factors ${Z^{, and a combinatorial proof based on a combinatorial analogue of this proposition which is harder to state (but which roughly speaking asserts that a function which is nearly orthogonal to all bounded functions of small ${U^{d_1}_{H_1}}$ norm, and also to all bounded functions of small ${U^{d_2}_{H_2}}$ norm, is also nearly orthogonal to alll bounded functions of small ${U^{d_1+d_2-1}_{H_1+H_2}}$ norm). The combinatorial proof parallels the proof of Proposition 2. A key point is that dual functions ${F := {\mathcal D}^d_H(f)}$ obey a property analogous to being a generalised eigenfunction, namely that

$\displaystyle T^h F = {\bf E}_k \lambda_{h,k} F_k$

where ${F_k := T^k F}$ and ${\lambda_{h,k} := {\mathcal D}^{d-1}( T^h f \overline{T^k f} )}$ is a “structured function of order ${d-1}$” along ${H}$. (In the language of this previous paper of mine, this is an assertion that dual functions are uniformly almost periodic of order ${d}$.) Again, the point (ii) above is crucial, and in particular it is key that any structure that ${F}$ has is inherited by the associated functions ${\lambda_{h,k}}$ and ${F_k}$. This sort of inheritance is quite easy to accomplish in the ergodic setting, as there is a ready-made language of factors to encapsulate the concept of structure, and the shift-invariance and ${\sigma}$-algebra properties of factors make it easy to show that just about any “natural” operation one performs on a function measurable with respect to a given factor, returns a function that is still measurable in that factor. In the finitary combinatorial setting, though, encoding the fact (ii) becomes a remarkably complicated notational nightmare, requiring a huge amount of “epsilon management” and “second-order epsilon management” (in which one manages not only scalar epsilons, but also function-valued epsilons that depend on other parameters). In order to avoid all this we were forced to utilise a nonstandard analysis framework for the combinatorial theorems, which made the arguments greatly resemble the ergodic arguments in many respects (though the two settings are still not equivalent, see this previous blog post for some comparisons between the two settings). Unfortunately the arguments are still rather complicated.

For combinatorial applications, dual formulations of the concatenation theorem are more useful. A direct dualisation of the theorem yields the following decomposition theorem: a bounded function which is small in ${U^{d_1+d_2-1}_{H_1+H_2}}$ norm can be split into a component that is small in ${U^{d_1}_{H_1}}$ norm, and a component that is small in ${U^{d_2}_{H_2}}$ norm. (One may wish to understand this type of result by first proving the following baby version: any function that has mean zero on every coset of ${H_1+H_2}$, can be decomposed as the sum of a function that has mean zero on every ${H_1}$ coset, and a function that has mean zero on every ${H_2}$ coset. This is dual to the assertion that a function that is constant on every ${H_1}$ coset and constant on every ${H_2}$ coset, is constant on every ${H_1+H_2}$ coset.) Combining this with some standard “almost orthogonality” arguments (i.e. Cauchy-Schwarz) give the following Bessel-type inequality: if one has a lot of subgroups ${H_1,\dots,H_k}$ and a bounded function is small in ${U^{2d-1}_{H_i+H_j}}$ norm for most ${i,j}$, then it is also small in ${U^d_{H_i}}$ norm for most ${i}$. (Here is a baby version one may wish to warm up on: if a function ${f}$ has small mean on ${({\bf Z}/p{\bf Z})^2}$ for some large prime ${p}$, then it has small mean on most of the cosets of most of the one-dimensional subgroups of ${({\bf Z}/p{\bf Z})^2}$.)

There is also a generalisation of the above Bessel inequality (as well as several of the other results mentioned above) in which the subgroups ${H_i}$ are replaced by more general coset progressions ${H_i+P_i}$ (of bounded rank), so that one has a Bessel inequailty controlling “local” Gowers uniformity norms such as ${U^d_{P_i}}$ by “global” Gowers uniformity norms such as ${U^{2d-1}_{P_i+P_j}}$. This turns out to be particularly useful when attempting to compute polynomial averages such as

$\displaystyle \sum_{n \leq N} \sum_{r \leq \sqrt{N}} f(n) g(n+r^2) h(n+2r^2) \ \ \ \ \ (2)$

for various functions ${f,g,h}$. After repeated use of the van der Corput lemma, one can control such averages by expressions such as

$\displaystyle \sum_{n \leq N} \sum_{h,m,k \leq \sqrt{N}} f(n) f(n+mh) f(n+mk) f(n+m(h+k))$

(actually one ends up with more complicated expressions than this, but let’s use this example for sake of discussion). This can be viewed as an average of various ${U^2}$ Gowers uniformity norms of ${f}$ along arithmetic progressions of the form ${\{ mh: h \leq \sqrt{N}\}}$ for various ${m \leq \sqrt{N}}$. Using the above Bessel inequality, this can be controlled in turn by an average of various ${U^3}$ Gowers uniformity norms along rank two generalised arithmetic progressions of the form ${\{ m_1 h_1 + m_2 h_2: h_1,h_2 \le \sqrt{N}\}}$ for various ${m_1,m_2 \leq \sqrt{N}}$. But for generic ${m_1,m_2}$, this rank two progression is close in a certain technical sense to the “global” interval ${\{ n: n \leq N \}}$ (this is ultimately due to the basic fact that two randomly chosen large integers are likely to be coprime, or at least have a small gcd). As a consequence, one can use the concatenation theorems from our first paper to control expressions such as (2) in terms of global Gowers uniformity norms. This is important in number theoretic applications, when one is interested in computing sums such as

$\displaystyle \sum_{n \leq N} \sum_{r \leq \sqrt{N}} \mu(n) \mu(n+r^2) \mu(n+2r^2)$

or

$\displaystyle \sum_{n \leq N} \sum_{r \leq \sqrt{N}} \Lambda(n) \Lambda(n+r^2) \Lambda(n+2r^2)$

where ${\mu}$ and ${\Lambda}$ are the Möbius and von Mangoldt functions respectively. This is because we are able to control global Gowers uniformity norms of such functions (thanks to results such as the proof of the inverse conjecture for the Gowers norms, the orthogonality of the Möbius function with nilsequences, and asymptotics for linear equations in primes), but much less control is currently available for local Gowers uniformity norms, even with the assistance of the generalised Riemann hypothesis (see this previous blog post for some further discussion).

By combining these tools and strategies with the “transference principle” approach from our previous paper (as improved using the recent “densification” technique of Conlon, Fox, and Zhao, discussed in this previous post), we are able in particular to establish the following result:

Theorem 4 (Polynomial patterns in the primes) Let ${P_1,\dots,P_k: {\bf Z} \rightarrow {\bf Z}}$ be polynomials of degree at most ${d}$, whose degree ${d}$ coefficients are all distinct, for some ${d \geq 1}$. Suppose that ${P_1,\dots,P_k}$ is admissible in the sense that for every prime ${p}$, there are ${n,r}$ such that ${n+P_1(r),\dots,n+P_k(r)}$ are all coprime to ${p}$. Then there exist infinitely many pairs ${n,r}$ of natural numbers such that ${n+P_1(r),\dots,n+P_k(r)}$ are prime.

Furthermore, we obtain an asymptotic for the number of such pairs ${n,r}$ in the range ${n \leq N}$, ${r \leq N^{1/d}}$ (actually for minor technical reasons we reduce the range of ${r}$ to be very slightly less than ${N^{1/d}}$). In fact one could in principle obtain asymptotics for smaller values of ${r}$, and relax the requirement that the degree ${d}$ coefficients be distinct with the requirement that no two of the ${P_i}$ differ by a constant, provided one had good enough local uniformity results for the Möbius or von Mangoldt functions. For instance, we can obtain an asymptotic for triplets of the form ${n, n+r,n+r^d}$ unconditionally for ${d \leq 5}$, and conditionally on GRH for all ${d}$, using known results on primes in short intervals on average.

The ${d=1}$ case of this theorem was obtained in a previous paper of myself and Ben Green (using the aforementioned conjectures on the Gowers uniformity norm and the orthogonality of the Möbius function with nilsequences, both of which are now proven). For higher ${d}$, an older result of Tamar and myself was able to tackle the case when ${P_1(0)=\dots=P_k(0)=0}$ (though our results there only give lower bounds on the number of pairs ${(n,r)}$, and no asymptotics). Both of these results generalise my older theorem with Ben Green on the primes containing arbitrarily long arithmetic progressions. The theorem also extends to multidimensional polynomials, in which case there are some additional previous results; see the paper for more details. We also get a technical refinement of our previous result on narrow polynomial progressions in (dense subsets of) the primes by making the progressions just a little bit narrower in the case of the density of the set one is using is small.

. This latter Bessel type inequality is particularly useful in combinatorial and number-theoretic applications, as it allows one to convert “global” Gowers uniformity norm (basically, bounds on norms such as ${U^{2d-1}_{H_i+H_j}}$) to “local” Gowers uniformity norm control.

Note: this post is of a particularly technical nature, in particular presuming familiarity with nilsequences, nilsystems, characteristic factors, etc., and is primarily intended for experts.

As mentioned in the previous post, Ben Green, Tamar Ziegler, and myself proved the following inverse theorem for the Gowers norms:

Theorem 1 (Inverse theorem for Gowers norms) Let ${N \geq 1}$ and ${s \geq 1}$ be integers, and let ${\delta > 0}$. Suppose that ${f: {\bf Z} \rightarrow [-1,1]}$ is a function supported on ${[N] := \{1,\dots,N\}}$ such that

$\displaystyle \frac{1}{N^{s+2}} \sum_{n,h_1,\dots,h_{s+1}} \prod_{\omega \in \{0,1\}^{s+1}} f(n+\omega_1 h_1 + \dots + \omega_{s+1} h_{s+1}) \geq \delta.$

Then there exists a filtered nilmanifold ${G/\Gamma}$ of degree ${\leq s}$ and complexity ${O_{s,\delta}(1)}$, a polynomial sequence ${g: {\bf Z} \rightarrow G}$, and a Lipschitz function ${F: G/\Gamma \rightarrow {\bf R}}$ of Lipschitz constant ${O_{s,\delta}(1)}$ such that

$\displaystyle \frac{1}{N} \sum_n f(n) F(g(n) \Gamma) \gg_{s,\delta} 1.$

This result was conjectured earlier by Ben Green and myself; this conjecture was strongly motivated by an analogous inverse theorem in ergodic theory by Host and Kra, which we formulate here in a form designed to resemble Theorem 1 as closely as possible:

Theorem 2 (Inverse theorem for Gowers-Host-Kra seminorms) Let ${s \geq 1}$ be an integer, and let ${(X, T)}$ be an ergodic, countably generated measure-preserving system. Suppose that one has

$\displaystyle \lim_{N \rightarrow \infty} \frac{1}{N^{s+1}} \sum_{h_1,\dots,h_{s+1} \in [N]} \int_X \prod_{\omega \in \{0,1\}^{s+1}} f(T^{\omega_1 h_1 + \dots + \omega_{s+1} h_{s+1}}x)\ d\mu(x)$

$\displaystyle > 0$

for all non-zero ${f \in L^\infty(X)}$ (all ${L^p}$ spaces are real-valued in this post). Then ${(X,T)}$ is an inverse limit (in the category of measure-preserving systems, up to almost everywhere equivalence) of ergodic degree ${\leq s}$ nilsystems, that is to say systems of the form ${(G/\Gamma, x \mapsto gx)}$ for some degree ${\leq s}$ filtered nilmanifold ${G/\Gamma}$ and a group element ${g \in G}$ that acts ergodically on ${G/\Gamma}$.

It is a natural question to ask if there is any logical relationship between the two theorems. In the finite field category, one can deduce the combinatorial inverse theorem from the ergodic inverse theorem by a variant of the Furstenberg correspondence principle, as worked out by Tamar Ziegler and myself, however in the current context of ${{\bf Z}}$-actions, the connection is less clear.

One can split Theorem 2 into two components:

Theorem 3 (Weak inverse theorem for Gowers-Host-Kra seminorms) Let ${s \geq 1}$ be an integer, and let ${(X, T)}$ be an ergodic, countably generated measure-preserving system. Suppose that one has

$\displaystyle \lim_{N \rightarrow \infty} \frac{1}{N^{s+1}} \sum_{h_1,\dots,h_{s+1} \in [N]} \int_X \prod_{\omega \in \{0,1\}^{s+1}} T^{\omega_1 h_1 + \dots + \omega_{s+1} h_{s+1}} f\ d\mu$

$\displaystyle > 0$

for all non-zero ${f \in L^\infty(X)}$, where ${T^h f := f \circ T^h}$. Then ${(X,T)}$ is a factor of an inverse limit of ergodic degree ${\leq s}$ nilsystems.

Theorem 4 (Pro-nilsystems closed under factors) Let ${s \geq 1}$ be an integer. Then any factor of an inverse limit of ergodic degree ${\leq s}$ nilsystems, is again an inverse limit of ergodic degree ${\leq s}$ nilsystems.

Indeed, it is clear that Theorem 2 implies both Theorem 3 and Theorem 4, and conversely that the two latter theorems jointly imply the former. Theorem 4 is, in principle, purely a fact about nilsystems, and should have an independent proof, but this is not known; the only known proofs go through the full machinery needed to prove Theorem 2 (or the closely related theorem of Ziegler). (However, the fact that a factor of a nilsystem is again a nilsystem was established previously by Parry.)

The purpose of this post is to record a partial implication in reverse direction to the correspondence principle:

Proposition 5 Theorem 1 implies Theorem 3.

As mentioned at the start of the post, a fair amount of familiarity with the area is presumed here, and some routine steps will be presented with only a fairly brief explanation.

A few years ago, Ben Green, Tamar Ziegler, and myself proved the following (rather technical-looking) inverse theorem for the Gowers norms:

Theorem 1 (Discrete inverse theorem for Gowers norms) Let ${N \geq 1}$ and ${s \geq 1}$ be integers, and let ${\delta > 0}$. Suppose that ${f: {\bf Z} \rightarrow [-1,1]}$ is a function supported on ${[N] := \{1,\dots,N\}}$ such that

$\displaystyle \frac{1}{N^{s+2}} \sum_{n,h_1,\dots,h_{s+1}} \prod_{\omega \in \{0,1\}^{s+1}} f(n+\omega_1 h_1 + \dots + \omega_{s+1} h_{s+1}) \geq \delta.$

Then there exists a filtered nilmanifold ${G/\Gamma}$ of degree ${\leq s}$ and complexity ${O_{s,\delta}(1)}$, a polynomial sequence ${g: {\bf Z} \rightarrow G}$, and a Lipschitz function ${F: G/\Gamma \rightarrow {\bf R}}$ of Lipschitz constant ${O_{s,\delta}(1)}$ such that

$\displaystyle \frac{1}{N} \sum_n f(n) F(g(n) \Gamma) \gg_{s,\delta} 1.$

For the definitions of “filtered nilmanifold”, “degree”, “complexity”, and “polynomial sequence”, see the paper of Ben, Tammy, and myself. (I should caution the reader that this blog post will presume a fair amount of familiarity with this subfield of additive combinatorics.) This result has a number of applications, for instance to establishing asymptotics for linear equations in the primes, but this will not be the focus of discussion here.

The purpose of this post is to record the observation that this “discrete” inverse theorem, together with an equidistribution theorem for nilsequences that Ben and I worked out in a separate paper, implies a continuous version:

Theorem 2 (Continuous inverse theorem for Gowers norms) Let ${s \geq 1}$ be an integer, and let ${\delta>0}$. Suppose that ${f: {\bf R} \rightarrow [-1,1]}$ is a measurable function supported on ${[0,1]}$ such that

$\displaystyle \int_{{\bf R}^{s+1}} \prod_{\omega \in \{0,1\}^{s+1}} f(t+\omega_1 h_1 + \dots + \omega_{s+1} h_{s+1})\ dt dh_1 \dots dh_{s+1} \geq \delta. \ \ \ \ \ (1)$

Then there exists a filtered nilmanifold ${G/\Gamma}$ of degree ${\leq s}$ and complexity ${O_{s,\delta}(1)}$, a (smooth) polynomial sequence ${g: {\bf R} \rightarrow G}$, and a Lipschitz function ${F: G/\Gamma \rightarrow {\bf R}}$ of Lipschitz constant ${O_{s,\delta}(1)}$ such that

$\displaystyle \int_{\bf R} f(t) F(g(t) \Gamma)\ dt \gg_{s,\delta} 1.$

The interval ${[0,1]}$ can be easily replaced with any other fixed interval by a change of variables. A key point here is that the bounds are completely uniform in the choice of ${f}$. Note though that the coefficients of ${g}$ can be arbitrarily large (and this is necessary, as can be seen just by considering functions of the form ${f(t) = \cos( \xi t)}$ for some arbitrarily large frequency ${\xi}$).

It is likely that one could prove Theorem 2 by carefully going through the proof of Theorem 1 and replacing all instances of ${{\bf Z}}$ with ${{\bf R}}$ (and making appropriate modifications to the argument to accommodate this). However, the proof of Theorem 1 is quite lengthy. Here, we shall proceed by the usual limiting process of viewing the continuous interval ${[0,1]}$ as a limit of the discrete interval ${\frac{1}{N} \cdot [N]}$ as ${N \rightarrow \infty}$. However there will be some problems taking the limit due to a failure of compactness, and specifically with regards to the coefficients of the polynomial sequence ${g: {\bf N} \rightarrow G}$ produced by Theorem 1, after normalising these coefficients by ${N}$. Fortunately, a factorisation theorem from a paper of Ben Green and myself resolves this problem by splitting ${g}$ into a “smooth” part which does enjoy good compactness properties, as well as “totally equidistributed” and “periodic” parts which can be eliminated using the measurability (and thus, approximate smoothness), of ${f}$.

Szemerédi’s theorem asserts that any subset of the integers of positive upper density contains arbitrarily large arithmetic progressions. Here is an equivalent quantitative form of this theorem:

Theorem 1 (Szemerédi’s theorem) Let ${N}$ be a positive integer, and let ${f: {\bf Z}/N{\bf Z} \rightarrow [0,1]}$ be a function with ${{\bf E}_{x \in {\bf Z}/N{\bf Z}} f(x) \geq \delta}$ for some ${\delta>0}$, where we use the averaging notation ${{\bf E}_{x \in A} f(x) := \frac{1}{|A|} \sum_{x \in A} f(x)}$, ${{\bf E}_{x,r \in A} f(x) := \frac{1}{|A|^2} \sum_{x, r \in A} f(x)}$, etc.. Then for ${k \geq 3}$ we have

$\displaystyle {\bf E}_{x,r \in {\bf Z}/N{\bf Z}} f(x) f(x+r) \dots f(x+(k-1)r) \geq c(k,\delta)$

for some ${c(k,\delta)>0}$ depending only on ${k,\delta}$.

The equivalence is basically thanks to an averaging argument of Varnavides; see for instance Chapter 11 of my book with Van Vu or this previous blog post for a discussion. We have removed the cases ${k=1,2}$ as they are trivial and somewhat degenerate.

There are now many proofs of this theorem. Some time ago, I took an ergodic-theoretic proof of Furstenberg and converted it to a purely finitary proof of the theorem. The argument used some simplifying innovations that had been developed since the original work of Furstenberg (in particular, deployment of the Gowers uniformity norms, as well as a “dual” norm that I called the uniformly almost periodic norm, and an emphasis on van der Waerden’s theorem for handling the “compact extension” component of the argument). But the proof was still quite messy. However, as discussed in this previous blog post, messy finitary proofs can often be cleaned up using nonstandard analysis. Thus, there should be a nonstandard version of the Furstenberg ergodic theory argument that is relatively clean. I decided (after some encouragement from Ben Green and Isaac Goldbring) to write down most of the details of this argument in this blog post, though for sake of brevity I will skim rather quickly over arguments that were already discussed at length in other blog posts. In particular, I will presume familiarity with nonstandard analysis (in particular, the notion of a standard part of a bounded real number, and the Loeb measure construction), see for instance this previous blog post for a discussion.

I’ve just uploaded to the arXiv my paper “Failure of the ${L^1}$ pointwise and maximal ergodic theorems for the free group“, submitted to Forum of Mathematics, Sigma. This paper concerns a variant of the pointwise ergodic theorem of Birkhoff, which asserts that if one has a measure-preserving shift map ${T: X \rightarrow X}$ on a probability space ${X = (X,\mu)}$, then for any ${f \in L^1(X)}$, the averages ${\frac{1}{N} \sum_{n=1}^N f \circ T^{-n}}$ converge pointwise almost everywhere. (In the important case when the shift map ${T}$ is ergodic, the pointwise limit is simply the mean ${\int_X f\ d\mu}$ of the original function ${f}$.)

The pointwise ergodic theorem can be extended to measure-preserving actions of other amenable groups, if one uses a suitably “tempered” Folner sequence of averages; see this paper of Lindenstrauss for more details. (I also wrote up some notes on that paper here, back in 2006 before I had started this blog.) But the arguments used to handle the amenable case break down completely for non-amenable groups, and in particular for the free non-abelian group ${F_2}$ on two generators.

Nevo and Stein studied this problem and obtained a number of pointwise ergodic theorems for ${F_2}$-actions ${(T_g)_{g \in F_2}}$ on probability spaces ${(X,\mu)}$. For instance, for the spherical averaging operators

$\displaystyle {\mathcal A}_n f := \frac{1}{4 \times 3^{n-1}} \sum_{g \in F_2: |g| = n} f \circ T_g^{-1}$

(where ${|g|}$ denotes the length of the reduced word that forms ${g}$), they showed that ${{\mathcal A}_{2n} f}$ converged pointwise almost everywhere provided that ${f}$ was in ${L^p(X)}$ for some ${p>1}$. (The need to restrict to spheres of even radius can be seen by considering the action of ${F_2}$ on the two-element set ${\{0,1\}}$ in which both generators of ${F_2}$ act by interchanging the elements, in which case ${{\mathcal A}_n}$ is determined by the parity of ${n}$.) This result was reproven with a different and simpler proof by Bufetov, who also managed to relax the condition ${f \in L^p(X)}$ to the weaker condition ${f \in L \log L(X)}$.

The question remained open as to whether the pointwise ergodic theorem for ${F_2}$-actions held if one only assumed that ${f}$ was in ${L^1(X)}$. Nevo and Stein were able to establish this for the Cesáro averages ${\frac{1}{N} \sum_{n=1}^N {\mathcal A}_n}$, but not for ${{\mathcal A}_n}$ itself. About six years ago, Assaf Naor and I tried our hand at this problem, and was able to show an associated maximal inequality on ${\ell^1(F_2)}$, but due to the non-amenability of ${F_2}$, this inequality did not transfer to ${L^1(X)}$ and did not have any direct impact on this question, despite a fair amount of effort on our part to attack it.

Inspired by some recent conversations with Lewis Bowen, I returned to this problem. This time around, I tried to construct a counterexample to the ${L^1}$ pointwise ergodic theorem – something Assaf and I had not seriously attempted to do (perhaps due to being a bit too enamoured of our ${\ell^1(F_2)}$ maximal inequality). I knew of an existing counterexample of Ornstein regarding a failure of an ${L^1}$ ergodic theorem for iterates ${P^n}$ of a self-adjoint Markov operator – in fact, I had written some notes on this example back in 2007. Upon revisiting my notes, I soon discovered that the Ornstein construction was adaptable to the ${F_2}$ setting, thus settling the problem in the negative:

Theorem 1 (Failure of ${L^1}$ pointwise ergodic theorem) There exists a measure-preserving ${F_2}$-action on a probability space ${X}$ and a non-negative function ${f \in L^1(X)}$ such that ${\sup_n {\mathcal A}_{2n} f(x) = +\infty}$ for almost every ${x}$.

To describe the proof of this theorem, let me first briefly sketch the main ideas of Ornstein’s construction, which gave an example of a self-adjoint Markov operator ${P}$ on a probability space ${X}$ and a non-negative ${f \in L^1(X)}$ such that ${\sup_n P^n f(x) = +\infty}$ for almost every ${x}$. By some standard manipulations, it suffices to show that for any given ${\alpha > 0}$ and ${\varepsilon>0}$, there exists a self-adjoint Markov operator ${P}$ on a probability space ${X}$ and a non-negative ${f \in L^1(X)}$ with ${\|f\|_{L^1(X)} \leq \alpha}$, such that ${\sup_n P^n f \geq 1-\varepsilon}$ on a set of measure at least ${1-\varepsilon}$. Actually, it will be convenient to replace the Markov chain ${(P^n f)_{n \geq 0}}$ with an ancient Markov chain ${(f_n)_{n \in {\bf Z}}}$ – that is to say, a sequence of non-negative functions ${f_n}$ for both positive and negative ${f}$, such that ${f_{n+1} = P f_n}$ for all ${n \in {\bf Z}}$. The purpose of requiring the Markov chain to be ancient (that is, to extend infinitely far back in time) is to allow for the Markov chain to be shifted arbitrarily in time, which is key to Ornstein’s construction. (Technically, Ornstein’s original argument only uses functions that go back to a large negative time, rather than being infinitely ancient, but I will gloss over this point for sake of discussion, as it turns out that the ${F_2}$ version of the argument can be run using infinitely ancient chains.)

For any ${\alpha>0}$, let ${P(\alpha)}$ denote the claim that for any ${\varepsilon>0}$, there exists an ancient Markov chain ${(f_n)_{n \in {\bf Z}}}$ with ${\|f_n\|_{L^1(X)} = \alpha}$ such that ${\sup_{n \in {\bf Z}} f_n \geq 1-\varepsilon}$ on a set of measure at least ${1-\varepsilon}$. Clearly ${P(1)}$ holds since we can just take ${f_n=1}$ for all ${n}$. Our objective is to show that ${P(\alpha)}$ holds for arbitrarily small ${\alpha}$. The heart of Ornstein’s argument is then the implication

$\displaystyle P(\alpha) \implies P( \alpha (1 - \frac{\alpha}{4}) ) \ \ \ \ \ (1)$

for any ${0 < \alpha \leq 1}$, which upon iteration quickly gives the desired claim.

Let’s see informally how (1) works. By hypothesis, and ignoring epsilons, we can find an ancient Markov chain ${(f_n)_{n \in {\bf Z}}}$ on some probability space ${X}$ of total mass ${\|f_n\|_{L^1(X)} = \alpha}$, such that ${\sup_n f_n}$ attains the value of ${1}$ or greater almost everywhere. Assuming that the Markov process is irreducible, the ${f_n}$ will eventually converge as ${n \rightarrow \infty}$ to the constant value of ${\|f_n\|_{L^1(X)}}$, in particular its final state will essentially stay above ${\alpha}$ (up to small errors).

Now suppose we duplicate the Markov process by replacing ${X}$ with a double copy ${X \times \{1,2\}}$ (giving ${\{1,2\}}$ the uniform probability measure), and using the disjoint sum of the Markov operators on ${X \times \{1\}}$ and ${X \times \{2\}}$ as the propagator, so that there is no interaction between the two components of this new system. Then the functions ${f'_n(x,i) := f_n(x) 1_{i=1}}$ form an ancient Markov chain of mass at most ${\alpha/2}$ that lives solely in the first half ${X \times \{1\}}$ of this copy, and ${\sup_n f'_n}$ attains the value of ${1}$ or greater on almost all of the first half ${X \times \{1\}}$, but is zero on the second half. The final state of ${f'_n}$ will be to stay above ${\alpha}$ in the first half ${X \times \{1\}}$, but be zero on the second half.

Now we modify the above example by allowing an infinitesimal amount of interaction between the two halves ${X \times \{1\}}$, ${X \times \{2\}}$ of the system (I mentally think of ${X \times \{1\}}$ and ${X \times \{2\}}$ as two identical boxes that a particle can bounce around in, and now we wish to connect the boxes by a tiny tube). The precise way in which this interaction is inserted is not terribly important so long as the new Markov process is irreducible. Once one does so, then the ancient Markov chain ${(f'_n)_{n \in {\bf Z}}}$ in the previous example gets replaced by a slightly different ancient Markov chain ${(f''_n)_{n \in {\bf Z}}}$ which is more or less identical with ${f'_n}$ for negative times ${n}$, or for bounded positive times ${n}$, but for very large values of ${n}$ the final state is now constant across the entire state space ${X \times \{1,2\}}$, and will stay above ${\alpha/2}$ on this space.

Finally, we consider an ancient Markov chain ${F_n}$ which is basically of the form

$\displaystyle F_n(x,i) \approx f''_n(x,i) + (1 - \frac{\alpha}{2}) f_{n-M}(x) 1_{i=2}$

for some large parameter ${M}$ and for all ${n \leq M}$ (the approximation becomes increasingly inaccurate for ${n}$ much larger than ${M}$, but never mind this for now). This is basically two copies of the original Markov process in separate, barely interacting state spaces ${X \times \{1\}, X \times \{2\}}$, but with the second copy delayed by a large time delay ${M}$, and also attenuated in amplitude by a factor of ${1-\frac{\alpha}{2}}$. The total mass of this process is now ${\frac{\alpha}{2} + \frac{\alpha}{2} (1 -\frac{\alpha}{2}) = \alpha (1 - \alpha/4)}$. Because of the ${f''_n}$ component of ${F_n}$, we see that ${\sup_n F_n}$ basically attains the value of ${1}$ or greater on the first half ${X \times \{1\}}$. On the second half ${X \times \{2\}}$, we work with times ${n}$ close to ${M}$. If ${M}$ is large enough, ${f''_n}$ would have averaged out to about ${\alpha/2}$ at such times, but the ${(1 - \frac{\alpha}{2}) f_{n-M}(x)}$ component can get as large as ${1-\alpha/2}$ here. Summing (and continuing to ignore various epsilon losses), we see that ${\sup_n F_n}$ can get as large as ${1}$ on almost all of the second half of ${X \times \{2\}}$. This concludes the rough sketch of how one establishes the implication (1).

It was observed by Bufetov that the spherical averages ${{\mathcal A}_n}$ for a free group action can be lifted up to become powers ${P^n}$ of a Markov operator, basically by randomly assigning a “velocity vector” ${s \in \{a,b,a^{-1},b^{-1}\}}$ to one’s base point ${x}$ and then applying the Markov process that moves ${x}$ along that velocity vector (and then randomly changing the velocity vector at each time step to the “reduced word” condition that the velocity never flips from ${s}$ to ${s^{-1}}$). Thus the spherical average problem has a Markov operator interpretation, which opens the door to adapting the Ornstein construction to the setting of ${F_2}$ systems. This turns out to be doable after a certain amount of technical artifice; the main thing is to work with ${F_2}$-measure preserving systems that admit ancient Markov chains that are initially supported in a very small region in the “interior” of the state space, so that one can couple such systems to each other “at the boundary” in the fashion needed to establish the analogue of (1) without disrupting the ancient dynamics of such chains. The initial such system (used to establish the base case ${P(1)}$) comes from basically considering the action of ${F_2}$ on a (suitably renormalised) “infinitely large ball” in the Cayley graph, after suitably gluing together the boundary of this ball to complete the action. The ancient Markov chain associated to this system starts at the centre of this infinitely large ball at infinite negative time ${n=-\infty}$, and only reaches the boundary of this ball at the time ${n=0}$.

An extremely large portion of mathematics is concerned with locating solutions to equations such as

$\displaystyle f(x) = 0$

or

$\displaystyle \Phi(x) = x \ \ \ \ \ (1)$

for ${x}$ in some suitable domain space (either finite-dimensional or infinite-dimensional), and various maps ${f}$ or ${\Phi}$. To solve the fixed point iteration equation (1), the simplest general method available is the fixed point iteration method: one starts with an initial approximate solution ${x_0}$ to (1), so that ${\Phi(x_0) \approx x_0}$, and then recursively constructs the sequence ${x_1, x_2, x_3, \dots}$ by ${x_n := \Phi(x_{n-1})}$. If ${\Phi}$ behaves enough like a “contraction”, and the domain is complete, then one can expect the ${x_n}$ to converge to a limit ${x}$, which should then be a solution to (1). For instance, if ${\Phi: X \rightarrow X}$ is a map from a metric space ${X = (X,d)}$ to itself, which is a contraction in the sense that

$\displaystyle d( \Phi(x), \Phi(y) ) \leq (1-\eta) d(x,y)$

for all ${x,y \in X}$ and some ${\eta>0}$, then with ${x_n}$ as above we have

$\displaystyle d( x_{n+1}, x_n ) \leq (1-\eta) d(x_n, x_{n-1} )$

for any ${n}$, and so the distances ${d(x_n, x_{n-1} )}$ between successive elements of the sequence decay at at least a geometric rate. This leads to the contraction mapping theorem, which has many important consequences, such as the inverse function theorem and the Picard existence theorem.

A slightly more complicated instance of this strategy arises when trying to linearise a complex map ${f: U \rightarrow {\bf C}}$ defined in a neighbourhood ${U}$ of a fixed point. For simplicity we normalise the fixed point to be the origin, thus ${0 \in U}$ and ${f(0)=0}$. When studying the complex dynamics ${f^2 = f \circ f}$, ${f^3 = f \circ f \circ f}$, ${\dots}$ of such a map, it can be useful to try to conjugate ${f}$ to another function ${g = \psi^{-1} \circ f \circ \psi}$, where ${\psi}$ is a holomorphic function defined and invertible near ${0}$ with ${\psi(0)=0}$, since the dynamics of ${g}$ will be conjguate to that of ${f}$. Note that if ${f(0)=0}$ and ${f'(0)=\lambda}$, then from the chain rule any conjugate ${g}$ of ${f}$ will also have ${g(0)=0}$ and ${g'(0)=\lambda}$. Thus, the “simplest” function one can hope to conjugate ${f}$ to is the linear function ${z \mapsto \lambda z}$. Let us say that ${f}$ is linearisable (around ${0}$) if it is conjugate to ${z \mapsto \lambda z}$ in some neighbourhood of ${0}$. Equivalently, ${f}$ is linearisable if there is a solution to the Schröder equation

$\displaystyle f( \psi(z) ) = \psi(\lambda z) \ \ \ \ \ (2)$

for some ${\psi: U' \rightarrow {\bf C}}$ defined and invertible in a neighbourhood ${U'}$ of ${0}$ with ${\psi(0)=0}$, and all ${z}$ sufficiently close to ${0}$. (The Schröder equation is normalised somewhat differently in the literature, but this form is equivalent to the usual form, at least when ${\lambda}$ is non-zero.) Note that if ${\psi}$ solves the above equation, then so does ${z \mapsto \psi(cz)}$ for any non-zero ${c}$, so we may normalise ${\psi'(0)=1}$ in addition to ${\psi(0)=0}$, which also ensures local invertibility from the inverse function theorem. (Note from winding number considerations that ${\psi}$ cannot be invertible near zero if ${\psi'(0)}$ vanishes.)

We have the following basic result of Koenigs:

Theorem 1 (Koenig’s linearisation theorem) Let ${f: U \rightarrow {\bf C}}$ be a holomorphic function defined near ${0}$ with ${f(0)=0}$ and ${f'(0)=\lambda}$. If ${0 < |\lambda| < 1}$ (attracting case) or ${1 < |\lambda| < \infty}$ (repelling case), then ${f}$ is linearisable near zero.

Proof: Observe that if ${f, \psi, \lambda}$ solve (2), then ${f^{-1}, \psi^{-1}, \lambda^{-1}}$ solve (2) also (in a sufficiently small neighbourhood of zero). Thus we may reduce to the attractive case ${0 < |\lambda| < 1}$.

Let ${r>0}$ be a sufficiently small radius, and let ${X}$ denote the space of holomorphic functions ${\psi: B(0,r) \rightarrow {\bf C}}$ on the complex disk ${B(0,r) := \{z \in {\bf C}: |z| < r \}}$ with ${\psi(0)=0}$ and ${\psi'(0)=1}$. We can view the Schröder equation (2) as a fixed point equation

$\displaystyle \psi = \Phi(\psi)$

where ${\Phi: X' \rightarrow X}$ is the partially defined function on ${X}$ that maps a function ${\psi: B(0,r) \rightarrow {\bf C}}$ to the function ${\Phi(\psi): B(0,r) \rightarrow {\bf C}}$ defined by

$\displaystyle \Phi(\psi)(z) := f^{-1}( \psi( \lambda z ) ),$

assuming that ${f^{-1}}$ is well-defined on the range of ${\psi(B(0,\lambda r))}$ (this is why ${\Phi}$ is only partially defined).

We can solve this equation by the fixed point iteration method, if ${r}$ is small enough. Namely, we start with ${\psi_0: B(0,r) \rightarrow {\bf C}}$ being the identity map, and set ${\psi_1 := \Phi(\psi_0), \psi_2 := \Phi(\psi_1)}$, etc. We equip ${X}$ with the uniform metric ${d( \psi, \tilde \psi ) := \sup_{z \in B(0,r)} |\psi(z) - \tilde \psi(z)|}$. Observe that if ${d( \psi, \psi_0 ), d(\tilde \psi, \psi_0) \leq r}$, and ${r}$ is small enough, then ${\psi, \tilde \psi}$ takes values in ${B(0,2r)}$, and ${\Phi(\psi), \Phi(\tilde \psi)}$ are well-defined and lie in ${X}$. Also, since ${f^{-1}}$ is smooth and has derivative ${\lambda^{-1}}$ at ${0}$, we have

$\displaystyle |f^{-1}(z) - f^{-1}(w)| \leq (1+\varepsilon) |\lambda|^{-1} |z-w|$

if ${z, w \in B(0,r)}$, ${\varepsilon>0}$ and ${r}$ is sufficiently small depending on ${\varepsilon}$. This is not yet enough to establish the required contraction (thanks to Mario Bonk for pointing this out); but observe that the function ${\frac{\psi(z)-\tilde \psi(z)}{z^2}}$ is holomorphic on ${B(0,r)}$ and bounded by ${d(\psi,\tilde \psi)/r^2}$ on the boundary of this ball (or slightly within this boundary), so by the maximum principle we see that

$\displaystyle |\frac{\psi(z)-\tilde \psi(z)}{z^2}| \leq \frac{1}{r^2} d(\psi,\tilde \psi)$

on all of ${B(0,r)}$, and in particular

$\displaystyle |\psi(z)-\tilde \psi(z)| \leq |\lambda|^2 d(\psi,\tilde \psi)$

on ${B(0,\lambda r)}$. Putting all this together, we see that

$\displaystyle d( \Phi(\psi), \Phi(\tilde \psi)) \leq (1+\varepsilon) |\lambda| d(\psi, \tilde \psi);$

since ${|\lambda|<1}$, we thus obtain a contraction on the ball ${\{ \psi \in X: d(\psi,\psi_0) \leq r \}}$ if ${\varepsilon}$ is small enough (and ${r}$ sufficiently small depending on ${\varepsilon}$). From this (and the completeness of ${X}$, which follows from Morera’s theorem) we see that the iteration ${\psi_n}$ converges (exponentially fast) to a limit ${\psi \in X}$ which is a fixed point of ${\Phi}$, and thus solves Schröder’s equation, as required. $\Box$

Koenig’s linearisation theorem leaves open the indifferent case when ${|\lambda|=1}$. In the rationally indifferent case when ${\lambda^n=1}$ for some natural number ${n}$, there is an obvious obstruction to linearisability, namely that ${f^n = 1}$ (in particular, linearisation is not possible in this case when ${f}$ is a non-trivial rational function). An obstruction is also present in some irrationally indifferent cases (where ${|\lambda|=1}$ but ${\lambda^n \neq 1}$ for any natural number ${n}$), if ${\lambda}$ is sufficiently close to various roots of unity; the first result of this form is due to Cremer, and the optimal result of this type for quadratic maps was established by Yoccoz. In the other direction, we have the following result of Siegel:

Theorem 2 (Siegel’s linearisation theorem) Let ${f: U \rightarrow {\bf C}}$ be a holomorphic function defined near ${0}$ with ${f(0)=0}$ and ${f'(0)=\lambda}$. If ${|\lambda|=1}$ and one has the Diophantine condition ${\frac{1}{|\lambda^n-1|} \leq C n^C}$ for all natural numbers ${n}$ and some constant ${C>0}$, then ${f}$ is linearisable at ${0}$.

The Diophantine condition can be relaxed to a more general condition involving the rational exponents of the phase ${\theta}$ of ${\lambda = e^{2\pi i \theta}}$; this was worked out by Brjuno, with the condition matching the one later obtained by Yoccoz. Amusingly, while the set of Diophantine numbers (and hence the set of linearisable ${\lambda}$) has full measure on the unit circle, the set of non-linearisable ${\lambda}$ is generic (the complement of countably many nowhere dense sets) due to the above-mentioned work of Cremer, leading to a striking disparity between the measure-theoretic and category notions of “largeness”.

Siegel’s theorem does not seem to be provable using a fixed point iteration method. However, it can be established by modifying another basic method to solve equations, namely Newton’s method. Let us first review how this method works to solve the equation ${f(x)=0}$ for some smooth function ${f: I \rightarrow {\bf R}}$ defined on an interval ${I}$. We suppose we have some initial approximant ${x_0 \in I}$ to this equation, with ${f(x_0)}$ small but not necessarily zero. To make the analysis more quantitative, let us suppose that the interval ${[x_0-r_0,x_0+r_0]}$ lies in ${I}$ for some ${r_0>0}$, and we have the estimates

$\displaystyle |f(x_0)| \leq \delta_0 r_0$

$\displaystyle |f'(x)| \geq \eta_0$

$\displaystyle |f''(x)| \leq \frac{1}{\eta_0 r_0}$

for some ${\delta_0 > 0}$ and ${0 < \eta_0 < 1/2}$ and all ${x \in [x_0-r_0,x_0+r_0]}$ (the factors of ${r_0}$ are present to make ${\delta_0,\eta_0}$ “dimensionless”).

Lemma 3 Under the above hypotheses, we can find ${x_1}$ with ${|x_1 - x_0| \leq \eta_0 r_0}$ such that

$\displaystyle |f(x_1)| \ll \delta_0^2 \eta_0^{-O(1)} r_0.$

In particular, setting ${r_1 := (1-\eta_0) r_0}$, ${\eta_1 := \eta_0/2}$, and ${\delta_1 = O(\delta_0^2 \eta_0^{-O(1)})}$, we have ${[x_1-r_1,x_1+r_1] \subset [x_0-r_0,x_0+r_0] \subset I}$, and

$\displaystyle |f(x_1)| \leq \delta_1 r_1$

$\displaystyle |f'(x)| \geq \eta_1$

$\displaystyle |f''(x)| \leq \frac{1}{\eta_1 r_1}$

for all ${x \in [x_1-r_1,x_1+r_1]}$.

The crucial point here is that the new error ${\delta_1}$ is roughly the square of the previous error ${\delta_0}$. This leads to extremely fast (double-exponential) improvement in the error upon iteration, which is more than enough to absorb the exponential losses coming from the ${\eta_0^{-O(1)}}$ factor.

Proof: If ${\delta_0 > c \eta_0^{C}}$ for some absolute constants ${C,c>0}$ then we may simply take ${x_0=x_1}$, so we may assume that ${\delta_0 \leq c \eta_0^{C}}$ for some small ${c>0}$ and large ${C>0}$. Using the Newton approximation ${f(x_0+h) \approx f(x_0) + h f'(x_0)}$ we are led to the choice

$\displaystyle x_1 := x_0 - \frac{f(x_0)}{f'(x_0)}$

for ${x_1}$. From the hypotheses on ${f}$ and the smallness hypothesis on ${\delta}$ we certainly have ${|x_1-x_0| \leq \eta_0 r_0}$. From Taylor’s theorem with remainder we have

$\displaystyle f(x_1) = f(x_0) - \frac{f(x_0)}{f'(x_0)} f'(x_0) + O( \frac{1}{\eta_0 r_0} |\frac{f(x_0)}{f'(x_0)}|^2 )$

$\displaystyle = O( \frac{1}{\eta_0 r_0} (\frac{\delta_0 r_0}{\eta_0})^2 )$

and the claim follows. $\Box$

We can iterate this procedure; starting with ${x_0,\eta_0,r_0,\delta_0}$ as above, we obtain a sequence of nested intervals ${[x_n-r_n,x_n+r_n]}$ with ${f(x_n)| \leq \delta_n}$, and with ${\eta_n,r_n,\delta_n,x_n}$ evolving by the recursive equations and estimates

$\displaystyle \eta_n = \eta_{n-1} / 2$

$\displaystyle r_n = (1 - \eta_{n-1}) r_{n-1}$

$\displaystyle \delta_n = O( \delta_{n-1}^2 \eta_{n-1}^{-O(1)} )$

$\displaystyle |x_n - x_{n-1}| \leq \eta_{n-1} r_{n-1}.$

If ${\delta_0}$ is sufficiently small depending on ${\eta_0}$, we see that ${\delta_n}$ converges rapidly to zero (indeed, we can inductively obtain a bound of the form ${\delta_n \leq \eta_0^{C (2^n + n)}}$ for some large absolute constant ${C}$ if ${\delta_0}$ is small enough), and ${x_n}$ converges to a limit ${x \in I}$ which then solves the equation ${f(x)=0}$ by the continuity of ${f}$.

As I recently learned from Zhiqiang Li, a similar scheme works to prove Siegel’s theorem, as can be found for instance in this text of Carleson and Gamelin. The key is the following analogue of Lemma 3.

Lemma 4 Let ${\lambda}$ be a complex number with ${|\lambda|=1}$ and ${\frac{1}{|\lambda^n-1|} \ll n^{O(1)}}$ for all natural numbers ${n}$. Let ${r_0>0}$, and let ${f_0: B(0,r_0) \rightarrow {\bf C}}$ be a holomorphic function with ${f_0(0)=0}$, ${f'_0(0)=\lambda}$, and

$\displaystyle |f_0(z) - \lambda z| \leq \delta_0 r_0 \ \ \ \ \ (3)$

for all ${z \in B(0,r_0)}$ and some ${\delta_0>0}$. Let ${0 < \eta_0 \leq 1/2}$, and set ${r_1 := (1-\eta_0) r_0}$. Then there exists an injective holomorphic function ${\psi_0: B(0, r_1) \rightarrow B(0, r_0)}$ and a holomorphic function ${f_1: B(0,r_1) \rightarrow {\bf C}}$ such that

$\displaystyle f_0( \psi_1(z) ) = \psi_1(f_1(z)) \ \ \ \ \ (4)$

for all ${z \in B(0,r_1)}$, and such that

$\displaystyle |\psi_1(z) - z| \ll \delta_0 \eta_0^{-O(1)} r_1$

and

$\displaystyle |f_1(z) - \lambda z| \leq \delta_1 r_1$

for all ${z \in B(0,r_1)}$ and some ${\delta_1 = O(\delta_0^2 \eta_0^{-O(1)})}$.

Proof: By scaling we may normalise ${r_0=1}$. If ${\delta_0 > c \eta_0^C}$ for some constants ${c,C>0}$, then we can simply take ${\psi_1}$ to be the identity and ${f_1=f_0}$, so we may assume that ${\delta_0 \leq c \eta_0^C}$ for some small ${c>0}$ and large ${C>0}$.

To motivate the choice of ${\psi_1}$, we write ${f_0(z) = \lambda z + \hat f_0(z)}$ and ${\psi_1(z) = z + \hat \psi(z)}$, with ${\hat f_0}$ and ${\hat \psi_1}$ viewed as small. We would like to have ${f_0(\psi_1(z)) \approx \psi_1(\lambda z)}$, which expands as

$\displaystyle \lambda z + \lambda \hat \psi_1(z) + \hat f_0( z + \hat \psi_1(z) ) \approx \lambda z + \hat \psi_1(\lambda z).$

As ${\hat f_0}$ and ${\hat \psi}$ are both small, we can heuristically approximate ${\hat f_0(z + \hat \psi_1(z) ) \approx \hat f_0(z)}$ up to quadratic errors (compare with the Newton approximation ${f(x_0+h) \approx f(x_0) + h f'(x_0)}$), and arrive at the equation

$\displaystyle \hat \psi_1(\lambda z) - \lambda \hat \psi_1(z) = \hat f_0(z). \ \ \ \ \ (5)$

This equation can be solved by Taylor series; the function ${\hat f_0}$ vanishes to second order at the origin and thus has a Taylor expansion

$\displaystyle \hat f_0(z) = \sum_{n=2}^\infty a_n z^n$

and then ${\hat \psi_1}$ has a Taylor expansion

$\displaystyle \hat \psi_1(z) = \sum_{n=2}^\infty \frac{a_n}{\lambda^n - \lambda} z^n.$

We take this as our definition of ${\hat \psi_1}$, define ${\psi_1(z) := z + \hat \psi_1(z)}$, and then define ${f_1}$ implicitly via (4).

Let us now justify that this choice works. By (3) and the generalised Cauchy integral formula, we have ${|a_n| \leq \delta_0}$ for all ${n}$; by the Diophantine assumption on ${\lambda}$, we thus have ${|\frac{a_n}{\lambda^n - \lambda}| \ll \delta_0 n^{O(1)}}$. In particular, ${\hat \psi_1}$ converges on ${B(0,1)}$, and on the disk ${B(0, (1-\eta_0/4))}$ (say) we have the bounds

$\displaystyle |\hat \psi_1(z)|, |\hat \psi'_1(z)| \ll \delta_0 \sum_{n=2}^\infty n^{O(1)} (1-\eta_0/4)^n \ll \eta_0^{-O(1)} \delta_0. \ \ \ \ \ (6)$

In particular, as ${\delta_0}$ is so small, we see that ${\psi_1}$ maps ${B(0, (1-\eta_0/4))}$ injectively to ${B(0,1)}$ and ${B(0,1-\eta_0)}$ to ${B(0,1-3\eta_0/4)}$, and the inverse ${\psi_1^{-1}}$ maps ${B(0, (1-\eta_0/2))}$ to ${B(0, (1-\eta_0/4))}$. From (3) we see that ${f_0}$ maps ${B(0,1-3\eta_0/4)}$ to ${B(0,1-\eta_0/2)}$, and so if we set ${f_1: B(0,1-\eta_0) \rightarrow B(0,1-\eta_0/4)}$ to be the function ${f_1 := \psi_1^{-1} \circ f_0 \circ \psi_1}$, then ${f_1}$ is a holomorphic function obeying (4). Expanding (4) in terms of ${\hat f_0}$ and ${\hat \psi_1}$ as before, and also writing ${f_1(z) = \lambda z + \hat f_1(z)}$, we have

$\displaystyle \lambda z + \lambda \hat \psi_1(z) + \hat f_0( z + \hat \psi_1(z) ) = \lambda z + \hat f_1(z) + \hat \psi_1(\lambda z + \hat f_1(z))$

for ${z \in B(0, 1-\eta_0)}$, which by (5) simplifies to

$\displaystyle \hat f_1(z) = \hat f_0( z + \hat \psi_1(z) ) - \hat f_0(z) + \hat \psi_1(\lambda z) - \hat \psi_1(\lambda z + \hat f_1(z)).$

From (6), the fundamental theorem of calculus, and the smallness of ${\delta_0}$ we have

$\displaystyle |\hat \psi_1(\lambda z) - \hat \psi_1(\lambda z + \hat f_1(z))| \leq \frac{1}{2} |\hat f_1(z)|$

and thus

$\displaystyle |\hat f_1(z)| \leq 2 |\hat f_0( z + \hat \psi_1(z) ) - \hat f_0(z)|.$

From (3) and the Cauchy integral formula we have ${\hat f'_0(z) = O( \delta_0 \eta_0^{-O(1)})}$ on (say) ${B(0,1-\eta_0/4)}$, and so from (6) and the fundamental theorem of calculus we conclude that

$\displaystyle |\hat f_1(z)| \ll \delta_0^2 \eta_0^{-O(1)}$

on ${B(0,1-\eta_0)}$, and the claim follows. $\Box$

If we set ${\eta_0 := 1/2}$, ${f_0 := f}$, and ${\delta_0>0}$ to be sufficiently small, then (since ${f(z)-\lambda z}$ vanishes to second order at the origin), the hypotheses of this lemma will be obeyed for some sufficiently small ${r_0}$. Iterating the lemma (and halving ${\eta_0}$ repeatedly), we can then find sequences ${\eta_n, \delta_n, r_n > 0}$, injective holomorphic functions ${\psi_n: B(0,r_n) \rightarrow B(0,r_{n-1})}$ and holomorphic functions ${f_n: B(0,r_n) \rightarrow {\bf C}}$ such that one has the recursive identities and estimates

$\displaystyle \eta_n = \eta_{n-1} / 2$

$\displaystyle r_n = (1 - \eta_{n-1}) r_{n-1}$

$\displaystyle \delta_n = O( \delta_{n-1}^2 \eta_{n-1}^{-O(1)} )$

$\displaystyle |\psi_n(z) - z| \ll \delta_{n-1} \eta_{n-1}^{-O(1)} r_n$

$\displaystyle |f_n(z) - \lambda z| \leq \delta_n r_n$

$\displaystyle f_{n-1}( \psi_n(z) ) = \psi_n(f_n(z))$

for all ${n \geq 1}$ and ${z \in B(0,r_n)}$. By construction, ${r_n}$ decreases to a positive radius ${r_\infty}$ that is a constant multiple of ${r_0}$, while (for ${\delta_0}$ small enough) ${\delta_n}$ converges double-exponentially to zero, so in particular ${f_n(z)}$ converges uniformly to ${\lambda z}$ on ${B(0,r_\infty)}$. Also, ${\psi_n}$ is close enough to the identity, the compositions ${\Psi_n := \psi_1 \circ \dots \circ \psi_n}$ are uniformly convergent on ${B(0,r_\infty/2)}$ with ${\Psi_n(0)=0}$ and ${\Psi'_n(0)=1}$. From this we have

$\displaystyle f( \Psi_n(z) ) = \Psi_n(f_n(z))$

on ${B(0,r_\infty/4)}$, and on taking limits using Morera’s theorem we obtain a holomorphic function ${\Psi}$ defined near ${0}$ with ${\Psi(0)=0}$, ${\Psi'(0)=1}$, and

$\displaystyle f( \Psi(z) ) = \Psi(\lambda z),$

obtaining the required linearisation.

Remark 5 The idea of using a Newton-type method to obtain error terms that decay double-exponentially, and can therefore absorb exponential losses in the iteration, also occurs in KAM theory and in Nash-Moser iteration, presumably due to Siegel’s influence on Moser. (I discuss Nash-Moser iteration in this note that I wrote back in 2006.)

The von Neumann ergodic theorem (the Hilbert space version of the mean ergodic theorem) asserts that if ${U: H \rightarrow H}$ is a unitary operator on a Hilbert space ${H}$, and ${v \in H}$ is a vector in that Hilbert space, then one has

$\displaystyle \lim_{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^N U^n v = \pi_{H^U} v$

in the strong topology, where ${H^U := \{ w \in H: Uw = w \}}$ is the ${U}$-invariant subspace of ${H}$, and ${\pi_{H^U}}$ is the orthogonal projection to ${H^U}$. (See e.g. these previous lecture notes for a proof.) The same proof extends to more general amenable groups: if ${G}$ is a countable amenable group acting on a Hilbert space ${H}$ by unitary transformations ${T^g: H \rightarrow H}$ for ${g \in G}$, and ${v \in H}$ is a vector in that Hilbert space, then one has

$\displaystyle \lim_{N \rightarrow \infty} \mathop{\bf E}_{g \in \Phi_N} T^g v = \pi_{H^G} v \ \ \ \ \ (1)$

for any Folner sequence ${\Phi_N}$ of ${G}$, where ${H^G := \{ w \in H: T^g w = w \hbox{ for all }g \in G \}}$ is the ${G}$-invariant subspace, and ${\mathop{\bf E}_{a \in A} f(a) := \frac{1}{|A|} \sum_{a \in A} f(a)}$ is the average of ${f}$ on ${A}$. Thus one can interpret ${\pi_{H^G} v}$ as a certain average of elements of the orbit ${Gv := \{ T^g v: g \in G \}}$ of ${v}$.

In a previous blog post, I noted a variant of this ergodic theorem (due to Alaoglu and Birkhoff) that holds even when the group ${G}$ is not amenable (or not discrete), using a more abstract notion of averaging:

Theorem 1 (Abstract ergodic theorem) Let ${G}$ be an arbitrary group acting unitarily on a Hilbert space ${H}$, and let ${v}$ be a vector in ${H}$. Then ${\pi_{H^G} v}$ is the element in the closed convex hull of ${Gv := \{ T^g v: g \in G \}}$ of minimal norm, and is also the unique element of ${H^G}$ in this closed convex hull.

I recently stumbled upon a different way to think about this theorem, in the additive case ${G = (G,+)}$ when ${G}$ is abelian, which has a closer resemblance to the classical mean ergodic theorem. Given an arbitrary additive group ${G = (G,+)}$ (not necessarily discrete, or countable), let ${{\mathcal F}}$ denote the collection of finite non-empty multisets in ${G}$ – that is to say, unordered collections ${\{a_1,\dots,a_n\}}$ of elements ${a_1,\dots,a_n}$ of ${G}$, not necessarily distinct, for some positive integer ${n}$. Given two multisets ${A = \{a_1,\dots,a_n\}}$, ${B = \{b_1,\dots,b_m\}}$ in ${{\mathcal F}}$, we can form the sum set ${A + B := \{ a_i + b_j: 1 \leq i \leq n, 1 \leq j \leq m \}}$. Note that the sum set ${A+B}$ can contain multiplicity even when ${A, B}$ do not; for instance, ${\{ 1,2\} + \{1,2\} = \{2,3,3,4\}}$. Given a multiset ${A = \{a_1,\dots,a_n\}}$ in ${{\mathcal F}}$, and a function ${f: G \rightarrow H}$ from ${G}$ to a vector space ${H}$, we define the average ${\mathop{\bf E}_{a \in A} f(a)}$ as

$\displaystyle \mathop{\bf E}_{a \in A} f(a) = \frac{1}{n} \sum_{j=1}^n f(a_j).$

Note that the multiplicity function of the set ${A}$ affects the average; for instance, we have ${\mathop{\bf E}_{a \in \{1,2\}} a = \frac{3}{2}}$, but ${\mathop{\bf E}_{a \in \{1,2,2\}} a = \frac{5}{3}}$.

We can define a directed set on ${{\mathcal F}}$ as follows: given two multisets ${A,B \in {\mathcal F}}$, we write ${A \geq B}$ if we have ${A = B+C}$ for some ${C \in {\mathcal F}}$. Thus for instance we have ${\{ 1, 2, 2, 3\} \geq \{1,2\}}$. It is easy to verify that this operation is transitive and reflexive, and is directed because any two elements ${A,B}$ of ${{\mathcal F}}$ have a common upper bound, namely ${A+B}$. (This is where we need ${G}$ to be abelian.) The notion of convergence along a net, now allows us to define the notion of convergence along ${{\mathcal F}}$; given a family ${x_A}$ of points in a topological space ${X}$ indexed by elements ${A}$ of ${{\mathcal F}}$, and a point ${x}$ in ${X}$, we say that ${x_A}$ converges to ${x}$ along ${{\mathcal F}}$ if, for every open neighbourhood ${U}$ of ${x}$ in ${X}$, one has ${x_A \in U}$ for sufficiently large ${A}$, that is to say there exists ${B \in {\mathcal F}}$ such that ${x_A \in U}$ for all ${A \geq B}$. If the topological space ${V}$ is Hausdorff, then the limit ${x}$ is unique (if it exists), and we then write

$\displaystyle x = \lim_{A \rightarrow G} x_A.$

When ${x_A}$ takes values in the reals, one can also define the limit superior or limit inferior along such nets in the obvious fashion.

We can then give an alternate formulation of the abstract ergodic theorem in the abelian case:

Theorem 2 (Abelian abstract ergodic theorem) Let ${G = (G,+)}$ be an arbitrary additive group acting unitarily on a Hilbert space ${H}$, and let ${v}$ be a vector in ${H}$. Then we have

$\displaystyle \pi_{H^G} v = \lim_{A \rightarrow G} \mathop{\bf E}_{a \in A} T^a v$

in the strong topology of ${H}$.

Proof: Suppose that ${A \geq B}$, so that ${A=B+C}$ for some ${C \in {\mathcal F}}$, then

$\displaystyle \mathop{\bf E}_{a \in A} T^a v = \mathop{\bf E}_{c \in C} T^c ( \mathop{\bf E}_{b \in B} T^b v )$

so by unitarity and the triangle inequality we have

$\displaystyle \| \mathop{\bf E}_{a \in A} T^a v \|_H \leq \| \mathop{\bf E}_{b \in B} T^b v \|_H,$

thus ${\| \mathop{\bf E}_{a \in A} T^a v \|_H^2}$ is monotone non-increasing in ${A}$. Since this quantity is bounded between ${0}$ and ${\|v\|_H}$, we conclude that the limit ${\lim_{A \rightarrow G} \| \mathop{\bf E}_{a \in A} T^a v \|_H^2}$ exists. Thus, for any ${\varepsilon > 0}$, we have for sufficiently large ${A}$ that

$\displaystyle \| \mathop{\bf E}_{b \in B} T^b v \|_H^2 \geq \| \mathop{\bf E}_{a \in A} T^a v \|_H^2 - \varepsilon$

for all ${B \geq A}$. In particular, for any ${g \in G}$, we have

$\displaystyle \| \mathop{\bf E}_{b \in A + \{0,g\}} T^b v \|_H^2 \geq \| \mathop{\bf E}_{a \in A} T^a v \|_H^2 - \varepsilon.$

We can write

$\displaystyle \mathop{\bf E}_{b \in A + \{0,g\}} T^b v = \frac{1}{2} \mathop{\bf E}_{a \in A} T^a v + \frac{1}{2} T^g \mathop{\bf E}_{a \in A} T^a v$

and so from the parallelogram law and unitarity we have

$\displaystyle \| \mathop{\bf E}_{a \in A} T^a v - T^g \mathop{\bf E}_{a \in A} T^a v \|_H^2 \leq 4 \varepsilon$

for all ${g \in G}$, and hence by the triangle inequality (averaging ${g}$ over a finite multiset ${C}$)

$\displaystyle \| \mathop{\bf E}_{a \in A} T^a v - \mathop{\bf E}_{b \in A+C} T^b v \|_H^2 \leq 4 \varepsilon$

for any ${C \in {\mathcal F}}$. This shows that ${\mathop{\bf E}_{a \in A} T^a v}$ is a Cauchy sequence in ${H}$ (in the strong topology), and hence (by the completeness of ${H}$) tends to a limit. Shifting ${A}$ by a group element ${g}$, we have

$\displaystyle \lim_{A \rightarrow G} \mathop{\bf E}_{a \in A} T^a v = \lim_{A \rightarrow G} \mathop{\bf E}_{a \in A + \{g\}} T^a v = T^g \lim_{A \rightarrow G} \mathop{\bf E}_{a \in A} T^a v$

and hence ${\lim_{A \rightarrow G} \mathop{\bf E}_{a \in A} T^a v}$ is invariant under shifts, and thus lies in ${H^G}$. On the other hand, for any ${w \in H^G}$ and ${A \in {\mathcal F}}$, we have

$\displaystyle \langle \mathop{\bf E}_{a \in A} T^a v, w \rangle_H = \mathop{\bf E}_{a \in A} \langle v, T^{-a} w \rangle_H = \langle v, w \rangle_H$

and thus on taking strong limits

$\displaystyle \langle \lim_{A \rightarrow G} \mathop{\bf E}_{a \in A} T^a v, w \rangle_H = \langle v, w \rangle_H$

and so ${v - \lim_{A \rightarrow G} \mathop{\bf E}_{a \in A} T^a v}$ is orthogonal to ${H^G}$. Combining these two facts we see that ${\lim_{A \rightarrow G} \mathop{\bf E}_{a \in A} T^a v}$ is equal to ${\pi_{H^G} v}$ as claimed. $\Box$

To relate this result to the classical ergodic theorem, we observe

Lemma 3 Let ${G}$ be a countable additive group, with a F{\o}lner sequence ${\Phi_n}$, and let ${f_g}$ be a bounded sequence in a normed vector space indexed by ${G}$. If ${\lim_{A \rightarrow G} \mathop{\bf E}_{a \in A} f_a}$ exists, then ${\lim_{n \rightarrow \infty} \mathop{\bf E}_{a \in \Phi_n} f_a}$ exists, and the two limits are equal.

Proof: From the F{\o}lner property, we see that for any ${A}$ and any ${\varepsilon>0}$, the averages ${\mathop{\bf E}_{a \in \Phi_n} f_a}$ and ${\mathop{\bf E}_{a \in A+\Phi_n} f_a}$ differ by at most ${\varepsilon}$ in norm if ${n}$ is sufficiently large depending on ${A}$, ${\varepsilon}$ (and the ${f_a}$). On the other hand, by the existence of the limit ${\lim_{A \rightarrow G} \mathop{\bf E}_{a \in A} f_a}$, the averages ${\mathop{\bf E}_{a \in A} f_a}$ and ${\mathop{\bf E}_{a \in A + \Phi_n} f_a}$ differ by at most ${\varepsilon}$ in norm if ${A}$ is sufficiently large depending on ${\varepsilon}$ (regardless of how large ${n}$ is). The claim follows. $\Box$

It turns out that this approach can also be used as an alternate way to construct the GowersHost-Kra seminorms in ergodic theory, which has the feature that it does not explicitly require any amenability on the group ${G}$ (or separability on the underlying measure space), though, as pointed out to me in comments, even uncountable abelian groups are amenable in the sense of possessing an invariant mean, even if they do not have a F{\o}lner sequence.

Given an arbitrary additive group ${G}$, define a ${G}$-system ${({\mathrm X}, T)}$ to be a probability space ${{\mathrm X} = (X, {\mathcal X}, \mu)}$ (not necessarily separable or standard Borel), together with a collection ${T^g: X \rightarrow X}$ of invertible, measure-preserving maps, such that ${T^0}$ is the identity and ${T^g T^h = T^{g+h}}$ (modulo null sets) for all ${g,h \in G}$. This then gives isomorphisms ${T^g: L^p({\mathrm X}) \rightarrow L^p({\mathrm X})}$ for ${1 \leq p \leq \infty}$ by setting ${T^g f(x) := f(T^{-g} x)}$. From the above abstract ergodic theorem, we see that

$\displaystyle {\mathbf E}( f | {\mathcal X}^G ) = \lim_{A \rightarrow G} \mathop{\bf E}_{a \in A} T^g f$

in the strong topology of ${L^2({\mathrm X})}$ for any ${f \in L^2({\mathrm X})}$, where ${{\mathcal X}^G}$ is the collection of measurable sets ${E}$ that are essentially ${G}$-invariant in the sense that ${T^g E = E}$ modulo null sets for all ${g \in G}$, and ${{\mathbf E}(f|{\mathcal X}^G)}$ is the conditional expectation of ${f}$ with respect to ${{\mathcal X}^G}$.

In a similar spirit, we have

Theorem 4 (Convergence of Gowers-Host-Kra seminorms) Let ${({\mathrm X},T)}$ be a ${G}$-system for some additive group ${G}$. Let ${d}$ be a natural number, and for every ${\omega \in\{0,1\}^d}$, let ${f_\omega \in L^{2^d}({\mathrm X})}$, which for simplicity we take to be real-valued. Then the expression

$\displaystyle \langle (f_\omega)_{\omega \in \{0,1\}^d} \rangle_{U^d({\mathrm X})} := \lim_{A_1,\dots,A_d \rightarrow G}$

$\displaystyle \mathop{\bf E}_{h_1 \in A_1-A_1,\dots,h_d \in A_d-A_d} \int_X \prod_{\omega \in \{0,1\}^d} T^{\omega_1 h_1 + \dots + \omega_d h_d} f_\omega\ d\mu$

converges, where we write ${\omega = (\omega_1,\dots,\omega_d)}$, and we are using the product direct set on ${{\mathcal F}^d}$ to define the convergence ${A_1,\dots,A_d \rightarrow G}$. In particular, for ${f \in L^{2^d}({\mathrm X})}$, the limit

$\displaystyle \| f \|_{U^d({\mathrm X})}^{2^d} = \lim_{A_1,\dots,A_d \rightarrow G}$

$\displaystyle \mathop{\bf E}_{h_1 \in A_1-A_1,\dots,h_d \in A_d-A_d} \int_X \prod_{\omega \in \{0,1\}^d} T^{\omega_1 h_1 + \dots + \omega_d h_d} f\ d\mu$

converges.

We prove this theorem below the fold. It implies a number of other known descriptions of the Gowers-Host-Kra seminorms ${\|f\|_{U^d({\mathrm X})}}$, for instance that

$\displaystyle \| f \|_{U^d({\mathrm X})}^{2^d} = \lim_{A \rightarrow G} \mathop{\bf E}_{h \in A-A} \| f T^h f \|_{U^{d-1}({\mathrm X})}^{2^{d-1}}$

for ${d > 1}$, while from the ergodic theorem we have

$\displaystyle \| f \|_{U^1({\mathrm X})} = \| {\mathbf E}( f | {\mathcal X}^G ) \|_{L^2({\mathrm X})}.$

This definition also manifestly demonstrates the cube symmetries of the Host-Kra measures ${\mu^{[d]}}$ on ${X^{\{0,1\}^d}}$, defined via duality by requiring that

$\displaystyle \langle (f_\omega)_{\omega \in \{0,1\}^d} \rangle_{U^d({\mathrm X})} = \int_{X^{\{0,1\}^d}} \bigotimes_{\omega \in \{0,1\}^d} f_\omega\ d\mu^{[d]}.$

In a subsequent blog post I hope to present a more detailed study of the ${U^2}$ norm and its relationship with eigenfunctions and the Kronecker factor, without assuming any amenability on ${G}$ or any separability or topological structure on ${{\mathrm X}}$.

The 2014 Fields medallists have just been announced as (in alphabetical order of surname) Artur Avila, Manjul Bhargava, Martin Hairer, and Maryam Mirzakhani (see also these nice video profiles for the winners, which is a new initiative of the IMU and the Simons foundation). This time four years ago, I wrote a blog post discussing one result from each of the 2010 medallists; I thought I would try to repeat the exercise here, although the work of the medallists this time around is a little bit further away from my own direct area of expertise than last time, and so my discussion will unfortunately be a bit superficial (and possibly not completely accurate) in places. As before, I am picking these results based on my own idiosyncratic tastes, and they should not be viewed as necessarily being the “best” work of these medallists. (See also the press releases for Avila, Bhargava, Hairer, and Mirzakhani.)

Artur Avila works in dynamical systems and in the study of Schrödinger operators. The work of Avila that I am most familiar with is his solution with Svetlana Jitormiskaya of the ten martini problem of Kac, the solution to which (according to Barry Simon) he offered ten martinis for, hence the name. The problem involves perhaps the simplest example of a Schrödinger operator with non-trivial spectral properties, namely the almost Mathieu operator ${H^{\lambda,\alpha}_\omega: \ell^2({\bf Z}) \rightarrow \ell^2({\bf Z})}$ defined for parameters ${\alpha,\omega \in {\bf R}/{\bf Z}}$ and ${\lambda>0}$ by a discrete one-dimensional Schrödinger operator with cosine potential:

$\displaystyle (H^{\lambda,\alpha}_\omega u)_n := u_{n+1} + u_{n-1} + 2\lambda (\cos 2\pi(\theta+n\alpha)) u_n.$

This is a bounded self-adjoint operator and thus has a spectrum ${\sigma( H^{\lambda,\alpha}_\omega )}$ that is a compact subset of the real line; it arises in a number of physical contexts, most notably in the theory of the integer quantum Hall effect, though I will not discuss these applications here. Remarkably, the structure of this spectrum depends crucially on the Diophantine properties of the frequency ${\alpha}$. For instance, if ${\alpha = p/q}$ is a rational number, then the operator is periodic with period ${q}$, and then basic (discrete) Floquet theory tells us that the spectrum is simply the union of ${q}$ (possibly touching) intervals. But for irrational ${\alpha}$ (in which case the spectrum is independent of the phase ${\theta}$), the situation is much more fractal in nature, for instance in the critical case ${\lambda=1}$ the spectrum (as a function of ${\alpha}$) gives rise to the Hofstadter butterfly. The “ten martini problem” asserts that for every irrational ${\alpha}$ and every choice of coupling constant ${\lambda > 0}$, the spectrum is homeomorphic to a Cantor set. Prior to the work of Avila and Jitormiskaya, there were a number of partial results on this problem, notably the result of Puig establishing Cantor spectrum for a full measure set of parameters ${(\lambda,\alpha)}$, as well as results requiring a perturbative hypothesis, such as ${\lambda}$ being very small or very large. The result was also already known for ${\alpha}$ being either very close to rational (i.e. a Liouville number) or very far from rational (a Diophantine number), although the analyses for these two cases failed to meet in the middle, leaving some cases untreated. The argument uses a wide variety of existing techniques, both perturbative and non-perturbative, to attack this problem, as well as an amusing argument by contradiction: they assume (in certain regimes) that the spectrum fails to be a Cantor set, and use this hypothesis to obtain additional Lipschitz control on the spectrum (as a function of the frequency ${\alpha}$), which they can then use (after much effort) to improve existing arguments and conclude that the spectrum was in fact Cantor after all!

Manjul Bhargava produces amazingly beautiful mathematics, though most of it is outside of my own area of expertise. One part of his work that touches on an area of my own interest (namely, random matrix theory) is his ongoing work with many co-authors on modeling (both conjecturally and rigorously) the statistics of various key number-theoretic features of elliptic curves (such as their rank, their Selmer group, or their Tate-Shafarevich groups). For instance, with Kane, Lenstra, Poonen, and Rains, Manjul has proposed a very general random matrix model that predicts all of these statistics (for instance, predicting that the ${p}$-component of the Tate-Shafarevich group is distributed like the cokernel of a certain random ${p}$-adic matrix, very much in the spirit of the Cohen-Lenstra heuristics discussed in this previous post). But what is even more impressive is that Manjul and his coauthors have been able to verify several non-trivial fragments of this model (e.g. showing that certain moments have the predicted asymptotics), giving for the first time non-trivial upper and lower bounds for various statistics, for instance obtaining lower bounds on how often an elliptic curve has rank ${0}$ or rank ${1}$, leading most recently (in combination with existing work of Gross-Zagier and of Kolyvagin, among others) to his amazing result with Skinner and Zhang that at least ${66\%}$ of all elliptic curves over ${{\bf Q}}$ (ordered by height) obey the Birch and Swinnerton-Dyer conjecture. Previously it was not even known that a positive proportion of curves obeyed the conjecture. This is still a fair ways from resolving the conjecture fully (in particular, the situation with the presumably small number of curves of rank ${2}$ and higher is still very poorly understood, and the theory of Gross-Zagier and Kolyvagin that this work relies on, which was initially only available for ${{\bf Q}}$, has only been extended to totally real number fields thus far, by the work of Zhang), but it certainly does provide hope that the conjecture could be within reach in a statistical sense at least.

Martin Hairer works in at the interface between probability and partial differential equations, and in particular in the theory of stochastic differential equations (SDEs). The result of his that is closest to my own interests is his remarkable demonstration with Jonathan Mattingly of unique invariant measure for the two-dimensional stochastically forced Navier-Stokes equation

$\displaystyle \partial_t u + (u \cdot \nabla u) = \nu \Delta u - \nabla p + \xi$

$\displaystyle \nabla \cdot u = 0$

on the two-torus ${({\bf R}/{\bf Z})^2}$, where ${\xi}$ is a Gaussian field that forces a fixed set of frequencies. It is expected that for any reasonable choice of initial data, the solution to this equation should asymptotically be distributed according to Kolmogorov’s power law, as discussed in this previous post. This is still far from established rigorously (although there are some results in this direction for dyadic models, see e.g. this paper of Cheskidov, Shvydkoy, and Friedlander). However, Hairer and Mattingly were able to show that there was a unique probability distribution to almost every initial data would converge to asymptotically; by the ergodic theorem, this is equivalent to demonstrating the existence and uniqueness of an invariant measure for the flow. Existence can be established using standard methods, but uniqueness is much more difficult. One of the standard routes to uniqueness is to establish a “strong Feller property” that enforces some continuity on the transition operators; among other things, this would mean that two ergodic probability measures with intersecting supports would in fact have a non-trivial common component, contradicting the ergodic theorem (which forces different ergodic measures to be mutually singular). Since all ergodic measures for Navier-Stokes can be seen to contain the origin in their support, this would give uniqueness. Unfortunately, the strong Feller property is unlikely to hold in the infinite-dimensional phase space for Navier-Stokes; but Hairer and Mattingly develop a clean abstract substitute for this property, which they call the asymptotic strong Feller property, which is again a regularity property on the transition operator; this in turn is then demonstrated by a careful application of Malliavin calculus.

Maryam Mirzakhani has mostly focused on the geometry and dynamics of Teichmuller-type moduli spaces, such as the moduli space of Riemann surfaces with a fixed genus and a fixed number of cusps (or with a fixed number of boundaries that are geodesics of a prescribed length). These spaces have an incredibly rich structure, ranging from geometric structure (such as the Kahler geometry given by the Weil-Petersson metric), to dynamical structure (through the action of the mapping class group on this and related spaces), to algebraic structure (viewing these spaces as algebraic varieties), and are thus connected to many other objects of interest in geometry and dynamics. For instance, by developing a new recursive formula for the Weil-Petersson volume of this space, Mirzakhani was able to asymptotically count the number of simple prime geodesics of length up to some threshold ${L}$ in a hyperbolic surface (or more precisely, she obtained asymptotics for the number of such geodesics in a given orbit of the mapping class group); the answer turns out to be polynomial in ${L}$, in contrast to the much larger class of non-simple prime geodesics, whose asymptotics are exponential in ${L}$ (the “prime number theorem for geodesics”, developed in a classic series of works by Delsart, Huber, Selberg, and Margulis); she also used this formula to establish a new proof of a conjecture of Witten on intersection numbers that was first proven by Kontsevich. More recently, in two lengthy papers with Eskin and with Eskin-Mohammadi, Mirzakhani established rigidity theorems for the action of ${SL_2({\bf R})}$ on such moduli spaces that are close analogues of Ratner’s celebrated rigidity theorems for unipotently generated groups (discussed in this previous blog post). Ratner’s theorems are already notoriously difficult to prove, and rely very much on the polynomial stability properties of unipotent flows; in this even more complicated setting, the unipotent flows are no longer tractable, and Mirzakhani instead uses a recent “exponential drift” method of Benoist and Quint with as a substitute. Ratner’s theorems are incredibly useful for all sorts of problems connected to homogeneous dynamics, and the analogous theorems established by Mirzakhani, Eskin, and Mohammadi have a similarly broad range of applications, for instance in counting periodic billiard trajectories in rational polygons.

As laid out in the foundational work of Kolmogorov, a classical probability space (or probability space for short) is a triplet ${(X, {\mathcal X}, \mu)}$, where ${X}$ is a set, ${{\mathcal X}}$ is a ${\sigma}$-algebra of subsets of ${X}$, and ${\mu: {\mathcal X} \rightarrow [0,1]}$ is a countably additive probability measure on ${{\mathcal X}}$. Given such a space, one can form a number of interesting function spaces, including

• the (real) Hilbert space ${L^2(X, {\mathcal X}, \mu)}$ of square-integrable functions ${f: X \rightarrow {\bf R}}$, modulo ${\mu}$-almost everywhere equivalence, and with the positive definite inner product ${\langle f, g\rangle_{L^2(X, {\mathcal X}, \mu)} := \int_X f g\ d\mu}$; and
• the unital commutative Banach algebra ${L^\infty(X, {\mathcal X}, \mu)}$ of essentially bounded functions ${f: X \rightarrow {\bf R}}$, modulo ${\mu}$-almost everywhere equivalence, with ${\|f\|_{L^\infty(X, {\mathcal X}, \mu)}}$ defined as the essential supremum of ${|f|}$.

There is also a trace ${\tau = \tau_\mu: L^\infty(X, {\mathcal X}, \mu) \rightarrow {\bf C}}$ on ${L^\infty}$ defined by integration: ${\tau(f) := \int_X f\ d\mu}$.

One can form the category ${\mathbf{Prb}}$ of classical probability spaces, by defining a morphism ${\phi: (X, {\mathcal X}, \mu) \rightarrow (Y, {\mathcal Y}, \nu)}$ between probability spaces to be a function ${\phi: X \rightarrow Y}$ which is measurable (thus ${\phi^{-1}(E) \in {\mathcal X}}$ for all ${E \in {\mathcal Y}}$) and measure-preserving (thus ${\mu(\phi^{-1}(E)) = \nu(E)}$ for all ${E \in {\mathcal Y}}$).

Let us now abstract the algebraic features of these spaces as follows; for want of a better name, I will refer to this abstraction as an algebraic probability space, and is very similar to the non-commutative probability spaces studied in this previous post, except that these spaces are now commutative (and real).

Definition 1 An algebraic probability space is a pair ${({\mathcal A}, \tau)}$ where

• ${{\mathcal A}}$ is a unital commutative real algebra;
• ${\tau: {\mathcal A} \rightarrow {\bf R}}$ is a homomorphism such that ${\tau(1)=1}$ and ${\tau( f^2 ) \geq 0}$ for all ${f \in {\mathcal A}}$;
• Every element ${f}$ of ${{\mathcal A}}$ is bounded in the sense that ${\sup_{k \geq 1} \tau( f^{2k} )^{1/2k} < \infty}$. (Technically, this isn’t an algebraic property, but I need it for technical reasons.)

A morphism ${\phi: ({\mathcal A}_1, \tau_1) \rightarrow ({\mathcal A}_2, \tau_2)}$ is a homomorphism ${\phi^*: {\mathcal A}_2 \rightarrow {\mathcal A}_1}$ which is trace-preserving, in the sense that ${\tau_1(\phi^*(f)) = \tau_2(f)}$ for all ${f \in {\mathcal A}_2}$.

For want of a better name, I’ll denote the category of algebraic probability spaces as ${\mathbf{AlgPrb}}$. One can view this category as the opposite category to that of (a subcategory of) the category of tracial commutative real algebras. One could emphasise this opposite nature by denoting the algebraic probability space as ${({\mathcal A}, \tau)^{op}}$ rather than ${({\mathcal A},\tau)}$; another suggestive (but slightly inaccurate) notation, inspired by the language of schemes, would be ${\hbox{Spec}({\mathcal A},\tau)}$ rather than ${({\mathcal A},\tau)}$. However, we will not adopt these conventions here, and refer to algebraic probability spaces just by the pair ${({\mathcal A},\tau)}$.

By the previous discussion, we have a covariant functor ${F: \textbf{Prb} \rightarrow \textbf{AlgPrb}}$ that takes a classical probability space ${(X, {\mathcal X}, \mu)}$ to its algebraic counterpart ${(L^\infty(X, {\mathcal X},\mu), \tau_\mu)}$, with a morphism ${\phi: (X, {\mathcal X}, \mu) \rightarrow (Y, {\mathcal Y}, \nu)}$ of classical probability spaces mapping to a morphism ${F(\phi): (L^\infty(X, {\mathcal X},\mu), \tau_\mu) \rightarrow (L^\infty(Y, {\mathcal Y},\nu), \tau_\nu)}$ of the corresponding algebraic probability spaces by the formula

$\displaystyle F(\phi)^* f := f \circ \phi$

for ${f \in L^\infty(Y, {\mathcal Y}, \nu)}$. One easily verifies that this is a functor.

In this post I would like to describe a functor ${G: \textbf{AlgPrb} \rightarrow \textbf{Prb}}$ which partially inverts ${F}$ (up to natural isomorphism), that is to say a recipe for starting with an algebraic probability space ${({\mathcal A}, \tau)}$ and producing a classical probability space ${(X, {\mathcal X}, \mu)}$. This recipe is not new – it is basically the (commutative) Gelfand-Naimark-Segal construction (discussed in this previous post) combined with the Loomis-Sikorski theorem (discussed in this previous post). However, I wanted to put the construction in a single location for sake of reference. I also wanted to make the point that ${F}$ and ${G}$ are not complete inverses; there is a bit of information in the algebraic probability space (e.g. topological information) which is lost when passing back to the classical probability space. In some future posts, I would like to develop some ergodic theory using the algebraic foundations of probability theory rather than the classical foundations; this turns out to be convenient in the ergodic theory arising from nonstandard analysis (such as that described in this previous post), in which the groups involved are uncountable and the underlying spaces are not standard Borel spaces.

Let us describe how to construct the functor ${G}$, with details postponed to below the fold.

1. Starting with an algebraic probability space ${({\mathcal A}, \tau)}$, form an inner product on ${{\mathcal A}}$ by the formula ${\langle f, g \rangle := \tau(fg)}$, and also form the spectral radius ${\rho(f) :=\lim_{k \rightarrow \infty} \tau(f^{2^k})^{1/2^k}}$.
2. The inner product is clearly positive semi-definite. Quotienting out the null vectors and taking completions, we arrive at a real Hilbert space ${L^2 = L^2({\mathcal A},\tau)}$, to which the trace ${\tau}$ may be extended.
3. Somewhat less obviously, the spectral radius is well-defined and gives a norm on ${{\mathcal A}}$. Taking ${L^2}$ limits of sequences in ${{\mathcal A}}$ of bounded spectral radius gives us a subspace ${L^\infty = L^\infty({\mathcal A},\tau)}$ of ${L^2}$ that has the structure of a real commutative Banach algebra.
4. The idempotents ${1_E}$ of the Banach algebra ${L^\infty}$ may be indexed by elements ${E}$ of an abstract ${\sigma}$-algebra ${{\mathcal B}}$.
5. The Boolean algebra homomorphisms ${\delta_x: {\mathcal B} \rightarrow \{0,1\}}$ (or equivalently, the real algebra homomorphisms ${\iota_x: L^\infty \rightarrow {\bf R}}$) may be indexed by elements ${x}$ of a space ${X}$.
6. Let ${{\mathcal X}}$ denote the ${\sigma}$-algebra on ${X}$ generated by the basic sets ${\overline{E} := \{ x \in X: \delta_x(E) = 1 \}}$ for every ${E \in {\mathcal B}}$.
7. Let ${{\mathcal N}}$ be the ${\sigma}$-ideal of ${{\mathcal X}}$ generated by the sets ${\bigcap_n \overline{E_n}}$, where ${E_n \in {\mathcal B}}$ is a sequence with ${\bigcap_n E_n = \emptyset}$.
8. One verifies that ${{\mathcal B}}$ is isomorphic to ${{\mathcal X}/{\mathcal N}}$. Using this isomorphism, the trace ${\tau}$ on ${L^\infty}$ can be used to construct a countably additive measure ${\mu}$ on ${{\mathcal X}}$. The classical probability space ${(X, {\mathcal X}, \mu)}$ is then ${G( {\mathcal A}, \tau )}$, and the abstract spaces ${L^2, L^\infty}$ may now be identified with their concrete counterparts ${L^2(X, {\mathcal X}, \mu)}$, ${L^\infty(X, {\mathcal X}, \mu)}$.
9. Every algebraic probability space morphism ${\phi: ({\mathcal A}_1,\tau_1) \rightarrow ({\mathcal A}_2,\tau_2)}$ generates a classical probability morphism ${G(\phi): (X_1, {\mathcal X}_1, \mu_1) \rightarrow (X_2, {\mathcal X}_2, \mu_2)}$ via the formula

$\displaystyle \delta_{G(\phi)(x_1)}( E_2 ) = \delta_{x_1}( \phi^*(E_2) )$

using a pullback operation ${\phi^*}$ on the abstract ${\sigma}$-algebras ${{\mathcal B}_1, {\mathcal B}_2}$ that can be defined by density.

Remark 1 The classical probability space ${X}$ constructed by the functor ${G}$ has some additional structure; namely ${X}$ is a ${\sigma}$-Stone space (a Stone space with the property that the closure of any countable union of clopen sets is clopen), ${{\mathcal X}}$ is the Baire ${\sigma}$-algebra (generated by the clopen sets), and the null sets are the meager sets. However, we will not use this additional structure here.

The partial inversion relationship between the functors ${F: \textbf{Prb} \rightarrow \textbf{AlgPrb}}$ and ${G: \textbf{AlgPrb} \rightarrow \textbf{Prb}}$ is given by the following assertion:

1. There is a natural transformation from ${F \circ G: \textbf{AlgPrb} \rightarrow \textbf{AlgPrb}}$ to the identity functor ${I: \textbf{AlgPrb} \rightarrow \textbf{AlgPrb}}$.

More informally: if one starts with an algebraic probability space ${({\mathcal A},\tau)}$ and converts it back into a classical probability space ${(X, {\mathcal X}, \mu)}$, then there is a trace-preserving algebra homomorphism of ${{\mathcal A}}$ to ${L^\infty( X, {\mathcal X}, \mu )}$, which respects morphisms of the algebraic probability space. While this relationship is far weaker than an equivalence of categories (which would require that ${F \circ G}$ and ${G \circ F}$ are both natural isomorphisms), it is still good enough to allow many ergodic theory problems formulated using classical probability spaces to be reformulated instead as an equivalent problem in algebraic probability spaces.

Remark 2 The opposite composition ${G \circ F: \textbf{Prb} \rightarrow \textbf{Prb}}$ is a little odd: it takes an arbitrary probability space ${(X, {\mathcal X}, \mu)}$ and returns a more complicated probability space ${(X', {\mathcal X}', \mu')}$, with ${X'}$ being the space of homomorphisms ${\iota_x: L^\infty(X, {\mathcal X}, \mu) \rightarrow {\bf R}}$. while there is “morally” an embedding of ${X}$ into ${X'}$ using the evaluation map, this map does not exist in general because points in ${X}$ may well have zero measure. However, if one takes a “pointless” approach and focuses just on the measure algebras ${({\mathcal X}, \mu)}$, ${({\mathcal X}', \mu')}$, then these algebras become naturally isomorphic after quotienting out by null sets.

Remark 3 An algebraic probability space captures a bit more structure than a classical probability space, because ${{\mathcal A}}$ may be identified with a proper subset of ${L^\infty}$ that describes the “regular” functions (or random variables) of the space. For instance, starting with the unit circle ${{\bf R}/{\bf Z}}$ (with the usual Haar measure and the usual trace ${\tau(f) = \int_{{\bf R}/{\bf Z}} f}$), any unital subalgebra ${{\mathcal A}}$ of ${L^\infty({\bf R}/{\bf Z})}$ that is dense in ${L^2({\bf R}/{\bf Z})}$ will generate the same classical probability space ${G( {\mathcal A}, \tau )}$ on applying the functor ${G}$, namely one will get the space ${({\bf R}/{\bf Z})'}$ of homomorphisms from ${L^\infty({\bf R}/{\bf Z})}$ to ${{\bf R}}$ (with the measure induced from ${\tau}$). Thus for instance ${{\mathcal A}}$ could be the continuous functions ${C( {\bf R}/{\bf Z} )}$, the Wiener algebra ${A({\bf R}/{\bf Z})}$ or the full space ${L^\infty({\bf R}/{\bf Z})}$, but the classical space ${G( {\mathcal A}, \tau )}$ will be unable to distinguish these spaces from each other. In particular, the functor ${F \circ G}$ loses information (roughly speaking, this functor takes an algebraic probability space and completes it to a von Neumann algebra, but then forgets exactly what algebra was initially used to create this completion). In ergodic theory, this sort of “extra structure” is traditionally encoded in topological terms, by assuming that the underlying probability space ${X}$ has a nice topological structure (e.g. a standard Borel space); however, with the algebraic perspective one has the freedom to have non-topological notions of extra structure, by choosing ${{\mathcal A}}$ to be something other than an algebra ${C(X)}$ of continuous functions on a topological space. I hope to discuss one such example of extra structure (coming from the Gowers-Host-Kra theory of uniformity seminorms) in a later blog post (this generalises the example of the Wiener algebra given previously, which is encoding “Fourier structure”).

A small example of how one could use the functors ${F, G}$ is as follows. Suppose one has a classical probability space ${(X, {\mathcal X}, \mu)}$ with a measure-preserving action of an uncountable group ${\Gamma}$, which is only defined (and an action) up to almost everywhere equivalence; thus for instance for any set ${E}$ and any ${g, h \in \Gamma}$, ${T^{gh} E}$ and ${T^g T^h E}$ might not be exactly equal, but only equal up to a null set. For similar reasons, an element ${E}$ of the invariant factor ${{\mathcal X}^\Gamma}$ might not be exactly invariant with respect to ${\Gamma}$, but instead one only has ${T^g E}$ and ${E}$ equal up to null sets for each ${g \in \Gamma}$. One might like to “clean up” the action of ${\Gamma}$ to make it defined everywhere, and a genuine action everywhere, but this is not immediately achievable if ${\Gamma}$ is uncountable, since the union of all the null sets where something bad occurs may cease to be a null set. However, by applying the functor ${F}$, each shift ${T^g: X \rightarrow X}$ defines a morphism ${T^g: L^\infty(X, {\mathcal X}, \mu) \rightarrow L^\infty(X, {\mathcal X}, \mu)}$ on the associated algebraic probability space (i.e. the Koopman operator), and then applying ${G}$, we obtain a shift ${T^g: X' \rightarrow X'}$ on a new classical probability space ${(X', {\mathcal X}', \mu')}$ which now gives a genuine measure-preserving action of ${\Gamma}$, and which is equivalent to the original action from a measure algebra standpoint. The invariant factor ${{\mathcal X}^\Gamma}$ now consists of those sets in ${{\mathcal X}'}$ which are genuinely ${\Gamma}$-invariant, not just up to null sets. (Basically, the classical probability space ${(X', {\mathcal X}', \mu')}$ contains a Boolean algebra ${\overline{\mathcal B}}$ with the property that every measurable set ${A \in {\mathcal X}'}$ is equivalent up to null sets to precisely one set in ${\overline{\mathcal B}}$, allowing for a canonical “retraction” onto ${\overline{\mathcal B}}$ that eliminates all null set issues.)

More indirectly, the functors ${F, G}$ suggest that one should be able to develop a “pointless” form of ergodic theory, in which the underlying probability spaces are given algebraically rather than classically. I hope to give some more specific examples of this in later posts.

There are a number of ways to construct the real numbers ${{\bf R}}$, for instance

• as the metric completion of ${{\bf Q}}$ (thus, ${{\bf R}}$ is defined as the set of Cauchy sequences of rationals, modulo Cauchy equivalence);
• as the space of Dedekind cuts on the rationals ${{\bf Q}}$;
• as the space of quasimorphisms ${\phi: {\bf Z} \rightarrow {\bf Z}}$ on the integers, quotiented by bounded functions. (I believe this construction first appears in this paper of Street, who credits the idea to Schanuel, though the germ of this construction arguably goes all the way back to Eudoxus.)

There is also a fourth family of constructions that proceeds via nonstandard analysis, as a special case of what is known as the nonstandard hull construction. (Here I will assume some basic familiarity with nonstandard analysis and ultraproducts, as covered for instance in this previous blog post.) Given an unbounded nonstandard natural number ${N \in {}^* {\bf N} \backslash {\bf N}}$, one can define two external additive subgroups of the nonstandard integers ${{}^* {\bf Z}}$:

• The group ${O(N) := \{ n \in {}^* {\bf Z}: |n| \leq CN \hbox{ for some } C \in {\bf N} \}}$ of all nonstandard integers of magnitude less than or comparable to ${N}$; and
• The group ${o(N) := \{ n \in {}^* {\bf Z}: |n| \leq C^{-1} N \hbox{ for all } C \in {\bf N} \}}$ of nonstandard integers of magnitude infinitesimally smaller than ${N}$.

The group ${o(N)}$ is a subgroup of ${O(N)}$, so we may form the quotient group ${O(N)/o(N)}$. This space is isomorphic to the reals ${{\bf R}}$, and can in fact be used to construct the reals:

Proposition 1 For any coset ${n + o(N)}$ of ${O(N)/o(N)}$, there is a unique real number ${\hbox{st} \frac{n}{N}}$ with the property that ${\frac{n}{N} = \hbox{st} \frac{n}{N} + o(1)}$. The map ${n + o(N) \mapsto \hbox{st} \frac{n}{N}}$ is then an isomorphism between the additive groups ${O(N)/o(N)}$ and ${{\bf R}}$.

Proof: Uniqueness is clear. For existence, observe that the set ${\{ x \in {\bf R}: Nx \leq n + o(N) \}}$ is a Dedekind cut, and its supremum can be verified to have the required properties for ${\hbox{st} \frac{n}{N}}$. $\Box$

In a similar vein, we can view the unit interval ${[0,1]}$ in the reals as the quotient

$\displaystyle [0,1] \equiv [N] / o(N) \ \ \ \ \ (1)$

where ${[N]}$ is the nonstandard (i.e. internal) set ${\{ n \in {\bf N}: n \leq N \}}$; of course, ${[N]}$ is not a group, so one should interpret ${[N]/o(N)}$ as the image of ${[N]}$ under the quotient map ${{}^* {\bf Z} \rightarrow {}^* {\bf Z} / o(N)}$ (or ${O(N) \rightarrow O(N)/o(N)}$, if one prefers). Or to put it another way, (1) asserts that ${[0,1]}$ is the image of ${[N]}$ with respect to the map ${\pi: n \mapsto \hbox{st} \frac{n}{N}}$.

In this post I would like to record a nice measure-theoretic version of the equivalence (1), which essentially appears already in standard texts on Loeb measure (see e.g. this text of Cutland). To describe the results, we must first quickly recall the construction of Loeb measure on ${[N]}$. Given an internal subset ${A}$ of ${[N]}$, we may define the elementary measure ${\mu_0(A)}$ of ${A}$ by the formula

$\displaystyle \mu_0(A) := \hbox{st} \frac{|A|}{N}.$

This is a finitely additive probability measure on the Boolean algebra of internal subsets of ${[N]}$. We can then construct the Loeb outer measure ${\mu^*(A)}$ of any subset ${A \subset [N]}$ in complete analogy with Lebesgue outer measure by the formula

$\displaystyle \mu^*(A) := \inf \sum_{n=1}^\infty \mu_0(A_n)$

where ${(A_n)_{n=1}^\infty}$ ranges over all sequences of internal subsets of ${[N]}$ that cover ${A}$. We say that a subset ${A}$ of ${[N]}$ is Loeb measurable if, for any (standard) ${\epsilon>0}$, one can find an internal subset ${B}$ of ${[N]}$ which differs from ${A}$ by a set of Loeb outer measure at most ${\epsilon}$, and in that case we define the Loeb measure ${\mu(A)}$ of ${A}$ to be ${\mu^*(A)}$. It is a routine matter to show (e.g. using the Carathéodory extension theorem) that the space ${{\mathcal L}}$ of Loeb measurable sets is a ${\sigma}$-algebra, and that ${\mu}$ is a countably additive probability measure on this space that extends the elementary measure ${\mu_0}$. Thus ${[N]}$ now has the structure of a probability space ${([N], {\mathcal L}, \mu)}$.

Now, the group ${o(N)}$ acts (Loeb-almost everywhere) on the probability space ${[N]}$ by the addition map, thus ${T^h n := n+h}$ for ${n \in [N]}$ and ${h \in o(N)}$ (excluding a set of Loeb measure zero where ${n+h}$ exits ${[N]}$). This action is clearly seen to be measure-preserving. As such, we can form the invariant factor ${Z^0_{o(N)}([N]) = ([N], {\mathcal L}^{o(N)}, \mu\downharpoonright_{{\mathcal L}^{o(N)}})}$, defined by restricting attention to those Loeb measurable sets ${A \subset [N]}$ with the property that ${T^h A}$ is equal ${\mu}$-almost everywhere to ${A}$ for each ${h \in o(N)}$.

The claim is then that this invariant factor is equivalent (up to almost everywhere equivalence) to the unit interval ${[0,1]}$ with Lebesgue measure ${m}$ (and the trivial action of ${o(N)}$), by the same factor map ${\pi: n \mapsto \hbox{st} \frac{n}{N}}$ used in (1). More precisely:

Theorem 2 Given a set ${A \in {\mathcal L}^{o(N)}}$, there exists a Lebesgue measurable set ${B \subset [0,1]}$, unique up to ${m}$-a.e. equivalence, such that ${A}$ is ${\mu}$-a.e. equivalent to the set ${\pi^{-1}(B) := \{ n \in [N]: \hbox{st} \frac{n}{N} \in B \}}$. Conversely, if ${B \in [0,1]}$ is Lebesgue measurable, then ${\pi^{-1}(B)}$ is in ${{\mathcal L}^{o(N)}}$, and ${\mu( \pi^{-1}(B) ) = m( B )}$.

$\displaystyle [0,1] \equiv Z^0_{o(N)}( [N] )$

of (1).

Proof: We first prove the converse. It is clear that ${\pi^{-1}(B)}$ is ${o(N)}$-invariant, so it suffices to show that ${\pi^{-1}(B)}$ is Loeb measurable with Loeb measure ${m(B)}$. This is easily verified when ${B}$ is an elementary set (a finite union of intervals). By countable subadditivity of outer measure, this implies that Loeb outer measure of ${\pi^{-1}(E)}$ is bounded by the Lebesgue outer measure of ${E}$ for any set ${E \subset [0,1]}$; since every Lebesgue measurable set differs from an elementary set by a set of arbitrarily small Lebesgue outer measure, the claim follows.

Now we establish the forward claim. Uniqueness is clear from the converse claim, so it suffices to show existence. Let ${A \in {\mathcal L}^{o(N)}}$. Let ${\epsilon>0}$ be an arbitrary standard real number, then we can find an internal set ${A_\epsilon \subset [N]}$ which differs from ${A}$ by a set of Loeb measure at most ${\epsilon}$. As ${A}$ is ${o(N)}$-invariant, we conclude that for every ${h \in o(N)}$, ${A_\epsilon}$ and ${T^h A_\epsilon}$ differ by a set of Loeb measure (and hence elementary measure) at most ${2\epsilon}$. By the (contrapositive of the) underspill principle, there must exist a standard ${\delta>0}$ such that ${A_\epsilon}$ and ${T^h A_\epsilon}$ differ by a set of elementary measure at most ${2\epsilon}$ for all ${|h| \leq \delta N}$. If we then define the nonstandard function ${f_\epsilon: [N] \rightarrow {}^* {\bf R}}$ by the formula

$\displaystyle f(n) := \hbox{st} \frac{1}{\delta N} \sum_{m \in [N]: m \leq n \leq m+\delta N} 1_{A_\epsilon}(m),$

then from the (nonstandard) triangle inequality we have

$\displaystyle \frac{1}{N} \sum_{n \in [N]} |f(n) - 1_{A_\epsilon}(n)| \leq 3\epsilon$

(say). On the other hand, ${f}$ has the Lipschitz continuity property

$\displaystyle |f(n)-f(m)| \leq \frac{2|n-m|}{\delta N}$

and so in particular we see that

$\displaystyle \hbox{st} f(n) = \tilde f( \hbox{st} \frac{n}{N} )$

for some Lipschitz continuous function ${\tilde f: [0,1] \rightarrow [0,1]}$. If we then let ${E_\epsilon}$ be the set where ${\tilde f \geq 1 - \sqrt{\epsilon}}$, one can check that ${A_\epsilon}$ differs from ${\pi^{-1}(E_\epsilon)}$ by a set of Loeb outer measure ${O(\sqrt{\epsilon})}$, and hence ${A}$ does so also. Sending ${\epsilon}$ to zero, we see (from the converse claim) that ${1_{E_\epsilon}}$ is a Cauchy sequence in ${L^1}$ and thus converges in ${L^1}$ for some Lebesgue measurable ${E}$. The sets ${A_\epsilon}$ then converge in Loeb outer measure to ${\pi^{-1}(E)}$, giving the claim. $\Box$

Thanks to the Lebesgue differentiation theorem, the conditional expectation ${{\bf E}( f | Z^0_{o(N)}([N]))}$ of a bounded Loeb-measurable function ${f: [N] \rightarrow {\bf R}}$ can be expressed (as a function on ${[0,1]}$, defined ${m}$-a.e.) as

$\displaystyle {\bf E}( f | Z^0_{o(N)}([N]))(x) := \lim_{\epsilon \rightarrow 0} \frac{1}{2\epsilon} \int_{[x-\epsilon N,x+\epsilon N]} f\ d\mu.$

By the abstract ergodic theorem from the previous post, one can also view this conditional expectation as the element in the closed convex hull of the shifts ${T^h f}$, ${h = o(N)}$ of minimal ${L^2}$ norm. In particular, we obtain a form of the von Neumann ergodic theorem in this context: the averages ${\frac{1}{H} \sum_{h=1}^H T^h f}$ for ${H=O(N)}$ converge (as a net, rather than a sequence) in ${L^2}$ to ${{\bf E}( f | Z^0_{o(N)}([N]))}$.

If ${f: [N] \rightarrow [-1,1]}$ is (the standard part of) an internal function, that is to say the ultralimit of a sequence ${f_n: [N_n] \rightarrow [-1,1]}$ of finitary bounded functions, one can view the measurable function ${F := {\bf E}( f | Z^0_{o(N)}([N]))}$ as a limit of the ${f_n}$ that is analogous to the “graphons” that emerge as limits of graphs (see e.g. the recent text of Lovasz on graph limits). Indeed, the measurable function ${F: [0,1] \rightarrow [-1,1]}$ is related to the discrete functions ${f_n: [N_n] \rightarrow [-1,1]}$ by the formula

$\displaystyle \int_a^b F(x)\ dx = \hbox{st} \lim_{n \rightarrow p} \frac{1}{N_n} \sum_{a N_n \leq m \leq b N_n} f_n(m)$

for all ${0 \leq a < b \leq 1}$, where ${p}$ is the nonprincipal ultrafilter used to define the nonstandard universe. In particular, from the Arzela-Ascoli diagonalisation argument there is a subsequence ${n_j}$ such that

$\displaystyle \int_a^b F(x)\ dx = \lim_{j \rightarrow \infty} \frac{1}{N_{n_j}} \sum_{a N_{n_j} \leq m \leq b N_{n_j}} f_n(m),$

thus ${F}$ is the asymptotic density function of the ${f_n}$. For instance, if ${f_n}$ is the indicator function of a randomly chosen subset of ${[N_n]}$, then the asymptotic density function would equal ${1/2}$ (almost everywhere, at least).

I’m continuing to look into understanding the ergodic theory of ${o(N)}$ actions, as I believe this may allow one to apply ergodic theory methods to the “single-scale” or “non-asymptotic” setting (in which one averages only over scales comparable to a large parameter ${N}$, rather than the traditional asymptotic approach of letting the scale go to infinity). I’m planning some further posts in this direction, though this is still a work in progress.