You are currently browsing the category archive for the ‘math.DS’ category.

Joni Teräväinen and I have just uploaded to the arXiv our paper “The structure of logarithmically averaged correlations of multiplicative functions, with applications to the Chowla and Elliott conjectures“, submitted to Duke Mathematical Journal. This paper builds upon my previous paper in which I introduced an “entropy decrement method” to prove the two-point (logarithmically averaged) cases of the Chowla and Elliott conjectures. A bit more specifically, I showed that

$\displaystyle \lim_{m \rightarrow \infty} \frac{1}{\log \omega_m} \sum_{x_m/\omega_m \leq n \leq x_m} \frac{g_0(n+h_0) g_1(n+h_1)}{n} = 0$

whenever ${1 \leq \omega_m \leq x_m}$ were sequences going to infinity, ${h_0,h_1}$ were distinct integers, and ${g_0,g_1: {\bf N} \rightarrow {\bf C}}$ were ${1}$-bounded multiplicative functions which were non-pretentious in the sense that

$\displaystyle \liminf_{X \rightarrow \infty} \inf_{|t_j| \leq X} \sum_{p \leq X} \frac{1-\mathrm{Re}( g_j(p) \overline{\chi_j}(p) p^{it_j})}{p} = \infty \ \ \ \ \ (1)$

for all Dirichlet characters ${\chi_j}$ and for ${j=0,1}$. Thus, for instance, one had the logarithmically averaged two-point Chowla conjecture

$\displaystyle \sum_{n \leq x} \frac{\lambda(n) \lambda(n+h)}{n} = o(\log x)$

for fixed any non-zero ${h}$, where ${\lambda}$ was the Liouville function.

One would certainly like to extend these results to higher order correlations than the two-point correlations. This looks to be difficult (though perhaps not completely impossible if one allows for logarithmic averaging): in a previous paper I showed that achieving this in the context of the Liouville function would be equivalent to resolving the logarithmically averaged Sarnak conjecture, as well as establishing logarithmically averaged local Gowers uniformity of the Liouville function. However, in this paper we are able to avoid having to resolve these difficult conjectures to obtain partial results towards the (logarithmically averaged) Chowla and Elliott conjecture. For the Chowla conjecture, we can obtain all odd order correlations, in that

$\displaystyle \sum_{n \leq x} \frac{\lambda(n+h_1) \dots \lambda(n+h_k)}{n} = o(\log x) \ \ \ \ \ (2)$

for all odd ${k}$ and all integers ${h_1,\dots,h_k}$ (which, in the odd order case, are no longer required to be distinct). (Superficially, this looks like we have resolved “half of the cases” of the logarithmically averaged Chowla conjecture; but it seems the odd order correlations are significantly easier than the even order ones. For instance, because of the Katai-Bourgain-Sarnak-Ziegler criterion, one can basically deduce the odd order cases of (2) from the even order cases (after allowing for some dilations in the argument ${n}$).

For the more general Elliott conjecture, we can show that

$\displaystyle \lim_{m \rightarrow \infty} \frac{1}{\log \omega_m} \sum_{x_m/\omega_m \leq n \leq x_m} \frac{g_1(n+h_1) \dots g_k(n+h_k)}{n} = 0$

for any ${k}$, any integers ${h_1,\dots,h_k}$ and any bounded multiplicative functions ${g_1,\dots,g_k}$, unless the product ${g_1 \dots g_k}$ weakly pretends to be a Dirichlet character ${\chi}$ in the sense that

$\displaystyle \sum_{p \leq X} \frac{1 - \hbox{Re}( g_1 \dots g_k(p) \overline{\chi}(p)}{p} = o(\log\log X).$

This can be seen to imply (2) as a special case. Even when ${g_1,\dots,g_k}$ does pretend to be a Dirichlet character ${\chi}$, we can still say something: if the limits

$\displaystyle f(a) := \lim_{m \rightarrow \infty} \frac{1}{\log \omega_m} \sum_{x_m/\omega_m \leq n \leq x_m} \frac{g_1(n+ah_1) \dots g_k(n+ah_k)}{n}$

exist for each ${a \in {\bf Z}}$ (which can be guaranteed if we pass to a suitable subsequence), then ${f}$ is the uniform limit of periodic functions ${f_i}$, each of which is ${\chi}$isotypic in the sense that ${f_i(ab) = f_i(a) \chi(b)}$ whenever ${a,b}$ are integers with ${b}$ coprime to the periods of ${\chi}$ and ${f_i}$. This does not pin down the value of any single correlation ${f(a)}$, but does put significant constraints on how these correlations may vary with ${a}$.

Among other things, this allows us to show that all ${16}$ possible length four sign patterns ${(\lambda(n+1),\dots,\lambda(n+4)) \in \{-1,+1\}^4}$ of the Liouville function occur with positive density, and all ${65}$ possible length four sign patterns ${(\mu(n+1),\dots,\mu(n+4)) \in \{-1,0,+1\}^4 \backslash \{-1,+1\}^4}$ occur with the conjectured logarithmic density. (In a previous paper with Matomaki and Radziwill, we obtained comparable results for length three patterns of Liouville and length two patterns of Möbius.)

To describe the argument, let us focus for simplicity on the case of the Liouville correlations

$\displaystyle f(a) := \lim_{X \rightarrow \infty} \frac{1}{\log X} \sum_{n \leq X} \frac{\lambda(n) \lambda(n+a) \dots \lambda(n+(k-1)a)}{n}, \ \ \ \ \ (3)$

assuming for sake of discussion that all limits exist. (In the paper, we instead use the device of generalised limits, as discussed in this previous post.) The idea is to combine together two rather different ways to control this function ${f}$. The first proceeds by the entropy decrement method mentioned earlier, which roughly speaking works as follows. Firstly, we pick a prime ${p}$ and observe that ${\lambda(pn)=-\lambda(n)}$ for any ${n}$, which allows us to rewrite (3) as

$\displaystyle (-1)^k f(a) = \lim_{X \rightarrow \infty} \frac{1}{\log X}$

$\displaystyle \sum_{n \leq X} \frac{\lambda(pn) \lambda(pn+ap) \dots \lambda(pn+(k-1)ap)}{n}.$

Making the change of variables ${n' = pn}$, we obtain

$\displaystyle (-1)^k f(a) = \lim_{X \rightarrow \infty} \frac{1}{\log X}$

$\displaystyle \sum_{n' \leq pX} \frac{\lambda(n') \lambda(n'+ap) \dots \lambda(n'+(k-1)ap)}{n'} p 1_{p|n'}.$

The difference between ${n' \leq pX}$ and ${n' \leq X}$ is negligible in the limit (here is where we crucially rely on the log-averaging), hence

$\displaystyle (-1)^k f(a) = \lim_{X \rightarrow \infty} \frac{1}{\log X} \sum_{n \leq X} \frac{\lambda(n) \lambda(n+ap) \dots \lambda(n+(k-1)ap)}{n} p 1_{p|n}$

and thus by (3) we have

$\displaystyle (-1)^k f(a) = f(ap) + \lim_{X \rightarrow \infty} \frac{1}{\log X}$

$\displaystyle \sum_{n \leq X} \frac{\lambda(n) \lambda(n+ap) \dots \lambda(n+(k-1)ap)}{n} (p 1_{p|n}-1).$

The entropy decrement argument can be used to show that the latter limit is small for most ${p}$ (roughly speaking, this is because the factors ${p 1_{p|n}-1}$ behave like independent random variables as ${p}$ varies, so that concentration of measure results such as Hoeffding’s inequality can apply, after using entropy inequalities to decouple somewhat these random variables from the ${\lambda}$ factors). We thus obtain the approximate isotopy property

$\displaystyle (-1)^k f(a) \approx f(ap) \ \ \ \ \ (4)$

for most ${a}$ and ${p}$.

On the other hand, by the Furstenberg correspondence principle (as discussed in these previous posts), it is possible to express ${f(a)}$ as a multiple correlation

$\displaystyle f(a) = \int_X g(x) g(T^a x) \dots g(T^{(k-1)a} x)\ d\mu(x)$

for some probability space ${(X,\mu)}$ equipped with a measure-preserving invertible map ${T: X \rightarrow X}$. Using results of Bergelson-Host-Kra, Leibman, and Le, this allows us to obtain a decomposition of the form

$\displaystyle f(a) = f_1(a) + f_2(a) \ \ \ \ \ (5)$

where ${f_1}$ is a nilsequence, and ${f_2}$ goes to zero in density (even along the primes, or constant multiples of the primes). The original work of Bergelson-Host-Kra required ergodicity on ${X}$, which is very definitely a hypothesis that is not available here; however, the later work of Leibman removed this hypothesis, and the work of Le refined the control on ${f_1}$ so that one still has good control when restricting to primes, or constant multiples of primes.

Ignoring the small error ${f_2(a)}$, we can now combine (5) to conclude that

$\displaystyle f(a) \approx (-1)^k f_1(ap).$

Using the equidistribution theory of nilsequences (as developed in this previous paper of Ben Green and myself), one can break up ${f_1}$ further into a periodic piece ${f_0}$ and an “irrational” or “minor arc” piece ${f_3}$. The contribution of the minor arc piece ${f_3}$ can be shown to mostly cancel itself out after dilating by primes ${p}$ and averaging, thanks to Vinogradov-type bilinear sum estimates (transferred to the primes). So we end up with

$\displaystyle f(a) \approx (-1)^k f_0(ap),$

which already shows (heuristically, at least) the claim that ${f}$ can be approximated by periodic functions ${f_0}$ which are isotopic in the sense that

$\displaystyle f_0(a) \approx (-1)^k f_0(ap).$

But if ${k}$ is odd, one can use Dirichlet’s theorem on primes in arithmetic progressions to restrict to primes ${p}$ that are ${1}$ modulo the period of ${f_0}$, and conclude now that ${f_0}$ vanishes identically, which (heuristically, at least) gives (2).

The same sort of argument works to give the more general bounds on correlations of bounded multiplicative functions. But for the specific task of proving (2), we initially used a slightly different argument that avoids using the ergodic theory machinery of Bergelson-Host-Kra, Leibman, and Le, but replaces it instead with the Gowers uniformity norm theory used to count linear equations in primes. Basically, by averaging (4) in ${p}$ using the “${W}$-trick”, as well as known facts about the Gowers uniformity of the von Mangoldt function, one can obtain an approximation of the form

$\displaystyle (-1)^k f(a) \approx {\bf E}_{b: (b,W)=1} f(ab)$

where ${b}$ ranges over a large range of integers coprime to some primorial ${W = \prod_{p \leq w} p}$. On the other hand, by iterating (4) we have

$\displaystyle f(a) \approx f(apq)$

for most semiprimes ${pq}$, and by again averaging over semiprimes one can obtain an approximation of the form

$\displaystyle f(a) \approx {\bf E}_{b: (b,W)=1} f(ab).$

For ${k}$ odd, one can combine the two approximations to conclude that ${f(a)=0}$. (This argument is not given in the current paper, but we plan to detail it in a subsequent one.)

I’ve just uploaded to the arXiv my paper “On the universality of the incompressible Euler equation on compact manifolds“, submitted to Discrete and Continuous Dynamical Systems. This is a variant of my recent paper on the universality of potential well dynamics, but instead of trying to embed dynamical systems into a potential well ${\partial_{tt} u = -\nabla V(u)}$, here we try to embed dynamical systems into the incompressible Euler equations

$\displaystyle \partial_t u + \nabla_u u = - \mathrm{grad}_g p \ \ \ \ \ (1)$

$\displaystyle \mathrm{div}_g u = 0$

on a Riemannian manifold ${(M,g)}$. (One is particularly interested in the case of flat manifolds ${M}$, particularly ${{\bf R}^3}$ or ${({\bf R}/{\bf Z})^3}$, but for the main result of this paper it is essential that one is permitted to consider curved manifolds.) This system, first studied by Ebin and Marsden, is the natural generalisation of the usual incompressible Euler equations to curved space; it can be viewed as the formal geodesic flow equation on the infinite-dimensional manifold of volume-preserving diffeomorphisms on ${M}$ (see this previous post for a discussion of this in the flat space case).

The Euler equations can be viewed as a nonlinear equation in which the nonlinearity is a quadratic function of the velocity field ${u}$. It is thus natural to compare the Euler equations with quadratic ODE of the form

$\displaystyle \partial_t y = B(y,y) \ \ \ \ \ (2)$

where ${y: {\bf R} \rightarrow {\bf R}^n}$ is the unknown solution, and ${B: {\bf R}^n \times {\bf R}^n \rightarrow {\bf R}^n}$ is a bilinear map, which we may assume without loss of generality to be symmetric. One can ask whether such an ODE may be linearly embedded into the Euler equations on some Riemannian manifold ${(M,g)}$, which means that there is an injective linear map ${U: {\bf R}^n \rightarrow \Gamma(TM)}$ from ${{\bf R}^n}$ to smooth vector fields on ${M}$, as well as a bilinear map ${P: {\bf R}^n \times {\bf R}^n \rightarrow C^\infty(M)}$ to smooth scalar fields on ${M}$, such that the map ${y \mapsto (U(y), P(y,y))}$ takes solutions to (2) to solutions to (1), or equivalently that

$\displaystyle U(B(y,y)) + \nabla_{U(y)} U(y) = - \mathrm{grad}_g P(y,y)$

$\displaystyle \mathrm{div}_g U(y) = 0$

for all ${y \in {\bf R}^n}$.

For simplicity let us restrict ${M}$ to be compact. There is an obvious necessary condition for this embeddability to occur, which comes from energy conservation law for the Euler equations; unpacking everything, this implies that the bilinear form ${B}$ in (2) has to obey a cancellation condition

$\displaystyle \langle B(y,y), y \rangle = 0 \ \ \ \ \ (3)$

for some positive definite inner product ${\langle, \rangle: {\bf R}^n \times {\bf R}^n \rightarrow {\bf R}}$ on ${{\bf R}^n}$. The main result of the paper is the converse to this statement: if ${B}$ is a symmetric bilinear form obeying a cancellation condition (3), then it is possible to embed the equations (2) into the Euler equations (1) on some Riemannian manifold ${(M,g)}$; the catch is that this manifold will depend on the form ${B}$ and on the dimension ${n}$ (in fact in the construction I have, ${M}$ is given explicitly as ${SO(n) \times ({\bf R}/{\bf Z})^{n+1}}$, with a funny metric on it that depends on ${B}$).

As a consequence, any finite dimensional portion of the usual “dyadic shell models” used as simplified toy models of the Euler equation, can actually be embedded into a genuine Euler equation, albeit on a high-dimensional and curved manifold. This includes portions of the self-similar “machine” I used in a previous paper to establish finite time blowup for an averaged version of the Navier-Stokes (or Euler) equations. Unfortunately, the result in this paper does not apply to infinite-dimensional ODE, so I cannot yet establish finite time blowup for the Euler equations on a (well-chosen) manifold. It does not seem so far beyond the realm of possibility, though, that this could be done in the relatively near future. In particular, the result here suggests that one could construct something resembling a universal Turing machine within an Euler flow on a manifold, which was one ingredient I would need to engineer such a finite time blowup.

The proof of the main theorem proceeds by an “elimination of variables” strategy that was used in some of my previous papers in this area, though in this particular case the Nash embedding theorem (or variants thereof) are not required. The first step is to lessen the dependence on the metric ${g}$ by partially reformulating the Euler equations (1) in terms of the covelocity ${g \cdot u}$ (which is a ${1}$-form) instead of the velocity ${u}$. Using the freedom to modify the dimension of the underlying manifold ${M}$, one can also decouple the metric ${g}$ from the volume form that is used to obtain the divergence-free condition. At this point the metric can be eliminated, with a certain positive definiteness condition between the velocity and covelocity taking its place. After a substantial amount of trial and error (motivated by some “two-and-a-half-dimensional” reductions of the three-dimensional Euler equations, and also by playing around with a number of variants of the classic “separation of variables” strategy), I eventually found an ansatz for the velocity and covelocity that automatically solved most of the components of the Euler equations (as well as most of the positive definiteness requirements), as long as one could find a number of scalar fields that obeyed a certain nonlinear system of transport equations, and also obeyed a positive definiteness condition. Here I was stuck for a bit because the system I ended up with was overdetermined – more equations than unknowns. After trying a number of special cases I eventually found a solution to the transport system on the sphere, except that the scalar functions sometimes degenerated and so the positive definiteness property I wanted was only obeyed with positive semi-definiteness. I tried for some time to perturb this example into a strictly positive definite solution before eventually working out that this was not possible. Finally I had the brainwave to lift the solution from the sphere to an even more symmetric space, and this quickly led to the final solution of the problem, using the special orthogonal group rather than the sphere as the underlying domain. The solution ended up being rather simple in form, but it is still somewhat miraculous to me that it exists at all; in retrospect, given the overdetermined nature of the problem, relying on a large amount of symmetry to cut down the number of equations was basically the only hope.

I’ve just uploaded to the arXiv my paper “On the universality of potential well dynamics“, submitted to Dynamics of PDE. This is a spinoff from my previous paper on blowup of nonlinear wave equations, inspired by some conversations with Sungjin Oh. Here we focus mainly on the zero-dimensional case of such equations, namely the potential well equation

$\displaystyle \partial_{tt} u = - (\nabla F)(u) \ \ \ \ \ (1)$

for a particle ${u: {\bf R} \rightarrow {\bf R}^m}$ trapped in a potential well with potential ${F: {\bf R}^m \rightarrow {\bf R}}$, with ${F(z) \rightarrow +\infty}$ as ${z \rightarrow \infty}$. This ODE always admits global solutions from arbitrary initial positions ${u(0)}$ and initial velocities ${\partial_t u(0)}$, thanks to conservation of the Hamiltonian ${\frac{1}{2} |\partial_t u|^2 + F(u)}$. As this Hamiltonian is coercive (in that its level sets are compact), solutions to this equation are always almost periodic. On the other hand, as can already be seen using the harmonic oscillator ${\partial_{tt} u = - k^2 u}$ (and direct sums of this system), this equation can generate periodic solutions, as well as quasiperiodic solutions.

All quasiperiodic motions are almost periodic. However, there are many examples of dynamical systems that admit solutions that are almost periodic but not quasiperiodic. So one can pose the question: are the dynamics of potential wells universal in the sense that they can capture all almost periodic solutions?

A precise question can be phrased as follows. Let ${M}$ be a compact manifold, and let ${X}$ be a smooth vector field on ${M}$; to avoid degeneracies, let us take ${X}$ to be non-singular in the sense that it is everywhere non-vanishing. Then the trajectories of the first-order ODE

$\displaystyle \partial_t u = X(u) \ \ \ \ \ (2)$

for ${u: {\bf R} \rightarrow M}$ are always global and almost periodic. Can we then find a (coercive) potential ${F: {\bf R}^m \rightarrow {\bf R}}$ for some ${m}$, as well as a smooth embedding ${\phi: M \rightarrow {\bf R}^m}$, such that every solution ${u}$ to (2) pushes forward under ${\phi}$ to a solution to (1)? (Actually, for technical reasons it is preferable to map into the phase space ${{\bf R}^m \times {\bf R}^m}$, rather than position space ${{\bf R}^m}$, but let us ignore this detail for this discussion.)

It turns out that the answer is no; there is a very specific obstruction. Given a pair ${(M,X)}$ as above, define a strongly adapted ${1}$-form to be a ${1}$-form ${\phi}$ on ${M}$ such that ${\phi(X)}$ is pointwise positive, and the Lie derivative ${{\mathcal L}_X \phi}$ is an exact ${1}$-form. We then have

Theorem 1 A smooth compact non-singular dynamics ${(M,X)}$ can be embedded smoothly in a potential well system if and only if it admits a strongly adapted ${1}$-form.

For the “only if” direction, the key point is that potential wells (viewed as a Hamiltonian flow on the phase space ${{\bf R}^m \times {\bf R}^m}$) admit a strongly adapted ${1}$-form, namely the canonical ${1}$-form ${p dq}$, whose Lie derivative is the derivative ${dL}$ of the Lagrangian ${L := \frac{1}{2} |\partial_t u|^2 - F(u)}$ and is thus exact. The converse “if” direction is mainly a consequence of the Nash embedding theorem, and follows the arguments used in my previous paper.

Interestingly, the same obstruction also works for potential wells in a more general Riemannian manifold than ${{\bf R}^m}$, or for nonlinear wave equations with a potential; combining the two, the obstruction is also present for wave maps with a potential.

It is then natural to ask whether this obstruction is non-trivial, in the sense that there are at least some examples of dynamics ${(M,X)}$ that do not support strongly adapted ${1}$-forms (and hence cannot be modeled smoothly by the dynamics of a potential well, nonlinear wave equation, or wave maps). I posed this question on MathOverflow, and Robert Bryant provided a very nice construction, showing that the vector field ${(\sin(2\pi x), \cos(2\pi x))}$ on the ${2}$-torus ${({\bf R}/{\bf Z})^2}$ had no strongly adapted ${1}$-forms, and hence the dynamics of this vector field cannot be smoothly reproduced by a potential well, nonlinear wave equation, or wave map:

On the other hand, the suspension of any diffeomorphism does support a strongly adapted ${1}$-form (the derivative ${dt}$ of the time coordinate), and using this and the previous theorem I was able to embed a universal Turing machine into a potential well. In particular, there are flows for an explicitly describable potential well whose trajectories have behavior that is undecidable using the usual ZFC axioms of set theory! So potential well dynamics are “effectively” universal, despite the presence of the aforementioned obstruction.

In my previous work on blowup for Navier-Stokes like equations, I speculated that if one could somehow replicate a universal Turing machine within the Euler equations, one could use this machine to create a “von Neumann machine” that replicated smaller versions of itself, which on iteration would lead to a finite time blowup. Now that such a mechanism is present in nonlinear wave equations, it is tempting to try to make this scheme work in that setting. Of course, in my previous paper I had already demonstrated finite time blowup, at least in a three-dimensional setting, but that was a relatively simple discretely self-similar blowup in which no computation occurred. This more complicated blowup scheme would be significantly more effort to set up, but would be proof-of-concept that the same scheme would in principle be possible for the Navier-Stokes equations, assuming somehow that one can embed a universal Turing machine into the Euler equations. (But I’m still hopelessly stuck on how to accomplish this latter task…)

A sequence ${a: {\bf Z} \rightarrow {\bf C}}$ of complex numbers is said to be quasiperiodic if it is of the form

$\displaystyle a(n) = F( \alpha_1 n \hbox{ mod } 1, \dots, \alpha_k n \hbox{ mod } 1)$

for some real numbers ${\alpha_1,\dots,\alpha_k}$ and continuous function ${F: ({\bf R}/{\bf Z})^k \rightarrow {\bf C}}$. For instance, linear phases such as ${n \mapsto e(\alpha n + \beta)}$ (where ${e(\theta) := e^{2\pi i \theta}}$) are examples of quasiperiodic sequences; the top order coefficient ${\alpha}$ (modulo ${1}$) can be viewed as a “frequency” of the integers, and an element of the Pontryagin dual ${\hat {\bf Z} \equiv {\bf R}/{\bf Z}}$ of the integers. Any periodic sequence is also quasiperiodic (taking ${k=1}$ and ${\alpha_1}$ to be the reciprocal of the period). A sequence is said to be almost periodic if it is the uniform limit of quasiperiodic sequences. For instance any Fourier series of the form

$\displaystyle a(n) = \sum_{j=1}^\infty c_j e(\alpha_j n)$

with ${\alpha_1,\alpha_2,\dots}$ real numbers and ${c_1,c_2,\dots}$ an absolutely summable sequence of complex coefficients, will be almost periodic.

These sequences arise in various “complexity one” problems in arithmetic combinatorics and ergodic theory. For instance, if ${(X, \mu, T)}$ is a measure-preserving system – a probability space ${(X,\mu)}$ equipped with a measure-preserving shift, and ${f_1,f_2 \in L^\infty(X)}$ are bounded measurable functions, then the correlation sequence

$\displaystyle a(n) := \int_X f_1(x) f_2(T^n x)\ d\mu(x)$

can be shown to be an almost periodic sequence, plus an error term ${b_n}$ which is “null” in the sense that it has vanishing uniform density:

$\displaystyle \sup_N \frac{1}{M} \sum_{n=N+1}^{N+M} |b_n| \rightarrow 0 \hbox{ as } M \rightarrow \infty.$

This can be established in a number of ways, for instance by writing ${a(n)}$ as the Fourier coefficients of the spectral measure of the shift ${T}$ with respect to the functions ${f_1,f_2}$, and then decomposing that measure into pure point and continuous components.

In the last two decades or so, it has become clear that there are natural higher order versions of these concepts, in which linear polynomials such as ${\alpha n + \beta}$ are replaced with higher degree counterparts. The most obvious candidates for these counterparts would be the polynomials ${\alpha_d n^d + \dots + \alpha_0}$, but this turns out to not be a complete set of higher degree objects needed for the theory. Instead, the higher order versions of quasiperiodic and almost periodic sequences are now known as basic nilsequences and nilsequences respectively, while the higher order version of a linear phase is a nilcharacter; each nilcharacter then has a symbol that is a higher order generalisation of the concept of a frequency (and the collection of all symbols forms a group that can be viewed as a higher order version of the Pontryagin dual of ${{\bf Z}}$). The theory of these objects is spread out in the literature across a number of papers; in particular, the theory of nilcharacters is mostly developed in Appendix E of this 116-page paper of Ben Green, Tamar Ziegler, and myself, and is furthermore written using nonstandard analysis and treating the more general setting of higher dimensional sequences. I therefore decided to rewrite some of that material in this blog post, in the simpler context of the qualitative asymptotic theory of one-dimensional nilsequences and nilcharacters rather than the quantitative single-scale theory that is needed for combinatorial applications (and which necessitated the use of nonstandard analysis in the previous paper).

For technical reasons (having to do with the non-trivial topological structure on nilmanifolds), it will be convenient to work with vector-valued sequences, that take values in a finite-dimensional complex vector space ${{\bf C}^m}$ rather than in ${{\bf C}}$. By doing so, the space of sequences is now, technically, no longer a ring, as the operations of addition and multiplication on vector-valued sequences become ill-defined. However, we can still take complex conjugates of a sequence, and add sequences taking values in the same vector space ${{\bf C}^m}$, and for sequences taking values in different vector spaces ${{\bf C}^m}$, ${{\bf C}^{m'}}$, we may utilise the tensor product ${\otimes: {\bf C}^m \times {\bf C}^{m'} \rightarrow {\bf C}^{mm'}}$, which we will normalise by defining

$\displaystyle (z_1,\dots,z_m) \otimes (w_1,\dots,w_{m'}) = (z_1 w_1, \dots, z_1 w_{m'}, \dots, z_m w_1, \dots, z_m w_{m'} ).$

This product is associative and bilinear, and also commutative up to permutation of the indices. It also interacts well with the Hermitian norm

$\displaystyle \| (z_1,\dots,z_m) \| := \sqrt{|z_1|^2 + \dots + |z_m|^2}$

since we have ${\|z \otimes w\| = \|z\| \|w\|}$.

The traditional definition of a basic nilsequence (as defined for instance by Bergelson, Host, and Kra) is as follows:

Definition 1 (Basic nilsequence, first definition) A nilmanifold of step at most ${d}$ is a quotient ${G/\Gamma}$, where ${G}$ is a connected, simply connected nilpotent Lie group of step at most ${d}$ (thus, all ${d+1}$-fold commutators vanish) and ${\Gamma}$ is a discrete cocompact lattice in ${G}$. A basic nilsequence of degree at most ${d}$ is a sequence of the form ${n \mapsto F(g^n g_0 \Gamma)}$, where ${g_0 \Gamma \in G/\Gamma}$, ${g \in G}$, and ${F: G/\Gamma \rightarrow {\bf C}^m}$ is a continuous function.

For instance, it is not difficult using this definition to show that a sequence is a basic nilsequence of degree at most ${1}$ if and only if it is quasiperiodic. The requirement that ${G}$ be simply connected can be easily removed if desired by passing to a universal cover, but it is technically convenient to assume it (among other things, it allows for a well-defined logarithm map that obeys the Baker-Campbell-Hausdorff formula). When one wishes to perform a more quantitative analysis of nilsequences (particularly when working on a “single scale”. sich as on a single long interval ${\{ N+1, \dots, N+M\}}$), it is common to impose additional regularity conditions on the function ${F}$, such as Lipschitz continuity or smoothness, but ordinary continuity will suffice for the qualitative discussion in this blog post.

Nowadays, and particularly when one needs to understand the “single-scale” equidistribution properties of nilsequences, it is more convenient (as is for instance done in this ICM paper of Green) to use an alternate definition of a nilsequence as follows.

Definition 2 Let ${d \geq 0}$. A filtered group of degree at most ${d}$ is a group ${G}$ together with a sequence ${G_\bullet = (G_0,G_1,G_2,\dots)}$ of subgroups ${G \geq G_0 \geq G_1 \geq \dots}$ with ${G_{d+1}=\{\hbox{id}\}}$ and ${[G_i,G_j] \subset G_{i+j}}$ for ${i,j \geq 0}$. A polynomial sequence ${g: {\bf Z} \rightarrow G}$ into a filtered group is a function such that ${\partial_{h_i} \dots \partial_{h_1} g(n) \in G_i}$ for all ${i \geq 0}$ and ${n,h_1,\dots,h_i \in{\bf Z}}$, where ${\partial_h g(n) := g(n+h) g(n)^{-1}}$ is the difference operator. A filtered nilmanifold of degree at most ${s}$ is a quotient ${G/\Gamma}$, where ${G}$ is a filtered group of degree at most ${s}$ such that ${G}$ and all of the subgroups ${G_i}$ are connected, simply connected nilpotent filtered Lie group, and ${\Gamma}$ is a discrete cocompact subgroup of ${G}$ such that ${\Gamma_i := \Gamma \cap G_i}$ is a discrete cocompact subgroup of ${G_i}$. A basic nilsequence of degree at most ${d}$ is a sequence of the form ${n \mapsto F(g(n))}$, where ${g: {\bf Z} \rightarrow G}$ is a polynomial sequence, ${G/\Gamma}$ is a filtered nilmanifold of degree at most ${d}$, and ${F: G \rightarrow {\bf C}^m}$ is a continuous function which is ${\Gamma}$-automorphic, in the sense that ${F(g \gamma) = F(g)}$ for all ${g \in G}$ and ${\gamma \in \Gamma}$.

One can easily identify a ${\Gamma}$-automorphic function on ${G}$ with a function on ${G/\Gamma}$, but there are some (very minor) advantages to working on the group ${G}$ instead of the quotient ${G/\Gamma}$, as it becomes slightly easier to modify the automorphy group ${\Gamma}$ when needed. (But because the action of ${\Gamma}$ on ${G}$ is free, one can pass from ${\Gamma}$-automorphic functions on ${G}$ to functions on ${G/\Gamma}$ with very little difficulty.) The main reason to work with polynomial sequences ${n \mapsto g(n)}$ rather than geometric progressions ${n \mapsto g^n g_0 \Gamma}$ is that they form a group, a fact essentially established by by Lazard and Leibman; see Corollary B.4 of this paper of Green, Ziegler, and myself for a proof in the filtered group setting.

It is easy to see that any sequence that is a basic nilsequence of degree at most ${d}$ in the sense of the first definition, is also a basic nilsequence of degree at most ${d}$ in the second definition, since a nilmanifold of degree at most ${d}$ can be filtered using the lower central series, and any linear sequence ${n \mapsto g^n g_0}$ will be a polynomial sequence with respect to that filtration. The converse implication is a little trickier, but still not too hard to show: see Appendix C of this paper of Ben Green, Tamar Ziegler, and myself. There are two key examples of basic nilsequences to keep in mind. The first are the polynomially quasiperiodic sequences

$\displaystyle a(n) = F( P_1(n), \dots, P_k(n) ),$

where ${P_1,\dots,P_k: {\bf Z} \rightarrow {\bf R}}$ are polynomials of degree at most ${d}$, and ${F: {\bf R}^k \rightarrow {\bf C}^m}$ is a ${{\bf Z}^k}$-automorphic (i.e., ${{\bf Z}^k}$-periodic) continuous function. The map ${P: {\bf Z} \rightarrow {\bf R}^k}$ defined by ${P(n) := (P_1(n),\dots,P_k(n))}$ is a polynomial map of degree at most ${d}$, if one filters ${{\bf R}^k}$ by defining ${({\bf R}^k)_i}$ to equal ${{\bf R}^k}$ when ${i \leq d}$, and ${\{0\}}$ for ${i > d}$. The torus ${{\bf R}^k/{\bf Z}^k}$ then becomes a filtered nilmanifold of degree at most ${d}$, and ${a(n)}$ is thus a basic nilsequence of degree at most ${d}$ as per the second definition. It is also possible explicitly describe ${a_n}$ as a basic nilsequence of degree at most ${d}$ as per the first definition, for instance (in the ${k=1}$ case) by taking ${G}$ to be the space of upper triangular unipotent ${d+1 \times d+1}$ real matrices, and ${\Gamma}$ the subgroup with integer coefficients; we leave the details to the interested reader.

The other key example is a sequence of the form

$\displaystyle a(n) = F( \alpha n, \{ \alpha n \} \beta n )$

where ${\alpha,\beta}$ are real numbers, ${\{ \alpha n \} = \alpha n - \lfloor \alpha n \rfloor}$ denotes the fractional part of ${\alpha n}$, and and ${F: {\bf R}^2 \rightarrow {\bf C}^m}$ is a ${{\bf Z}^2}$-automorphic continuous function that vanishes in a neighbourhood of ${{\bf Z} \times {\bf R}}$. To describe this as a nilsequence, we use the nilpotent connected, simply connected degree ${2}$, Heisenberg group

$\displaystyle G := \begin{pmatrix} 1 & {\bf R} & {\bf R} \\ 0 & 1 & {\bf R} \\ 0 & 0 & 1 \end{pmatrix}$

with the lower central series filtration ${G_0=G_1=G}$, ${G_2= [G,G] = \begin{pmatrix} 1 &0 & {\bf R} \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{pmatrix}}$, and ${G_i = \{ \mathrm{id} \}}$ for ${i > 2}$, ${\Gamma}$ to be the discrete compact subgroup

$\displaystyle \Gamma := \begin{pmatrix} 1 & {\bf Z} & {\bf Z} \\ 0 & 1 & {\bf Z} \\ 0 & 0 & 1 \end{pmatrix},$

${g: {\bf Z} \rightarrow G}$ to be the polynomial sequence

$\displaystyle g(n) := \begin{pmatrix} 1 & \beta n & \alpha \beta n^2 \\ 0 & 1 & \alpha n \\ 0 & 0 & 1 \end{pmatrix}$

and ${\tilde F: G \rightarrow {\bf C}^m}$ to be the ${\Gamma}$-automorphic function

$\displaystyle \tilde F( \begin{pmatrix} 1 & x & y \\ 0 & 1 & z \\ 0 & 0 & 1 \end{pmatrix} ) = F( \{ z \}, y - \lfloor z \rfloor x );$

one easily verifies that this function is indeed ${\Gamma}$-automorphic, and it is continuous thanks to the vanishing properties of ${F}$. Also we have ${a(n) = \tilde F(g(n))}$, so ${a}$ is a basic nilsequence of degree at most ${2}$. One can concoct similar examples with ${\{ \alpha n \} \beta n}$ replaced by other “bracket polynomials” of ${n}$; for instance

$\displaystyle a(n) = F( \alpha n, \{ \alpha n - \frac{1}{2} \} \beta n )$

will be a basic nilsequence if ${F}$ now vanishes in a neighbourhood of ${(\frac{1}{2}+{\bf Z}) \times {\bf R}}$ rather than ${{\bf Z} \times {\bf R}}$. See this paper of Bergelson and Leibman for more discussion of bracket polynomials (also known as generalised polynomials) and their relationship to nilsequences.

A nilsequence of degree at most ${d}$ is defined to be a sequence that is the uniform limit of basic nilsequences of degree at most ${d}$. Thus for instance a sequence is a nilsequence of degree at most ${1}$ if and only if it is almost periodic, while a sequence is a nilsequence of degree at most ${0}$ if and only if it is constant. Such objects arise in higher order recurrence: for instance, if ${h_0,\dots,h_d}$ are integers, ${(X,\mu,T)}$ is a measure-preserving system, and ${f_0,\dots,f_d \in L^\infty(X)}$, then it was shown by Leibman that the sequence

$\displaystyle n \mapsto \int_X f_0(T^{h_0 n} x) \dots f_d(T^{h_d n} x)\ d\mu(x)$

is equal to a nilsequence of degree at most ${d}$, plus a null sequence. (The special case when the measure-preserving system was ergodic and ${h_i = i}$ for ${i=0,\dots,d}$ was previously established by Bergelson, Host, and Kra.) Nilsequences also arise in the inverse theory of the Gowers uniformity norms, as discussed for instance in this previous post.

It is easy to see that a sequence ${a: {\bf Z} \rightarrow {\bf C}^m}$ is a basic nilsequence of degree at most ${d}$ if and only if each of its ${m}$ components are. The scalar basic nilsequences ${a: {\bf Z} \rightarrow {\bf C}}$ of degree ${d}$ are easily seen to form a ${*}$-algebra (that is to say, they are a complex vector space closed under pointwise multiplication and complex conjugation), which implies similarly that vector-valued basic nilsequences ${a: {\bf Z} \rightarrow {\bf C}^m}$ of degree at most ${d}$ form a complex vector space closed under complex conjugation for each ${m}$, and that the tensor product of any two basic nilsequences of degree at most ${d}$ is another basic nilsequence of degree at most ${d}$. Similarly with “basic nilsequence” replaced by “nilsequence” throughout.

Now we turn to the notion of a nilcharacter, as defined in this paper of Ben Green, Tamar Ziegler, and myself:

Definition 3 (Nilcharacters) Let ${d \geq 1}$. A sub-nilcharacter of degree ${d}$ is a basic nilsequence ${\chi: n \mapsto F(g(n))}$ of degree at most ${d}$, such that ${F}$ obeys the additional modulation property

$\displaystyle F( g_d g ) = e( \xi \cdot g_d ) F(g) \ \ \ \ \ (1)$

for all ${g \in G}$ and ${g_d \in G_d}$, where ${\xi: G_d \rightarrow {\bf R}}$ is a continuous homomorphism ${g_d \mapsto \xi \cdot g_d}$. (Note from (1) and ${\Gamma}$-automorphy that unless ${F}$ vanishes identically, ${\xi}$ must map ${\Gamma_d}$ to ${{\bf Z}}$, thus without loss of generality one can view ${\xi}$ as an element of the Pontryagial dual of the torus ${G_d / \Gamma_d}$.) If in addition one has ${\|F(g)\|=1}$ for all ${g \in G}$, we call ${\chi}$ a nilcharacter of degree ${d \geq 1}$.

In the degree one case ${d=1}$, the only sub-nilcharacters are of the form ${\chi(n) = e(\alpha n)}$ for some vector ${c \in {\bf C}^m}$ and ${\alpha \in {\bf R}}$, and this is a nilcharacter if ${c}$ is a unit vector. Similarly, in higher degree, any sequence of the form ${\chi(n) = c e(P(n))}$, where ${c \in {\bf C}^m}$ is a vector and ${P: {\bf Z} \rightarrow {\bf R}}$ is a polynomial of degree at most ${d}$, is a sub-nilcharacter of degree ${d}$, and a character if ${c}$ is a unit vector. A nilsequence of degree at most ${d-1}$ is automatically a sub-nilcharacter of degree ${d}$, and a nilcharacter if it is of magnitude ${1}$. A further example of a nilcharacter is provided by the two-dimensional sequence ${\chi: {\bf Z} \rightarrow {\bf C}^2}$ defined by

$\displaystyle \chi(n) := ( F_0( \alpha n ) e( \{ \alpha n \} \beta n ), F_{1/2}( \alpha n ) e( \{ \alpha n - \frac{1}{2} \} \beta n ) ) \ \ \ \ \ (2)$

where ${F_0, F_{1/2}: {\bf R} \rightarrow {\bf C}}$ are continuous, ${{\bf Z}}$-automorphic functions that vanish on a neighbourhood of ${{\bf Z}}$ and ${\frac{1}{2}+{\bf Z}}$ respectively, and which form a partition of unity in the sense that

$\displaystyle |F_0(x)|^2 + |F_{1/2}(x)|^2 = 1$

for all ${x \in {\bf R}}$. Note that one needs both ${F_0}$ and ${F_{1/2}}$ to be not identically zero in order for all these conditions to be satisfied; it turns out (for topological reasons) that there is no scalar nilcharacter that is “equivalent” to this nilcharacter in a sense to be defined shortly. In some literature, one works exclusively with sub-nilcharacters rather than nilcharacters, however the former space contains zero-divisors, which is a little annoying technically. Nevertheless, both nilcharacters and sub-nilcharacters generate the same set of “symbols” as we shall see later.

We claim that every degree ${d}$ sub-nilcharacter ${f: {\bf Z} \rightarrow {\bf C}^m}$ can be expressed in the form ${f = c \chi}$, where ${\chi: {\bf Z} \rightarrow {\bf C}^{m'}}$ is a degree ${d}$ nilcharacter, and ${c: {\bf C}^{m'} \rightarrow {\bf C}^m}$ is a linear transformation. Indeed, by scaling we may assume ${f(n) = F(g(n))}$ where ${|F| < 1}$ uniformly. Using partitions of unity, one can find further functions ${F_1,\dots,F_m}$ also obeying (1) for the same character ${\xi}$ such that ${|F_1|^2 + \dots + |F_m|^2}$ is non-zero; by dividing out the ${F_1,\dots,F_m}$ by the square root of this quantity, and then multiplying by ${\sqrt{1-|F|^2}}$, we may assume that

$\displaystyle |F|^2 + |F_1|^2 + \dots + |F_m|^2 = 1,$

and then

$\displaystyle \chi(n) := (F(g(n)), F_1(g(n)), \dots, F_m(g(n)))$

becomes a degree ${d}$ nilcharacter that contains ${f(n)}$ amongst its components, giving the claim.

As we shall show below, nilsequences can be approximated uniformly by linear combinations of nilcharacters, in much the same way that quasiperiodic or almost periodic sequences can be approximated uniformly by linear combinations of linear phases. In particular, nilcharacters can be used as “obstructions to uniformity” in the sense of the inverse theory of the Gowers uniformity norms.

The space of degree ${d}$ nilcharacters forms a semigroup under tensor product, with the constant sequence ${1}$ as the identity. One can upgrade this semigroup to an abelian group by quotienting nilcharacters out by equivalence:

Definition 4 Let ${d \geq 1}$. We say that two degree ${d}$ nilcharacters ${\chi: {\bf Z} \rightarrow {\bf C}^m}$, ${\chi': {\bf Z} \rightarrow {\bf C}^{m'}}$ are equivalent if ${\chi \otimes \overline{\chi'}: {\bf Z} \rightarrow {\bf C}^{mm'}}$ is equal (as a sequence) to a basic nilsequence of degree at most ${d-1}$. (We will later show that this is indeed an equivalence relation.) The equivalence class ${[\chi]_{\mathrm{Symb}^d({\bf Z})}}$ of such a nilcharacter will be called the symbol of that nilcharacter (in analogy to the symbol of a differential or pseudodifferential operator), and the collection of such symbols will be denoted ${\mathrm{Symb}^d({\bf Z})}$.

As we shall see below the fold, ${\mathrm{Symb}^d({\bf Z})}$ has the structure of an abelian group, and enjoys some nice “symbol calculus” properties; also, one can view symbols as precisely describing the obstruction to equidistribution for nilsequences. For ${d=1}$, the group is isomorphic to the Ponytragin dual ${\hat {\bf Z} = {\bf R}/{\bf Z}}$ of the integers, and ${\mathrm{Symb}^d({\bf Z})}$ for ${d > 1}$ should be viewed as higher order generalisations of this Pontryagin dual. In principle, this group can be explicitly described for all ${d}$, but the theory rapidly gets complicated as ${d}$ increases (much as the classification of nilpotent Lie groups or Lie algebras of step ${d}$ rapidly gets complicated even for medium-sized ${d}$ such as ${d=3}$ or ${d=4}$). We will give an explicit description of the ${d=2}$ case here. There is however one nice (and non-trivial) feature of ${\mathrm{Symb}^d({\bf Z})}$ for ${d \geq 2}$ – it is not just an abelian group, but is in fact a vector space over the rationals ${{\bf Q}}$!

Tamar Ziegler and I have just uploaded to the arXiv two related papers: “Concatenation theorems for anti-Gowers-uniform functions and Host-Kra characteoristic factors” and “polynomial patterns in primes“, with the former developing a “quantitative Bessel inequality” for local Gowers norms that is crucial in the latter.

We use the term “concatenation theorem” to denote results in which structural control of a function in two or more “directions” can be “concatenated” into structural control in a joint direction. A trivial example of such a concatenation theorem is the following: if a function ${f: {\bf Z} \times {\bf Z} \rightarrow {\bf R}}$ is constant in the first variable (thus ${x \mapsto f(x,y)}$ is constant for each ${y}$), and also constant in the second variable (thus ${y \mapsto f(x,y)}$ is constant for each ${x}$), then it is constant in the joint variable ${(x,y)}$. A slightly less trivial example: if a function ${f: {\bf Z} \times {\bf Z} \rightarrow {\bf R}}$ is affine-linear in the first variable (thus, for each ${y}$, there exist ${\alpha(y), \beta(y)}$ such that ${f(x,y) = \alpha(y) x + \beta(y)}$ for all ${x}$) and affine-linear in the second variable (thus, for each ${x}$, there exist ${\gamma(x), \delta(x)}$ such that ${f(x,y) = \gamma(x)y + \delta(x)}$ for all ${y}$) then ${f}$ is a quadratic polynomial in ${x,y}$; in fact it must take the form

$\displaystyle f(x,y) = \epsilon xy + \zeta x + \eta y + \theta \ \ \ \ \ (1)$

for some real numbers ${\epsilon, \zeta, \eta, \theta}$. (This can be seen for instance by using the affine linearity in ${y}$ to show that the coefficients ${\alpha(y), \beta(y)}$ are also affine linear.)

The same phenomenon extends to higher degree polynomials. Given a function ${f: G \rightarrow K}$ from one additive group ${G}$ to another, we say that ${f}$ is of degree less than ${d}$ along a subgroup ${H}$ of ${G}$ if all the ${d}$-fold iterated differences of ${f}$ along directions in ${H}$ vanish, that is to say

$\displaystyle \partial_{h_1} \dots \partial_{h_d} f(x) = 0$

for all ${x \in G}$ and ${h_1,\dots,h_d \in H}$, where ${\partial_h}$ is the difference operator

$\displaystyle \partial_h f(x) := f(x+h) - f(x).$

(We adopt the convention that the only ${f}$ of degree less than ${0}$ is the zero function.)

We then have the following simple proposition:

Proposition 1 (Concatenation of polynomiality) Let ${f: G \rightarrow K}$ be of degree less than ${d_1}$ along one subgroup ${H_1}$ of ${G}$, and of degree less than ${d_2}$ along another subgroup ${H_2}$ of ${G}$, for some ${d_1,d_2 \geq 1}$. Then ${f}$ is of degree less than ${d_1+d_2-1}$ along the subgroup ${H_1+H_2}$ of ${G}$.

Note the previous example was basically the case when ${G = {\bf Z} \times {\bf Z}}$, ${H_1 = {\bf Z} \times \{0\}}$, ${H_2 = \{0\} \times {\bf Z}}$, ${K = {\bf R}}$, and ${d_1=d_2=2}$.

Proof: The claim is trivial for ${d_1=1}$ or ${d_2=1}$ (in which ${f}$ is constant along ${H_1}$ or ${H_2}$ respectively), so suppose inductively ${d_1,d_2 \geq 2}$ and the claim has already been proven for smaller values of ${d_1-1}$.

We take a derivative in a direction ${h_1 \in H_1}$ along ${h_1}$ to obtain

$\displaystyle T^{-h_1} f = f + \partial_{h_1} f$

where ${T^{-h_1} f(x) = f(x+h_1)}$ is the shift of ${f}$ by ${-h_1}$. Then we take a further shift by a direction ${h_2 \in H_2}$ to obtain

$\displaystyle T^{-h_1-h_2} f = T^{-h_2} f + T^{-h_2} \partial_{h_1} f = f + \partial_{h_2} f + T^{-h_2} \partial_{h_1} f$

$\displaystyle \partial_{h_1+h_2} f = \partial_{h_2} f + T^{-h_2} \partial_{h_1} f.$

Since ${f}$ has degree less than ${d_1}$ along ${H_1}$ and degree less than ${d_2}$ along ${H_2}$, ${\partial_{h_1} f}$ has degree less than ${d_1-1}$ along ${H_1}$ and less than ${d_2}$ along ${H_2}$, so is degree less than ${d_1+d_2-2}$ along ${H_1+H_2}$ by induction hypothesis. Similarly ${\partial_{h_2} f}$ is also of degree less than ${d_1+d_2-2}$ along ${H_1+H_2}$. Combining this with the cocycle equation we see that ${\partial_{h_1+h_2}f}$ is of degree less than ${d_1+d_2-2}$ along ${H_1+H_2}$ for any ${h_1+h_2 \in H_1+H_2}$, and hence ${f}$ is of degree less than ${d_1+d_2-1}$ along ${H_1+H_2}$, as required. $\Box$

While this proposition is simple, it already illustrates some basic principles regarding how one would go about proving a concatenation theorem:

• (i) One should perform induction on the degrees ${d_1,d_2}$ involved, and take advantage of the recursive nature of degree (in this case, the fact that a function is of less than degree ${d}$ along some subgroup ${H}$ of directions iff all of its first derivatives along ${H}$ are of degree less than ${d-1}$).
• (ii) Structure is preserved by operations such as addition, shifting, and taking derivatives. In particular, if a function ${f}$ is of degree less than ${d}$ along some subgroup ${H}$, then any derivative ${\partial_k f}$ of ${f}$ is also of degree less than ${d}$ along ${H}$, even if ${k}$ does not belong to ${H}$.

Here is another simple example of a concatenation theorem. Suppose an at most countable additive group ${G}$ acts by measure-preserving shifts ${T: g \mapsto T^g}$ on some probability space ${(X, {\mathcal X}, \mu)}$; we call the pair ${(X,T)}$ (or more precisely ${(X, {\mathcal X}, \mu, T)}$) a ${G}$-system. We say that a function ${f \in L^\infty(X)}$ is a generalised eigenfunction of degree less than ${d}$ along some subgroup ${H}$ of ${G}$ and some ${d \geq 1}$ if one has

$\displaystyle T^h f = \lambda_h f$

almost everywhere for all ${h \in H}$, and some functions ${\lambda_h \in L^\infty(X)}$ of degree less than ${d-1}$ along ${H}$, with the convention that a function has degree less than ${0}$ if and only if it is equal to ${1}$. Thus for instance, a function ${f}$ is an generalised eigenfunction of degree less than ${1}$ along ${H}$ if it is constant on almost every ${H}$-ergodic component of ${G}$, and is a generalised function of degree less than ${2}$ along ${H}$ if it is an eigenfunction of the shift action on almost every ${H}$-ergodic component of ${G}$. A basic example of a higher order eigenfunction is the function ${f(x,y) := e^{2\pi i y}}$ on the skew shift ${({\bf R}/{\bf Z})^2}$ with ${{\bf Z}}$ action given by the generator ${T(x,y) := (x+\alpha,y+x)}$ for some irrational ${\alpha}$. One can check that ${T^h f = \lambda_h f}$ for every integer ${h}$, where ${\lambda_h: x \mapsto e^{2\pi i \binom{h}{2} \alpha} e^{2\pi i h x}}$ is a generalised eigenfunction of degree less than ${2}$ along ${{\bf Z}}$, so ${f}$ is of degree less than ${3}$ along ${{\bf Z}}$.

We then have

Proposition 2 (Concatenation of higher order eigenfunctions) Let ${(X,T)}$ be a ${G}$-system, and let ${f \in L^\infty(X)}$ be a generalised eigenfunction of degree less than ${d_1}$ along one subgroup ${H_1}$ of ${G}$, and a generalised eigenfunction of degree less than ${d_2}$ along another subgroup ${H_2}$ of ${G}$, for some ${d_1,d_2 \geq 1}$. Then ${f}$ is a generalised eigenfunction of degree less than ${d_1+d_2-1}$ along the subgroup ${H_1+H_2}$ of ${G}$.

The argument is almost identical to that of the previous proposition and is left as an exercise to the reader. The key point is the point (ii) identified earlier: the space of generalised eigenfunctions of degree less than ${d}$ along ${H}$ is preserved by multiplication and shifts, as well as the operation of “taking derivatives” ${f \mapsto \lambda_k}$ even along directions ${k}$ that do not lie in ${H}$. (To prove this latter claim, one should restrict to the region where ${f}$ is non-zero, and then divide ${T^k f}$ by ${f}$ to locate ${\lambda_k}$.)

A typical example of this proposition in action is as follows: consider the ${{\bf Z}^2}$-system given by the ${3}$-torus ${({\bf R}/{\bf Z})^3}$ with generating shifts

$\displaystyle T^{(1,0)}(x,y,z) := (x+\alpha,y,z+y)$

$\displaystyle T^{(0,1)}(x,y,z) := (x,y+\alpha,z+x)$

for some irrational ${\alpha}$, which can be checked to give a ${{\bf Z}^2}$ action

$\displaystyle T^{(n,m)}(x,y,z) := (x+n\alpha, y+m\alpha, z+ny+mx+nm\alpha).$

The function ${f(x,y,z) := e^{2\pi i z}}$ can then be checked to be a generalised eigenfunction of degree less than ${2}$ along ${{\bf Z} \times \{0\}}$, and also less than ${2}$ along ${\{0\} \times {\bf Z}}$, and less than ${3}$ along ${{\bf Z}^2}$. One can view this example as the dynamical systems translation of the example (1) (see this previous post for some more discussion of this sort of correspondence).

The main results of our concatenation paper are analogues of these propositions concerning a more complicated notion of “polynomial-like” structure that are of importance in additive combinatorics and in ergodic theory. On the ergodic theory side, the notion of structure is captured by the Host-Kra characteristic factors ${Z^{ of a ${G}$-system ${X}$ along a subgroup ${H}$. These factors can be defined in a number of ways. One is by duality, using the Gowers-Host-Kra uniformity seminorms (defined for instance here) ${\| \|_{U^d_H(X)}}$. Namely, ${Z^{ is the factor of ${X}$ defined up to equivalence by the requirement that

$\displaystyle \|f\|_{U^d_H(X)} = 0 \iff {\bf E}(f | Z^{

An equivalent definition is in terms of the dual functions ${{\mathcal D}^d_H(f)}$ of ${f}$ along ${H}$, which can be defined recursively by setting ${{\mathcal D}^0_H(f) = 1}$ and

$\displaystyle {\mathcal D}^d_H(f) = {\bf E}_h T^h f {\mathcal D}^{d-1}( f \overline{T^h f} )$

where ${{\bf E}_h}$ denotes the ergodic average along a Følner sequence in ${G}$ (in fact one can also define these concepts in non-amenable abelian settings as per this previous post). The factor ${Z^{ can then be alternately defined as the factor generated by the dual functions ${{\mathcal D}^d_H(f)}$ for ${f \in L^\infty(X)}$.

In the case when ${G=H={\bf Z}}$ and ${X}$ is ${G}$-ergodic, a deep theorem of Host and Kra shows that the factor ${Z^{ is equivalent to the inverse limit of nilsystems of step less than ${d}$. A similar statement holds with ${{\bf Z}}$ replaced by any finitely generated group by Griesmer, while the case of an infinite vector space over a finite field was treated in this paper of Bergelson, Ziegler, and myself. The situation is more subtle when ${X}$ is not ${G}$-ergodic, or when ${X}$ is ${G}$-ergodic but ${H}$ is a proper subgroup of ${G}$ acting non-ergodically, when one has to start considering measurable families of directional nilsystems; see for instance this paper of Austin for some of the subtleties involved (for instance, higher order group cohomology begins to become relevant!).

One of our main theorems is then

Proposition 3 (Concatenation of characteristic factors) Let ${(X,T)}$ be a ${G}$-system, and let ${f}$ be measurable with respect to the factor ${Z^{ and with respect to the factor ${Z^{ for some ${d_1,d_2 \geq 1}$ and some subgroups ${H_1,H_2}$ of ${G}$. Then ${f}$ is also measurable with respect to the factor ${Z^{.

We give two proofs of this proposition in the paper; an ergodic-theoretic proof using the Host-Kra theory of “cocycles of type ${ (along a subgroup ${H}$)”, which can be used to inductively describe the factors ${Z^{, and a combinatorial proof based on a combinatorial analogue of this proposition which is harder to state (but which roughly speaking asserts that a function which is nearly orthogonal to all bounded functions of small ${U^{d_1}_{H_1}}$ norm, and also to all bounded functions of small ${U^{d_2}_{H_2}}$ norm, is also nearly orthogonal to alll bounded functions of small ${U^{d_1+d_2-1}_{H_1+H_2}}$ norm). The combinatorial proof parallels the proof of Proposition 2. A key point is that dual functions ${F := {\mathcal D}^d_H(f)}$ obey a property analogous to being a generalised eigenfunction, namely that

$\displaystyle T^h F = {\bf E}_k \lambda_{h,k} F_k$

where ${F_k := T^k F}$ and ${\lambda_{h,k} := {\mathcal D}^{d-1}( T^h f \overline{T^k f} )}$ is a “structured function of order ${d-1}$” along ${H}$. (In the language of this previous paper of mine, this is an assertion that dual functions are uniformly almost periodic of order ${d}$.) Again, the point (ii) above is crucial, and in particular it is key that any structure that ${F}$ has is inherited by the associated functions ${\lambda_{h,k}}$ and ${F_k}$. This sort of inheritance is quite easy to accomplish in the ergodic setting, as there is a ready-made language of factors to encapsulate the concept of structure, and the shift-invariance and ${\sigma}$-algebra properties of factors make it easy to show that just about any “natural” operation one performs on a function measurable with respect to a given factor, returns a function that is still measurable in that factor. In the finitary combinatorial setting, though, encoding the fact (ii) becomes a remarkably complicated notational nightmare, requiring a huge amount of “epsilon management” and “second-order epsilon management” (in which one manages not only scalar epsilons, but also function-valued epsilons that depend on other parameters). In order to avoid all this we were forced to utilise a nonstandard analysis framework for the combinatorial theorems, which made the arguments greatly resemble the ergodic arguments in many respects (though the two settings are still not equivalent, see this previous blog post for some comparisons between the two settings). Unfortunately the arguments are still rather complicated.

For combinatorial applications, dual formulations of the concatenation theorem are more useful. A direct dualisation of the theorem yields the following decomposition theorem: a bounded function which is small in ${U^{d_1+d_2-1}_{H_1+H_2}}$ norm can be split into a component that is small in ${U^{d_1}_{H_1}}$ norm, and a component that is small in ${U^{d_2}_{H_2}}$ norm. (One may wish to understand this type of result by first proving the following baby version: any function that has mean zero on every coset of ${H_1+H_2}$, can be decomposed as the sum of a function that has mean zero on every ${H_1}$ coset, and a function that has mean zero on every ${H_2}$ coset. This is dual to the assertion that a function that is constant on every ${H_1}$ coset and constant on every ${H_2}$ coset, is constant on every ${H_1+H_2}$ coset.) Combining this with some standard “almost orthogonality” arguments (i.e. Cauchy-Schwarz) give the following Bessel-type inequality: if one has a lot of subgroups ${H_1,\dots,H_k}$ and a bounded function is small in ${U^{2d-1}_{H_i+H_j}}$ norm for most ${i,j}$, then it is also small in ${U^d_{H_i}}$ norm for most ${i}$. (Here is a baby version one may wish to warm up on: if a function ${f}$ has small mean on ${({\bf Z}/p{\bf Z})^2}$ for some large prime ${p}$, then it has small mean on most of the cosets of most of the one-dimensional subgroups of ${({\bf Z}/p{\bf Z})^2}$.)

There is also a generalisation of the above Bessel inequality (as well as several of the other results mentioned above) in which the subgroups ${H_i}$ are replaced by more general coset progressions ${H_i+P_i}$ (of bounded rank), so that one has a Bessel inequailty controlling “local” Gowers uniformity norms such as ${U^d_{P_i}}$ by “global” Gowers uniformity norms such as ${U^{2d-1}_{P_i+P_j}}$. This turns out to be particularly useful when attempting to compute polynomial averages such as

$\displaystyle \sum_{n \leq N} \sum_{r \leq \sqrt{N}} f(n) g(n+r^2) h(n+2r^2) \ \ \ \ \ (2)$

for various functions ${f,g,h}$. After repeated use of the van der Corput lemma, one can control such averages by expressions such as

$\displaystyle \sum_{n \leq N} \sum_{h,m,k \leq \sqrt{N}} f(n) f(n+mh) f(n+mk) f(n+m(h+k))$

(actually one ends up with more complicated expressions than this, but let’s use this example for sake of discussion). This can be viewed as an average of various ${U^2}$ Gowers uniformity norms of ${f}$ along arithmetic progressions of the form ${\{ mh: h \leq \sqrt{N}\}}$ for various ${m \leq \sqrt{N}}$. Using the above Bessel inequality, this can be controlled in turn by an average of various ${U^3}$ Gowers uniformity norms along rank two generalised arithmetic progressions of the form ${\{ m_1 h_1 + m_2 h_2: h_1,h_2 \le \sqrt{N}\}}$ for various ${m_1,m_2 \leq \sqrt{N}}$. But for generic ${m_1,m_2}$, this rank two progression is close in a certain technical sense to the “global” interval ${\{ n: n \leq N \}}$ (this is ultimately due to the basic fact that two randomly chosen large integers are likely to be coprime, or at least have a small gcd). As a consequence, one can use the concatenation theorems from our first paper to control expressions such as (2) in terms of global Gowers uniformity norms. This is important in number theoretic applications, when one is interested in computing sums such as

$\displaystyle \sum_{n \leq N} \sum_{r \leq \sqrt{N}} \mu(n) \mu(n+r^2) \mu(n+2r^2)$

or

$\displaystyle \sum_{n \leq N} \sum_{r \leq \sqrt{N}} \Lambda(n) \Lambda(n+r^2) \Lambda(n+2r^2)$

where ${\mu}$ and ${\Lambda}$ are the Möbius and von Mangoldt functions respectively. This is because we are able to control global Gowers uniformity norms of such functions (thanks to results such as the proof of the inverse conjecture for the Gowers norms, the orthogonality of the Möbius function with nilsequences, and asymptotics for linear equations in primes), but much less control is currently available for local Gowers uniformity norms, even with the assistance of the generalised Riemann hypothesis (see this previous blog post for some further discussion).

By combining these tools and strategies with the “transference principle” approach from our previous paper (as improved using the recent “densification” technique of Conlon, Fox, and Zhao, discussed in this previous post), we are able in particular to establish the following result:

Theorem 4 (Polynomial patterns in the primes) Let ${P_1,\dots,P_k: {\bf Z} \rightarrow {\bf Z}}$ be polynomials of degree at most ${d}$, whose degree ${d}$ coefficients are all distinct, for some ${d \geq 1}$. Suppose that ${P_1,\dots,P_k}$ is admissible in the sense that for every prime ${p}$, there are ${n,r}$ such that ${n+P_1(r),\dots,n+P_k(r)}$ are all coprime to ${p}$. Then there exist infinitely many pairs ${n,r}$ of natural numbers such that ${n+P_1(r),\dots,n+P_k(r)}$ are prime.

Furthermore, we obtain an asymptotic for the number of such pairs ${n,r}$ in the range ${n \leq N}$, ${r \leq N^{1/d}}$ (actually for minor technical reasons we reduce the range of ${r}$ to be very slightly less than ${N^{1/d}}$). In fact one could in principle obtain asymptotics for smaller values of ${r}$, and relax the requirement that the degree ${d}$ coefficients be distinct with the requirement that no two of the ${P_i}$ differ by a constant, provided one had good enough local uniformity results for the Möbius or von Mangoldt functions. For instance, we can obtain an asymptotic for triplets of the form ${n, n+r,n+r^d}$ unconditionally for ${d \leq 5}$, and conditionally on GRH for all ${d}$, using known results on primes in short intervals on average.

The ${d=1}$ case of this theorem was obtained in a previous paper of myself and Ben Green (using the aforementioned conjectures on the Gowers uniformity norm and the orthogonality of the Möbius function with nilsequences, both of which are now proven). For higher ${d}$, an older result of Tamar and myself was able to tackle the case when ${P_1(0)=\dots=P_k(0)=0}$ (though our results there only give lower bounds on the number of pairs ${(n,r)}$, and no asymptotics). Both of these results generalise my older theorem with Ben Green on the primes containing arbitrarily long arithmetic progressions. The theorem also extends to multidimensional polynomials, in which case there are some additional previous results; see the paper for more details. We also get a technical refinement of our previous result on narrow polynomial progressions in (dense subsets of) the primes by making the progressions just a little bit narrower in the case of the density of the set one is using is small.

Note: this post is of a particularly technical nature, in particular presuming familiarity with nilsequences, nilsystems, characteristic factors, etc., and is primarily intended for experts.

As mentioned in the previous post, Ben Green, Tamar Ziegler, and myself proved the following inverse theorem for the Gowers norms:

Theorem 1 (Inverse theorem for Gowers norms) Let ${N \geq 1}$ and ${s \geq 1}$ be integers, and let ${\delta > 0}$. Suppose that ${f: {\bf Z} \rightarrow [-1,1]}$ is a function supported on ${[N] := \{1,\dots,N\}}$ such that

$\displaystyle \frac{1}{N^{s+2}} \sum_{n,h_1,\dots,h_{s+1}} \prod_{\omega \in \{0,1\}^{s+1}} f(n+\omega_1 h_1 + \dots + \omega_{s+1} h_{s+1}) \geq \delta.$

Then there exists a filtered nilmanifold ${G/\Gamma}$ of degree ${\leq s}$ and complexity ${O_{s,\delta}(1)}$, a polynomial sequence ${g: {\bf Z} \rightarrow G}$, and a Lipschitz function ${F: G/\Gamma \rightarrow {\bf R}}$ of Lipschitz constant ${O_{s,\delta}(1)}$ such that

$\displaystyle \frac{1}{N} \sum_n f(n) F(g(n) \Gamma) \gg_{s,\delta} 1.$

This result was conjectured earlier by Ben Green and myself; this conjecture was strongly motivated by an analogous inverse theorem in ergodic theory by Host and Kra, which we formulate here in a form designed to resemble Theorem 1 as closely as possible:

Theorem 2 (Inverse theorem for Gowers-Host-Kra seminorms) Let ${s \geq 1}$ be an integer, and let ${(X, T)}$ be an ergodic, countably generated measure-preserving system. Suppose that one has

$\displaystyle \lim_{N \rightarrow \infty} \frac{1}{N^{s+1}} \sum_{h_1,\dots,h_{s+1} \in [N]} \int_X \prod_{\omega \in \{0,1\}^{s+1}} f(T^{\omega_1 h_1 + \dots + \omega_{s+1} h_{s+1}}x)\ d\mu(x)$

$\displaystyle > 0$

for all non-zero ${f \in L^\infty(X)}$ (all ${L^p}$ spaces are real-valued in this post). Then ${(X,T)}$ is an inverse limit (in the category of measure-preserving systems, up to almost everywhere equivalence) of ergodic degree ${\leq s}$ nilsystems, that is to say systems of the form ${(G/\Gamma, x \mapsto gx)}$ for some degree ${\leq s}$ filtered nilmanifold ${G/\Gamma}$ and a group element ${g \in G}$ that acts ergodically on ${G/\Gamma}$.

It is a natural question to ask if there is any logical relationship between the two theorems. In the finite field category, one can deduce the combinatorial inverse theorem from the ergodic inverse theorem by a variant of the Furstenberg correspondence principle, as worked out by Tamar Ziegler and myself, however in the current context of ${{\bf Z}}$-actions, the connection is less clear.

One can split Theorem 2 into two components:

Theorem 3 (Weak inverse theorem for Gowers-Host-Kra seminorms) Let ${s \geq 1}$ be an integer, and let ${(X, T)}$ be an ergodic, countably generated measure-preserving system. Suppose that one has

$\displaystyle \lim_{N \rightarrow \infty} \frac{1}{N^{s+1}} \sum_{h_1,\dots,h_{s+1} \in [N]} \int_X \prod_{\omega \in \{0,1\}^{s+1}} T^{\omega_1 h_1 + \dots + \omega_{s+1} h_{s+1}} f\ d\mu$

$\displaystyle > 0$

for all non-zero ${f \in L^\infty(X)}$, where ${T^h f := f \circ T^h}$. Then ${(X,T)}$ is a factor of an inverse limit of ergodic degree ${\leq s}$ nilsystems.

Theorem 4 (Pro-nilsystems closed under factors) Let ${s \geq 1}$ be an integer. Then any factor of an inverse limit of ergodic degree ${\leq s}$ nilsystems, is again an inverse limit of ergodic degree ${\leq s}$ nilsystems.

Indeed, it is clear that Theorem 2 implies both Theorem 3 and Theorem 4, and conversely that the two latter theorems jointly imply the former. Theorem 4 is, in principle, purely a fact about nilsystems, and should have an independent proof, but this is not known; the only known proofs go through the full machinery needed to prove Theorem 2 (or the closely related theorem of Ziegler). (However, the fact that a factor of a nilsystem is again a nilsystem was established previously by Parry.)

The purpose of this post is to record a partial implication in reverse direction to the correspondence principle:

Proposition 5 Theorem 1 implies Theorem 3.

As mentioned at the start of the post, a fair amount of familiarity with the area is presumed here, and some routine steps will be presented with only a fairly brief explanation.

A few years ago, Ben Green, Tamar Ziegler, and myself proved the following (rather technical-looking) inverse theorem for the Gowers norms:

Theorem 1 (Discrete inverse theorem for Gowers norms) Let ${N \geq 1}$ and ${s \geq 1}$ be integers, and let ${\delta > 0}$. Suppose that ${f: {\bf Z} \rightarrow [-1,1]}$ is a function supported on ${[N] := \{1,\dots,N\}}$ such that

$\displaystyle \frac{1}{N^{s+2}} \sum_{n,h_1,\dots,h_{s+1}} \prod_{\omega \in \{0,1\}^{s+1}} f(n+\omega_1 h_1 + \dots + \omega_{s+1} h_{s+1}) \geq \delta.$

Then there exists a filtered nilmanifold ${G/\Gamma}$ of degree ${\leq s}$ and complexity ${O_{s,\delta}(1)}$, a polynomial sequence ${g: {\bf Z} \rightarrow G}$, and a Lipschitz function ${F: G/\Gamma \rightarrow {\bf R}}$ of Lipschitz constant ${O_{s,\delta}(1)}$ such that

$\displaystyle \frac{1}{N} \sum_n f(n) F(g(n) \Gamma) \gg_{s,\delta} 1.$

For the definitions of “filtered nilmanifold”, “degree”, “complexity”, and “polynomial sequence”, see the paper of Ben, Tammy, and myself. (I should caution the reader that this blog post will presume a fair amount of familiarity with this subfield of additive combinatorics.) This result has a number of applications, for instance to establishing asymptotics for linear equations in the primes, but this will not be the focus of discussion here.

The purpose of this post is to record the observation that this “discrete” inverse theorem, together with an equidistribution theorem for nilsequences that Ben and I worked out in a separate paper, implies a continuous version:

Theorem 2 (Continuous inverse theorem for Gowers norms) Let ${s \geq 1}$ be an integer, and let ${\delta>0}$. Suppose that ${f: {\bf R} \rightarrow [-1,1]}$ is a measurable function supported on ${[0,1]}$ such that

$\displaystyle \int_{{\bf R}^{s+1}} \prod_{\omega \in \{0,1\}^{s+1}} f(t+\omega_1 h_1 + \dots + \omega_{s+1} h_{s+1})\ dt dh_1 \dots dh_{s+1} \geq \delta. \ \ \ \ \ (1)$

Then there exists a filtered nilmanifold ${G/\Gamma}$ of degree ${\leq s}$ and complexity ${O_{s,\delta}(1)}$, a (smooth) polynomial sequence ${g: {\bf R} \rightarrow G}$, and a Lipschitz function ${F: G/\Gamma \rightarrow {\bf R}}$ of Lipschitz constant ${O_{s,\delta}(1)}$ such that

$\displaystyle \int_{\bf R} f(t) F(g(t) \Gamma)\ dt \gg_{s,\delta} 1.$

The interval ${[0,1]}$ can be easily replaced with any other fixed interval by a change of variables. A key point here is that the bounds are completely uniform in the choice of ${f}$. Note though that the coefficients of ${g}$ can be arbitrarily large (and this is necessary, as can be seen just by considering functions of the form ${f(t) = \cos( \xi t)}$ for some arbitrarily large frequency ${\xi}$).

It is likely that one could prove Theorem 2 by carefully going through the proof of Theorem 1 and replacing all instances of ${{\bf Z}}$ with ${{\bf R}}$ (and making appropriate modifications to the argument to accommodate this). However, the proof of Theorem 1 is quite lengthy. Here, we shall proceed by the usual limiting process of viewing the continuous interval ${[0,1]}$ as a limit of the discrete interval ${\frac{1}{N} \cdot [N]}$ as ${N \rightarrow \infty}$. However there will be some problems taking the limit due to a failure of compactness, and specifically with regards to the coefficients of the polynomial sequence ${g: {\bf N} \rightarrow G}$ produced by Theorem 1, after normalising these coefficients by ${N}$. Fortunately, a factorisation theorem from a paper of Ben Green and myself resolves this problem by splitting ${g}$ into a “smooth” part which does enjoy good compactness properties, as well as “totally equidistributed” and “periodic” parts which can be eliminated using the measurability (and thus, approximate smoothness), of ${f}$.

Szemerédi’s theorem asserts that any subset of the integers of positive upper density contains arbitrarily large arithmetic progressions. Here is an equivalent quantitative form of this theorem:

Theorem 1 (Szemerédi’s theorem) Let ${N}$ be a positive integer, and let ${f: {\bf Z}/N{\bf Z} \rightarrow [0,1]}$ be a function with ${{\bf E}_{x \in {\bf Z}/N{\bf Z}} f(x) \geq \delta}$ for some ${\delta>0}$, where we use the averaging notation ${{\bf E}_{x \in A} f(x) := \frac{1}{|A|} \sum_{x \in A} f(x)}$, ${{\bf E}_{x,r \in A} f(x) := \frac{1}{|A|^2} \sum_{x, r \in A} f(x)}$, etc.. Then for ${k \geq 3}$ we have

$\displaystyle {\bf E}_{x,r \in {\bf Z}/N{\bf Z}} f(x) f(x+r) \dots f(x+(k-1)r) \geq c(k,\delta)$

for some ${c(k,\delta)>0}$ depending only on ${k,\delta}$.

The equivalence is basically thanks to an averaging argument of Varnavides; see for instance Chapter 11 of my book with Van Vu or this previous blog post for a discussion. We have removed the cases ${k=1,2}$ as they are trivial and somewhat degenerate.

There are now many proofs of this theorem. Some time ago, I took an ergodic-theoretic proof of Furstenberg and converted it to a purely finitary proof of the theorem. The argument used some simplifying innovations that had been developed since the original work of Furstenberg (in particular, deployment of the Gowers uniformity norms, as well as a “dual” norm that I called the uniformly almost periodic norm, and an emphasis on van der Waerden’s theorem for handling the “compact extension” component of the argument). But the proof was still quite messy. However, as discussed in this previous blog post, messy finitary proofs can often be cleaned up using nonstandard analysis. Thus, there should be a nonstandard version of the Furstenberg ergodic theory argument that is relatively clean. I decided (after some encouragement from Ben Green and Isaac Goldbring) to write down most of the details of this argument in this blog post, though for sake of brevity I will skim rather quickly over arguments that were already discussed at length in other blog posts. In particular, I will presume familiarity with nonstandard analysis (in particular, the notion of a standard part of a bounded real number, and the Loeb measure construction), see for instance this previous blog post for a discussion.

I’ve just uploaded to the arXiv my paper “Failure of the ${L^1}$ pointwise and maximal ergodic theorems for the free group“, submitted to Forum of Mathematics, Sigma. This paper concerns a variant of the pointwise ergodic theorem of Birkhoff, which asserts that if one has a measure-preserving shift map ${T: X \rightarrow X}$ on a probability space ${X = (X,\mu)}$, then for any ${f \in L^1(X)}$, the averages ${\frac{1}{N} \sum_{n=1}^N f \circ T^{-n}}$ converge pointwise almost everywhere. (In the important case when the shift map ${T}$ is ergodic, the pointwise limit is simply the mean ${\int_X f\ d\mu}$ of the original function ${f}$.)

The pointwise ergodic theorem can be extended to measure-preserving actions of other amenable groups, if one uses a suitably “tempered” Folner sequence of averages; see this paper of Lindenstrauss for more details. (I also wrote up some notes on that paper here, back in 2006 before I had started this blog.) But the arguments used to handle the amenable case break down completely for non-amenable groups, and in particular for the free non-abelian group ${F_2}$ on two generators.

Nevo and Stein studied this problem and obtained a number of pointwise ergodic theorems for ${F_2}$-actions ${(T_g)_{g \in F_2}}$ on probability spaces ${(X,\mu)}$. For instance, for the spherical averaging operators

$\displaystyle {\mathcal A}_n f := \frac{1}{4 \times 3^{n-1}} \sum_{g \in F_2: |g| = n} f \circ T_g^{-1}$

(where ${|g|}$ denotes the length of the reduced word that forms ${g}$), they showed that ${{\mathcal A}_{2n} f}$ converged pointwise almost everywhere provided that ${f}$ was in ${L^p(X)}$ for some ${p>1}$. (The need to restrict to spheres of even radius can be seen by considering the action of ${F_2}$ on the two-element set ${\{0,1\}}$ in which both generators of ${F_2}$ act by interchanging the elements, in which case ${{\mathcal A}_n}$ is determined by the parity of ${n}$.) This result was reproven with a different and simpler proof by Bufetov, who also managed to relax the condition ${f \in L^p(X)}$ to the weaker condition ${f \in L \log L(X)}$.

The question remained open as to whether the pointwise ergodic theorem for ${F_2}$-actions held if one only assumed that ${f}$ was in ${L^1(X)}$. Nevo and Stein were able to establish this for the Cesáro averages ${\frac{1}{N} \sum_{n=1}^N {\mathcal A}_n}$, but not for ${{\mathcal A}_n}$ itself. About six years ago, Assaf Naor and I tried our hand at this problem, and was able to show an associated maximal inequality on ${\ell^1(F_2)}$, but due to the non-amenability of ${F_2}$, this inequality did not transfer to ${L^1(X)}$ and did not have any direct impact on this question, despite a fair amount of effort on our part to attack it.

Inspired by some recent conversations with Lewis Bowen, I returned to this problem. This time around, I tried to construct a counterexample to the ${L^1}$ pointwise ergodic theorem – something Assaf and I had not seriously attempted to do (perhaps due to being a bit too enamoured of our ${\ell^1(F_2)}$ maximal inequality). I knew of an existing counterexample of Ornstein regarding a failure of an ${L^1}$ ergodic theorem for iterates ${P^n}$ of a self-adjoint Markov operator – in fact, I had written some notes on this example back in 2007. Upon revisiting my notes, I soon discovered that the Ornstein construction was adaptable to the ${F_2}$ setting, thus settling the problem in the negative:

Theorem 1 (Failure of ${L^1}$ pointwise ergodic theorem) There exists a measure-preserving ${F_2}$-action on a probability space ${X}$ and a non-negative function ${f \in L^1(X)}$ such that ${\sup_n {\mathcal A}_{2n} f(x) = +\infty}$ for almost every ${x}$.

To describe the proof of this theorem, let me first briefly sketch the main ideas of Ornstein’s construction, which gave an example of a self-adjoint Markov operator ${P}$ on a probability space ${X}$ and a non-negative ${f \in L^1(X)}$ such that ${\sup_n P^n f(x) = +\infty}$ for almost every ${x}$. By some standard manipulations, it suffices to show that for any given ${\alpha > 0}$ and ${\varepsilon>0}$, there exists a self-adjoint Markov operator ${P}$ on a probability space ${X}$ and a non-negative ${f \in L^1(X)}$ with ${\|f\|_{L^1(X)} \leq \alpha}$, such that ${\sup_n P^n f \geq 1-\varepsilon}$ on a set of measure at least ${1-\varepsilon}$. Actually, it will be convenient to replace the Markov chain ${(P^n f)_{n \geq 0}}$ with an ancient Markov chain ${(f_n)_{n \in {\bf Z}}}$ – that is to say, a sequence of non-negative functions ${f_n}$ for both positive and negative ${f}$, such that ${f_{n+1} = P f_n}$ for all ${n \in {\bf Z}}$. The purpose of requiring the Markov chain to be ancient (that is, to extend infinitely far back in time) is to allow for the Markov chain to be shifted arbitrarily in time, which is key to Ornstein’s construction. (Technically, Ornstein’s original argument only uses functions that go back to a large negative time, rather than being infinitely ancient, but I will gloss over this point for sake of discussion, as it turns out that the ${F_2}$ version of the argument can be run using infinitely ancient chains.)

For any ${\alpha>0}$, let ${P(\alpha)}$ denote the claim that for any ${\varepsilon>0}$, there exists an ancient Markov chain ${(f_n)_{n \in {\bf Z}}}$ with ${\|f_n\|_{L^1(X)} = \alpha}$ such that ${\sup_{n \in {\bf Z}} f_n \geq 1-\varepsilon}$ on a set of measure at least ${1-\varepsilon}$. Clearly ${P(1)}$ holds since we can just take ${f_n=1}$ for all ${n}$. Our objective is to show that ${P(\alpha)}$ holds for arbitrarily small ${\alpha}$. The heart of Ornstein’s argument is then the implication

$\displaystyle P(\alpha) \implies P( \alpha (1 - \frac{\alpha}{4}) ) \ \ \ \ \ (1)$

for any ${0 < \alpha \leq 1}$, which upon iteration quickly gives the desired claim.

Let’s see informally how (1) works. By hypothesis, and ignoring epsilons, we can find an ancient Markov chain ${(f_n)_{n \in {\bf Z}}}$ on some probability space ${X}$ of total mass ${\|f_n\|_{L^1(X)} = \alpha}$, such that ${\sup_n f_n}$ attains the value of ${1}$ or greater almost everywhere. Assuming that the Markov process is irreducible, the ${f_n}$ will eventually converge as ${n \rightarrow \infty}$ to the constant value of ${\|f_n\|_{L^1(X)}}$, in particular its final state will essentially stay above ${\alpha}$ (up to small errors).

Now suppose we duplicate the Markov process by replacing ${X}$ with a double copy ${X \times \{1,2\}}$ (giving ${\{1,2\}}$ the uniform probability measure), and using the disjoint sum of the Markov operators on ${X \times \{1\}}$ and ${X \times \{2\}}$ as the propagator, so that there is no interaction between the two components of this new system. Then the functions ${f'_n(x,i) := f_n(x) 1_{i=1}}$ form an ancient Markov chain of mass at most ${\alpha/2}$ that lives solely in the first half ${X \times \{1\}}$ of this copy, and ${\sup_n f'_n}$ attains the value of ${1}$ or greater on almost all of the first half ${X \times \{1\}}$, but is zero on the second half. The final state of ${f'_n}$ will be to stay above ${\alpha}$ in the first half ${X \times \{1\}}$, but be zero on the second half.

Now we modify the above example by allowing an infinitesimal amount of interaction between the two halves ${X \times \{1\}}$, ${X \times \{2\}}$ of the system (I mentally think of ${X \times \{1\}}$ and ${X \times \{2\}}$ as two identical boxes that a particle can bounce around in, and now we wish to connect the boxes by a tiny tube). The precise way in which this interaction is inserted is not terribly important so long as the new Markov process is irreducible. Once one does so, then the ancient Markov chain ${(f'_n)_{n \in {\bf Z}}}$ in the previous example gets replaced by a slightly different ancient Markov chain ${(f''_n)_{n \in {\bf Z}}}$ which is more or less identical with ${f'_n}$ for negative times ${n}$, or for bounded positive times ${n}$, but for very large values of ${n}$ the final state is now constant across the entire state space ${X \times \{1,2\}}$, and will stay above ${\alpha/2}$ on this space.

Finally, we consider an ancient Markov chain ${F_n}$ which is basically of the form

$\displaystyle F_n(x,i) \approx f''_n(x,i) + (1 - \frac{\alpha}{2}) f_{n-M}(x) 1_{i=2}$

for some large parameter ${M}$ and for all ${n \leq M}$ (the approximation becomes increasingly inaccurate for ${n}$ much larger than ${M}$, but never mind this for now). This is basically two copies of the original Markov process in separate, barely interacting state spaces ${X \times \{1\}, X \times \{2\}}$, but with the second copy delayed by a large time delay ${M}$, and also attenuated in amplitude by a factor of ${1-\frac{\alpha}{2}}$. The total mass of this process is now ${\frac{\alpha}{2} + \frac{\alpha}{2} (1 -\frac{\alpha}{2}) = \alpha (1 - \alpha/4)}$. Because of the ${f''_n}$ component of ${F_n}$, we see that ${\sup_n F_n}$ basically attains the value of ${1}$ or greater on the first half ${X \times \{1\}}$. On the second half ${X \times \{2\}}$, we work with times ${n}$ close to ${M}$. If ${M}$ is large enough, ${f''_n}$ would have averaged out to about ${\alpha/2}$ at such times, but the ${(1 - \frac{\alpha}{2}) f_{n-M}(x)}$ component can get as large as ${1-\alpha/2}$ here. Summing (and continuing to ignore various epsilon losses), we see that ${\sup_n F_n}$ can get as large as ${1}$ on almost all of the second half of ${X \times \{2\}}$. This concludes the rough sketch of how one establishes the implication (1).

It was observed by Bufetov that the spherical averages ${{\mathcal A}_n}$ for a free group action can be lifted up to become powers ${P^n}$ of a Markov operator, basically by randomly assigning a “velocity vector” ${s \in \{a,b,a^{-1},b^{-1}\}}$ to one’s base point ${x}$ and then applying the Markov process that moves ${x}$ along that velocity vector (and then randomly changing the velocity vector at each time step to the “reduced word” condition that the velocity never flips from ${s}$ to ${s^{-1}}$). Thus the spherical average problem has a Markov operator interpretation, which opens the door to adapting the Ornstein construction to the setting of ${F_2}$ systems. This turns out to be doable after a certain amount of technical artifice; the main thing is to work with ${F_2}$-measure preserving systems that admit ancient Markov chains that are initially supported in a very small region in the “interior” of the state space, so that one can couple such systems to each other “at the boundary” in the fashion needed to establish the analogue of (1) without disrupting the ancient dynamics of such chains. The initial such system (used to establish the base case ${P(1)}$) comes from basically considering the action of ${F_2}$ on a (suitably renormalised) “infinitely large ball” in the Cayley graph, after suitably gluing together the boundary of this ball to complete the action. The ancient Markov chain associated to this system starts at the centre of this infinitely large ball at infinite negative time ${n=-\infty}$, and only reaches the boundary of this ball at the time ${n=0}$.

An extremely large portion of mathematics is concerned with locating solutions to equations such as

$\displaystyle f(x) = 0$

or

$\displaystyle \Phi(x) = x \ \ \ \ \ (1)$

for ${x}$ in some suitable domain space (either finite-dimensional or infinite-dimensional), and various maps ${f}$ or ${\Phi}$. To solve the fixed point iteration equation (1), the simplest general method available is the fixed point iteration method: one starts with an initial approximate solution ${x_0}$ to (1), so that ${\Phi(x_0) \approx x_0}$, and then recursively constructs the sequence ${x_1, x_2, x_3, \dots}$ by ${x_n := \Phi(x_{n-1})}$. If ${\Phi}$ behaves enough like a “contraction”, and the domain is complete, then one can expect the ${x_n}$ to converge to a limit ${x}$, which should then be a solution to (1). For instance, if ${\Phi: X \rightarrow X}$ is a map from a metric space ${X = (X,d)}$ to itself, which is a contraction in the sense that

$\displaystyle d( \Phi(x), \Phi(y) ) \leq (1-\eta) d(x,y)$

for all ${x,y \in X}$ and some ${\eta>0}$, then with ${x_n}$ as above we have

$\displaystyle d( x_{n+1}, x_n ) \leq (1-\eta) d(x_n, x_{n-1} )$

for any ${n}$, and so the distances ${d(x_n, x_{n-1} )}$ between successive elements of the sequence decay at at least a geometric rate. This leads to the contraction mapping theorem, which has many important consequences, such as the inverse function theorem and the Picard existence theorem.

A slightly more complicated instance of this strategy arises when trying to linearise a complex map ${f: U \rightarrow {\bf C}}$ defined in a neighbourhood ${U}$ of a fixed point. For simplicity we normalise the fixed point to be the origin, thus ${0 \in U}$ and ${f(0)=0}$. When studying the complex dynamics ${f^2 = f \circ f}$, ${f^3 = f \circ f \circ f}$, ${\dots}$ of such a map, it can be useful to try to conjugate ${f}$ to another function ${g = \psi^{-1} \circ f \circ \psi}$, where ${\psi}$ is a holomorphic function defined and invertible near ${0}$ with ${\psi(0)=0}$, since the dynamics of ${g}$ will be conjguate to that of ${f}$. Note that if ${f(0)=0}$ and ${f'(0)=\lambda}$, then from the chain rule any conjugate ${g}$ of ${f}$ will also have ${g(0)=0}$ and ${g'(0)=\lambda}$. Thus, the “simplest” function one can hope to conjugate ${f}$ to is the linear function ${z \mapsto \lambda z}$. Let us say that ${f}$ is linearisable (around ${0}$) if it is conjugate to ${z \mapsto \lambda z}$ in some neighbourhood of ${0}$. Equivalently, ${f}$ is linearisable if there is a solution to the Schröder equation

$\displaystyle f( \psi(z) ) = \psi(\lambda z) \ \ \ \ \ (2)$

for some ${\psi: U' \rightarrow {\bf C}}$ defined and invertible in a neighbourhood ${U'}$ of ${0}$ with ${\psi(0)=0}$, and all ${z}$ sufficiently close to ${0}$. (The Schröder equation is normalised somewhat differently in the literature, but this form is equivalent to the usual form, at least when ${\lambda}$ is non-zero.) Note that if ${\psi}$ solves the above equation, then so does ${z \mapsto \psi(cz)}$ for any non-zero ${c}$, so we may normalise ${\psi'(0)=1}$ in addition to ${\psi(0)=0}$, which also ensures local invertibility from the inverse function theorem. (Note from winding number considerations that ${\psi}$ cannot be invertible near zero if ${\psi'(0)}$ vanishes.)

We have the following basic result of Koenigs:

Theorem 1 (Koenig’s linearisation theorem) Let ${f: U \rightarrow {\bf C}}$ be a holomorphic function defined near ${0}$ with ${f(0)=0}$ and ${f'(0)=\lambda}$. If ${0 < |\lambda| < 1}$ (attracting case) or ${1 < |\lambda| < \infty}$ (repelling case), then ${f}$ is linearisable near zero.

Proof: Observe that if ${f, \psi, \lambda}$ solve (2), then ${f^{-1}, \psi^{-1}, \lambda^{-1}}$ solve (2) also (in a sufficiently small neighbourhood of zero). Thus we may reduce to the attractive case ${0 < |\lambda| < 1}$.

Let ${r>0}$ be a sufficiently small radius, and let ${X}$ denote the space of holomorphic functions ${\psi: B(0,r) \rightarrow {\bf C}}$ on the complex disk ${B(0,r) := \{z \in {\bf C}: |z| < r \}}$ with ${\psi(0)=0}$ and ${\psi'(0)=1}$. We can view the Schröder equation (2) as a fixed point equation

$\displaystyle \psi = \Phi(\psi)$

where ${\Phi: X' \rightarrow X}$ is the partially defined function on ${X}$ that maps a function ${\psi: B(0,r) \rightarrow {\bf C}}$ to the function ${\Phi(\psi): B(0,r) \rightarrow {\bf C}}$ defined by

$\displaystyle \Phi(\psi)(z) := f^{-1}( \psi( \lambda z ) ),$

assuming that ${f^{-1}}$ is well-defined on the range of ${\psi(B(0,\lambda r))}$ (this is why ${\Phi}$ is only partially defined).

We can solve this equation by the fixed point iteration method, if ${r}$ is small enough. Namely, we start with ${\psi_0: B(0,r) \rightarrow {\bf C}}$ being the identity map, and set ${\psi_1 := \Phi(\psi_0), \psi_2 := \Phi(\psi_1)}$, etc. We equip ${X}$ with the uniform metric ${d( \psi, \tilde \psi ) := \sup_{z \in B(0,r)} |\psi(z) - \tilde \psi(z)|}$. Observe that if ${d( \psi, \psi_0 ), d(\tilde \psi, \psi_0) \leq r}$, and ${r}$ is small enough, then ${\psi, \tilde \psi}$ takes values in ${B(0,2r)}$, and ${\Phi(\psi), \Phi(\tilde \psi)}$ are well-defined and lie in ${X}$. Also, since ${f^{-1}}$ is smooth and has derivative ${\lambda^{-1}}$ at ${0}$, we have

$\displaystyle |f^{-1}(z) - f^{-1}(w)| \leq (1+\varepsilon) |\lambda|^{-1} |z-w|$

if ${z, w \in B(0,r)}$, ${\varepsilon>0}$ and ${r}$ is sufficiently small depending on ${\varepsilon}$. This is not yet enough to establish the required contraction (thanks to Mario Bonk for pointing this out); but observe that the function ${\frac{\psi(z)-\tilde \psi(z)}{z^2}}$ is holomorphic on ${B(0,r)}$ and bounded by ${d(\psi,\tilde \psi)/r^2}$ on the boundary of this ball (or slightly within this boundary), so by the maximum principle we see that

$\displaystyle |\frac{\psi(z)-\tilde \psi(z)}{z^2}| \leq \frac{1}{r^2} d(\psi,\tilde \psi)$

on all of ${B(0,r)}$, and in particular

$\displaystyle |\psi(z)-\tilde \psi(z)| \leq |\lambda|^2 d(\psi,\tilde \psi)$

on ${B(0,\lambda r)}$. Putting all this together, we see that

$\displaystyle d( \Phi(\psi), \Phi(\tilde \psi)) \leq (1+\varepsilon) |\lambda| d(\psi, \tilde \psi);$

since ${|\lambda|<1}$, we thus obtain a contraction on the ball ${\{ \psi \in X: d(\psi,\psi_0) \leq r \}}$ if ${\varepsilon}$ is small enough (and ${r}$ sufficiently small depending on ${\varepsilon}$). From this (and the completeness of ${X}$, which follows from Morera’s theorem) we see that the iteration ${\psi_n}$ converges (exponentially fast) to a limit ${\psi \in X}$ which is a fixed point of ${\Phi}$, and thus solves Schröder’s equation, as required. $\Box$

Koenig’s linearisation theorem leaves open the indifferent case when ${|\lambda|=1}$. In the rationally indifferent case when ${\lambda^n=1}$ for some natural number ${n}$, there is an obvious obstruction to linearisability, namely that ${f^n = 1}$ (in particular, linearisation is not possible in this case when ${f}$ is a non-trivial rational function). An obstruction is also present in some irrationally indifferent cases (where ${|\lambda|=1}$ but ${\lambda^n \neq 1}$ for any natural number ${n}$), if ${\lambda}$ is sufficiently close to various roots of unity; the first result of this form is due to Cremer, and the optimal result of this type for quadratic maps was established by Yoccoz. In the other direction, we have the following result of Siegel:

Theorem 2 (Siegel’s linearisation theorem) Let ${f: U \rightarrow {\bf C}}$ be a holomorphic function defined near ${0}$ with ${f(0)=0}$ and ${f'(0)=\lambda}$. If ${|\lambda|=1}$ and one has the Diophantine condition ${\frac{1}{|\lambda^n-1|} \leq C n^C}$ for all natural numbers ${n}$ and some constant ${C>0}$, then ${f}$ is linearisable at ${0}$.

The Diophantine condition can be relaxed to a more general condition involving the rational exponents of the phase ${\theta}$ of ${\lambda = e^{2\pi i \theta}}$; this was worked out by Brjuno, with the condition matching the one later obtained by Yoccoz. Amusingly, while the set of Diophantine numbers (and hence the set of linearisable ${\lambda}$) has full measure on the unit circle, the set of non-linearisable ${\lambda}$ is generic (the complement of countably many nowhere dense sets) due to the above-mentioned work of Cremer, leading to a striking disparity between the measure-theoretic and category notions of “largeness”.

Siegel’s theorem does not seem to be provable using a fixed point iteration method. However, it can be established by modifying another basic method to solve equations, namely Newton’s method. Let us first review how this method works to solve the equation ${f(x)=0}$ for some smooth function ${f: I \rightarrow {\bf R}}$ defined on an interval ${I}$. We suppose we have some initial approximant ${x_0 \in I}$ to this equation, with ${f(x_0)}$ small but not necessarily zero. To make the analysis more quantitative, let us suppose that the interval ${[x_0-r_0,x_0+r_0]}$ lies in ${I}$ for some ${r_0>0}$, and we have the estimates

$\displaystyle |f(x_0)| \leq \delta_0 r_0$

$\displaystyle |f'(x)| \geq \eta_0$

$\displaystyle |f''(x)| \leq \frac{1}{\eta_0 r_0}$

for some ${\delta_0 > 0}$ and ${0 < \eta_0 < 1/2}$ and all ${x \in [x_0-r_0,x_0+r_0]}$ (the factors of ${r_0}$ are present to make ${\delta_0,\eta_0}$ “dimensionless”).

Lemma 3 Under the above hypotheses, we can find ${x_1}$ with ${|x_1 - x_0| \leq \eta_0 r_0}$ such that

$\displaystyle |f(x_1)| \ll \delta_0^2 \eta_0^{-O(1)} r_0.$

In particular, setting ${r_1 := (1-\eta_0) r_0}$, ${\eta_1 := \eta_0/2}$, and ${\delta_1 = O(\delta_0^2 \eta_0^{-O(1)})}$, we have ${[x_1-r_1,x_1+r_1] \subset [x_0-r_0,x_0+r_0] \subset I}$, and

$\displaystyle |f(x_1)| \leq \delta_1 r_1$

$\displaystyle |f'(x)| \geq \eta_1$

$\displaystyle |f''(x)| \leq \frac{1}{\eta_1 r_1}$

for all ${x \in [x_1-r_1,x_1+r_1]}$.

The crucial point here is that the new error ${\delta_1}$ is roughly the square of the previous error ${\delta_0}$. This leads to extremely fast (double-exponential) improvement in the error upon iteration, which is more than enough to absorb the exponential losses coming from the ${\eta_0^{-O(1)}}$ factor.

Proof: If ${\delta_0 > c \eta_0^{C}}$ for some absolute constants ${C,c>0}$ then we may simply take ${x_0=x_1}$, so we may assume that ${\delta_0 \leq c \eta_0^{C}}$ for some small ${c>0}$ and large ${C>0}$. Using the Newton approximation ${f(x_0+h) \approx f(x_0) + h f'(x_0)}$ we are led to the choice

$\displaystyle x_1 := x_0 - \frac{f(x_0)}{f'(x_0)}$

for ${x_1}$. From the hypotheses on ${f}$ and the smallness hypothesis on ${\delta}$ we certainly have ${|x_1-x_0| \leq \eta_0 r_0}$. From Taylor’s theorem with remainder we have

$\displaystyle f(x_1) = f(x_0) - \frac{f(x_0)}{f'(x_0)} f'(x_0) + O( \frac{1}{\eta_0 r_0} |\frac{f(x_0)}{f'(x_0)}|^2 )$

$\displaystyle = O( \frac{1}{\eta_0 r_0} (\frac{\delta_0 r_0}{\eta_0})^2 )$

and the claim follows. $\Box$

We can iterate this procedure; starting with ${x_0,\eta_0,r_0,\delta_0}$ as above, we obtain a sequence of nested intervals ${[x_n-r_n,x_n+r_n]}$ with ${f(x_n)| \leq \delta_n}$, and with ${\eta_n,r_n,\delta_n,x_n}$ evolving by the recursive equations and estimates

$\displaystyle \eta_n = \eta_{n-1} / 2$

$\displaystyle r_n = (1 - \eta_{n-1}) r_{n-1}$

$\displaystyle \delta_n = O( \delta_{n-1}^2 \eta_{n-1}^{-O(1)} )$

$\displaystyle |x_n - x_{n-1}| \leq \eta_{n-1} r_{n-1}.$

If ${\delta_0}$ is sufficiently small depending on ${\eta_0}$, we see that ${\delta_n}$ converges rapidly to zero (indeed, we can inductively obtain a bound of the form ${\delta_n \leq \eta_0^{C (2^n + n)}}$ for some large absolute constant ${C}$ if ${\delta_0}$ is small enough), and ${x_n}$ converges to a limit ${x \in I}$ which then solves the equation ${f(x)=0}$ by the continuity of ${f}$.

As I recently learned from Zhiqiang Li, a similar scheme works to prove Siegel’s theorem, as can be found for instance in this text of Carleson and Gamelin. The key is the following analogue of Lemma 3.

Lemma 4 Let ${\lambda}$ be a complex number with ${|\lambda|=1}$ and ${\frac{1}{|\lambda^n-1|} \ll n^{O(1)}}$ for all natural numbers ${n}$. Let ${r_0>0}$, and let ${f_0: B(0,r_0) \rightarrow {\bf C}}$ be a holomorphic function with ${f_0(0)=0}$, ${f'_0(0)=\lambda}$, and

$\displaystyle |f_0(z) - \lambda z| \leq \delta_0 r_0 \ \ \ \ \ (3)$

for all ${z \in B(0,r_0)}$ and some ${\delta_0>0}$. Let ${0 < \eta_0 \leq 1/2}$, and set ${r_1 := (1-\eta_0) r_0}$. Then there exists an injective holomorphic function ${\psi_0: B(0, r_1) \rightarrow B(0, r_0)}$ and a holomorphic function ${f_1: B(0,r_1) \rightarrow {\bf C}}$ such that

$\displaystyle f_0( \psi_1(z) ) = \psi_1(f_1(z)) \ \ \ \ \ (4)$

for all ${z \in B(0,r_1)}$, and such that

$\displaystyle |\psi_1(z) - z| \ll \delta_0 \eta_0^{-O(1)} r_1$

and

$\displaystyle |f_1(z) - \lambda z| \leq \delta_1 r_1$

for all ${z \in B(0,r_1)}$ and some ${\delta_1 = O(\delta_0^2 \eta_0^{-O(1)})}$.

Proof: By scaling we may normalise ${r_0=1}$. If ${\delta_0 > c \eta_0^C}$ for some constants ${c,C>0}$, then we can simply take ${\psi_1}$ to be the identity and ${f_1=f_0}$, so we may assume that ${\delta_0 \leq c \eta_0^C}$ for some small ${c>0}$ and large ${C>0}$.

To motivate the choice of ${\psi_1}$, we write ${f_0(z) = \lambda z + \hat f_0(z)}$ and ${\psi_1(z) = z + \hat \psi(z)}$, with ${\hat f_0}$ and ${\hat \psi_1}$ viewed as small. We would like to have ${f_0(\psi_1(z)) \approx \psi_1(\lambda z)}$, which expands as

$\displaystyle \lambda z + \lambda \hat \psi_1(z) + \hat f_0( z + \hat \psi_1(z) ) \approx \lambda z + \hat \psi_1(\lambda z).$

As ${\hat f_0}$ and ${\hat \psi}$ are both small, we can heuristically approximate ${\hat f_0(z + \hat \psi_1(z) ) \approx \hat f_0(z)}$ up to quadratic errors (compare with the Newton approximation ${f(x_0+h) \approx f(x_0) + h f'(x_0)}$), and arrive at the equation

$\displaystyle \hat \psi_1(\lambda z) - \lambda \hat \psi_1(z) = \hat f_0(z). \ \ \ \ \ (5)$

This equation can be solved by Taylor series; the function ${\hat f_0}$ vanishes to second order at the origin and thus has a Taylor expansion

$\displaystyle \hat f_0(z) = \sum_{n=2}^\infty a_n z^n$

and then ${\hat \psi_1}$ has a Taylor expansion

$\displaystyle \hat \psi_1(z) = \sum_{n=2}^\infty \frac{a_n}{\lambda^n - \lambda} z^n.$

We take this as our definition of ${\hat \psi_1}$, define ${\psi_1(z) := z + \hat \psi_1(z)}$, and then define ${f_1}$ implicitly via (4).

Let us now justify that this choice works. By (3) and the generalised Cauchy integral formula, we have ${|a_n| \leq \delta_0}$ for all ${n}$; by the Diophantine assumption on ${\lambda}$, we thus have ${|\frac{a_n}{\lambda^n - \lambda}| \ll \delta_0 n^{O(1)}}$. In particular, ${\hat \psi_1}$ converges on ${B(0,1)}$, and on the disk ${B(0, (1-\eta_0/4))}$ (say) we have the bounds

$\displaystyle |\hat \psi_1(z)|, |\hat \psi'_1(z)| \ll \delta_0 \sum_{n=2}^\infty n^{O(1)} (1-\eta_0/4)^n \ll \eta_0^{-O(1)} \delta_0. \ \ \ \ \ (6)$

In particular, as ${\delta_0}$ is so small, we see that ${\psi_1}$ maps ${B(0, (1-\eta_0/4))}$ injectively to ${B(0,1)}$ and ${B(0,1-\eta_0)}$ to ${B(0,1-3\eta_0/4)}$, and the inverse ${\psi_1^{-1}}$ maps ${B(0, (1-\eta_0/2))}$ to ${B(0, (1-\eta_0/4))}$. From (3) we see that ${f_0}$ maps ${B(0,1-3\eta_0/4)}$ to ${B(0,1-\eta_0/2)}$, and so if we set ${f_1: B(0,1-\eta_0) \rightarrow B(0,1-\eta_0/4)}$ to be the function ${f_1 := \psi_1^{-1} \circ f_0 \circ \psi_1}$, then ${f_1}$ is a holomorphic function obeying (4). Expanding (4) in terms of ${\hat f_0}$ and ${\hat \psi_1}$ as before, and also writing ${f_1(z) = \lambda z + \hat f_1(z)}$, we have

$\displaystyle \lambda z + \lambda \hat \psi_1(z) + \hat f_0( z + \hat \psi_1(z) ) = \lambda z + \hat f_1(z) + \hat \psi_1(\lambda z + \hat f_1(z))$

for ${z \in B(0, 1-\eta_0)}$, which by (5) simplifies to

$\displaystyle \hat f_1(z) = \hat f_0( z + \hat \psi_1(z) ) - \hat f_0(z) + \hat \psi_1(\lambda z) - \hat \psi_1(\lambda z + \hat f_1(z)).$

From (6), the fundamental theorem of calculus, and the smallness of ${\delta_0}$ we have

$\displaystyle |\hat \psi_1(\lambda z) - \hat \psi_1(\lambda z + \hat f_1(z))| \leq \frac{1}{2} |\hat f_1(z)|$

and thus

$\displaystyle |\hat f_1(z)| \leq 2 |\hat f_0( z + \hat \psi_1(z) ) - \hat f_0(z)|.$

From (3) and the Cauchy integral formula we have ${\hat f'_0(z) = O( \delta_0 \eta_0^{-O(1)})}$ on (say) ${B(0,1-\eta_0/4)}$, and so from (6) and the fundamental theorem of calculus we conclude that

$\displaystyle |\hat f_1(z)| \ll \delta_0^2 \eta_0^{-O(1)}$

on ${B(0,1-\eta_0)}$, and the claim follows. $\Box$

If we set ${\eta_0 := 1/2}$, ${f_0 := f}$, and ${\delta_0>0}$ to be sufficiently small, then (since ${f(z)-\lambda z}$ vanishes to second order at the origin), the hypotheses of this lemma will be obeyed for some sufficiently small ${r_0}$. Iterating the lemma (and halving ${\eta_0}$ repeatedly), we can then find sequences ${\eta_n, \delta_n, r_n > 0}$, injective holomorphic functions ${\psi_n: B(0,r_n) \rightarrow B(0,r_{n-1})}$ and holomorphic functions ${f_n: B(0,r_n) \rightarrow {\bf C}}$ such that one has the recursive identities and estimates

$\displaystyle \eta_n = \eta_{n-1} / 2$

$\displaystyle r_n = (1 - \eta_{n-1}) r_{n-1}$

$\displaystyle \delta_n = O( \delta_{n-1}^2 \eta_{n-1}^{-O(1)} )$

$\displaystyle |\psi_n(z) - z| \ll \delta_{n-1} \eta_{n-1}^{-O(1)} r_n$

$\displaystyle |f_n(z) - \lambda z| \leq \delta_n r_n$

$\displaystyle f_{n-1}( \psi_n(z) ) = \psi_n(f_n(z))$

for all ${n \geq 1}$ and ${z \in B(0,r_n)}$. By construction, ${r_n}$ decreases to a positive radius ${r_\infty}$ that is a constant multiple of ${r_0}$, while (for ${\delta_0}$ small enough) ${\delta_n}$ converges double-exponentially to zero, so in particular ${f_n(z)}$ converges uniformly to ${\lambda z}$ on ${B(0,r_\infty)}$. Also, ${\psi_n}$ is close enough to the identity, the compositions ${\Psi_n := \psi_1 \circ \dots \circ \psi_n}$ are uniformly convergent on ${B(0,r_\infty/2)}$ with ${\Psi_n(0)=0}$ and ${\Psi'_n(0)=1}$. From this we have

$\displaystyle f( \Psi_n(z) ) = \Psi_n(f_n(z))$

on ${B(0,r_\infty/4)}$, and on taking limits using Morera’s theorem we obtain a holomorphic function ${\Psi}$ defined near ${0}$ with ${\Psi(0)=0}$, ${\Psi'(0)=1}$, and

$\displaystyle f( \Psi(z) ) = \Psi(\lambda z),$

obtaining the required linearisation.

Remark 5 The idea of using a Newton-type method to obtain error terms that decay double-exponentially, and can therefore absorb exponential losses in the iteration, also occurs in KAM theory and in Nash-Moser iteration, presumably due to Siegel’s influence on Moser. (I discuss Nash-Moser iteration in this note that I wrote back in 2006.)