Two weeks ago I was at Oberwolfach, for the Arbeitsgemeinschaft in Ergodic Theory and Combinatorial Number Theory that I was one of the organisers for. At this workshop, I learned the details of a very nice recent convergence result of Miguel Walsh (who, incidentally, is an informal grandstudent of mine, as his advisor, Roman Sasyk, was my informal student), which considerably strengthens and generalises a number of previous convergence results in ergodic theory (including one of my own), with a remarkably simple proof. Walsh’s argument is phrased in a finitary language (somewhat similar, in fact, to the approach used in my paper mentioned previously), and (among other things) relies on the concept of metastability of sequences, a variant of the notion of convergence which is useful in situations in which one does not expect a uniform convergence rate; see this previous blog post for some discussion of metastability. When interpreted in a finitary setting, this concept requires a fair amount of “epsilon management” to manipulate; also, Walsh’s argument uses some other epsilon-intensive finitary arguments, such as a decomposition lemma of Gowers based on the Hahn-Banach theorem. As such, I was tempted to try to rewrite Walsh’s argument in the language of nonstandard analysis to see the extent to which these sorts of issues could be managed. As it turns out, the argument gets cleaned up rather nicely, with the notion of metastability being replaced with the simpler notion of external Cauchy convergence (which we will define below the fold).

Let’s first state Walsh’s theorem. This theorem is a norm convergence theorem in ergodic theory, and can be viewed as a substantial generalisation of one of the most fundamental theorems of this type, namely the mean ergodic theorem:

Theorem 1 (Mean ergodic theorem) Let ${(X,\mu,T)}$ be a measure-preserving system (a probability space ${(X,\mu)}$ with an invertible measure-preserving transformation ${T}$). Then for any ${f \in L^2(X,\mu)}$, the averages ${\frac{1}{N} \sum_{n=1}^N T^n f}$ converge in ${L^2(X,\mu)}$ norm as ${N \rightarrow \infty}$, where ${T^n f(x) := f(T^{-n} x)}$.

In this post, all functions in ${L^2(X,\mu)}$ and similar spaces will be taken to be real instead of complex-valued for simplicity, though the extension to the complex setting is routine.

Actually, we have a precise description of the limit of these averages, namely the orthogonal projection of ${f}$ to the ${T}$-invariant factors. (See for instance my lecture notes on this theorem.) While this theorem ostensibly involves measure theory, it can be abstracted to the more general setting of unitary operators on a Hilbert space:

Theorem 2 (von Neumann mean ergodic theorem) Let ${H}$ be a Hilbert space, and let ${U: H \rightarrow H}$ be a unitary operator on ${H}$. Then for any ${f \in H}$, the averages ${\frac{1}{N} \sum_{n=1}^N U^n f}$ converge strongly in ${H}$ as ${N \rightarrow \infty}$.

Again, see my lecture notes (or just about any text in ergodic theory) for a proof.

Now we turn to Walsh’s theorem.

Theorem 3 (Walsh’s convergence theorem) Let ${(X,\mu)}$ be a measure space with a measure-preserving action of a nilpotent group ${G}$. Let ${g_1,\ldots,g_k: {\bf Z} \rightarrow G}$ be polynomial sequences in ${G}$ (i.e. each ${g_i}$ takes the form ${g_i(n) = a_{i,1}^{p_{i,1}(n)} \ldots a_{i,j}^{p_{i,j}(n)}}$ for some ${a_{i,1},\ldots,a_{i,j} \in G}$ and polynomials ${p_{i,1},\ldots,p_{i,j}: {\bf Z} \rightarrow {\bf Z}}$). Then for any ${f_1,\ldots,f_k \in L^\infty(X,\mu)}$, the averages ${\frac{1}{N} \sum_{n=1}^N (g_1(n) f_1) \ldots (g_k(n) f_k)}$ converge in ${L^2(X,\mu)}$ norm as ${N \rightarrow \infty}$, where ${g(n) f(x) := f(g(n)^{-1} x)}$.

It turns out that this theorem can also be abstracted to some extent, although due to the multiplication in the summand ${(g_1(n) f_1) \ldots (g_k(n) f_k)}$, one cannot work purely with Hilbert spaces as in the von Neumann mean ergodic theorem, but must also work with something like the Banach algebra ${L^\infty(X,\mu)}$. There are a number of ways to formulate this abstraction (which will be of some minor convenience to us, as it will allow us to reduce the need to invoke the nonstandard measure theory of Loeb, discussed for instance in this blog post); we will use the notion of a (real) commutative probability space ${({\mathcal A},\tau)}$, which for us will be a commutative unital algebra ${{\mathcal A}}$ over the reals together with a linear functional ${\tau: {\mathcal A} \rightarrow {\bf R}}$ which maps ${1}$ to ${1}$ and obeys the non-negativity axiom ${\tau(f^2) \ge 0}$ for all ${f}$. The key example to keep in mind here is ${{\mathcal A} = L^\infty(X,\mu)}$ of essentially bounded real-valued measurable functions with the supremum norm, and with the trace ${\tau(f) := \int_X f\ d\mu}$. We will also assume in our definition of commutative probability spaces that all elements ${f}$ of ${{\mathcal A}}$ are bounded in the sense that the spectral radius ${\rho(f) := \lim_{k \rightarrow \infty} \tau(f^{2k})^{1/2k}}$ is finite. (In the concrete case of ${L^\infty(X,\mu)}$, the spectral radius is just the ${L^\infty}$ norm.)

Given a commutative probability space, we can form an inner product ${\langle, \rangle_{L^2(\tau)}}$ on it by the formula

$\displaystyle \langle f, g \rangle_{L^2(\tau)} := \tau(fg).$

This is a positive semi-definite form, and gives a (possibly degenerate) inner product structure on ${{\mathcal A}}$. We could complete this structure into a Hilbert space ${L^2(\tau)}$ (after quotienting out the elements of zero norm), but we will not do so here, instead just viewing ${L^2(\tau)}$ as providing a semi-metric on ${{\mathcal A}}$. For future reference we record the inequalities

$\displaystyle \rho(fg) \leq \rho(f) \rho(g)$

$\displaystyle \rho(f+g) \leq \rho(f) + \rho(g)$

$\displaystyle \| fg\|_{L^2(\tau)} \leq \|f\|_{L^2(\tau)} \rho(g)$

for any ${f,g}$, which we will use in the sequel without further comment; see e.g. these previous blog notes for proofs. (Actually, for the purposes of proving Theorem 3, one can specialise to the ${L^\infty(X,\mu)}$ case (and ultraproducts thereof), in which case these inequalities are just the triangle and Hölder inequalities.)

The abstract version of Theorem 3 is then

Theorem 4 (Walsh’s theorem, abstract version) Let ${({\mathcal A},\tau)}$ be a commutative probability space, and let ${G}$ be a nilpotent group acting on ${{\mathcal A}}$ by isomorphisms (preserving the algebra, conjugation, and trace structure, and thus also preserving the spectral radius and ${L^2(\tau)}$ norm). Let ${g_1,\ldots,g_k: {\bf Z} \rightarrow G}$ be polynomial sequences. Then for any ${f_1,\ldots,f_k \in {\mathcal A}}$, the averages ${\frac{1}{N} \sum_{n=1}^N (g_1(n) f_1) \ldots (g_k(n) f_k)}$ form a Cauchy sequence in ${L^2(\tau)}$ (semi-)norm as ${N \rightarrow \infty}$.

It is easy to see that this theorem generalises Theorem 3. Conversely, one can use the commutative Gelfand-Naimark theorem to deduce Theorem 4 from Theorem 3, although we will not need this implication. Note how we are abandoning all attempts to discern what the limit of the sequence actually is, instead contenting ourselves with demonstrating that it is merely a Cauchy sequence. With this phrasing, it is tempting to ask whether there is any analogue of Walsh’s theorem for noncommutative probability spaces, but unfortunately the answer to that question is negative for all but the simplest of averages, as was worked out in this paper of Austin, Eisner, and myself.

Our proof of Theorem 4 will proceed as follows. Firstly, in order to avoid the epsilon management alluded to earlier, we will take an ultraproduct to rephrase the theorem in the language of nonstandard analysis; for reasons that will be clearer later, we will also convert the convergence problem to a problem of obtaining metastability (external Cauchy convergence). Then, we observe that (the nonstandard counterpart of) the expression ${\|\frac{1}{N} \sum_{n=1}^N (g_1(n) f_1) \ldots (g_k(n) f_k)\|_{L^2(\tau)}^2}$ can be viewed as the inner product of (say) ${f_k}$ with a certain type of expression, which we call a dual function. By performing an orthogonal projection to the span of the dual functions, we can split ${f_k}$ into the sum of an expression orthogonal to all dual functions (the “pseudorandom” component), and a function that can be well approximated by finite linear combinations of dual functions (the “structured” component). The contribution of the pseudorandom component is asymptotically negligible, so we can reduce to consideration of the structured component. But by a little bit of rearrangement, this can be viewed as an average of expressions similar to the initial average ${\frac{1}{N} \sum_{n=1}^N (g_1(n) f_1) \ldots (g_k(n) f_k)}$, except with the polynomials ${g_1,\ldots,g_k}$ replaced by a “lower complexity” set of such polynomials, which can be greater in number, but which have slightly lower degrees in some sense. One can iterate this (using “PET induction”) until all the polynomials become trivial, at which point the claim follows.

— 1. Nonstandard analysis and metastability —

We will assume some familiarity with nonstandard analysis, as covered for instance in these previous blog posts.

As is common practice in nonstandard analysis, we will need to select a non-principal ultrafilter ${p \in \beta {\bf N} \backslash {\bf N}}$. Using this ultrafilter, we can now form the ultraproduct ${\prod_{{\bf n} \rightarrow p} X_{\bf n}}$ of any sequence ${X_{\bf n}}$ of (standard) spaces, defined as the space of all ultralimits ${\lim_{{\bf n} \rightarrow p} x_{\bf n}}$ of sequences ${x_{\bf n}\in X_{\bf n}}$ defined for ${{\bf n}}$ sufficiently close to ${p}$, with two sequences considered to have the same ultralimit iff they agree sufficiently close to ${p}$. Any operation or relation on the standard spaces ${X_{\bf n}}$ can then be defined on the nonstandard space ${X = \prod_{{\bf n} \rightarrow p} X_{\bf n}}$ in a natural fashion. For instance, given a sequence of standard functions ${f_{\bf n}: X_{\bf n} \rightarrow Y_{\bf n}}$, one can form their ultralimit ${f = \lim_{{\bf n} \rightarrow p} f_{\bf n}}$ from the nonstandard space ${X = \prod_{{\bf n} \rightarrow p} X_{\bf n}}$ to the nonstandard space ${Y = \prod_{{\bf n} \rightarrow p} Y_{\bf n}}$ by the formula

$\displaystyle f( \lim_{{\bf n} \rightarrow p} x_{\bf n} ) := \lim_{{\bf n} \rightarrow p} f( x_{\bf n}).$

As usual, we call a nonstandard real ${x \in {}^* {\bf R}}$ bounded if we have ${|x| \leq C}$ for some standard ${C}$, and infinitesimal if we have ${|x| \leq \epsilon}$ for every standard ${\epsilon>0}$, and in the latter case we also write ${x=o(1)}$. Every bounded nonstandard real ${x}$ is infinitesimally close to a unique standard real, called the standard part ${\hbox{st}(x)}$ of ${x}$.

We will need the following fundamental properties about nonstandard analysis:

• (Transfer / Los’s theorem) If for each ${i=1,\ldots,k}$, ${x_{i,{\bf n}}}$ is a sequence of mathematical objects, spaces, or functions, with ultralimit or ultraproduct ${x_i}$, then for any first-order predicate ${P( y_1,\ldots,y_k)}$ involving ${k}$ mathematical objects of the appropriate type, the claim ${P(x_1,\ldots,x_k)}$ is true if and only if ${P(x_1,\ldots,x_{k,{\bf n}})}$.
• (Overspill) If an internal set ${E}$ (an ultraproduct of standard sets, also known as a nonstandard set) of nonstandard numbers contains all unbounded natural numbers, then there exists a standard natural number ${N}$ such that ${E}$ contains all nonstandard numbers larger than ${N}$.
• (Loeb measure, hyperfinite case) If ${A}$ is a non-empty nonstandard finite set (i.e. the ultraproduct ${A = \prod_{{\bf n} \rightarrow p} A_{\bf n}}$ of standard finite sets, also known as a hyperfinite set), and the Loeb ${\sigma}$-algebra is defined as the ${\sigma}$-algebra generated by the internal subsets of ${A}$, then there exists a unique countably additive probability measure ${\mu_A}$ on the Loeb ${\sigma}$-algebra, called Loeb measure, such that for any internal subset ${E =\prod_{{\bf n} \rightarrow p} E_{\bf n}}$ of ${A}$, one has ${\mu_A(E) = \hbox{st} \lim_{{\bf n} \rightarrow p} |E_{\bf n}|/|A_{\bf n}|}$. See e.g. this previous blog post for the details of the construction.

To motivate the discussion that follows, let us recall some equivalent formulations of a Cauchy sequence in a pseudometric space (i.e. a generalisation of a metric space in which some distances are allowed to vanish).

Proposition 5 Let ${(x_n)_{n \in {\bf N}}}$ be a sequence in a pseudometric space ${(X,d)}$ (not necessarily complete). Let ${(x_n)_{n \in {}^* {\bf N}}}$ be the nonstandard extension of ${(x_n)_{n \in {\bf N}}}$, taking values in the nonstandard metric space ${({}^* X, {}^* d)}$. Then the following are equivalent:

1. (standard Cauchy sequence) For every standard ${\epsilon > 0}$, there exists a standard ${N > 0}$ such that ${d(x_n,x_m) \leq \epsilon}$ for all standard ${n,m \geq N}$.
2. (nonstandard Cauchy sequence) For every nonstandard ${\epsilon > 0}$, there exists a nonstandard ${N > 0}$ such that ${{}^* d(x_n,x_m) \leq \epsilon}$ for all nonstandard ${n,m \geq N}$.
3. (standard metastability) For every standard function ${F: {\bf N} \rightarrow {\bf N}}$ and standard ${\epsilon>0}$, there exists a standard ${N>0}$ such that ${d(x_n,x_m) \leq \epsilon}$ for all standard ${N \leq n,m \leq N+F(N)}$.
4. (nonstandard metastability) For every nonstandard function ${F: {\bf N} \rightarrow {\bf N}}$ and nonstandard ${\epsilon>0}$, there exists a nonstandard ${N>0}$ such that ${{}^* d(x_n,x_m) \leq \epsilon}$ for all nonstandard ${N \leq n,m \leq N+F(N)}$.
5. (asymptotic stability) One has ${{}^* d(x_n,x_m) = o(1)}$ for all unbounded ${n,m}$.

Proof: The equivalence of 1 and 2 follows from the transfer principle (or Los’s theorem), as does the equivalence of 3 and 4. The implication of 3 from 1 is also clear. Finally, suppose that 1 failed, then there is an ${\epsilon>0}$ such that for every standard ${N>0}$ we can find a larger number ${N' > N}$ such that ${d(x_N,x_{N'}) > \epsilon}$. Setting ${F(N) := N'-N}$, we see that 3 fails also.

If 1 holds, then from transfer we see that for any unbounded ${n,m}$, one has ${{}^* d(x_n,x_m) \leq \epsilon}$ for every standard ${\epsilon}$, giving 5. Conversely, if 1 fails, then letting ${F(N)}$ be as before, we see from transfer that ${{}^* d(x_N, x_{N+F(N)}) > \epsilon}$ for every nonstandard ${N}$, contradicting 5. $\Box$

Now we consider more general sequences, in which the above notions of convergence begin to diverge:

Definition 6 Let ${(X,d)}$ be a nonstandard pseudometric space (i.e. the ultraproduct of standard pseudometric spaces ${(X_{\bf n},d_{\bf n})}$; in particular, ${d}$ takes values in ${{}^* {\bf R}^+}$ rather than ${{\bf R}^+}$), and let ${x: n \mapsto x_n}$ be an nonstandard sequence (or internal sequence) in ${X}$, that is to say a nonstandard map (or internal map) from ${{}^* {\bf N}}$ to ${X}$ (and thus an ultralimit of maps from ${{\bf N}}$ to ${X_{\bf n}}$).

• We say that the sequence ${x}$ is internally Cauchy if for every nonstandard ${\epsilon>0}$, there exists a nonstandard ${N>0}$ such that ${d(x_n,x_m) \leq \epsilon}$ for all nonstandard ${n,m \geq N}$.
• We say that the sequence ${x}$ is externally Cauchy or metastable if for every standard ${\epsilon>0}$, there exists a standard ${N>0}$ such that ${d(x_n,x_m) \leq \epsilon}$ for all standard ${n,m \geq N}$.
• We say that the sequence ${x}$ is asymptotically stable if ${d(x_n,x_m)=o(1)}$ whenever ${n,m}$ are unbounded.

These three notions are now distinct, even for a simple nonstandard metric space such as the ultrapower ${{}^*[-1,1]}$ of the unit interval with the usual metric, as the following examples demonstrate:

1. If ${N}$ is an unbounded natural number, then the nonstandard sequence ${x_n := (-1)^n \exp(-n/N)}$ is internally Cauchy, but not externally Cauchy or asymptotically stable.
2. If ${N}$ is an unbounded natural number, then the nonstandard sequence ${x_n := 1_{n > N}}$ is internally and externally Cauchy, but not asymptotically stable.
3. If ${N}$ is an unbounded natural number, then the nonstandard sequence ${x_n := (-1)^n 1_{n > N}}$ is externally Cauchy, but not internally Cauchy or asymptotically stable.
4. If ${N}$ is an unbounded natural number, then the nonstandard sequence ${x_n := (-1)^n/N}$ is asymptotically stable and externally Cauchy, but not internally Cauchy.
5. Any monotone bounded nonstandard sequence of nonstandard reals is automatically both externally Cauchy and internally Cauchy, but is not necessarily asymptotically stable, as the example ${1_{n>N}}$ above shows.
6. The property of being externally Cauchy is only dependent on an initial segment of the sequence: if ${x_n}$ is externally Cauchy, and one modifies ${x_n}$ arbitrarily for ${n \geq N}$ and some fixed unbounded ${N}$, then the modified sequence ${x'_n}$ will still be externally Cauchy. The same claim is certainly not true for the notions of internally Cauchy or asymptotically stable, as can be seen by considering examples such as ${0}$, ${1_{n > N}}$ and ${(-1)^n 1_{n>N}}$.
7. The property of being externally Cauchy is closed under (external) uniform limits; if ${x_n}$ is a nonstandard sequence such that for every standard ${\epsilon>0}$ one can find an externally Cauchy sequence ${x'_n}$ with ${d(x_n,x'_n) \leq \epsilon}$ for all ${n}$, then ${x_n}$ is itself externally Cauchy. The same claim holds as well for asymptotically stability, but not for the internal Cauchy property (unless one allows ${\epsilon}$ to be nonstandard).

One can equate these three nonstandard notions of convergence with standard notions as follows:

Proposition 7 Let ${(X,d)}$ be a nonstandard pseudometric space (the ultraproduct of standard pseudometric spaces ${(X_{\bf n},d_{\bf n})}$), and let the nonstandard sequence ${x: {}^* {\bf N} \rightarrow X}$ be the ultralimit of standard sequences ${x_{\bf n}: {\bf N} \rightarrow X_{\bf n}}$.

1. The nonstandard sequence ${x}$ is internally Cauchy if and only if the standard sequences ${x_{\bf n}}$ are Cauchy for all ${{\bf n}}$ sufficiently close to ${p}$.
2. The nonstandard sequence ${x}$ is externally Cauchy if and only if for every standard ${F: {\bf N} \rightarrow {\bf N}}$ and standard ${\epsilon>0}$ , there exists a standard ${N}$ such that ${d_{\bf n}(x_n,x_m) \leq \epsilon}$ for all ${N \leq n,m\leq N+F(N)}$ and all ${{\bf n}}$ sufficiently close to ${p}$.
3. The nonstandard sequence ${x}$ is asymptotically stable if and only if for every standard ${\epsilon>0}$, there exists a standard ${N}$ such that one has ${d_{\bf n}(x_n,x_m) \leq \epsilon}$ for all standard ${n,m \geq N}$ and all ${{\bf n}}$ sufficiently close to ${p}$.
4. The nonstandard sequence ${x}$ is externally Cauchy if and only if there exists an unbounded ${N_0}$ such that ${x}$ is asymptotically stable up to ${N_0}$, in the sense that ${d(x_n,x_m)=o(1)}$ for all unbounded ${n,m \leq N_0}$.

Informally: internally Cauchy sequences are ultralimits of sequences that are Cauchy; externally Cauchy sequences are ultralimits of sequences that are uniformly metastable for an asymptotically infinite period of time; and asymptotically stable sequences are ultralimits of sequences that converge at a uniform rate for an asymptotically infinite period of time.

Proof: The claim 1 follows directly from the transfer principle. Claim 2 follows from the equivalences of parts 1 and 3 of Proposition 5 applied to the standard portion ${(x_n)_{n \in{\bf N}}}$ of the sequence (replacing the nonstandard metric by its standard part). Finally, we verify claim 3. If ${x}$ is asymptotically stable, then for every standard ${\epsilon>0}$, we have ${d(x_n,x_m) < \epsilon}$ for all unbounded ${n,m}$, and so by the overspill principle, there is a standard ${N}$ such that ${d(x_n,x_m) < \epsilon}$ for all ${n,m \geq N}$, which by transfer gives the “only if” portion of Claim 3. Reversing these steps gives the “if” direction.

To show Claim 4, observe that if ${x}$ is externally Cauchy, then for every standard ${\epsilon > 0}$, one has ${d(x_n,x_m) \leq \epsilon}$ for all sufficiently large standard ${n,m}$, and thus by overspill there is an unbounded ${N_\epsilon}$ such that ${d(x_n,x_m) \leq \epsilon}$ for all unbounded ${n,m \leq N_\epsilon}$. By overspill (or countable saturation) one can find an unbounded ${N_0}$ such that ${N_0 \leq N_\epsilon}$ for every standard ${\epsilon>0}$, giving the “only if” direction. The “if” implication follows by reversing the steps. $\Box$

From these equivalences one sees that asymptotic stability implies externally Cauchy, but as the above counterexamples show, there are no other implications between the three concepts.

Of the three notions of convergence for nonstandard sequences, we will focus almost exclusively on the notion of external Cauchy convergence, which at the finitary level corresponds to uniform metastability bounds (as opposed to qualitative convergence, or convergence at a uniform rate). In particular, we will deduce Walsh’s theorem from the following nonstandard version:

Theorem 8 (Walsh’s theorem, nonstandard version) Let ${({\mathcal A},\tau)}$ be a nonstandard commutative probability space (i.e. the ultraproduct of standard commutative probability spaces), and let ${G}$ be a nilpotent nonstandard group acting on ${{\mathcal A}}$ by isomorphisms. Let ${g_1,\ldots,g_k: {}^* {\bf Z} \rightarrow {}^* G}$ be polynomial nonstandard functions (i.e. each ${g_i}$ takes the form ${g_i(n) = a_{i,1}^{p_{i,1}(n)} \ldots a_{i,j}^{p_{i,j}(n)}}$ for some standard ${j}$, some ${a_{i,1},\ldots,a_{i,j} \in G}$ and some standard polynomials ${p_{i,1},\ldots,p_{i,j}: {\bf Z} \rightarrow {\bf Z}}$). Then for any elements ${f_1,\ldots,f_k \in {\mathcal A}}$ which are bounded (in the sense that ${\rho(f_1),\ldots,\rho(f_k)}$ are bounded), the averages ${\frac{1}{N} \sum_{n=1}^N (g_1(n) f_1) \ldots (g_k(n) f_k)}$ form an externally Cauchy sequence with respect to the (non-standard) ${L^2(\tau)}$ pseudometric.

From Proposition 5, Theorem 8 implies Theorem 4 and thus Theorem 3. But it is actually somewhat stronger, in that it gives a uniform metastability on the averages occuring in those latter two theorems. (This uniform metastability was already derived in Walsh’s original paper, and I did something similar in the special case of linear commuting averages.) This uniformity ultimately comes from the fact that in the above theorem, the polynomial sequences ${g_1,\ldots,g_k}$ are allowed to have nonstandard coefficients, rather than just standard ones (and the space ${({\mathcal A},\tau)}$ is a general nonstandard space, rather than an ultrapower).

Remark 1 Since the original appearance of this post, it was essentially observed in this preprint of Avigad and Iovino that the original result in Theorem 4 implies the special case of Theorem 8 in the case when the polynomials ${g_1,\ldots,g_k}$ have standard coefficients, as one can use the standard part construction to project the nonstandard commutative probability space to a standard commutative probability space. As such, Theorem 4 automatically implies a metastable version of itself, in which the metastability bound is allowed to depend on the coefficients of the polynomials as well as the degree. The result in Theorem 8 is then apparently stronger because the metastability bound obtained in the finitary setting is also uniform in the choice of coefficients of the polynomials; however, by lifting the group ${G}$ to a higher rank nilpotent group, one can replace the action of any finite number of polynomials with unbounded coefficients with lifted polynomials with bounded coefficients (this trick dates back to the book of Furstenberg) and so it turns out that Theorem 8 is ultimately equivalent to Theorem 4 (and also Theorem 3, using the structural theory of commutative probability spaces).

From the definition of external Cauchy convergence, it is clear that if ${(x_n)_{n \in {}^* {\bf N}}}$, ${(y_n)_{n \in {}^* {\bf N}}}$ are two externally Cauchy convergent sequences of (nonstandard) reals, then their sum is also externally Cauchy convergent, and more generally any (standard) finite linear combination (with standard real coefficients) of externally Cauchy convergent sequences of nonstandard reals is also externally Cauchy convergent. A key property for us is that external Cauchy convergence is also preserved by hyperfinite averages involving a nonstandardly finite number of sequences:

Proposition 9 (Metastable dominated convergence theorem) Let ${A}$ be a non-empty nonstandard finite set (i.e. the ultraproduct of standard finite sets), and let ${( x_{n,\alpha} )_{n \in {}^* {\bf N}; \alpha \in A}}$ be an internal family of internal sequences ${(x_{n,\alpha})_{n \in {}^* {\bf N}}}$ of bounded elements of a nonstandard normed vector space. If the sequences ${(x_{n,\alpha})_{n \in {}^* {\bf N}}}$ are externally Cauchy convergent for each ${\alpha \in A}$, then the (nonstandardly) averaged sequence ${( \mathop{\bf E}_{\alpha \in A} x_{n,\alpha} )_{n \in {}^* {\bf N}}}$ is also externally Cauchy convergent.

This is an infinitary version of the finitary metastable dominated convergence theorem that first appeared in this paper of mine, which roughly speaking claims that the average of uniformly metastable bounded sequences is again metastable. The proof was infinitary (deducing it from the Lebesgue dominated convergence theorem), and we will take a similar approach here. The argument was eventually finitised (and strengthened) in this paper of Avigad, Dean, and Rute, but the finitary argument is surprisingly non-trivial.

Proof: As each ${x_{n,\alpha}}$ is individually bounded (i.e. smaller in norm than any unbounded natural number), and depends internally on ${n,\alpha}$, we see from overspill that there is a uniform bound ${\|x_{n,\alpha}\| \leq C}$ for some standard natural number ${C}$.

For each standard ${\epsilon>0}$ and standard natural number ${N}$, let ${E_{\epsilon,N}}$ denote the subset of ${A}$ given by the formula

$\displaystyle E_{\epsilon,N} := \{ \alpha \in A: \|x_{n,\alpha}-x_{m,\alpha}\| \leq \epsilon \hbox{ for all } n,m \geq N \}.$

These sets are not internal subsets of ${A}$, but are instead ${\bigwedge}$-internal (i.e. countable intersections of internal sets). In particular, they are still Loeb measurable subsets of ${A}$ and thus have a well-defined Loeb measure ${\mu_A(E_{\epsilon,N})}$.

By hypothesis, we see that for any fixed ${\epsilon}$, the ${E_{\epsilon,N}}$ increase to ${A}$ in the sense that ${E_{\epsilon,N} \subset E_{\epsilon,N+1}}$ and ${A = \bigcup_N E_{\epsilon,N}}$. By monotone convergence, we conclude that there exists a standard ${N}$ such that ${\mu_A(E_{\epsilon,N}) \geq 1-\epsilon}$. We then have for ${n,m \geq N}$ that

$\displaystyle \|\mathop{\bf E}_{\alpha \in A} x_{n,\alpha} - \mathop{\bf E}_{\alpha \in A} x_{m,\alpha}\| \leq \mathop{\bf E}_{\alpha \in A} \| x_{n,\alpha} - x_{m,\alpha} \|$

$\displaystyle = \int_A \hbox{st} \|x_{n,\alpha} - x_{m,\alpha}\|\ d\mu_A(\alpha)$

$\displaystyle \leq \int_{E_{\epsilon,N}} \epsilon\ d\mu_A(\alpha) + \int_{A \backslash E_{\epsilon,N}} 2C\ d\mu_A(\alpha)$

$\displaystyle \leq (2C+1) \epsilon.$

As ${\epsilon}$ can be arbitrarily small, this gives the external Cauchy convergence of ${\mathop{\bf E}_{\alpha \in A} x_{n,\alpha}}$ as desired. $\Box$

Remark 2 This proof is significantly shorter than the finitary proof of Avigad, Dean, and Rute, but the complexity has been concealed in the construction of Loeb measure and the monotone convergence theorem. This is typically how nonstandard analysis arguments work; they are unable to magically make the “hard” component of an argument disappear entirely, but they are often able to efficiently conceal such components in fundamental building blocks which are of independent interest, and which can be usefully applied as a black box to a wide spectrum of problems. (In contrast, a hard argument in a finitary argument often needs to be reworked each time one wishes to apply it to a new problem.)

— 2. A simple case: the von Neumann ergodic theorem —

Before we prove Theorem 8, let us first warm up by establishing an easy case, namely the nonstandard version of the von Neumann ergodic theorem (Theorem 2):

Theorem 10 (Nonstandard von Neumann mean ergodic theorem) Let ${H}$ be a nonstandard inner product space (i.e. the ultraproduct of standard inner product spaces), and let ${U: H \rightarrow H}$ be a be a nonstandard unitary operator on ${H}$. Then for any bounded ${f \in H}$, the averages ${A_N(f) := \frac{1}{N} \sum_{n=1}^N U^n f}$ are externally Cauchy in ${N}$.

We first observe that if one takes the bounded elements of ${H}$ and forms the Hilbert space completion using the standard norm

$\displaystyle \| f \|_{\hbox{st}(H)} := \hbox{st} \|f\|_H,$

one obtains a Hilbert space ${\overline{H}}$. Thanks to the bound

$\displaystyle \|A_N(f) \|_{\hbox{st}(H)} \leq \|f \|_{\hbox{st}(H)}$

for all bounded ${f}$ in ${H}$, we see that these ergodic averages can be defined in ${\hbox{st}(H)}$, and so it suffices to show that the averages ${A_N(f)}$ are externally Cauchy in ${\hbox{st}(H)}$ for all ${f \in \hbox{st}(H)}$.

Let us first investigate a condition that would force ${A_{N}(f)}$ to be asymptotically stable in ${\hbox{st}(H)}$. We expand out

$\displaystyle \|A_{N}(f)\|_{\hbox{st}(H)}^2$

$\displaystyle = \hbox{st} \langle \frac{1}{N} \sum_{n=1}^N A_N(f), U^n f \rangle_H$

(where all expressions have been extended to nonstandard values of ${n}$ or ${N}$ in the usual fashion, and operations are extended from the bounded elements of ${H}$ to ${\hbox{st}(H)}$ by continuity). With a little bit of rearrangement, this expression can be rewritten as ${\langle f, D_N(A_N(f)) \rangle_{\hbox{st} H}}$, where the dual function ${D_N(f_0) = D_{N,f}(f_0) \in \overline{L^2(\tau)}}$ for any ${f_0 \in \hbox{st}(H)}$ is defined by the formula

$\displaystyle D_N(f_0) := \frac{1}{N} \sum_{n=1}^N U^{-n} f_0.$

(The terminology of dual functions originates from this paper of Ben Green and myself.) Thus, if we let ${Z}$ be the linear span in ${\hbox{st}(H)}$ of all functions of the form ${D_N(f_0)}$ with ${N}$ unbounded and ${f_0 \in \hbox{st}(H)}$, and ${f}$ is orthogonal to ${Z}$, then ${A_N(f)}$ vanishes in ${\hbox{st}(H)}$ for any unbounded ${N}$. This makes ${A_{N}(f)}$ asymptotically stable, and thus externally Cauchy, in ${\hbox{st}(H)}$ norm.

In view of this fact, the existence of an orthogonal projection to the closure of ${Z}$, and the linearity of ${A_{N}(f)}$ in ${f}$, it suffices to show that for any ${f}$ in the closure of ${Z}$ in ${\hbox{st}(H)}$, the expression ${A_N(f)}$ is eventually Cauchy in ${\hbox{st}(H)}$.

Remark 3 This step used the existence of orthogonal projections from a Hilbert space to a closed subspace, and is closely related to an analogous use of such projections in the textbook proof of Theorem 2 (see e.g. the proof of Theorem 2 in these blog notes). In the finitary argument of Walsh, one uses instead a decomposition established by Gowers using the Hahn-Banach theorem as a substitute for orthogonal projections. See also the “Hilbert space finite convergence principle” from this blog post for a closely related link between orthogonal projections and quantitative decompositions.

Having eliminated the “pseudorandom case” when ${f}$ is orthogonal to ${Z}$, we have now reduced to the “structured case” when ${f}$ lies in the closure of ${Z}$. By linearity and an approximation argument, we may reduce to the case when ${f}$ is just a projection ${f = D_{N_0}(f_0)}$ of a single dual function ${D_{N_0}(f_0)}$ for some ${f_0 \in \hbox{st}(H)}$ and unbounded ${N_0}$, and by a further density and approximation argument we can assume that ${f_0}$ is a bounded element of ${H}$, rather than a general element of ${\hbox{st}(H)}$. (Incidentally, the non-standard analysis formalism is painlessly skipping over a certain amount of epsilon management here which is much more visible in the finitary version of the argument.) Thus, our task is now to show that the expression ${A_{N}(D_{N_0}(f_0))}$ is externally Cauchy in ${H}$.

We expand

$\displaystyle A_N( D_{N_0}(f_0) ) = \frac{1}{N} \sum_{n=1}^N \frac{1}{N_0} \sum_{n'=1}^{N_0} U^{n-n'} f_0.$

Fixing ${N_0}$, we now restrict to the regime ${N=o(N_0)}$, thus ${N \leq \epsilon N_0}$ for all standard ${\epsilon>0}$. Then ${n = o(N_0)}$ as well. We can then shift the range ${\{1,\ldots,N_0\}}$ by ${n}$ by making a substitution ${n' := n+m}$. If we then return ${m}$ to the range ${\{1,\ldots,N_0\}}$, this creates an error of ${\hbox{st}(H)}$ norm ${\hbox{st} O( n / N_0 ) = 0}$, and so

$\displaystyle A_N( D_{N_0}(f_0) ) = \frac{1}{N} \sum_{n=1}^N \frac{1}{N_0} \sum_{m=1}^{N_0} U^{-m} f_0 = D_{N_0}(f_0)$

in ${\hbox{st}(H)}$ for all ${N = o(N_0)}$. In particular, ${A_N( D_{N_0}(f_0) )}$ is asymptotically stable up to ${o(N_0)}$, and thus externally Cauchy as required.

— 3. Descent —

Theorem 8 is proven by an induction on the “complexity” of the ${g_1,\ldots,g_k}$. Fix ${({\mathcal A},\tau)}$ and the action of the nilpotent nonstandard group ${G}$. Given any finite tuple ${\vec g := (g_1,\ldots,g_k)}$ of internal functions from ${{}^* {\bf Z}}$ to ${G}$ (not necessarily polynomials), let us say that ${\vec g}$ is good if the conclusion of Theorem 8 holds, thus the averages ${\frac{1}{N} \sum_{n=1}^N (g_1(n) f_1) \ldots (g_k(n) f_k)}$ form an external Cauchy sequence in ${L^2(\tau)}$ for all bounded ${f_1,\ldots,f_k \in {\mathcal A}}$. Trivially, any permutation of a good tuple is good, and any tuple that consists entirely of copies of the constant function ${1_G}$ mapping ${{\bf Z}}$ to the group identity ${1_G}$ of ${G}$ is good. Furthermore, if ${\vec g}$ is a finite tuple, and ${\vec g'}$ is obtained from ${\vec g}$ by removing duplicate functions (e.g. converting ${(g_1,g_2,g_1,g_2,g_3)}$ into ${(g_1,g_2,g_3)}$) and also removing all copies of ${1_G}$, then ${\vec g}$ is good if and only if ${\vec g'}$ is good.

If the tuple ${\vec g = (g_1,\ldots,g_k)}$ is non-empty (i.e. ${k \geq 1}$), then for any standard integer ${m}$, we define the ${m}$-reduction ${\vec g^*_m}$ of ${\vec g}$ to be the tuple consisting of the functions ${g_1,\ldots,g_{k-1}}$, together with the function

$\displaystyle n \mapsto g_k(n) g_k(n+m)^{-1}$

and the functions

$\displaystyle n \mapsto g_k(n) g_k(n+m)^{-1} g_i(n+m)$

for ${i=1,\ldots,k-1}$. (We will see why these particular functions arise in the argument shortly.) The key step in proving Theorem 8 is then the following result, reminiscent of the van der Corput lemma in ergodic theory (see e.g. this blog post).

Proposition 11 (Descent) Let ${({\mathcal A},\tau)}$ be a nonstandard commutative probability space, and let ${G}$ be a nonstandard group acting on ${{\mathcal A}}$ by isomorphisms. Let ${\vec g}$ be a non-empty finite tuple of internal functions from ${{}^* {\bf Z}}$ to ${G}$. If ${\vec g^*_m}$ is good for every nonstandard integer ${m}$, then ${\vec g}$ is good.

Once one has this proposition, Theorem 8 will be an immediate consequence of the following combinatorial claim (and the remarks made at the beginning of this section).

Proposition 12 (PET induction) Let ${G}$ be a nilpotent nonstandard group. Then there exists a well-ordered set ${W}$ and a way to assign to each finite tuple ${\vec g}$ of polynomial nonstandard functions from ${{}^* {\bf Z}}$ to a nilpotent nonstandard group ${G}$ a tuple ${\tilde \vec g}$ which is a permutation of ${\vec g}$ after all duplicates and copies of ${1_G}$ have been removed, and a weight ${w(\vec g)}$ in ${W}$, with the following property: if ${\tilde \vec g}$ is non-empty, and ${m}$ is an nonstandard integer, then there is a permutation of ${(\tilde \vec g)^*_m}$ which has a strictly smaller weight than that of ${\vec g}$.

Indeed, once one has this proposition and Proposition 11, Theorem 8 follows by strong induction on the weight ${w(\vec g)}$.

We prove Proposition 11 in this section, and Proposition 12 in the next section.

The proof of Proposition 11 closely mimics the proof of Theorem 10. Fix ${({\mathcal A},\tau)}$, ${G}$, the tuple ${\vec g = (g_1,\ldots,g_k)}$, and bounded elements ${f_1,\ldots,f_k \in {\mathcal A}}$, and assume that ${\vec g^*_m}$ is good for all standard integers ${m}$. We consider the averages

$\displaystyle A_{\vec g,N}(f_1,\ldots,f_k) := \frac{1}{N} \sum_{n=1}^N (g_1(n) f_1) \ldots (g_k(n) f_k) \ \ \ \ \ (1)$

and our task is to show that the ${A_N(f_1,\ldots,f_k)}$ form an external Cauchy sequence in ${L^2(\tau)}$.

We fix bounded elements ${f_1,\ldots,f_{k-1} \in {\mathcal A}}$, and largely work with manipulation of ${f_k}$. The map ${f_k \mapsto A_{\vec g,N}(f_1,\ldots,f_k)}$ is then an operator from ${{\mathcal A}}$ to ${{\mathcal A}}$. We have the easily verified bound

$\displaystyle \|A_{\vec g,N}(f_1,\ldots,f_k)\|_{L^2(\tau)} \leq \rho(f_1) \ldots \rho(f_{k-1}) \|f_k\|_{L^2(\tau)}.$

Because of this, the linear operator ${f_k \mapsto A_{\vec g,N}(f_1,\ldots,f_k)}$ can be uniquely continuously extended to a linear operator from ${\hbox{st} L^2(\tau)}$ to ${\hbox{st} L^2(\tau)}$, where ${\hbox{st} L^2(\tau)}$ is defined as the Hilbert space completion of the bounded elements of ${{\mathcal A}}$ under the norm

$\displaystyle \|f\|_{\hbox{st} L^2(\tau)} := \hbox{st} \|f\|_{L^2(\tau)} = \hbox{st} \tau( ff^* )^{1/2}.$

In particular ${\hbox{st} L^2(\tau)}$ quotients out all the elements of infinitesimal norm. In this Hilbertian formalism, the problem can now be viewed as one of establishing a weighted variant of Theorem 10.

Let us first investigate a condition that would force ${A_{\vec g,N}(f_1,\ldots,f_k)}$ to be be asymptotically stable in ${\hbox{st} L^2(\tau)}$. We expand out

$\displaystyle \|A_{\vec g,N}(f_1,\ldots,f_k)\|_{\hbox{st} L^2(\tau)}^2$

$\displaystyle = \hbox{st} \tau( \frac{1}{N} \sum_{n=1}^N A_{\vec g,N}(f_1,\ldots,f_k) (g_1(n) f_1) \ldots (g_k(n) f_k) )$

(where all expressions have been extended to nonstandard values of ${n}$ or ${N}$ in the usual fashion, and operations are extended from the bounded elements of ${{\mathcal A}}$ to ${\hbox{st} L^2(\tau)}$ by continuity). With a little bit of rearrangement, this expression can be rewritten as ${\langle f_k, D_N(A_{\vec g,N}(f_1,\ldots,f_k)) \rangle_{\hbox{st} L^2(\tau)}}$, where the dual function ${D_N(f_0) = D_{N,f_1,\ldots,f_{k-1},\vec g}(f_0) \in \overline{L^2(\tau)}}$ for any ${f_0 \in \hbox{st} L^2(\tau)}$ is defined by the formula

$\displaystyle D_N(f_0) := \frac{1}{N} \sum_{n=1}^N (g_k(n)^{-1} f_0) (g_k(n)^{-1} g_1(n) f_1) \ldots (g_k(n)^{-1} g_{k-1}(n) f_{k-1}).$

Thus, if we let ${Z}$ be the linear span in ${\hbox{st} L^2(\tau)}$ of all functions of the form ${D_N(f_0)}$ with ${N}$ unbounded and ${f_0 \in \hbox{st} L^2(\tau)}$, and ${f_k}$ is orthogonal to ${Z}$, then ${A_{\vec g,N}(f_1,\ldots,f_k)}$ vanishes in ${\hbox{st} L^2(\tau)}$ for any unbounded ${N}$. This makes ${A_{\vec g,N}(f_1,\ldots,f_k)}$ asymptotically stable, and thus externally Cauchy, in ${L^2(\tau)}$ norm.

In view of this fact, the existence of an orthogonal projection to the closure of ${Z}$, and the linearity of ${A_{\vec g,N}(f_1,\ldots,f_k)}$ in ${f_k}$, it suffices to show that for any ${f_k}$ in the closure of ${Z}$ in ${\hbox{st} L^2(\tau)}$, the expression ${A_{\vec g,N}(f_1,\ldots,f_k)}$ is eventually Cauchy in ${L^2(\tau)}$.

Having eliminated the “pseudorandom case” when ${f_k}$ is orthogonal to ${Z}$, we have now reduced to the “structured case” when ${f_k}$ lies in the closure of ${Z}$. By linearity and an approximation argument, we may reduce to the case when ${f_k}$ is just a projection ${f_k = D_{N_0}(f_0)}$ of a single dual function ${D_{N_0}(f_0)}$ for some ${f_0 \in \hbox{st} L^2(\tau)}$ and unbounded ${N_0}$, and by a further density and approximation argument we can assume that ${f_0}$ is a bounded element of ${{\mathcal A}}$, rather than a general element of ${\hbox{st} L^2(\tau)}$.

Inspecting the definition (1) of ${A_{\vec g,N}}$, we see that we need to understand the shifts ${g_k(n) D_{N_0}(f_0)}$ for ${1 \leq n \leq N}$. It is here that we perform the “van der Corput” or “Weyl differencing” calculation that is pervasive in multiple recurrence theory. Namely, we expand

$\displaystyle g_k(n) D_{N_0}(f_0) = \frac{1}{N_0} \sum_{n'=1}^{N_0} (g_k(n) g_k(n')^{-1} f_0) (g_k(n) g_k(n')^{-1} g_1(n') f_1) \ldots$

$\displaystyle \ldots (g_k(n) g_k(n')^{-1} g_{k-1}(n') f_{k-1}).$

Fixing ${N_0}$, we now restrict to the regime ${N=o(N_0)}$, thus ${N \leq \epsilon N_0}$ for all standard ${\epsilon>0}$. Then ${n = o(N_0)}$ as well. We can then shift the range ${\{1,\ldots,N_0\}}$ by ${n}$ by making a substitution ${n' := n+m}$. If we then return ${m}$ to the range ${\{1,\ldots,N_0\}}$, this creates an error of ${\hbox{st} L^2(\tau)}$ norm ${\hbox{st} O( n / N_0 ) = 0}$, and so

$\displaystyle g_k(n) D_{N_0}(f_0) = \frac{1}{N_0} \sum_{m=1}^{N_0} (g_k(n) g_k(n+m)^{-1} f_0)$

$\displaystyle (g_k(n) g_k(n+m)^{-1} g_1(n+m) f_1) \ldots$

$\displaystyle \ldots (g_k(n) g_k(n+m)^{-1} g_{k-1}(n+m) f_{k-1})$

(with both sides being interpreted in ${\hbox{st} L^2(\tau)}$). (In the language of Walsh’s paper, this identity asserts that dual functions are reducible.) Substituting this into (1), and recalling the definition of the tuples ${\vec g^*_k}$, we thus obtain the “Weyl differencing identity”

$\displaystyle A_{\vec g,N}(f_1,\ldots,f_{k-1},D_{N_0}(f_0)) \ \ \ \ \ (2)$

$\displaystyle = \frac{1}{N_0} \sum_{m=1}^{N_0} A_{\vec g^*_m,N}( f_1,\ldots,f_{k-1}, f_0, f_1, \ldots, f_{k-1} )$

in ${\hbox{st} L^2(\tau)}$ whenever ${N = o(N_0)}$. In particular, since the property of being externally Cauchy is unaffected by truncation to ${n for any unbounded ${N_1}$ (and in particular to an unbounded ${N_1=o(N_0)}$), we see that the left-hand side of (2) is externally Cauchy in ${L^2(\tau)}$ if and only if the right-hand side is. But by the induction hypothesis, each of the sequences ${A_{\vec g^*_m,N}( f_1,\ldots,f_{k-1}, f_0, f_1, \ldots, f_{k-1} )}$ is externally Cauchy in ${L^2(\tau)}$, and from Proposition 9 we see that ${A_{\vec g,N}(f_1,\ldots,f_{k-1},D_{N_0}(f_0))}$ is externally Cauchy in ${L^2(\tau)}$, and Proposition 11 follows.

— 4. PET induction —

Now we prove Proposition 12, which will follow the general PET induction method first introduced by Bergelson. We prove the claim first for abelian groups (where there is an obvious notion of the “degree” of a polynomial sequence), and indicate at the end of the section how to modify the argument to handle nilpotent groups.

Henceforth the nonstandard abelian group ${G}$ is fixed. In the abelian case, we can take ${W}$ to be the well-ordered set ${\omega^\omega}$ of tuples ${(a_0,a_1,\ldots)}$ of standard non-negative integers with only finitely many of the ${a_i}$ non-zero, with the reverse lexicographical ordering, thus ${(a_0,a_1,\ldots) < (b_0,b_1,\ldots)}$ if there exists ${n}$ such that ${a_n and ${a_m=b_m}$ for all ${m>n}$.

It is easy to see that any polynomial nonstandard function ${g: {}^* {\bf Z} \rightarrow G}$ can be uniquely expressed in the discrete Taylor expansion form

$\displaystyle g(n) = g_0 g_1^n g_2^{\binom{n}{2}} \ldots g_d^{\binom{n}{d}}$

for some finite number of group elements ${g_0,\ldots,g_d}$ with ${g_d}$ non-trivial (or with ${d=-\infty}$ if ${g}$ is trivial). We call ${d}$ the degree of ${g}$; in the case that ${G}$ is the nonstandard integers ${{}^* {\bf N}}$ with the additive group operation, this corresponds to the usual notion of the degree of a polynomial.

We observe the ultratriangle inequality

$\displaystyle \hbox{deg}(g_1 g_2) \leq \max( \hbox{deg}(g_1), \hbox{deg}(g_2) ) \ \ \ \ \ (3)$

with the inequality being equality if ${g_1,g_2}$ have different degree; we also have the symmetry property

$\displaystyle \hbox{deg}(g^{-1}) = \hbox{deg}(g). \ \ \ \ \ (4)$

Also, we observe the key fact that if ${g}$ is a non-trivial polynomial sequence, then for any ${h \in {}^* {\bf Z}}$, the derivative ${\partial_h g: {}^* {\bf Z} \rightarrow G}$ defined by ${\partial_h g(n) := g(n) g(n+h)^{-1}}$ is a polynomial sequence of strictly smaller degree.

We can now place an ultrametric on ${{\mathcal P}}$, with the distance between two polynomials ${g_1,g_2}$ defined as

$\displaystyle d(g_1,g_2) := 2^{\hbox{deg}(g_1^{-1} g_2)}, \ \ \ \ \ (5)$

with the convention that ${2^{-\infty} = 0}$. One easily verifies that the ultrametric axioms are obeyed.

Example 1 If ${G={}^* {\bf Z}}$, and we consider the four polynomials ${g_0(n) := 0}$, ${g_1(n) := 1}$, ${g_2(n) := n}$, ${g_3(n) := n^2}$, then ${g_3}$ is separated from ${g_0,g_1,g_2}$ by a distance of ${4}$, ${g_2}$ is separated from ${g_0,g_1}$ by a distance of ${2}$, and ${g_0,g_1}$ are separated from each other by a distance of ${1}$.

Now let ${(g_0,g_1,g_2,\ldots,g_k)}$ be a finite tuple of polynomials in ${{\mathcal P}}$ for some ${k \geq 0}$. Selecting a reference polynomial ${g_*}$ (not necessarily in the tuple), we say that two polynomials ${g_a,g_b}$ are equivalent relative to ${g_*}$ if ${d(g_a,g_b) \leq d(g_a,g_*)/2}$. From the ultrametric property we see that this is an equivalence relation, and each equivalence class is a constant distance from ${g_*}$. We can then define the weight function ${w = w_{g_0,\ldots,g_k; g_*} \in W}$ of the tuple ${(g_0,\ldots,g_k)}$ relative to ${g_*}$ to equal ${(a_0,a_1,\ldots)}$, where ${a_d}$ is the number of equivalence classes that have distance exactly ${2^d}$ from ${g_i}$.

Example 2 Let ${g_0,g_1,g_2,g_3}$ be as in Example 1. Relative to ${g_3}$, ${g_0,g_1,g_2}$ are all equivalent and at distance ${4}$ from ${g_3}$, so the weight function here is ${(0,0,1,0,\ldots)}$. Relative instead to ${g_0}$, none of the ${g_1,g_2,g_3}$ are equivalent, and at are distances ${1,2,4}$ from ${g_0}$, so the weight function here is ${(1,1,1,0,\ldots)}$. For the tuple ${0,n^2,2n^2}$, the weight function relative to any one of these three polynomials is ${(0,0,1,0,\ldots)}$.

Now let ${\vec g = (g_1,\ldots,g_k)}$ be a non-empty tuple of nonstandard polynomials. We form ${\tilde \vec g = (\tilde g_1,\ldots,\tilde g_{\tilde k})}$ by removing all duplicates and copies of ${1_G}$ from ${\vec g}$(starting from the left and moving right), and if ${\tilde g_{\tilde k}}$ does not already have maximal degree amongst all the ${\tilde g_i}$, permute the tuple ${\tilde \vec g}$ in some arbitrary fashion to make this the case. We define the weight ${w(\vec g)}$ of ${\vec g}$ to be the weight of the augmented tuple ${(1_G, \tilde g_1,\ldots,\tilde g_{\tilde k})}$ relative to the final element ${\tilde g_{\tilde k}}$ of the tuple:

$\displaystyle w(\vec g) := w_{1_G,\tilde g_1,\ldots,\tilde g_{\tilde k}; \tilde g_{\tilde k}}.$

Suppose ${w(\vec g) = (a_0,a_1,\ldots)}$, thus there are ${a_d}$ equivalence classes intersecting ${1_G,\tilde g_1,\ldots,\tilde g_{\tilde k-1}}$ that are at a distance exactly ${2^d}$ from ${\tilde g_{\tilde k}}$. We set ${\tilde g_0 := 1_G}$.

Now let ${m}$ be a nonstandard integer, and consider the ${m}$-reduction

$\displaystyle \vec \tilde g^*_m = (\tilde g_1,\ldots,\tilde g_{\tilde k-1},g'_0,\ldots,g'_{\tilde k-1}),$

where ${g'_i(n) := \tilde g_{\tilde k}(n) \tilde g_{\tilde k}(n+m)^{-1} \tilde g_i(n+m)}$. We first consider the weight of the augmented tuple

$\displaystyle (1_G, \vec \tilde g^*_m) = (\tilde g_0,\tilde g_1,\ldots,\tilde g_{\tilde k-1},g'_0,\ldots,g'_{\tilde k-1}) \ \ \ \ \ (6)$

relative to ${\tilde g_{\tilde k}}$. Observe that for any ${j=0,\ldots,\tilde k-1}$, one has

$\displaystyle d(\tilde g_j,g'_j) = 2^{\hbox{deg}( \tilde g_j(\cdot)^{-1} \tilde g_{\tilde k}(\cdot) \tilde g_{\tilde k}(\cdot+m)^{-1} \tilde g_j(\cdot+m) )}$

$\displaystyle = 2^{\hbox{deg}( \partial_m ( \tilde g_j^{-1} \tilde g_{\tilde k} ) )}$

$\displaystyle < 2^{\hbox{deg}( \tilde g_j^{-1} \tilde g_{\tilde k})}$

$\displaystyle = d( \tilde g_j, \tilde g_{\tilde k} );$

thus, relative to ${\tilde g_{\tilde k}}$, ${\tilde g_j}$ and ${g_j}$ are in the same equivalence class. As such, we see that the weight of the tuple (6) relative to ${\tilde g_{\tilde k}}$ is equal to ${w(\vec g)}$, thus there are still ${a_d}$ equivalence classes intersecting the tuple (6) that are distance exactly ${2^d}$ from ${\tilde g_{\tilde k}}$. We remark for future reference that the abelian nature of ${G}$ was not directly used in the above calculation.

Now let ${g_*}$ be the element of the tuple (6) which has the minimal distance ${2^{d_*} = d(g_*,\tilde g_{\tilde k})}$ to ${\tilde g_{\tilde k}}$, and has maximal degree. The two requirements are compatible, as any element of the tuple has degree less than that of ${\tilde g_{\tilde k}}$ (which has the maximal degree by construction) necessarily has the maximal distance to ${\tilde g_{\tilde k}}$. The weight of (6) relative to ${g_*}$ is then strictly smaller than the weight of (6) relative to ${\tilde g_{\tilde k}}$, because the weight function ${a_{d_*}}$ at ${d_*}$ is decreased by one, while the weight function at all values strictly greater than ${d_*}$ are unchanged. (The weight function at values less than ${d_*}$ can increase dramatically, but with the lexicographical ordering this does not change the validity of the previous assertion.) Because of this, if we then permute ${\vec \tilde g^*_m}$ to place ${g_*}$ at the end, then we see that ${w(\vec \tilde g^*_m) < w(\vec g)}$ (note that removing duplicates and copies of ${1_G}$ from ${\vec \tilde g^*_m}$ only serves to decrease the weight vector, not to increase it), and the claim follows.

Example 3 Suppose we start with the tuple ${(n^2,2n^2)}$, whose weight vector is ${(0,0,2,0,\ldots)}$. Performing an ${1}$-reduction, we obtain

$\displaystyle (n^2, -4n-2, n^2-2n-1),$

with a weight vector now reduced to ${(0,1,1,0,\ldots)}$. Note that ${n^2-2n-1}$ already has maximal degree and has minimal distance to ${2n^2}$, so no additional permutation is needed at this stage. Performing another ${1}$-reduction, we obtain

$\displaystyle (n^2, -4n-2, -2n+1, n^2+2, -6n-5)$

but now we need to permute to move ${n^2+2}$ (which has maximal degree and minimal distance to ${n^2-2n-1}$) to the end, giving

$\displaystyle (n^2, -4n-2, -2n+1, -6n-5, n^2+2)$

with a weight vector now of ${(1,0,1,0,\ldots)}$. Performing another ${1}$-reduction, we obtain

$\displaystyle (n^2, -4n-2, -2n+1, -6n-5, -2n-1, n^2, -6n-5, -4n-2)$

which after eliminating duplicates and moving ${n^2}$ (which has maximal degree and minimal distance to ${n^2+2}$) to the end, gives

$\displaystyle (-4n-2, -2n+1, -6n-5, -2n-1, n^2)$

with a weight vector of ${(0,0,1,0,\ldots)}$. Another ${1}$-reduction then gives

$\displaystyle (-4n-2, -2n+1, -6n-5, -2n-1,$

$\displaystyle -2n-1, -6n-5, -4n-2, -8n-12)$

(note the elimination of all quadratic terms) which after eliminating duplicates becomes

$\displaystyle (-4n-2, -2n+1, -6n-5, -8n-12)$

with a weight vector of ${(0,4,0,\ldots)}$. Performing yet another ${1}$-reduction gives

$\displaystyle (-4n-2, -2n+1, -6n-5, -8, -4n-14, -2n-9, -6n-19)$

with a weight vector of ${(1,3,0,\ldots)}$. Continuing this process, we will see that the linear terms will eventually all be eliminated, leaving only the constant terms, which can then be eliminated one at a time using further reduction until only the empty tuple remains. See also Walsh’s paper for several further examples of this reduction process, as well as some commentary on how the process can be speeded up somewhat if one observes that one can eliminate not only duplicate polynomials, but also polynomials which differ from an existing polynomial by a constant.

Finally, we address the case of a nilpotent group ${G}$, which will be a modification of the previous argument. The main issue is how to define degree properly. If ${g: {}^* {\bf Z} \rightarrow G}$ is a polynomial nonstandard sequence, then by many applications of (discrete analogues of) Baker-Campbell-Hausdorff formula, we can (as before) place ${g}$ uniquely in the Taylor expansion form

$\displaystyle g(n) = g_0 g_1^n g_2^{\binom{n}{2}} \ldots g_d^{\binom{n}{d}} \ \ \ \ \ (7)$

for some (standard) finite number ${g_0,\ldots,g_d}$ of group elements of ${G}$; see e.g. Exercise 11 of this previous blog post. In the abelian case, we used the largest ${d}$ for which ${g_d}$ was non-trivial as the degree of ${g}$. This turns out to not be a good choice in the nilpotent case, because the crucial ultratriangle property (3) does not hold for this concept of degree. For instance, if ${G}$ is a two-step nilpotent group, and ${g,h}$ are non-commuting elements of ${G}$, then the sequences ${n \mapsto g^n, n \mapsto h^n}$ would ostensibly have degree ${1}$ with this definition, but the product

$\displaystyle g^n h^n = (gh)^n [g,h]^{-\binom{n}{2}}$

where ${[g,h] := g^{-1}h^{-1} gh}$ is the commutator of ${g}$ and ${h}$, would then have degree ${2}$, thus contradicting (3). (The symmetry property (4) can also be shown to break down.)

Fortunately, the theory of polynomial sequences in nilpotent groups has been understood since the work of Leibman. The trick is not to view the coefficients ${g_0,\ldots,g_d}$ appearing above as roaming unrestrictedly in the whole ${s}$-step nilpotent group ${G}$, but to restrict some or all of these coefficients to subgroups in the lower central series ${G = G_1 \geq G_2 \geq \ldots \geq G_s}$, defined by setting ${G_1 := G}$ and ${G_{i+1} := [G,G_i]}$ for all ${i \geq 1}$. Given natural numbers ${d_1,\ldots,d_s}$, we then say that a sequence ${g(n)}$ has filtered degree at most ${(d_1,\ldots,d_s)}$ if, when using the Taylor expansion (7), we have ${g_k \in G_i}$ whenever ${k > d_i}$. Thus, for instance, if ${s=2}$, ${g_1 \in G_1}$ and ${g_2 \in G_2}$, then the sequence ${n \mapsto g_1^n g_2^{\binom{n}{2}}}$ has filtered degree at most ${(1,2)}$. A fundamental result of Leibman (proven for instance in this previous post) asserts that if the sequence ${(d_1,\ldots,d_s)}$ is superadditive in the sense that ${d_{i+j} \geq d_i+d_j}$ whenever ${i+j \leq s}$, then the collection of polynomial sequences of filtered degree at most ${(d_1,\ldots,d_s)}$ form a group. A related fact is that if a sequence ${g}$ has filtered degree at most ${(d_1,\ldots,d_s)}$ for some superadditive ${(d_1,\ldots,d_s)}$, then any derivative ${\partial_m g}$ of ${g}$ has filtered degree at most ${((d_1-1)_+,\ldots,(d_s-1)_+)}$ (which is still superadditive).

If we let ${S}$ be the set of all superadditive degree sequences, we can order such sequences lexicographically by declaring ${(d_1,\ldots,d_s) < (d'_1,\ldots,d'_s)}$ if there is an ${i}$ with ${d_i < d'_i}$ and ${d_j=d'_j}$ for all ${j. This makes ${S}$ a well-ordered set, and then we can define the filtered degree ${\hbox{deg}(g)}$ of a polynomial sequence ${g}$ to be the minimal ${(d_1,\ldots,d_s)}$ in ${S}$ for which ${g}$ has filtered degree at at most ${(d_1,\ldots,d_s)}$. Thus, for instance, in an ${s}$-step nilpotent group, a sequence ${n \mapsto g^n}$ with ${g}$ non-trivial would have filtered degree ${(1,2,\ldots,s)}$. From Leibman’s results we then have the key properties (3), (4), and also that any derivative ${\partial_h g}$ of ${g}$ has strictly smaller filtered degree.

Unfortunately, as filtered degrees are not numbers, we cannot define an ultrametric taking values in ${{\bf R}^+}$ in the using the formula (5), but this is not a real difficulty; we simply declare an “ultrametric” taking values in ${\{0\} \cup S}$ instead of ${{\bf R}^+}$, by declaring ${d(g_1,g_2) := \hbox{deg}(g_1^{-1},g_2)}$ if ${g_1,g_2}$ are distinct, and ${d(g_1,g_2)=0}$ otherwise. If we view ${0}$ as being smaller than any element of ${S}$, we see that the ultrametric axioms are still obeyed, and one can still run the argument more or less exactly as given above; we leave the details to the interested reader.