This week I gave a talk at UCLA on some aspects of polymath1 project, and specifically on the three new combinatorial proofs of the density Hales-Jewett theorem that have been obtained as a consequence of that project. (There have been a number of other achievements of this project, including some accomplished on the polymath1 threads hosted on this blog, but I will have to postpone a presentation of those to a later post.) It seems that at least two of the proofs will extend to prove this theorem for arbitrary alphabet lengths, but for this talk I will focus on the ${k=3}$ case. The theorem to prove is then

Theorem 1 (${k=3}$ density Hales-Jewett theorem) Let ${0 < \delta \leq 1}$. Then if ${n}$ is a sufficiently large integer, any subset ${A}$ of the cube ${[3]^n = \{1,2,3\}^n}$ of density ${|A|/3^n}$ at least ${\delta}$ contains at least one combinatorial line ${\{ \ell(1), \ell(2), \ell(3)\}}$, where ${\ell \in \{1,2,3,x\}^n \backslash [3]^n}$ is a string of ${1}$s, ${2}$s, ${3}$s, and ${x}$‘s containing at least one “wildcard” ${x}$, and ${\ell(i)}$ is the string formed from ${\ell}$ by replacing all ${x}$‘s with ${i}$‘s.

The full density Hales-Jewett theorem is the same statement, but with ${[3]}$ replaced by ${[k]}$ for some ${k \geq 1}$. (The case ${k=1}$ is trivial, and the case ${k=2}$ follows from Sperner’s theorem.)

This theorem was first proven by Furstenberg and Katznelson, by first converting it to a statement in ergodic theory; the original paper of Furstenberg-Katznelson argument was for the ${k=3}$ case only, and gave only part of the proof in detail, but in a subsequent paper a full proof in general ${k}$ was provided. The remaining components of the original ${k=3}$ argument were later completed in unpublished notes of McCutcheon. One of the new proofs is essentially a finitary translation of this ${k=3}$ argument; in principle one could also finitise the significantly more complicated argument of Furstenberg and Katznelson for general ${k}$, but this has not been properly carried out yet (the other two proofs are likely to generalise much more easily to higher ${k}$). The result is considered quite deep; for instance, the general ${k}$ case of the density Hales-Jewett theorem already implies Szemerédi’s theorem, which is a highly non-trivial theorem in its own right, as a special case.

Another of the proofs is based primarily on the density increment method that goes back to Roth, and also incorporates some ideas from a paper of Ajtai and Szemerédi establishing what we have called the “corners theorem” (and which is also implied by the ${k=3}$ case of the density Hales-Jewett theorem). A key new idea involved studying the correlations of the original set ${A}$ with special subsets of ${[3]^n}$, such as ${ij}$-insensitive sets, or intersections of ${ij}$-insensitive and ${ik}$-insensitive sets. Work is currently ongoing to extend this argument to higher ${k}$, and it seems that there are no major obstacles in doing so.

This correlations idea inspired a new ergodic proof of the density Hales-Jewett theorem for all values of ${k}$ by Austin, which is in the spirit of the triangle removal lemma (or hypergraph removal lemma) proofs of Roth’s theorem (or the multidimensional Szemerédi theorem). A finitary translation of this argument in the ${k=3}$ case has been sketched out; I believe it also extends in a relatively straightforward manner to the higher ${k}$ case (in analogy with some proofs of the hypergraph removal lemma ).

— 1. Simpler cases of density Hales-Jewett —

In order to motivate the known proofs of the density Hales-Jewett theorem, it is instructive to consider some simpler theorems which are implied by this theorem. The first is the corners theorem of Ajtai and Szemerédi:

Theorem 2 (Corners theorem) Let ${0 < \delta \leq 1}$. Then if ${n}$ is a sufficiently large integer, any subset ${A}$ of the square ${[n]^2}$ of density ${|A|/n^2}$ at least ${\delta}$ contains at least one right-angled triangle (or “corner”) ${\{ (x,y), (x+r,y), (x,y+r) \}}$ with ${r \neq 0}$.

The ${k=3}$ density Hales-Jewett theorem implies the corners theorem; this is proven by utilising the map ${\phi: [3]^n \rightarrow [n]^2}$ from the cube to the square, defined by mapping a string ${x \in [3]^n}$ to a pair ${(a,b)}$, where ${a, b}$ are the number of ${1}$s and ${2}$s respectively in ${x}$. The key point is that ${\phi}$ maps combinatorial lines to corners. (Strictly speaking, this mapping only establishes the corners theorem for dense subsets of ${[n/3 - \sqrt{n}, n/3 + \sqrt{n}]^2}$, but it is not difficult to obtain the general case from this by replacing ${n}$ by ${n^2}$ and using translation-invariance.)

(The corners theorem is also closely related to the problem of finding dense sets of points in a triangular grid without any equilateral triangles, a problem which we have called Fujimura’s problem.

The corners theorem in turn implies

Theorem 3 (Roth’s theorem) Let ${0 < \delta \leq 1}$. Then if ${n}$ is a sufficiently large integer, any subset ${A}$ of the interval ${[n]}$ of density ${|A|/n}$ at least ${\delta}$ contains at least one arithmetic progression ${a, a+r, a+2r}$ of length three.

Roth’s theorem can be deduced from the corners theorem by considering the map ${\psi: [n]^2 \rightarrow [3n]}$ defined by ${\psi(a,b) := a+2b}$; the key point is that ${\psi}$ maps corners to arithmetic progressions of length three.

There are higher ${k}$ analogues of these implications; the general ${k}$ version of the density Hales-Jewett theorem implies a general ${k}$ version of the corners theorem known as the multidimensional Szemerédi theorem, which in term implies a general version of Roth’s theorem known as Szemerédi’s theorem.

— 2. The density increment argument —

The strategy of the density increment argument, which goes back to Roth’s proof of Theorem 3 in 1956, is to perform a downward induction on the density ${\delta}$. Indeed, the theorem is obvious for high enough values of ${\delta}$; for instance, if ${\delta > 2/3}$, then partitioning the cube ${[3]^n}$ into lines and applying the pigeonhole principle will already give a combinatorial line. So the idea is to deduce the claim for a fixed density ${\delta}$ from that of a higher density ${\delta}$.

A key concept here is that of an ${m}$-dimensional combinatorial subspace of ${[3]^n}$ – a set of the form ${\phi([3]^m)}$, where ${\phi \in \{1,2,3,*_1,\ldots,*_m\}^n}$ is a string formed using the base alphabet and ${m}$ wildcards ${*_1,\ldots,*_m}$ (with each wildcard appearing at least opnce), and ${\phi(a_1 \ldots a_m)}$ is the string formed by substituting ${a_i}$ for ${*_i}$ for each ${i}$. (Thus, for instance, a combinatorial line is a combinatorial subspace of dimension ${1}$.) The identification ${\phi}$ between ${[3]^m}$ and the combinatorial space ${\phi([3]^m)}$ maps combinatorial lines to combinatorial lines. Thus, to prove Theorem 1, it suffices to show

Proposition 4 (Lack of lines implies density increment) Let ${0 < \delta \leq 1}$. Then if ${n}$ is a sufficiently large integer, and ${A \subset [3]^n}$ has density at least ${\delta}$ and has no combinatorial lines, then there exists an ${m}$-dimensional subspace ${\phi([3]^m)}$ of ${[3]^n}$ on which ${A}$ has density at least ${\delta + c(\delta)}$, where ${c(\delta) > 0}$ depends only on ${\delta}$ (and is bounded away from zero on any compact range of ${\delta}$), and ${m \geq m_0(n,\delta)}$ for some function ${m_0(n,\delta)}$ that goes to infinity as ${n \rightarrow \infty}$ for fixed ${\delta}$.

It is easy to see that Proposition 4 implies Theorem 1 (for instance, one could consider the infimum of all ${\delta}$ for which the theorem holds, and show that having this infimum non-zero would lead to a contradiction).

Now we have to figure out how to get that density increment. The original argument of Roth relied on Fourier analysis, which in turn relies on an underlying translation-invariant structure which is not present in the density Hales-Jewett setting. (Arithmetic progressions are translation-invariant, but combinatorial lines are not.) It turns out that one can proceed instead by adapting a (modification of) an argument of Ajtai and Szemerédi, which gave the first proof of Theorem 2.

The (modified) Ajtai-Szemerédi argument uses the density increment method, assuming that ${A}$ has no right-angled triangles and showing that ${A}$ has an increased density on a subgrid – a product ${P \times Q}$ of fairly long arithmetic progressions with the same spacing. The argument proceeds in two stages, which we describe slightly informally (in particular, glossing over some technical details regarding quantitative parameters such as ${\varepsilon}$) as follows:

• Step 1. If ${A \subset [n]^2}$ is dense but has no right-angled triangles, then ${A}$ has an increased density on a cartesian product ${U \times V}$ of dense sets ${U, V \subset [n]}$ (which are not necessarily arithmetic progressions).
• Step 2. Any Cartesian product ${U \times V}$ in ${[n]^2}$ can be partitioned into reasonably large grids ${P \times Q}$, plus a remainder term of small density.

From Step 1, Step 2 and the pigeonhole principle we obtain the desired density increment of ${A}$ on a grid ${P \times Q}$, and then the density increment argument gives us the corners theorem.

Step 1 is actually quite easy. If ${A}$ is dense, then it must also be dense on some diagonal ${D = \{ (x,y): x+y = const \}}$, by the pigeonhole principle. Let ${U}$ and ${V}$ denote the rows and columns that ${A \cap D}$ occupies. Every pair of points in ${A \cap D}$ forms the hypotenuse of some corner, whose third vertex lies in ${U \times V}$. Thus, if ${A}$ has no corners, then ${A}$ must avoid all the points formed by ${U \times V}$ (except for those of the diagonal ${D}$). Thus ${A}$ has a significant density decrease on the Cartesian product ${U \times V}$. Dividing the remainder ${[n]^2 \backslash (U \times V)}$ into three further Cartesian products ${U \times ([n]\backslash V)}$, ${([n] \backslash U) \times V}$, ${([n] \backslash U) \times ([n] \backslash V)}$ and using the pigeonhole principle we obtain the claim (after redefining ${U, V}$ appropriately).

Step 2 can be obtained by iterating a one-dimensional version:

• Step 2a. Any set ${U \subset [n]}$ can be partitioned into reasonably long arithmetic progressions ${P}$, plus a remainder term of small density.

Indeed, from Step 2a, one can partition ${U \times [n]}$ into products ${P \times [n]}$ (plus a small remainder), which can be easily repartitioned into grids ${P \times Q}$ (plus small remainder). This partitions ${U \times V}$ into sets ${P \times (V \cap Q)}$ (plus small remainder). Applying Step 2a again, each ${V \cap Q}$ can be partitioned further into progressions ${Q'}$ (plus small remainder), which allows us to partition each ${P \times (V \cap Q)}$ into grids ${P' \times Q'}$ (plus small remainder).

So all one has left to do is establish Step 2a. But this can be done by the greedy algorithm: locate one long arithmetic progression ${P}$ in ${U}$ and remove it from ${U}$, then locate another to remove, and so forth until no further long progressions remain in the set. But Szemerédi’s theorem then tells us the remaining set has low density, and one is done!

This argument has the apparent disadvantage of requiring a deep theorem (Szemerédi’s theorem) in order to complete the proof. However, interestingly enough, when one adapts the argument to the density Hales-Jewett theorem, one gets to replace Szemerédi’s theorem by a more elementary result – one which in fact follows from the (easy) ${k=2}$ version of the density Hales-Jewett theorem, i.e. Sperner’s theorem.

We first need to understand the analogue of the Cartesian products ${U \times V}$. Note that ${U \times V}$ is the intersection of a “vertically insensitive set” ${U \times [n]}$ and a “horizontally insensitive set” ${[n] \times V}$. By “vertically insensitive” we mean that membership of a point ${(x,y)}$ in that set is unaffected if one moves that point in a vertical direction, and similarly for “horizontally insensitive”. In a similar fashion, define a “${12}$-insensitive set” to be a subset of ${[3]^n}$, membership in which is unaffected if one flips a coordinate from a ${1}$ to a ${2}$ or vice versa (e.g. if ${1223}$ lies in the set, then so must ${1213}$, ${1113}$, ${2113}$, etc.). Similarly define the notion of a “${13}$-insensitive set”. We then define a “complexity ${1}$ set” to be the intersection ${E_{12} \cap E_{13}}$ of a ${12}$-insensitive set ${E_{12}}$ and a ${13}$-insensitive set ${E_{13}}$; these are analogous to the Cartesian products ${U \times V}$.

(For technical reasons, one actually has to deal with local versions of insensitive sets and complexity ${1}$ sets, in which one is only allowed to flip a moderately small number of the ${n}$ coordinates rather than all of them. But to simplify the discussion let me ignore this (important) detail, which is also a major issue to address in the other two proofs of this theorem.)

The analogues of Steps 1, 2 for the density Hales-Jewett theorem are then

• Step 1. If ${A \subset [3]^n}$ is dense but has no combinatorial lines, then ${A}$ has an increased density on a (local) complexity ${1}$ set ${E_{12} \cap E_{13}}$.
• Step 2. Any (local) complexity ${1}$ set ${E_{12} \cap E_{13} \subset [3]^n}$ can be partitioned into moderately large combinatorial subspaces (plus a small remainder).

We can sketch how Step 1 works as follows. Given any ${x \in [3]^n}$, let ${\pi_{1 \rightarrow 2}(x)}$ denote the string formed by replacing all ${1}$s with ${2}$s, e.g. ${\pi_{1 \rightarrow 2}(1321) = 2322}$. Similarly define ${\pi_{1 \rightarrow 3}(x)}$. Observe that ${x, \pi_{1 \rightarrow 2}(x), \pi_{1 \rightarrow 3}(x)}$ forms a combinatorial line (except in the rare case when ${x}$ doesn’t contain any ${1}$s). Thus if we let ${E_{12} := \{ x: \pi_{1 \rightarrow 2}(x) \in A\}}$, ${E_{13} := \{ x: \pi_{1 \rightarrow 3}(x) \in A\}}$, we see that ${A}$ must avoid essentially all of ${E_{12} \cap E_{13}}$. On the other hand, observe that ${E_{12}}$ and ${E_{13}}$ are ${12}$-insensitive and ${13}$-insensitive sets respectively. Taking complements and using the same sort of pigeonhole argument as before, we obtain the claim. (Actually, this argument doesn’t quite work because ${E_{12}}$, ${E_{13}}$ could be very sparse; this problem can be fixed, but requires one to use local complexity ${1}$ sets rather than global ones, and also to introduce the concept of “equal-slices measure”; I will not discuss these issues here.)

Step 2 can be reduced, much as before, to the following analogue of Step 2a:

• Step 2a. Any ${12}$-insensitive set ${E_{12} \subset [3]^n}$ can be partitioned into moderately large combinatorial subspaces (plus a small remainder).

Identifying the letters ${1}$ and ${2}$ together, one can quotient ${[3]^n}$ down to ${[2]^n}$; the preimages of this projection are precisely the ${12}$-insensitive sets. Because of this, Step 2a is basically equivalent (modulo some technicalities about measure) to

• Step 2a’. Any ${E \subset [2]^n}$ can be partitioned into moderately large combinatorial subspaces (plus a small remainder).

By the greedy algorithm, we will be able to accomplish this step if we can show that every dense subset of ${[2]^n}$ contains moderately large subspaces. But this turns out to be possible by carefully iterating Sperner’s theorem (which shows that every dense subset of ${[2]^n}$ contains combinatorial lines).

This proof of Theorem 1 seems to extend without major difficulty to the case of higher ${k}$, though details are still being worked out.

— 3. The triangle removal argument —

The triangle removal lemma of Ruzsa and Szemerédi is a graph-theoretic result which implies the corners theorem (and hence Roth’s theorem). It asserts the following:

Lemma 5 (Triangle removal lemma) For every ${\varepsilon > 0}$ there exists ${\delta > 0}$ such that if a graph ${G}$ on ${n}$ vertices has fewer than ${\delta n^3}$ triangles, then the triangles can be deleted entirely by removing at most ${\varepsilon n^2}$ edges.

Let’s see how the triangle removal lemma implies the corners theorem. A corner is, of course, already a triangle in the geometric sense, but we need to convert it to a triangle in the graph-theoretic sense, as follows. Let ${A}$ be a subset of ${[n]^2}$ with no corners; the aim is to show that ${A}$ has small density. Let ${V_h}$ be the set of all horizontal lines in ${[n]^2}$, ${V_v}$ the set of vertical lines, and ${V_d}$ the set of diagonal lines (thus all three sets have size about ${n}$). We create a tripartite graph ${G}$ on the vertex sets ${V_h \cup V_v \cup V_d}$ by joining a horizontal line ${h \in V_h}$ to a vertical line ${v \in V_v}$ whenever ${h}$ and ${v}$ intersect at a point in ${A}$, and similarly connecting ${V_h}$ or ${V_v}$ to ${V_d}$. Observe that a triangle in ${G}$ corresponds either to a corner in ${A}$, or to a “degenerate” corner in which the horizontal, vertical, and diagonal line are all concurrent. In particular, there are very few triangle in ${G}$, which can then be deleted by removing a small number of edges from ${G}$ by the triangle removal lemma. But each edge removed can delete at most one degenerate corner, and the number of degenerate corners is ${|A|}$, and so ${|A|}$ is small as required.

All known proofs of the triangle removal lemma proceed by some version of the following three steps:

• “Regularity lemma step”: Applying tools such as the Szemerédi regularity lemma, one can partition the graph ${G}$ into components ${G_{ij}}$ between cells ${V_i, V_j}$ of vertices, such that most of the ${G_{ij}}$ are “pseudorandom”. One way to define what pseudorandom means is to view each graph component ${G_{ij}}$ as a subset of the Cartesian product ${V_i \times V_j}$, in which case ${G_{ij}}$ is pseudorandom if it does not have a significant density increment on any smaller Cartesian product ${V'_i \times V'_j}$ of non-trivial size.
• ”Counting lemma step”: By exploiting the pseudorandomness property, one shows that if ${G}$ has a triple ${G_{ij}, G_{jk}, G_{ki}}$ of dense pseudorandom graphs between cells ${V_i, V_j, V_k}$ of non-trivial size, then this triple must generate a large number of triangles; hence, if ${G}$ has very few triangles, then one cannot find such a triple of dense pseudorandom graphs.
• ”Cleaning step”: If one then removes all components of ${G}$ which are too sparse or insufficiently pseudorandom, one can thus eliminate all triangles.

Pulling this argument back to the corners theorem, we see that cells such as ${V_i, V_j, V_k}$ will correspond either to horizontally insensitive sets, vertically insensitive sets, or diagonally insensitive sets. Thus this proof of the corners theorem proceeds by partitioning ${[n]^2}$ in three different ways into insensitive sets in such a way that ${A}$ is pseudorandom with respect to many of the cells created by any two of these partitions, counting the corners generated by any triple of large cells in which A is pseudorandom and dense, and cleaning out all the other cells.

It turns out that a variant of this argument can give Theorem 1; this was in fact the original approach studied by the polymath1 project, though it was only after a detour through ergodic theory (as well as the development of the density-increment argument discussed above) that the triangle-removal approach could be properly executed. In particular, an ergodic argument based on the infinitary analogue of the triangle removal lemma (and its hypergraph generalisations) was developed by Austin, which then inspired the combinatorial version sketched here.

The analogue of the vertex cells ${V_i}$ are given by certain ${12}$-insensitive sets ${E_{12}^a}$, ${13}$-insensitive sets ${E_{13}^b}$, and ${23}$-insensitive sets ${E_{23}^c}$. Roughly speaking, a set ${A \subset [3]^n}$ would be said to be pseudorandom with respect to a cell ${E_{12}^a \cap E_{13}^b}$ if ${A \cap E_{12}^a \cap E_{13}^b}$ has no further density increment on any smaller cell ${E'_{12} \cap E'_{13}}$ with ${E'_{12}}$ a ${12}$-insensitive subset of ${E_{12}^a}$, and ${E'_{13}}$ a ${13}$-insensitive subset of ${E_{13}^b}$. (This is an oversimplification, glossing over an important refinement of the concept of pseudorandomness involving the discrepancy between global densities in ${[3]^n}$ and local densities in subspaces of ${[3]^n}$.) There is a similar notion of ${A}$ being pseudorandom with respect to a cell ${E_{13}^b \cap E_{23}^c}$ or ${E_{23}^c \cap E_{12}^a}$.

We briefly describe the “regularity lemma” step. By modifying the proof of the regularity lemma, one can obtain three partitions

$\displaystyle [3]^n = E_{12}^1 \cup \ldots \cup E_{12}^{M_{12}} = E_{13}^1 \cup \ldots \cup E_{13}^{M_{13}} = E_{23}^1 \cup \ldots \cup E_{23}^{M_{23}}$

into ${12}$-insensitive, ${13}$-insensitive, and ${23}$-insensitive components respectively, where ${M_{12}, M_{13}, M_{23}}$ are not too large, and ${A}$ is pseudorandom with respect to most cells ${E_{12}^a \cap E_{13}^b}$, ${E_{13}^b \cap E_{23}^c}$, and ${E_{23}^c \cap E_{12}^a}$.

In order for the counting step to work, one also needs an additional “stationarity” reduction, which is difficult to state precisely, but roughly speaking asserts that the “local” statistics of sets such as ${E_{12}^a}$ on medium-dimensional subspaces are close to the corresponding “global” statistics of such sets; this can be achieved by an additional pigeonholing argument. We will gloss over this issue, pretending that there is no distinction between local statistics and global statistics. (Thus, for instance, if ${E_{12}^a}$ has large global density in ${[3]^n}$, we shall assume that ${E_{12}^a}$ also has large density on most medium-sized subspaces of ${[3]^n}$.)

Now for the “counting lemma” step. Suppose we can find ${a,b,c}$ such that the cells ${E_{12}^a, E_{13}^b, E_{23}^c}$ are large, and that ${A}$ intersects ${E_{12}^a \cap E_{13}^b}$, ${E_{13}^b \cap E_{23}^c}$, and ${E_{23}^c \cap E_{12}^a}$ in a dense pseudorandom manner. We claim that this will force ${A}$ to have a large number of combinatorial lines ${\ell}$, with ${\ell(1)}$ in ${A \cap E_{12}^a \cap E_{13}^b}$, ${\ell(2)}$ in ${A \cap E_{23}^c \cap E_{12}^a}$, and ${\ell(3)}$ in ${A \cap E_{13}^b \cap E_{23}^c}$. Because of the dense pseudorandom nature of ${A}$ in these cells, it turns out that it will suffice to show that there are a lot of lines ${\ell(1)}$ with ${\ell(1) \in E_{12}^a \cap E_{13}^b}$, ${\ell(2) \in E_{23}^c \cap E_{12}^a}$, and ${\ell(3) \in E_{13}^b \cap E_{23}^c}$.

One way to generate a line ${\ell}$ is by taking the triple ${\{ x, \pi_{1 \rightarrow 2}(x), \pi_{1 \rightarrow 3}(x) \}}$, where ${x \in [3]^n}$ is a generic point. (Actually, as we will see below, we would have to to a subspace of ${[3]^n}$ before using this recipe to generate lines.) Then we need to find many ${x}$ obeying the constraints

$\displaystyle x \in E_{12}^a \cap E_{13}^b; \quad \pi_{1 \rightarrow 2}(x) \in E_{23}^c \cap E_{12}^a; \quad \pi_{1 \rightarrow 3}(x) \in E_{13}^b \cap E_{23}^c.$

Because of the various insensitivity properties, many of these conditions are redundant, and we can simplify to

$\displaystyle x \in E_{12}^a \cap E_{13}^b; \quad \pi_{1 \rightarrow 2}(x) \in E_{23}^c.$

Now note that the property “${\pi_{1 \rightarrow 2}(x) \in E_{23}^c}$” is 123-insensitive; it is simultaneously 12-insensitive, 23-insensitive, and 13-insensitive. As ${E_{23}^c}$ is assumed to be large, there will be large combinatorial subspaces on which (a suitably localised version of) this property “${\pi_{1 \rightarrow 2}(x) \in E_{23}^c}$” will be always true. Localising to this space (taking advantage of the stationarity properties alluded to earlier), we are now looking for solutions to

$\displaystyle x \in E_{12}^a \cap E_{13}^b.$

We’ll pick ${x}$ to be of the form ${\pi_{2 \rightarrow 1}(y)}$ for some ${y}$. We can then rewrite the constraints on ${y}$ as

$\displaystyle y \in E_{12}^a; \quad \pi_{2 \rightarrow 1}(y) \in E_{13}^b.$

The property “${\pi_{2 \rightarrow 1}(y) \in E_{13}^b}$” is 123-invariant, and ${E_{13}^b}$ is large, so by arguing as before we can pass to a large subspace where this property is always true. The largeness of ${E_{12}^a}$ then gives us a large number of solutions.

Taking contrapositives, we conclude that if ${A}$ in fact has no combinatorial lines, then there do not exist any triple ${E_{12}^a, E_{13}^b, E_{23}^c}$ of large cells with respect to which ${A}$ is dense and pseudorandom. This forces ${A}$ to be confined either to very small cells, or to very sparse subsets of cells, or to the rare cells which fail to be pseudorandom. None of these cases can contribute much to the density of ${A}$, and so ${A}$ itself is very sparse – contradicting the hypothesis in Theorem 1 that ${A}$ is dense (this is the “cleaning step”). This concludes the sketch of the triangle-removal proof of this theorem.

The ergodic version of this argument, due to Austin, works for all values of ${k}$, so I expect the combinatorial version to do so as well.

— 4. The finitary Furstenberg-Katznelson argument —

In 1989, Furstenberg and Katznelson gave the first proof of Theorem 1, by translating it into a recurrence statement about a certain type of stationary process indexed by an infinite cube ${[3]^\omega := \bigcup_{n=1}^\infty [3]^n}$. This argument was inspired by a long string of other successful proofs of density Ramsey theorems via ergodic means, starting with the initial 1977 paper of Furstenberg giving an ergodic theory proof of Szemeredi’s theorem. The latter proof was transcribed into a finitary language by myself, so it was reasonable to expect that the Furstenberg-Katznelson argument could similarly be translated into a combinatorial framework.

Let us first briefly describe the original strategy of Furstenberg to establish Roth’s theorem, but phrased in an informal, and vaguely combinatorial, language. The basic task is to get a non-trivial lower bound on averages of the form

$\displaystyle {\Bbb E}_{a,r} f(a) f(a+r) f(a+2r) \ \ \ \ \ (1)$

where we will be a bit vague about what ${a,r}$ are ranging over, and where ${f}$ is some non-negative function of positive mean. It is then natural to study more general averages of the form

$\displaystyle {\Bbb E}_{a,r} f(a) g(a+r) h(a+2r). \ \ \ \ \ (2)$

Now, it turns out that certain types of functions ${f,g,h}$ give a negligible contribution to expressions such as (2). In particular, if ${f}$ is weakly mixing, which roughly means that the pair correlations

$\displaystyle {\Bbb E}_a f(a) f(a+r)$

are small for most ${r}$, then the average (2) is small no matter what ${g, h}$ are (so long as they are bounded). This can be established by some applications of the Cauchy-Schwarz inequality (or its close cousin, the van der Corput lemma). As a consequence of this, all weakly mixing components of ${f}$ can essentially be discarded when considering an average such as (1).

After getting rid of the weakly mixing components, what is left? Being weakly mixing is like saying that almost all the shifts ${f(\cdot+r)}$ of ${f}$ are close to orthogonal to each other. At the other extreme is that of periodicity – the shifts ${f(\cdot + r)}$ periodically recur to become equal to ${f}$ again. There is a slightly more general notion of almost periodicity – roughly, this means that the shifts ${f(\cdot + r)}$ don’t have to recur exactly to ${f}$ again, but they are forced to range in a precompact set, which basically means that for every ${\varepsilon > 0}$, that ${f(\cdot+r)}$ lies within ${\varepsilon}$ (in some suitable norm) of some finite-dimensional space. A good example of an almost periodic function is an eigenfunction, in which we have ${f(a+r) = \lambda_r f(a)}$ for each ${r}$ and some quantity ${\lambda_r}$ independent of ${a}$ (e.g. one can take ${f(a) = e^{2\pi i \alpha a}}$ for some ${\alpha \in {\mathbb R}}$). In this case, the finite-dimensional space is simply the scalar multiples of ${f(a)}$ (and one can even take ${\varepsilon=0}$ in this special case).

It is easy to see that non-trivial almost periodic functions are not weakly mixing; more generally, any function which correlates non-trivially with an almost periodic function can also be seen to not be weakly mixing. In the converse direction, it is also fairly easy to show that any function which is not weakly mixing must have non-trivial correlation with an almost periodic function. Because of this, it turns out that one can basically decompose any function into almost periodic and weakly mixing components. For the purposes of getting lower bounds on (1), this allows us to essentially reduce matters to the special case when ${f}$ is almost periodic. But then the shifts ${f(\cdot + r)}$ are almost ranging in a finite-dimensional set, which allows one to essentially assign each shift ${r}$ a colour from a finite range of colours. If one then applies the van der Waerden theorem, one can find many arithmetic progressions ${a,a+r,a+2r}$ which have the same colour, and this can be used to give a non-trivial lower bound on (1). (Thus we see that the role of a compactness property such as almost periodicity is to reduce density Ramsey theorems to colouring Ramsey theorems.)

This type of argument can be extended to more advanced recurrence theorems, but certain things become more complicated. For instance, suppose one wanted to count progressions of length ${4}$; this amounts to lower bounding expressions such as

$\displaystyle {\Bbb E}_{a,r} f(a) f(a+r) f(a+2r) f(a+3r). \ \ \ \ \ (3)$

It turns out that ${f}$ being weakly mixing is no longer enough to give a negligible contribution to expressions such as (3). For that, one needs the stronger property of being weakly mixing relative to almost periodic functions; roughly speaking, this means that for most ${r}$, the expression ${f(\cdot) f(\cdot+r)}$ is not merely of small mean (which is what weak mixing would mean), but that this expression furthermore does not correlate strongly with any almost periodic function (i.e. ${{\Bbb E}_a f(a) f(a+r) g(a)}$ is small for any almost periodic ${g}$). Once one has this stronger weak mixing property, then one can discard all components of ${f}$ which are weakly mixing relative to almost periodic functions.

One then has to figure out what is left after all these components are discarded. Because we strengthened the notion of weak mixing, we have to weaken the notion of almost periodicity to compensate. The correct notion is no longer that of almost periodicity – in which the shifts ${f(\cdot + r)}$ almost take values in a finite-dimensional vector space – but that of almost periodicity relative to almost periodic functions, in which the shifts almost take values in a finite-dimensional module over the algebra of almost periodic functions. A good example of such a beast is that of a quadratic eigenfunction, in which we have ${f(a+r) = \lambda_r(a) f(a)}$ where ${\lambda_r(a)}$ is itself an ordinary eigenfunction, and thus almost periodic in the ordinary sense; here, the relative module is the one-dimensional module formed by almost periodic multiples of ${f}$. (A typical example of a quadratic eigenfunction is ${f(a) = e^{2\pi i \alpha a^2}}$ for some ${\alpha \in {\mathbb R}}$.)

It turns out that one can “relativise” all of the previous arguments to the almost periodic “factor”, and decompose an arbitrary ${f}$ into a component which is weakly mixing relative to almost periodic functions, and another component which is almost periodic relative to almost periodic functions. The former type of components can be discarded. For the latter, we can once again start colouring the shifts ${f(\cdot+r)}$ with a finite number of colours, but with the caveat that the colour assigned is no longer independent of ${a}$, but depends in an almost periodic fashion on ${a}$. Nevertheless, it is still possible to combine the van der Waerden colouring Ramsey theorem with the theory of recurrence for ordinary almost periodic functions to get a lower bound on (3) in this case. One can then iterate this argument to deal with arithmetic progressions of longer length, but one now needs to consider even more intricate notions of almost periodicity, e.g. almost periodicity relative to (almost periodic functions relative to almost periodic functions), etc.

It turns out that these types of ideas can be adapted (with some effort) to the density Hales-Jewett setting. It’s simplest to begin with the ${k=2}$ situation rather than the ${k=3}$ situation. Here, we are trying to obtain non-trivial lower bounds for averages of the form

$\displaystyle {\Bbb E}_{\ell} f(\ell(1)) f(\ell(2)) \ \ \ \ \ (4)$

where ${\ell}$ ranges in some fashion over combinatorial lines in ${[2]^n}$, and ${f}$ is some non-negative function with large mean.

The analogues of weakly mixing and almost periodic in this setting are the ${12}$-uniform and ${12}$-low influence functions respectively. Roughly speaking, a function is ${12}$-low influence if its value usually doesn’t change much if a ${1}$ is flipped to a ${2}$ or vice versa (e.g. the indicator function of a ${12}$-insensitive set is ${12}$-low influence); conversely, a ${12}$-uniform function is a function ${g}$ such that ${{\Bbb E}_{\ell} f(\ell(1)) g(\ell(2))}$ is small for all (bounded) ${f}$. One can show that any function can be decomposed, more or less orthogonally, into a ${12}$-uniform function and a ${12}$-low influence function, with the upshot being that one can basically reduce the task of lower bounding (4) to the case when ${f}$ is ${12}$-low influence. But then ${f(\ell(1))}$ and ${f(\ell(2))}$ are approximately equal to each other, and it is straightforward to get a lower-bound in this case.

Now we turn to the ${k=3}$ setting, where we are looking at lower-bounding expressions such as

$\displaystyle {\Bbb E}_{\ell} f(\ell(1)) g(\ell(2)) h(\ell(3)) \ \ \ \ \ (5)$

with ${f=g=h}$.

It turns out that ${g}$ (say) being ${12}$-uniform is no longer enough to give a negligible contribution to the average (5). Instead, one needs the more complicated notion of ${g}$ being ${12}$-uniform relative to ${23}$-low influence functions; this means that not only are the averages ${{\Bbb E}_{\ell} f(\ell(1)) g(\ell(2))}$ small for all bounded ${f}$, but furthermore ${{\Bbb E}_{\ell} f(\ell(1)) g(\ell(2)) h(\ell)}$ is small for all bounded ${f}$ and all ${23}$-low influence ${h}$ (there is a minor technical point here that ${h}$ is a function of a line rather than of a point, but this should be ignored). Any component of ${g}$ in (5) which is ${12}$-uniform relative to ${23}$-low influence functions are negligible and so can be removed.

One then needs to figure out what is left in ${g}$ when these components are removed. The answer turns out to be functions ${g}$ that are ${12}$-almost periodic relative to ${23}$-low influence. The precise definition of this concept is technical, but very roughly speaking it means that if one flips a digit from a ${1}$ to a ${2}$, then the value of ${g}$ changes in a manner which is controlled by ${23}$-low influence functions. Anyway, the upshot is that one can reduce ${g}$ in (5) from ${f}$ to the components of ${f}$ which are ${12}$-almost periodic relative to ${23}$-low influence. Similarly, one can reduce ${h}$ in (5) from ${f}$ to the components of ${f}$ which are ${13}$-almost periodic relative to ${23}$-low influence.

At this point, one has to use a colouring Ramsey theorem – in this case, the Graham-Rothschild theorem – in conjunction with the relative almost periodicity to locate lots of places in which ${g(\ell(2))}$ is close to ${g(\ell(1))}$ while ${h(\ell(3))}$ is simultaneously close to ${h(\ell(1))}$. This turns (5) into an expression of the form ${{\Bbb E}_x f(x) g(x) h(x)}$, which turns out to be relatively easy to lower bound (because ${g, h}$, being projections of ${f}$, tend to be large wherever ${f}$ is large).