This week I gave a talk at UCLA on some aspects of polymath1 project, and specifically on the three new combinatorial proofs of the density Hales-Jewett theorem that have been obtained as a consequence of that project. (There have been a number of other achievements of this project, including some accomplished on the polymath1 threads hosted on this blog, but I will have to postpone a presentation of those to a later post.) It seems that at least two of the proofs will extend to prove this theorem for arbitrary alphabet lengths, but for this talk I will focus on the case. The theorem to prove is then
Theorem 1 ( density Hales-Jewett theorem) Let . Then if is a sufficiently large integer, any subset of the cube of density at least contains at least one combinatorial line , where is a string of s, s, s, and ‘s containing at least one “wildcard” , and is the string formed from by replacing all ‘s with ‘s.
The full density Hales-Jewett theorem is the same statement, but with replaced by for some . (The case is trivial, and the case follows from Sperner’s theorem.)
This theorem was first proven by Furstenberg and Katznelson, by first converting it to a statement in ergodic theory; the original paper of Furstenberg-Katznelson argument was for the case only, and gave only part of the proof in detail, but in a subsequent paper a full proof in general was provided. The remaining components of the original argument were later completed in unpublished notes of McCutcheon. One of the new proofs is essentially a finitary translation of this argument; in principle one could also finitise the significantly more complicated argument of Furstenberg and Katznelson for general , but this has not been properly carried out yet (the other two proofs are likely to generalise much more easily to higher ). The result is considered quite deep; for instance, the general case of the density Hales-Jewett theorem already implies Szemerédi’s theorem, which is a highly non-trivial theorem in its own right, as a special case.
Another of the proofs is based primarily on the density increment method that goes back to Roth, and also incorporates some ideas from a paper of Ajtai and Szemerédi establishing what we have called the “corners theorem” (and which is also implied by the case of the density Hales-Jewett theorem). A key new idea involved studying the correlations of the original set with special subsets of , such as -insensitive sets, or intersections of -insensitive and -insensitive sets. Work is currently ongoing to extend this argument to higher , and it seems that there are no major obstacles in doing so.
This correlations idea inspired a new ergodic proof of the density Hales-Jewett theorem for all values of by Austin, which is in the spirit of the triangle removal lemma (or hypergraph removal lemma) proofs of Roth’s theorem (or the multidimensional Szemerédi theorem). A finitary translation of this argument in the case has been sketched out; I believe it also extends in a relatively straightforward manner to the higher case (in analogy with some proofs of the hypergraph removal lemma ).
— 1. Simpler cases of density Hales-Jewett —
In order to motivate the known proofs of the density Hales-Jewett theorem, it is instructive to consider some simpler theorems which are implied by this theorem. The first is the corners theorem of Ajtai and Szemerédi:
The density Hales-Jewett theorem implies the corners theorem; this is proven by utilising the map from the cube to the square, defined by mapping a string to a pair , where are the number of s and s respectively in . The key point is that maps combinatorial lines to corners. (Strictly speaking, this mapping only establishes the corners theorem for dense subsets of , but it is not difficult to obtain the general case from this by replacing by and using translation-invariance.)
(The corners theorem is also closely related to the problem of finding dense sets of points in a triangular grid without any equilateral triangles, a problem which we have called Fujimura’s problem.
The corners theorem in turn implies
Roth’s theorem can be deduced from the corners theorem by considering the map defined by ; the key point is that maps corners to arithmetic progressions of length three.
There are higher analogues of these implications; the general version of the density Hales-Jewett theorem implies a general version of the corners theorem known as the multidimensional Szemerédi theorem, which in term implies a general version of Roth’s theorem known as Szemerédi’s theorem.
— 2. The density increment argument —
The strategy of the density increment argument, which goes back to Roth’s proof of Theorem 3 in 1956, is to perform a downward induction on the density . Indeed, the theorem is obvious for high enough values of ; for instance, if , then partitioning the cube into lines and applying the pigeonhole principle will already give a combinatorial line. So the idea is to deduce the claim for a fixed density from that of a higher density .
A key concept here is that of an -dimensional combinatorial subspace of – a set of the form , where is a string formed using the base alphabet and wildcards (with each wildcard appearing at least opnce), and is the string formed by substituting for for each . (Thus, for instance, a combinatorial line is a combinatorial subspace of dimension .) The identification between and the combinatorial space maps combinatorial lines to combinatorial lines. Thus, to prove Theorem 1, it suffices to show
Proposition 4 (Lack of lines implies density increment) Let . Then if is a sufficiently large integer, and has density at least and has no combinatorial lines, then there exists an -dimensional subspace of on which has density at least , where depends only on (and is bounded away from zero on any compact range of ), and for some function that goes to infinity as for fixed .
It is easy to see that Proposition 4 implies Theorem 1 (for instance, one could consider the infimum of all for which the theorem holds, and show that having this infimum non-zero would lead to a contradiction).
Now we have to figure out how to get that density increment. The original argument of Roth relied on Fourier analysis, which in turn relies on an underlying translation-invariant structure which is not present in the density Hales-Jewett setting. (Arithmetic progressions are translation-invariant, but combinatorial lines are not.) It turns out that one can proceed instead by adapting a (modification of) an argument of Ajtai and Szemerédi, which gave the first proof of Theorem 2.
The (modified) Ajtai-Szemerédi argument uses the density increment method, assuming that has no right-angled triangles and showing that has an increased density on a subgrid – a product of fairly long arithmetic progressions with the same spacing. The argument proceeds in two stages, which we describe slightly informally (in particular, glossing over some technical details regarding quantitative parameters such as ) as follows:
- Step 1. If is dense but has no right-angled triangles, then has an increased density on a cartesian product of dense sets (which are not necessarily arithmetic progressions).
- Step 2. Any Cartesian product in can be partitioned into reasonably large grids , plus a remainder term of small density.
From Step 1, Step 2 and the pigeonhole principle we obtain the desired density increment of on a grid , and then the density increment argument gives us the corners theorem.
Step 1 is actually quite easy. If is dense, then it must also be dense on some diagonal , by the pigeonhole principle. Let and denote the rows and columns that occupies. Every pair of points in forms the hypotenuse of some corner, whose third vertex lies in . Thus, if has no corners, then must avoid all the points formed by (except for those of the diagonal ). Thus has a significant density decrease on the Cartesian product . Dividing the remainder into three further Cartesian products , , and using the pigeonhole principle we obtain the claim (after redefining appropriately).
Step 2 can be obtained by iterating a one-dimensional version:
- Step 2a. Any set can be partitioned into reasonably long arithmetic progressions , plus a remainder term of small density.
Indeed, from Step 2a, one can partition into products (plus a small remainder), which can be easily repartitioned into grids (plus small remainder). This partitions into sets (plus small remainder). Applying Step 2a again, each can be partitioned further into progressions (plus small remainder), which allows us to partition each into grids (plus small remainder).
So all one has left to do is establish Step 2a. But this can be done by the greedy algorithm: locate one long arithmetic progression in and remove it from , then locate another to remove, and so forth until no further long progressions remain in the set. But Szemerédi’s theorem then tells us the remaining set has low density, and one is done!
This argument has the apparent disadvantage of requiring a deep theorem (Szemerédi’s theorem) in order to complete the proof. However, interestingly enough, when one adapts the argument to the density Hales-Jewett theorem, one gets to replace Szemerédi’s theorem by a more elementary result – one which in fact follows from the (easy) version of the density Hales-Jewett theorem, i.e. Sperner’s theorem.
We first need to understand the analogue of the Cartesian products . Note that is the intersection of a “vertically insensitive set” and a “horizontally insensitive set” . By “vertically insensitive” we mean that membership of a point in that set is unaffected if one moves that point in a vertical direction, and similarly for “horizontally insensitive”. In a similar fashion, define a “-insensitive set” to be a subset of , membership in which is unaffected if one flips a coordinate from a to a or vice versa (e.g. if lies in the set, then so must , , , etc.). Similarly define the notion of a “-insensitive set”. We then define a “complexity set” to be the intersection of a -insensitive set and a -insensitive set ; these are analogous to the Cartesian products .
(For technical reasons, one actually has to deal with local versions of insensitive sets and complexity sets, in which one is only allowed to flip a moderately small number of the coordinates rather than all of them. But to simplify the discussion let me ignore this (important) detail, which is also a major issue to address in the other two proofs of this theorem.)
The analogues of Steps 1, 2 for the density Hales-Jewett theorem are then
- Step 1. If is dense but has no combinatorial lines, then has an increased density on a (local) complexity set .
- Step 2. Any (local) complexity set can be partitioned into moderately large combinatorial subspaces (plus a small remainder).
We can sketch how Step 1 works as follows. Given any , let denote the string formed by replacing all s with s, e.g. . Similarly define . Observe that forms a combinatorial line (except in the rare case when doesn’t contain any s). Thus if we let , , we see that must avoid essentially all of . On the other hand, observe that and are -insensitive and -insensitive sets respectively. Taking complements and using the same sort of pigeonhole argument as before, we obtain the claim. (Actually, this argument doesn’t quite work because , could be very sparse; this problem can be fixed, but requires one to use local complexity sets rather than global ones, and also to introduce the concept of “equal-slices measure”; I will not discuss these issues here.)
Step 2 can be reduced, much as before, to the following analogue of Step 2a:
- Step 2a. Any -insensitive set can be partitioned into moderately large combinatorial subspaces (plus a small remainder).
Identifying the letters and together, one can quotient down to ; the preimages of this projection are precisely the -insensitive sets. Because of this, Step 2a is basically equivalent (modulo some technicalities about measure) to
- Step 2a’. Any can be partitioned into moderately large combinatorial subspaces (plus a small remainder).
By the greedy algorithm, we will be able to accomplish this step if we can show that every dense subset of contains moderately large subspaces. But this turns out to be possible by carefully iterating Sperner’s theorem (which shows that every dense subset of contains combinatorial lines).
This proof of Theorem 1 seems to extend without major difficulty to the case of higher , though details are still being worked out.
— 3. The triangle removal argument —
The triangle removal lemma of Ruzsa and Szemerédi is a graph-theoretic result which implies the corners theorem (and hence Roth’s theorem). It asserts the following:
Lemma 5 (Triangle removal lemma) For every there exists such that if a graph on vertices has fewer than triangles, then the triangles can be deleted entirely by removing at most edges.
Let’s see how the triangle removal lemma implies the corners theorem. A corner is, of course, already a triangle in the geometric sense, but we need to convert it to a triangle in the graph-theoretic sense, as follows. Let be a subset of with no corners; the aim is to show that has small density. Let be the set of all horizontal lines in , the set of vertical lines, and the set of diagonal lines (thus all three sets have size about ). We create a tripartite graph on the vertex sets by joining a horizontal line to a vertical line whenever and intersect at a point in , and similarly connecting or to . Observe that a triangle in corresponds either to a corner in , or to a “degenerate” corner in which the horizontal, vertical, and diagonal line are all concurrent. In particular, there are very few triangle in , which can then be deleted by removing a small number of edges from by the triangle removal lemma. But each edge removed can delete at most one degenerate corner, and the number of degenerate corners is , and so is small as required.
All known proofs of the triangle removal lemma proceed by some version of the following three steps:
- “Regularity lemma step”: Applying tools such as the Szemerédi regularity lemma, one can partition the graph into components between cells of vertices, such that most of the are “pseudorandom”. One way to define what pseudorandom means is to view each graph component as a subset of the Cartesian product , in which case is pseudorandom if it does not have a significant density increment on any smaller Cartesian product of non-trivial size.
- ”Counting lemma step”: By exploiting the pseudorandomness property, one shows that if has a triple of dense pseudorandom graphs between cells of non-trivial size, then this triple must generate a large number of triangles; hence, if has very few triangles, then one cannot find such a triple of dense pseudorandom graphs.
- ”Cleaning step”: If one then removes all components of which are too sparse or insufficiently pseudorandom, one can thus eliminate all triangles.
Pulling this argument back to the corners theorem, we see that cells such as will correspond either to horizontally insensitive sets, vertically insensitive sets, or diagonally insensitive sets. Thus this proof of the corners theorem proceeds by partitioning in three different ways into insensitive sets in such a way that is pseudorandom with respect to many of the cells created by any two of these partitions, counting the corners generated by any triple of large cells in which A is pseudorandom and dense, and cleaning out all the other cells.
It turns out that a variant of this argument can give Theorem 1; this was in fact the original approach studied by the polymath1 project, though it was only after a detour through ergodic theory (as well as the development of the density-increment argument discussed above) that the triangle-removal approach could be properly executed. In particular, an ergodic argument based on the infinitary analogue of the triangle removal lemma (and its hypergraph generalisations) was developed by Austin, which then inspired the combinatorial version sketched here.
The analogue of the vertex cells are given by certain -insensitive sets , -insensitive sets , and -insensitive sets . Roughly speaking, a set would be said to be pseudorandom with respect to a cell if has no further density increment on any smaller cell with a -insensitive subset of , and a -insensitive subset of . (This is an oversimplification, glossing over an important refinement of the concept of pseudorandomness involving the discrepancy between global densities in and local densities in subspaces of .) There is a similar notion of being pseudorandom with respect to a cell or .
We briefly describe the “regularity lemma” step. By modifying the proof of the regularity lemma, one can obtain three partitions
into -insensitive, -insensitive, and -insensitive components respectively, where are not too large, and is pseudorandom with respect to most cells , , and .
In order for the counting step to work, one also needs an additional “stationarity” reduction, which is difficult to state precisely, but roughly speaking asserts that the “local” statistics of sets such as on medium-dimensional subspaces are close to the corresponding “global” statistics of such sets; this can be achieved by an additional pigeonholing argument. We will gloss over this issue, pretending that there is no distinction between local statistics and global statistics. (Thus, for instance, if has large global density in , we shall assume that also has large density on most medium-sized subspaces of .)
Now for the “counting lemma” step. Suppose we can find such that the cells are large, and that intersects , , and in a dense pseudorandom manner. We claim that this will force to have a large number of combinatorial lines , with in , in , and in . Because of the dense pseudorandom nature of in these cells, it turns out that it will suffice to show that there are a lot of lines with , , and .
One way to generate a line is by taking the triple , where is a generic point. (Actually, as we will see below, we would have to to a subspace of before using this recipe to generate lines.) Then we need to find many obeying the constraints
Because of the various insensitivity properties, many of these conditions are redundant, and we can simplify to
Now note that the property “” is 123-insensitive; it is simultaneously 12-insensitive, 23-insensitive, and 13-insensitive. As is assumed to be large, there will be large combinatorial subspaces on which (a suitably localised version of) this property “” will be always true. Localising to this space (taking advantage of the stationarity properties alluded to earlier), we are now looking for solutions to
We’ll pick to be of the form for some . We can then rewrite the constraints on as
The property “” is 123-invariant, and is large, so by arguing as before we can pass to a large subspace where this property is always true. The largeness of then gives us a large number of solutions.
Taking contrapositives, we conclude that if in fact has no combinatorial lines, then there do not exist any triple of large cells with respect to which is dense and pseudorandom. This forces to be confined either to very small cells, or to very sparse subsets of cells, or to the rare cells which fail to be pseudorandom. None of these cases can contribute much to the density of , and so itself is very sparse – contradicting the hypothesis in Theorem 1 that is dense (this is the “cleaning step”). This concludes the sketch of the triangle-removal proof of this theorem.
The ergodic version of this argument, due to Austin, works for all values of , so I expect the combinatorial version to do so as well.
— 4. The finitary Furstenberg-Katznelson argument —
In 1989, Furstenberg and Katznelson gave the first proof of Theorem 1, by translating it into a recurrence statement about a certain type of stationary process indexed by an infinite cube . This argument was inspired by a long string of other successful proofs of density Ramsey theorems via ergodic means, starting with the initial 1977 paper of Furstenberg giving an ergodic theory proof of Szemeredi’s theorem. The latter proof was transcribed into a finitary language by myself, so it was reasonable to expect that the Furstenberg-Katznelson argument could similarly be translated into a combinatorial framework.
Let us first briefly describe the original strategy of Furstenberg to establish Roth’s theorem, but phrased in an informal, and vaguely combinatorial, language. The basic task is to get a non-trivial lower bound on averages of the form
Now, it turns out that certain types of functions give a negligible contribution to expressions such as (2). In particular, if is weakly mixing, which roughly means that the pair correlations
are small for most , then the average (2) is small no matter what are (so long as they are bounded). This can be established by some applications of the Cauchy-Schwarz inequality (or its close cousin, the van der Corput lemma). As a consequence of this, all weakly mixing components of can essentially be discarded when considering an average such as (1).
After getting rid of the weakly mixing components, what is left? Being weakly mixing is like saying that almost all the shifts of are close to orthogonal to each other. At the other extreme is that of periodicity – the shifts periodically recur to become equal to again. There is a slightly more general notion of almost periodicity – roughly, this means that the shifts don’t have to recur exactly to again, but they are forced to range in a precompact set, which basically means that for every , that lies within (in some suitable norm) of some finite-dimensional space. A good example of an almost periodic function is an eigenfunction, in which we have for each and some quantity independent of (e.g. one can take for some ). In this case, the finite-dimensional space is simply the scalar multiples of (and one can even take in this special case).
It is easy to see that non-trivial almost periodic functions are not weakly mixing; more generally, any function which correlates non-trivially with an almost periodic function can also be seen to not be weakly mixing. In the converse direction, it is also fairly easy to show that any function which is not weakly mixing must have non-trivial correlation with an almost periodic function. Because of this, it turns out that one can basically decompose any function into almost periodic and weakly mixing components. For the purposes of getting lower bounds on (1), this allows us to essentially reduce matters to the special case when is almost periodic. But then the shifts are almost ranging in a finite-dimensional set, which allows one to essentially assign each shift a colour from a finite range of colours. If one then applies the van der Waerden theorem, one can find many arithmetic progressions which have the same colour, and this can be used to give a non-trivial lower bound on (1). (Thus we see that the role of a compactness property such as almost periodicity is to reduce density Ramsey theorems to colouring Ramsey theorems.)
This type of argument can be extended to more advanced recurrence theorems, but certain things become more complicated. For instance, suppose one wanted to count progressions of length ; this amounts to lower bounding expressions such as
It turns out that being weakly mixing is no longer enough to give a negligible contribution to expressions such as (3). For that, one needs the stronger property of being weakly mixing relative to almost periodic functions; roughly speaking, this means that for most , the expression is not merely of small mean (which is what weak mixing would mean), but that this expression furthermore does not correlate strongly with any almost periodic function (i.e. is small for any almost periodic ). Once one has this stronger weak mixing property, then one can discard all components of which are weakly mixing relative to almost periodic functions.
One then has to figure out what is left after all these components are discarded. Because we strengthened the notion of weak mixing, we have to weaken the notion of almost periodicity to compensate. The correct notion is no longer that of almost periodicity – in which the shifts almost take values in a finite-dimensional vector space – but that of almost periodicity relative to almost periodic functions, in which the shifts almost take values in a finite-dimensional module over the algebra of almost periodic functions. A good example of such a beast is that of a quadratic eigenfunction, in which we have where is itself an ordinary eigenfunction, and thus almost periodic in the ordinary sense; here, the relative module is the one-dimensional module formed by almost periodic multiples of . (A typical example of a quadratic eigenfunction is for some .)
It turns out that one can “relativise” all of the previous arguments to the almost periodic “factor”, and decompose an arbitrary into a component which is weakly mixing relative to almost periodic functions, and another component which is almost periodic relative to almost periodic functions. The former type of components can be discarded. For the latter, we can once again start colouring the shifts with a finite number of colours, but with the caveat that the colour assigned is no longer independent of , but depends in an almost periodic fashion on . Nevertheless, it is still possible to combine the van der Waerden colouring Ramsey theorem with the theory of recurrence for ordinary almost periodic functions to get a lower bound on (3) in this case. One can then iterate this argument to deal with arithmetic progressions of longer length, but one now needs to consider even more intricate notions of almost periodicity, e.g. almost periodicity relative to (almost periodic functions relative to almost periodic functions), etc.
It turns out that these types of ideas can be adapted (with some effort) to the density Hales-Jewett setting. It’s simplest to begin with the situation rather than the situation. Here, we are trying to obtain non-trivial lower bounds for averages of the form
where ranges in some fashion over combinatorial lines in , and is some non-negative function with large mean.
The analogues of weakly mixing and almost periodic in this setting are the -uniform and -low influence functions respectively. Roughly speaking, a function is -low influence if its value usually doesn’t change much if a is flipped to a or vice versa (e.g. the indicator function of a -insensitive set is -low influence); conversely, a -uniform function is a function such that is small for all (bounded) . One can show that any function can be decomposed, more or less orthogonally, into a -uniform function and a -low influence function, with the upshot being that one can basically reduce the task of lower bounding (4) to the case when is -low influence. But then and are approximately equal to each other, and it is straightforward to get a lower-bound in this case.
It turns out that (say) being -uniform is no longer enough to give a negligible contribution to the average (5). Instead, one needs the more complicated notion of being -uniform relative to -low influence functions; this means that not only are the averages small for all bounded , but furthermore is small for all bounded and all -low influence (there is a minor technical point here that is a function of a line rather than of a point, but this should be ignored). Any component of in (5) which is -uniform relative to -low influence functions are negligible and so can be removed.
One then needs to figure out what is left in when these components are removed. The answer turns out to be functions that are -almost periodic relative to -low influence. The precise definition of this concept is technical, but very roughly speaking it means that if one flips a digit from a to a , then the value of changes in a manner which is controlled by -low influence functions. Anyway, the upshot is that one can reduce in (5) from to the components of which are -almost periodic relative to -low influence. Similarly, one can reduce in (5) from to the components of which are -almost periodic relative to -low influence.
At this point, one has to use a colouring Ramsey theorem – in this case, the Graham-Rothschild theorem – in conjunction with the relative almost periodicity to locate lots of places in which is close to while is simultaneously close to . This turns (5) into an expression of the form , which turns out to be relatively easy to lower bound (because , being projections of , tend to be large wherever is large).