In logic, there is a subtle but important distinction between the concept of mutual knowledge – information that everyone (or almost everyone) knows – and common knowledge, which is not only knowledge that (almost) everyone knows, but something that (almost) everyone knows that everyone else knows (and that everyone knows that everyone else knows that everyone else knows, and so forth).  A classic example arises from Hans Christian Andersens’ fable of the Emperor’s New Clothes: the fact that the emperor in fact has no clothes is mutual knowledge, but not common knowledge, because everyone (save, eventually, for a small child) is refusing to acknowledge the emperor’s nakedness, thus perpetuating the charade that the emperor is actually wearing some incredibly expensive and special clothing that is only visible to a select few.  My own personal favourite example of the distinction comes from the blue-eyed islander puzzle, discussed previously here, here and here on the blog.  (By the way, I would ask that any commentary about that puzzle be directed to those blog posts, rather than to the current one.)

I believe that there is now a real-life instance of this situation in the US presidential election, regarding the following

Proposition 1.  The presumptive nominee of the Republican Party, Donald Trump, is not even remotely qualified to carry out the duties of the presidency of the United States of America.

Proposition 1 is a statement which I think is approaching the level of mutual knowledge amongst the US population (and probably a large proportion of people following US politics overseas): even many of Trump’s nominal supporters secretly suspect that this proposition is true, even if they are hesitant to say it out loud.  And there have been many prominent people, from both major parties, that have made the case for Proposition 1: for instance Mitt Romney, the Republican presidential nominee in 2012, did so back in March, and just a few days ago Hillary Clinton, the likely Democratic presidential nominee this year, did so in this speech:

I highly recommend watching the entirety of the (35 mins or so) speech, followed by the entirety of Trump’s rebuttal.

However, even if Proposition 1 is approaching the status of “mutual knowledge”, it does not yet seem to be close to the status of “common knowledge”: one may secretly believe that Trump cannot be considered as a serious candidate for the US presidency, but must continue to entertain this possibility, because they feel that others around them, or in politics or the media, appear to be doing so.  To reconcile these views can require taking on some implausible hypotheses that are not otherwise supported by any evidence, such as the hypothesis that Trump’s displays of policy ignorance, pettiness, and other clearly unpresidential behaviour are merely “for show”, and that behind this facade there is actually a competent and qualified presidential candidate; much like the emperor’s new clothes, this alleged competence is supposedly only visible to a select few.  And so the charade continues.

I feel that it is time for the charade to end: Trump is unfit to be president, and everybody knows it.  But more people need to say so, openly.

Important note: I anticipate there will be any number of “tu quoque” responses, asserting for instance that Hillary Clinton is also unfit to be the US president.  I personally do not believe that to be the case (and certainly not to the extent that Trump exhibits), but in any event such an assertion has no logical bearing on the qualification of Trump for the presidency.  As such, any comments that are purely of this “tu quoque” nature, and which do not directly address the validity or epistemological status of Proposition 1, will be deleted as off-topic.  However, there is a legitimate case to be made that there is a fundamental weakness in the current mechanics of the US presidential election, particularly with the “first-past-the-post” voting system, in that (once the presidential primaries are concluded) a voter in the presidential election is effectively limited to choosing between just two viable choices, one from each of the two major parties, or else refusing to vote or making a largely symbolic protest vote. This weakness is particularly evident when at least one of these two major choices is demonstrably unfit for office, as per Proposition 1.  I think there is a serious case for debating the possibility of major electoral reform in the US (I am particularly partial to the Instant Runoff Voting system, used for instance in my home country of Australia, which allows for meaningful votes to third parties), and I would consider such a debate to be on-topic for this post.  But this is very much a longer term issue, as there is absolutely no chance that any such reform would be implemented by the time of the US elections in November (particularly given that any significant reform would almost certainly require, at minimum, a constitutional amendment).

Note: the following is a record of some whimsical mathematical thoughts and computations I had after doing some grading. It is likely that the sort of problems discussed here are in fact well studied in the appropriate literature; I would appreciate knowing of any links to such.

Suppose one assigns ${N}$ true-false questions on an examination, with the answers randomised so that each question is equally likely to have “true” as the correct answer as “false”, with no correlation between different questions. Suppose that the students taking the examination must answer each question with exactly one of “true” or “false” (they are not allowed to skip any question). Then it is easy to see how to grade the exam: one can simply count how many questions each student answered correctly (i.e. each correct answer scores one point, and each incorrect answer scores zero points), and give that number ${k}$ as the final grade of the examination. More generally, one could assign some score of ${A}$ points to each correct answer and some score (possibly negative) of ${B}$ points to each incorrect answer, giving a total grade of ${A k + B(N-k)}$ points. As long as ${A > B}$, this grade is simply an affine rescaling of the simple grading scheme ${k}$ and would serve just as well for the purpose of evaluating the students, as well as encouraging each student to answer the questions as correctly as possible.

In practice, though, a student will probably not know the answer to each individual question with absolute certainty. One can adopt a probabilistic model, where for a given student ${S}$ and a given question ${n}$, the student ${S}$ may think that the answer to question ${n}$ is true with probability ${p_{S,n}}$ and false with probability ${1-p_{S,n}}$, where ${0 \leq p_{S,n} \leq 1}$ is some quantity that can be viewed as a measure of confidence ${S}$ has in the answer (with ${S}$ being confident that the answer is true if ${p_{S,n}}$ is close to ${1}$, and confident that the answer is false if ${p_{S,n}}$ is close to ${0}$); for simplicity let us assume that in ${S}$‘s probabilistic model, the answers to each question are independent random variables. Given this model, and assuming that the student ${S}$ wishes to maximise his or her expected grade on the exam, it is an easy matter to see that the optimal strategy for ${S}$ to take is to answer question ${n}$ true if ${p_{S,n} > 1/2}$ and false if ${p_{S,n} < 1/2}$. (If ${p_{S,n}=1/2}$, the student ${S}$ can answer arbitrarily.)

[Important note: here we are not using the term “confidence” in the technical sense used in statistics, but rather as an informal term for “subjective probability”.]

This is fine as far as it goes, but for the purposes of evaluating how well the student actually knows the material, it provides only a limited amount of information, in particular we do not get to directly see the student’s subjective probabilities ${p_{S,n}}$ for each question. If for instance ${S}$ answered ${7}$ out of ${10}$ questions correctly, was it because he or she actually knew the right answer for seven of the questions, or was it because he or she was making educated guesses for the ten questions that turned out to be slightly better than random chance? There seems to be no way to discern this if the only input the student is allowed to provide for each question is the single binary choice of true/false.

But what if the student were able to give probabilistic answers to any given question? That is to say, instead of being forced to answer just “true” or “false” for a given question ${n}$, the student was allowed to give answers such as “${60\%}$ confident that the answer is true” (and hence ${40\%}$ confidence the answer is false). Such answers would give more insight as to how well the student actually knew the material; in particular, we would theoretically be able to actually see the student’s subjective probabilities ${p_{S,n}}$.

But now it becomes less clear what the right grading scheme to pick is. Suppose for instance we wish to extend the simple grading scheme in which an correct answer given in ${100\%}$ confidence is awarded one point. How many points should one award a correct answer given in ${60\%}$ confidence? How about an incorrect answer given in ${60\%}$ confidence (or equivalently, a correct answer given in ${40\%}$ confidence)?

Mathematically, one could design a grading scheme by selecting some grading function ${f: [0,1] \rightarrow {\bf R}}$ and then awarding a student ${f(p)}$ points whenever they indicate the correct answer with a confidence of ${p}$. For instance, if the student was ${60\%}$ confident that the answer was “true” (and hence ${40\%}$ confident that the answer was “false”), then this grading scheme would award the student ${f(0.6)}$ points if the correct answer actually was “true”, and ${f(0.4)}$ points if the correct answer actually was “false”. One can then ask the question of what functions ${f}$ would be “best” for this scheme?

Intuitively, one would expect that ${f}$ should be monotone increasing – one should be rewarded more for being correct with high confidence, than correct with low confidence. On the other hand, some sort of “partial credit” should still be assigned in the latter case. One obvious proposal is to just use a linear grading function ${f(p) = p}$ – thus for instance a correct answer given with ${60\%}$ confidence might be worth ${0.6}$ points. But is this the “best” option?

To make the problem more mathematically precise, one needs an objective criterion with which to evaluate a given grading scheme. One criterion that one could use here is the avoidance of perverse incentives. If a grading scheme is designed badly, a student may end up overstating or understating his or her confidence in an answer in order to optimise the (expected) grade: the optimal level of confidence ${q_{S,n}}$ for a student ${S}$ to report on a question may differ from that student’s subjective confidence ${p_{S,n}}$. So one could ask to design a scheme so that ${q_{S,n}}$ is always equal to ${p_{S,n}}$, so that the incentive is for the student to honestly report his or her confidence level in the answer.

This turns out to give a precise constraint on the grading function ${f}$. If a student ${S}$ thinks that the answer to a question ${n}$ is true with probability ${p_{S,n}}$ and false with probability ${1-p_{S,n}}$, and enters in an answer of “true” with confidence ${q_{S,n}}$ (and thus “false” with confidence ${1-q_{S,n}}$), then student would expect a grade of

$\displaystyle p_{S,n} f( q_{S,n} ) + (1-p_{S,n}) f(1 - q_{S,n})$

on average for this question. To maximise this expected grade (assuming differentiability of ${f}$, which is a reasonable hypothesis for a partial credit grading scheme), one performs the usual maneuvre of differentiating in the independent variable ${q_{S,n}}$ and setting the result to zero, thus obtaining

$\displaystyle p_{S,n} f'( q_{S,n} ) - (1-p_{S,n}) f'(1 - q_{S,n}) = 0.$

In order to avoid perverse incentives, the maximum should occur at ${q_{S,n} = p_{S,n}}$, thus we should have

$\displaystyle p f'(p) - (1-p) f'(1-p) = 0$

for all ${0 \leq p \leq 1}$. This suggests that the function ${p \mapsto p f'(p)}$ should be constant. (Strictly speaking, it only gives the weaker constraint that ${p \mapsto p f'(p)}$ is symmetric around ${p=1/2}$; but if one generalised the problem to allow for multiple-choice questions with more than two possible answers, with a grading scheme that depended only on the confidence assigned to the correct answer, the same analysis would in fact force ${p f'(p)}$ to be constant in ${p}$; we leave this computation to the interested reader.) In other words, ${f(p)}$ should be of the form ${A \log p + B}$ for some ${A,B}$; by monotonicity we expect ${A}$ to be positive. If we make the normalisation ${f(1/2)=0}$ (so that no points are awarded for a ${50-50}$ split in confidence between true and false) and ${f(1)=1}$, one arrives at the grading scheme

$\displaystyle f(p) := \log_2(2p).$

Thus, if a student believes that an answer is “true” with confidence ${p}$ and “false” with confidence ${1-p}$, he or she will be awarded ${\log_2(2p)}$ points when the correct answer is “true”, and ${\log_2(2(1-p))}$ points if the correct answer is “false”. The following table gives some illustrative values for this scheme:

 Confidence that answer is “true” Points awarded if answer is “true” Points awarded if answer is “false” ${0\%}$ ${-\infty}$ ${1.000}$ ${1\%}$ ${-5.644}$ ${0.9855}$ ${2\%}$ ${-4.644}$ ${0.9709}$ ${5\%}$ ${-3.322}$ ${0.9260}$ ${10\%}$ ${-2.322}$ ${0.8480}$ ${20\%}$ ${-1.322}$ ${0.6781}$ ${30\%}$ ${-0.737}$ ${0.4854}$ ${40\%}$ ${-0.322}$ ${0.2630}$ ${50\%}$ ${0.000}$ ${0.000}$ ${60\%}$ ${0.2630}$ ${-0.322}$ ${70\%}$ ${0.4854}$ ${-0.737}$ ${80\%}$ ${0.6781}$ ${-1.322}$ ${90\%}$ ${0.8480}$ ${-2.322}$ ${95\%}$ ${0.9260}$ ${-3.322}$ ${98\%}$ ${0.9709}$ ${-4.644}$ ${99\%}$ ${0.9855}$ ${-5.644}$ ${100\%}$ ${1.000}$ ${-\infty}$

Note the large penalties for being extremely confident of an answer that ultimately turns out to be incorrect; in particular, answers of ${100\%}$ confidence should be avoided unless one really is absolutely certain as to the correctness of one’s answer.

The total grade given under such a scheme to a student ${S}$ who answers each question ${n}$ to be “true” with confidence ${p_{S,n}}$, and “false” with confidence ${1-p_{S,n}}$, is

$\displaystyle \sum_{n: \hbox{ ans is true}} \log_2(2 p_{S,n} ) + \sum_{n: \hbox{ ans is false}} \log_2(2(1-p_{S,n})).$

This grade can also be written as

$\displaystyle N + \frac{1}{\log 2} \log {\mathcal L}$

where

$\displaystyle {\mathcal L} := \prod_{n: \hbox{ ans is true}} p_{S,n} \times \prod_{n: \hbox{ ans is false}} (1-p_{S,n})$

is the likelihood of the student ${S}$‘s subjective probability model, given the outcome of the correct answers. Thus the grade system here has another natural interpretation, as being an affine rescaling of the log-likelihood. The incentive is thus for the student to maximise the likelihood of his or her own subjective model, which aligns well with standard practices in statistics. From the perspective of Bayesian probability, the grade given to a student can then be viewed as a measurement (in logarithmic scale) of how much the posterior probability that the student’s model was correct has improved over the prior probability.

One could propose using the above grading scheme to evaluate predictions to binary events, such as an upcoming election with only two viable candidates, to see in hindsight just how effective each predictor was in calling these events. One difficulty in doing so is that many predictions do not come with explicit probabilities attached to them, and attaching a default confidence level of ${100\%}$ to any prediction made without any such qualification would result in an automatic grade of ${-\infty}$ if even one of these predictions turned out to be incorrect. But perhaps if a predictor refuses to attach confidence level to his or her predictions, one can assign some default level ${p}$ of confidence to these predictions, and then (using some suitable set of predictions from this predictor as “training data”) find the value of ${p}$ that maximises this predictor’s grade. This level can then be used going forward as the default level of confidence to apply to any future predictions from this predictor.

The above grading scheme extends easily enough to multiple-choice questions. But one question I had trouble with was how to deal with uncertainty, in which the student does not know enough about a question to venture even a probability of being true or false. Here, it is natural to allow a student to leave a question blank (i.e. to answer “I don’t know”); a more advanced option would be to allow the student to enter his or her confidence level as an interval range (e.g. “I am between ${50\%}$ and ${70\%}$ confident that the answer is “true””). But now I do not have a good proposal for a grading scheme; once there is uncertainty in the student’s subjective model, the problem of that student maximising his or her expected grade becomes ill-posed due to the “unknown unknowns”, and so the previous criterion of avoiding perverse incentives becomes far less useful.

A capset in the vector space ${{\bf F}_3^n}$ over the finite field ${{\bf F}_3}$ of three elements is a subset ${A}$ of ${{\bf F}_3^n}$ that does not contain any lines ${\{ x,x+r,x+2r\}}$, where ${x,r \in {\bf F}_3^n}$ and ${r \neq 0}$. A basic problem in additive combinatorics (discussed in one of the very first posts on this blog) is to obtain good upper and lower bounds for the maximal size of a capset in ${{\bf F}_3^n}$.

Trivially, one has ${|A| \leq 3^n}$. Using Fourier methods (and the density increment argument of Roth), the bound of ${|A| \leq O( 3^n / n )}$ was obtained by Meshulam, and improved only as late as 2012 to ${O( 3^n /n^{1+c})}$ for some absolute constant ${c>0}$ by Bateman and Katz. But in a very recent breakthrough, Ellenberg (and independently Gijswijt) obtained the exponentially superior bound ${|A| \leq O( 2.756^n )}$, using a version of the polynomial method recently introduced by Croot, Lev, and Pach. (In the converse direction, a construction of Edel gives capsets as large as ${(2.2174)^n}$.) Given the success of the polynomial method in superficially similar problems such as the finite field Kakeya problem (discussed in this previous post), it was natural to wonder that this method could be applicable to the cap set problem (see for instance this MathOverflow comment of mine on this from 2010), but it took a surprisingly long time before Croot, Lev, and Pach were able to identify the precise variant of the polynomial method that would actually work here.

The proof of the capset bound is very short (Ellenberg’s and Gijswijt’s preprints are both 3 pages long, and Croot-Lev-Pach is 6 pages), but I thought I would present a slight reformulation of the argument which treats the three points on a line in ${{\bf F}_3}$ symmetrically (as opposed to treating the third point differently from the first two, as is done in the Ellenberg and Gijswijt papers; Croot-Lev-Pach also treat the middle point of a three-term arithmetic progression differently from the two endpoints, although this is a very natural thing to do in their context of ${({\bf Z}/4{\bf Z})^n}$). The basic starting point is this: if ${A}$ is a capset, then one has the identity

$\displaystyle \delta_{0^n}( x+y+z ) = \sum_{a \in A} \delta_a(x) \delta_a(y) \delta_a(z) \ \ \ \ \ (1)$

for all ${(x,y,z) \in A^3}$, where ${\delta_a(x) := 1_{a=x}}$ is the Kronecker delta function, which we view as taking values in ${{\bf F}_3}$. Indeed, (1) reflects the fact that the equation ${x+y+z=0}$ has solutions precisely when ${x,y,z}$ are either all equal, or form a line, and the latter is ruled out precisely when ${A}$ is a capset.

To exploit (1), we will show that the left-hand side of (1) is “low rank” in some sense, while the right-hand side is “high rank”. Recall that a function ${F: A \times A \rightarrow {\bf F}}$ taking values in a field ${{\bf F}}$ is of rank one if it is non-zero and of the form ${(x,y) \mapsto f(x) g(y)}$ for some ${f,g: A \rightarrow {\bf F}}$, and that the rank of a general function ${F: A \times A \rightarrow {\bf F}}$ is the least number of rank one functions needed to express ${F}$ as a linear combination. More generally, if ${k \geq 2}$, we define the rank of a function ${F: A^k \rightarrow {\bf F}}$ to be the least number of “rank one” functions of the form

$\displaystyle (x_1,\dots,x_k) \mapsto f(x_i) g(x_1,\dots,x_{i-1},x_{i+1},\dots,x_k)$

for some ${i=1,\dots,k}$ and some functions ${f: A \rightarrow {\bf F}}$, ${g: A^{k-1} \rightarrow {\bf F}}$, that are needed to generate ${F}$ as a linear combination. For instance, when ${k=3}$, the rank one functions take the form ${(x,y,z) \mapsto f(x) g(y,z)}$, ${(x,y,z) \mapsto f(y) g(x,z)}$, ${(x,y,z) \mapsto f(z) g(x,y)}$, and linear combinations of ${r}$ such rank one functions will give a function of rank at most ${r}$.

It is a standard fact in linear algebra that the rank of a diagonal matrix is equal to the number of non-zero entries. This phenomenon extends to higher dimensions:

Lemma 1 (Rank of diagonal hypermatrices) Let ${k \geq 2}$, let ${A}$ be a finite set, let ${{\bf F}}$ be a field, and for each ${a \in A}$, let ${c_a \in {\bf F}}$ be a coefficient. Then the rank of the function

$\displaystyle (x_1,\dots,x_k) \mapsto \sum_{a \in A} c_a \delta_a(x_1) \dots \delta_a(x_k) \ \ \ \ \ (2)$

is equal to the number of non-zero coefficients ${c_a}$.

Proof: We induct on ${k}$. As mentioned above, the case ${k=2}$ follows from standard linear algebra, so suppose now that ${k>2}$ and the claim has already been proven for ${k-1}$.

It is clear that the function (2) has rank at most equal to the number of non-zero ${c_a}$ (since the summands on the right-hand side are rank one functions), so it suffices to establish the lower bound. By deleting from ${A}$ those elements ${a \in A}$ with ${c_a=0}$ (which cannot increase the rank), we may assume without loss of generality that all the ${c_a}$ are non-zero. Now suppose for contradiction that (2) has rank at most ${|A|-1}$, then we obtain a representation

$\displaystyle \sum_{a \in A} c_a \delta_a(x_1) \dots \delta_a(x_k)$

$\displaystyle = \sum_{i=1}^k \sum_{\alpha \in I_i} f_{i,\alpha}(x_i) g_{i,\alpha}( x_1,\dots,x_{i-1},x_{i+1},\dots,x_k) \ \ \ \ \ (3)$

for some sets ${I_1,\dots,I_k}$ of cardinalities adding up to at most ${|A|-1}$, and some functions ${f_{i,\alpha}: A \rightarrow {\bf F}}$ and ${g_{i,\alpha}: A^{k-1} \rightarrow {\bf R}}$.

Consider the space of functions ${h: A \rightarrow {\bf F}}$ that are orthogonal to all the ${f_{k,\alpha}}$, ${\alpha \in I_k}$ in the sense that

$\displaystyle \sum_{x \in A} f_{k,\alpha}(x) h(x) = 0$

for all ${\alpha \in I_k}$. This space is a vector space whose dimension ${d}$ is at least ${|A| - |I_k|}$. A basis of this space generates a ${d \times |A|}$ coordinate matrix of full rank, which implies that there is at least one non-singular ${d \times d}$ minor. This implies that there exists a function ${h: A \rightarrow {\bf F}}$ in this space which is nowhere vanishing on some subset ${A'}$ of ${A}$ of cardinality at least ${|A|-|I_k|}$.

If we multiply (3) by ${h(x_k)}$ and sum in ${x_k}$, we conclude that

$\displaystyle \sum_{a \in A} c_a h(a) \delta_a(x_1) \dots \delta_a(x_{k-1})$

$\displaystyle = \sum_{i=1}^{k-1} \sum_{\alpha \in I_i} f_{i,\alpha}(x_i)\tilde g_{i,\alpha}( x_1,\dots,x_{i-1},x_{i+1},\dots,x_{k-1})$

where

$\displaystyle \tilde g_{i,\alpha}(x_1,\dots,x_{i-1},x_{i+1},\dots,x_{k-1})$

$\displaystyle := \sum_{x_k \in A} g_{i,\alpha}(x_1,\dots,x_{i-1},x_{i+1},\dots,x_k) h(x_k).$

The right-hand side has rank at most ${|A|-1-|I_k|}$, since the summands are rank one functions. On the other hand, from induction hypothesis the left-hand side has rank at least ${|A|-|I_k|}$, giving the required contradiction. $\Box$

On the other hand, we have the following (symmetrised version of a) beautifully simple observation of Croot, Lev, and Pach:

Lemma 2 On ${({\bf F}_3^n)^3}$, the rank of the function ${(x,y,z) \mapsto \delta_{0^n}(x+y+z)}$ is at most ${3N}$, where

$\displaystyle N := \sum_{a,b,c \geq 0: a+b+c=n, b+2c \leq 2n/3} \frac{n!}{a!b!c!}.$

Proof: Using the identity ${\delta_0(x) = 1 - x^2}$ for ${x \in {\bf F}_3}$, we have

$\displaystyle \delta_{0^n}(x+y+z) = \prod_{i=1}^n (1 - (x_i+y_i+z_i)^2).$

The right-hand side is clearly a polynomial of degree ${2n}$ in ${x,y,z}$, which is then a linear combination of monomials

$\displaystyle x_1^{i_1} \dots x_n^{i_n} y_1^{j_1} \dots y_n^{j_n} z_1^{k_1} \dots z_n^{k_n}$

with ${i_1,\dots,i_n,j_1,\dots,j_n,k_1,\dots,k_n \in \{0,1,2\}}$ with

$\displaystyle i_1 + \dots + i_n + j_1 + \dots + j_n + k_1 + \dots + k_n \leq 2n.$

In particular, from the pigeonhole principle, at least one of ${i_1 + \dots + i_n, j_1 + \dots + j_n, k_1 + \dots + k_n}$ is at most ${2n/3}$.

Consider the contribution of the monomials for which ${i_1 + \dots + i_n \leq 2n/3}$. We can regroup this contribution as

$\displaystyle \sum_\alpha f_\alpha(x) g_\alpha(y,z)$

where ${\alpha}$ ranges over those ${(i_1,\dots,i_n) \in \{0,1,2\}^n}$ with ${i_1 + \dots + i_n \leq 2n/3}$, ${f_\alpha}$ is the monomial

$\displaystyle f_\alpha(x_1,\dots,x_n) := x_1^{i_1} \dots x_n^{i_n}$

and ${g_\alpha: {\bf F}_3^n \times {\bf F}_3^n \rightarrow {\bf F}_3}$ is some explicitly computable function whose exact form will not be of relevance to our argument. The number of such ${\alpha}$ is equal to ${N}$, so this contribution has rank at most ${N}$. The remaining contributions arising from the cases ${j_1 + \dots + j_n \leq 2n/3}$ and ${k_1 + \dots + k_n \leq 2n/3}$ similarly have rank at most ${N}$ (grouping the monomials so that each monomial is only counted once), so the claim follows.

Upon restricting from ${({\bf F}_3^n)^3}$ to ${A^3}$, the rank of ${(x,y,z) \mapsto \delta_{0^n}(x+y+z)}$ is still at most ${3N}$. The two lemmas then combine to give the Ellenberg-Gijswijt bound

$\displaystyle |A| \leq 3N.$

All that remains is to compute the asymptotic behaviour of ${N}$. This can be done using the general tool of Cramer’s theorem, but can also be derived from Stirling’s formula (discussed in this previous post). Indeed, if ${a = (\alpha+o(1)) n}$, ${b = (\beta+o(1)) n}$, ${c = (\gamma+o(1)) n}$ for some ${\alpha,\beta,\gamma \geq 0}$ summing to ${1}$, Stirling’s formula gives

$\displaystyle \frac{n!}{a!b!c!} = \exp( n (h(\alpha,\beta,\gamma) + o(1)) )$

where ${h}$ is the entropy function

$\displaystyle h(\alpha,\beta,\gamma) = \alpha \log \frac{1}{\alpha} + \beta \log \frac{1}{\beta} + \gamma \log \frac{1}{\gamma}.$

We then have

$\displaystyle N = \exp( n (X + o(1))$

where ${X}$ is the maximum entropy ${h(\alpha,\beta,\gamma)}$ subject to the constraints

$\displaystyle \alpha,\beta,\gamma \geq 0; \alpha+\beta+\gamma=1; \beta+2\gamma \leq 2/3.$

A routine Lagrange multiplier computation shows that the maximum occurs when

$\displaystyle \alpha = \frac{32}{3(15 + \sqrt{33})}$

$\displaystyle \beta = \frac{4(\sqrt{33}-1)}{3(15+\sqrt{33})}$

$\displaystyle \gamma = \frac{(\sqrt{33}-1)^2}{6(15+\sqrt{33})}$

and ${h(\alpha,\beta,\gamma)}$ is approximately ${1.013455}$, giving rise to the claimed bound of ${O( 2.756^n )}$.

Remark 3 As noted in the Ellenberg and Gijswijt papers, the above argument extends readily to other fields than ${{\bf F}_3}$ to control the maximal size of subset of ${{\bf F}^n}$ that has no non-trivial solutions to the equation ${ax+by+cz=0}$, where ${a,b,c \in {\bf F}}$ are non-zero constants that sum to zero. Of course one replaces the function ${(x,y,z) \mapsto \delta_{0^n}(x+y+z)}$ in Lemma 2 by ${(x,y,z) \mapsto \delta_{0^n}(ax+by+cz)}$ in this case.

Remark 4 This symmetrised formulation suggests that one possible way to improve slightly on the numerical quantity ${2.756}$ by finding a more efficient way to decompose ${\delta_{0^n}(x+y+z)}$ into rank one functions, however I was not able to do so (though such improvements are reminiscent of the Strassen type algorithms for fast matrix multiplication).

Remark 5 It is tempting to see if this method can get non-trivial upper bounds for sets ${A}$ with no length ${4}$ progressions, in (say) ${{\bf F}_5^n}$. One can run the above arguments, replacing the function

$\displaystyle (x,y,z) \mapsto \delta_{0^n}(x+y+z)$

with

$\displaystyle (x,y,z,w) \mapsto \delta_{0^n}(x-2y+z) \delta_{0^n}(y-2z+w);$

this leads to the bound ${|A| \leq 4N}$ where

$\displaystyle N := \sum_{a,b,c,d,e \geq 0: a+b+c+d+e=n, b+2c+3d+4e \leq 2n} \frac{n!}{a!b!c!d!e!}.$

Unfortunately, ${N}$ is asymptotic to ${\frac{1}{2} 5^n}$ and so this bound is in fact slightly worse than the trivial bound ${|A| \leq 5^n}$! However, there is a slim chance that there is a more efficient way to decompose ${\delta_{0^n}(x-2y+z) \delta_{0^n}(y-2z+w)}$ into rank one functions that would give a non-trivial bound on ${A}$. I experimented with a few possible such decompositions but unfortunately without success.

Remark 6 Return now to the capset problem. Since Lemma 1 is valid for any field ${{\bf F}}$, one could perhaps hope to get better bounds by viewing the Kronecker delta function ${\delta}$ as taking values in another field than ${{\bf F}_3}$, such as the complex numbers ${{\bf C}}$. However, as soon as one works in a field of characteristic other than ${3}$, one can adjoin a cube root ${\omega}$ of unity, and one now has the Fourier decomposition

$\displaystyle \delta_{0^n}(x+y+z) = \frac{1}{3^n} \sum_{\xi \in {\bf F}_3^n} \omega^{\xi \cdot x} \omega^{\xi \cdot y} \omega^{\xi \cdot z}.$

Moving to the Fourier basis, we conclude from Lemma 1 that the function ${(x,y,z) \mapsto \delta_{0^n}(x+y+z)}$ on ${{\bf F}_3^n}$ now has rank exactly ${3^n}$, and so one cannot improve upon the trivial bound of ${|A| \leq 3^n}$ by this method using fields of characteristic other than three as the range field. So it seems one has to stick with ${{\bf F}_3}$ (or the algebraic completion thereof).

Thanks to Jordan Ellenberg and Ben Green for helpful discussions.

I’ve just uploaded to the arXiv my paper “Equivalence of the logarithmically averaged Chowla and Sarnak conjectures“, submitted to the Festschrift “Number Theory – Diophantine problems, uniform distribution and applications” in honour of Robert F. Tichy. This paper is a spinoff of my previous paper establishing a logarithmically averaged version of the Chowla (and Elliott) conjectures in the two-point case. In that paper, the estimate

$\displaystyle \sum_{n \leq x} \frac{\lambda(n) \lambda(n+h)}{n} = o( \log x )$

as ${x \rightarrow \infty}$ was demonstrated, where ${h}$ was any positive integer and ${\lambda}$ denoted the Liouville function. The proof proceeded using a method I call the “entropy decrement argument”, which ultimately reduced matters to establishing a bound of the form

$\displaystyle \sum_{n \leq x} \frac{|\sum_{h \leq H} \lambda(n+h) e( \alpha h)|}{n} = o( H \log x )$

whenever ${H}$ was a slowly growing function of ${x}$. This was in turn established in a previous paper of Matomaki, Radziwill, and myself, using the recent breakthrough of Matomaki and Radziwill.

It is natural to see to what extent the arguments can be adapted to attack the higher-point cases of the logarithmically averaged Chowla conjecture (ignoring for this post the more general Elliott conjecture for other bounded multiplicative functions than the Liouville function). That is to say, one would like to prove that

$\displaystyle \sum_{n \leq x} \frac{\lambda(n+h_1) \dots \lambda(n+h_k)}{n} = o( \log x )$

as ${x \rightarrow \infty}$ for any fixed distinct integers ${h_1,\dots,h_k}$. As it turns out (and as is detailed in the current paper), the entropy decrement argument extends to this setting (after using some known facts about linear equations in primes), and allows one to reduce the above estimate to an estimate of the form

$\displaystyle \sum_{n \leq x} \frac{1}{n} \| \lambda \|_{U^d[n, n+H]} = o( \log x )$

for ${H}$ a slowly growing function of ${x}$ and some fixed ${d}$ (in fact we can take ${d=k-1}$ for ${k \geq 3}$), where ${U^d}$ is the (normalised) local Gowers uniformity norm. (In the case ${k=3}$, ${d=2}$, this becomes the Fourier-uniformity conjecture discussed in this previous post.) If one then applied the (now proven) inverse conjecture for the Gowers norms, this estimate is in turn equivalent to the more complicated looking assertion

$\displaystyle \sum_{n \leq x} \frac{1}{n} \sup |\sum_{h \leq H} \lambda(n+h) F( g^h x )| = o( \log x ) \ \ \ \ \ (1)$

where the supremum is over all possible choices of nilsequences ${h \mapsto F(g^h x)}$ of controlled step and complexity (see the paper for definitions of these terms).

The main novelty in the paper (elaborating upon a previous comment I had made on this blog) is to observe that this latter estimate in turn follows from the logarithmically averaged form of Sarnak’s conjecture (discussed in this previous post), namely that

$\displaystyle \sum_{n \leq x} \frac{1}{n} \lambda(n) F( T^n x )= o( \log x )$

whenever ${n \mapsto F(T^n x)}$ is a zero entropy (i.e. deterministic) sequence. Morally speaking, this follows from the well-known fact that nilsequences have zero entropy, but the presence of the supremum in (1) means that we need a little bit more; roughly speaking, we need the class of nilsequences of a given step and complexity to have “uniformly zero entropy” in some sense.

On the other hand, it was already known (see previous post) that the Chowla conjecture implied the Sarnak conjecture, and similarly for the logarithmically averaged form of the two conjectures. Putting all these implications together, we obtain the pleasant fact that the logarithmically averaged Sarnak and Chowla conjectures are equivalent, which is the main result of the current paper. There have been a large number of special cases of the Sarnak conjecture worked out (when the deterministic sequence involved came from a special dynamical system), so these results can now also be viewed as partial progress towards the Chowla conjecture also (at least with logarithmic averaging). However, my feeling is that the full resolution of these conjectures will not come from these sorts of special cases; instead, conjectures like the Fourier-uniformity conjecture in this previous post look more promising to attack.

It would also be nice to get rid of the pesky logarithmic averaging, but this seems to be an inherent requirement of the entropy decrement argument method, so one would probably have to find a way to avoid that argument if one were to remove the log averaging.

When teaching mathematics, the traditional method of lecturing in front of a blackboard is still hard to improve upon, despite all the advances in modern technology.  However, there are some nice things one can do in an electronic medium, such as this blog.  Here, I would like to experiment with the ability to animate images, which I think can convey some mathematical concepts in ways that cannot be easily replicated by traditional static text and images. Given that many readers may find these animations annoying, I am placing the rest of the post below the fold.

Throughout this post we shall always work in the smooth category, thus all manifolds, maps, coordinate charts, and functions are assumed to be smooth unless explicitly stated otherwise.

A (real) manifold ${M}$ can be defined in at least two ways. On one hand, one can define the manifold extrinsically, as a subset of some standard space such as a Euclidean space ${{\bf R}^d}$. On the other hand, one can define the manifold intrinsically, as a topological space equipped with an atlas of coordinate charts. The fundamental embedding theorems show that, under reasonable assumptions, the intrinsic and extrinsic approaches give the same classes of manifolds (up to isomorphism in various categories). For instance, we have the following (special case of) the Whitney embedding theorem:

Theorem 1 (Whitney embedding theorem) Let ${M}$ be a compact manifold. Then there exists an embedding ${u: M \rightarrow {\bf R}^d}$ from ${M}$ to a Euclidean space ${{\bf R}^d}$.

In fact, if ${M}$ is ${n}$-dimensional, one can take ${d}$ to equal ${2n}$, which is often best possible (easy examples include the circle ${{\bf R}/{\bf Z}}$ which embeds into ${{\bf R}^2}$ but not ${{\bf R}^1}$, or the Klein bottle that embeds into ${{\bf R}^4}$ but not ${{\bf R}^3}$). One can also relax the compactness hypothesis on ${M}$ to second countability, but we will not pursue this extension here. We give a “cheap” proof of this theorem below the fold which allows one to take ${d}$ equal to ${2n+1}$.

A significant strengthening of the Whitney embedding theorem is (a special case of) the Nash embedding theorem:

Theorem 2 (Nash embedding theorem) Let ${(M,g)}$ be a compact Riemannian manifold. Then there exists a isometric embedding ${u: M \rightarrow {\bf R}^d}$ from ${M}$ to a Euclidean space ${{\bf R}^d}$.

In order to obtain the isometric embedding, the dimension ${d}$ has to be a bit larger than what is needed for the Whitney embedding theorem; in this article of Gunther the bound

$\displaystyle d = \max( n(n+5)/2, n(n+3)/2 + 5) \ \ \ \ \ (1)$

is attained, which I believe is still the record for large ${n}$. (In the converse direction, one cannot do better than ${d = \frac{n(n+1)}{2}}$, basically because this is the number of degrees of freedom in the Riemannian metric ${g}$.) Nash’s original proof of theorem used what is now known as Nash-Moser inverse function theorem, but a subsequent simplification of Gunther allowed one to proceed using just the ordinary inverse function theorem (in Banach spaces).

I recently had the need to invoke the Nash embedding theorem to establish a blowup result for a nonlinear wave equation, which motivated me to go through the proof of the theorem more carefully. Below the fold I give a proof of the theorem that does not attempt to give an optimal value of ${d}$, but which hopefully isolates the main ideas of the argument (as simplified by Gunther). One advantage of not optimising in ${d}$ is that it allows one to freely exploit the very useful tool of pairing together two maps ${u_1: M \rightarrow {\bf R}^{d_1}}$, ${u_2: M \rightarrow {\bf R}^{d_2}}$ to form a combined map ${(u_1,u_2): M \rightarrow {\bf R}^{d_1+d_2}}$ that can be closer to an embedding or an isometric embedding than the original maps ${u_1,u_2}$. This lets one perform a “divide and conquer” strategy in which one first starts with the simpler problem of constructing some “partial” embeddings of ${M}$ and then pairs them together to form a “better” embedding.

In preparing these notes, I found the articles of Deane Yang and of Siyuan Lu to be helpful.

Over the last few years, a large group of mathematicians have been developing an online database to systematically collect the known facts, numerical data, and algorithms concerning some of the most central types of objects in modern number theory, namely the L-functions associated to various number fields, curves, and modular forms, as well as further data about these modular forms.  This of course includes the most famous examples of L-functions and modular forms respectively, namely the Riemann zeta function $\zeta(s)$ and the discriminant modular form $\Delta(q)$, but there are countless other examples of both. The connections between these classes of objects lie at the heart of the Langlands programme.

As of today, the “L-functions and modular forms database” is now out of beta, and open to the public; at present the database is mostly geared towards specialists in computational number theory, but will hopefully develop into a more broadly useful resource as time develops.  An article by John Cremona summarising the purpose of the database can be found here.

(Thanks to Andrew Sutherland and Kiran Kedlaya for the information.)

The International Mathematical Union (with the assistance of the Friends of the International Mathematical Union and The World Academy of Sciences, and supported by Ian Agol, Simon Donaldson, Maxim Kontsevich, Jacob Lurie, Richard Taylor, and myself) has just launched the Graduate Breakout Fellowships, which will offer highly qualified students from developing countries a full scholarship to study for a PhD in mathematics at an institution that is also located in a developing country.  Nominations for this fellowship (which should be from a sponsoring mathematician, preferably a mentor of the nominee) have just opened (with an application deadline of June 22); details on the nomination process and eligibility requirements can be found at this page.

In functional analysis, it is common to endow various (infinite-dimensional) vector spaces with a variety of topologies. For instance, a normed vector space can be given the strong topology as well as the weak topology; if the vector space has a predual, it also has a weak-* topology. Similarly, spaces of operators have a number of useful topologies on them, including the operator norm topology, strong operator topology, and the weak operator topology. For function spaces, one can use topologies associated to various modes of convergence, such as uniform convergence, pointwise convergence, locally uniform convergence, or convergence in the sense of distributions. (A small minority of such modes are not topologisable, though, the most common of which is pointwise almost everywhere convergence; see Exercise 8 of this previous post).

Some of these topologies are much stronger than others (in that they contain many more open sets, or equivalently that they have many fewer convergent sequences and nets). However, even the weakest topologies used in analysis (e.g. convergence in distributions) tend to be Hausdorff, since this at least ensures the uniqueness of limits of sequences and nets, which is a fundamentally useful feature for analysis. On the other hand, some Hausdorff topologies used are “better” than others in that many more analysis tools are available for those topologies. In particular, topologies that come from Banach space norms are particularly valued, as such topologies (and their attendant norm and metric structures) grant access to many convenient additional results such as the Baire category theorem, the uniform boundedness principle, the open mapping theorem, and the closed graph theorem.

Of course, most topologies placed on a vector space will not come from Banach space norms. For instance, if one takes the space ${C_0({\bf R})}$ of continuous functions on ${{\bf R}}$ that converge to zero at infinity, the topology of uniform convergence comes from a Banach space norm on this space (namely, the uniform norm ${\| \|_{L^\infty}}$), but the topology of pointwise convergence does not; and indeed all the other usual modes of convergence one could use here (e.g. ${L^1}$ convergence, locally uniform convergence, convergence in measure, etc.) do not arise from Banach space norms.

I recently realised (while teaching a graduate class in real analysis) that the closed graph theorem provides a quick explanation for why Banach space topologies are so rare:

Proposition 1 Let ${V = (V, {\mathcal F})}$ be a Hausdorff topological vector space. Then, up to equivalence of norms, there is at most one norm ${\| \|}$ one can place on ${V}$ so that ${(V,\| \|)}$ is a Banach space whose topology is at least as strong as ${{\mathcal F}}$. In particular, there is at most one topology stronger than ${{\mathcal F}}$ that comes from a Banach space norm.

Proof: Suppose one had two norms ${\| \|_1, \| \|_2}$ on ${V}$ such that ${(V, \| \|_1)}$ and ${(V, \| \|_2)}$ were both Banach spaces with topologies stronger than ${{\mathcal F}}$. Now consider the graph of the identity function ${\hbox{id}: V \rightarrow V}$ from the Banach space ${(V, \| \|_1)}$ to the Banach space ${(V, \| \|_2)}$. This graph is closed; indeed, if ${(x_n,x_n)}$ is a sequence in this graph that converged in the product topology to ${(x,y)}$, then ${x_n}$ converges to ${x}$ in ${\| \|_1}$ norm and hence in ${{\mathcal F}}$, and similarly ${x_n}$ converges to ${y}$ in ${\| \|_2}$ norm and hence in ${{\mathcal F}}$. But limits are unique in the Hausdorff topology ${{\mathcal F}}$, so ${x=y}$. Applying the closed graph theorem (see also previous discussions on this theorem), we see that the identity map is continuous from ${(V, \| \|_1)}$ to ${(V, \| \|_2)}$; similarly for the inverse. Thus the norms ${\| \|_1, \| \|_2}$ are equivalent as claimed. $\Box$

By using various generalisations of the closed graph theorem, one can generalise the above proposition to Fréchet spaces, or even to F-spaces. The proposition can fail if one drops the requirement that the norms be stronger than a specified Hausdorff topology; indeed, if ${V}$ is infinite dimensional, one can use a Hamel basis of ${V}$ to construct a linear bijection on ${V}$ that is unbounded with respect to a given Banach space norm ${\| \|}$, and which can then be used to give an inequivalent Banach space structure on ${V}$.

One can interpret Proposition 1 as follows: once one equips a vector space with some “weak” (but still Hausdorff) topology, there is a canonical choice of “strong” topology one can place on that space that is stronger than the “weak” topology but arises from a Banach space structure (or at least a Fréchet or F-space structure), provided that at least one such structure exists. In the case of function spaces, one can usually use the topology of convergence in distribution as the “weak” Hausdorff topology for this purpose, since this topology is weaker than almost all of the other topologies used in analysis. This helps justify the common practice of describing a Banach or Fréchet function space just by giving the set of functions that belong to that space (e.g. ${{\mathcal S}({\bf R}^n)}$ is the space of Schwartz functions on ${{\bf R}^n}$) without bothering to specify the precise topology to serve as the “strong” topology, since it is usually understood that one is using the canonical such topology (e.g. the Fréchet space structure on ${{\mathcal S}({\bf R}^n)}$ given by the usual Schwartz space seminorms).

Of course, there are still some topological vector spaces which have no “strong topology” arising from a Banach space at all. Consider for instance the space ${c_c({\bf N})}$ of finitely supported sequences. A weak, but still Hausdorff, topology to place on this space is the topology of pointwise convergence. But there is no norm ${\| \|}$ stronger than this topology that makes this space a Banach space. For, if there were, then letting ${e_1,e_2,e_3,\dots}$ be the standard basis of ${c_c({\bf N})}$, the series ${\sum_{n=1}^\infty 2^{-n} e_n / \| e_n \|}$ would have to converge in ${\| \|}$, and hence pointwise, to an element of ${c_c({\bf N})}$, but the only available pointwise limit for this series lies outside of ${c_c({\bf N})}$. But I do not know if there is an easily checkable criterion to test whether a given vector space (equipped with a Hausdorff “weak” toplogy) can be equipped with a stronger Banach space (or Fréchet space or ${F}$-space) topology.

Tamar Ziegler and I have just uploaded to the arXiv two related papers: “Concatenation theorems for anti-Gowers-uniform functions and Host-Kra characteoristic factors” and “polynomial patterns in primes“, with the former developing a “quantitative Bessel inequality” for local Gowers norms that is crucial in the latter.

We use the term “concatenation theorem” to denote results in which structural control of a function in two or more “directions” can be “concatenated” into structural control in a joint direction. A trivial example of such a concatenation theorem is the following: if a function ${f: {\bf Z} \times {\bf Z} \rightarrow {\bf R}}$ is constant in the first variable (thus ${x \mapsto f(x,y)}$ is constant for each ${y}$), and also constant in the second variable (thus ${y \mapsto f(x,y)}$ is constant for each ${x}$), then it is constant in the joint variable ${(x,y)}$. A slightly less trivial example: if a function ${f: {\bf Z} \times {\bf Z} \rightarrow {\bf R}}$ is affine-linear in the first variable (thus, for each ${y}$, there exist ${\alpha(y), \beta(y)}$ such that ${f(x,y) = \alpha(y) x + \beta(y)}$ for all ${x}$) and affine-linear in the second variable (thus, for each ${x}$, there exist ${\gamma(x), \delta(x)}$ such that ${f(x,y) = \gamma(x)y + \delta(x)}$ for all ${y}$) then ${f}$ is a quadratic polynomial in ${x,y}$; in fact it must take the form

$\displaystyle f(x,y) = \epsilon xy + \zeta x + \eta y + \theta \ \ \ \ \ (1)$

for some real numbers ${\epsilon, \zeta, \eta, \theta}$. (This can be seen for instance by using the affine linearity in ${y}$ to show that the coefficients ${\alpha(y), \beta(y)}$ are also affine linear.)

The same phenomenon extends to higher degree polynomials. Given a function ${f: G \rightarrow K}$ from one additive group ${G}$ to another, we say that ${f}$ is of degree less than ${d}$ along a subgroup ${H}$ of ${G}$ if all the ${d}$-fold iterated differences of ${f}$ along directions in ${H}$ vanish, that is to say

$\displaystyle \partial_{h_1} \dots \partial_{h_d} f(x) = 0$

for all ${x \in G}$ and ${h_1,\dots,h_d \in H}$, where ${\partial_h}$ is the difference operator

$\displaystyle \partial_h f(x) := f(x+h) - f(x).$

(We adopt the convention that the only ${f}$ of degree less than ${0}$ is the zero function.)

We then have the following simple proposition:

Proposition 1 (Concatenation of polynomiality) Let ${f: G \rightarrow K}$ be of degree less than ${d_1}$ along one subgroup ${H_1}$ of ${G}$, and of degree less than ${d_2}$ along another subgroup ${H_2}$ of ${G}$, for some ${d_1,d_2 \geq 1}$. Then ${f}$ is of degree less than ${d_1+d_2-1}$ along the subgroup ${H_1+H_2}$ of ${G}$.

Note the previous example was basically the case when ${G = {\bf Z} \times {\bf Z}}$, ${H_1 = {\bf Z} \times \{0\}}$, ${H_2 = \{0\} \times {\bf Z}}$, ${K = {\bf R}}$, and ${d_1=d_2=2}$.

Proof: The claim is trivial for ${d_1=1}$ or ${d_2=1}$ (in which ${f}$ is constant along ${H_1}$ or ${H_2}$ respectively), so suppose inductively ${d_1,d_2 \geq 2}$ and the claim has already been proven for smaller values of ${d_1-1}$.

We take a derivative in a direction ${h_1 \in H_1}$ along ${h_1}$ to obtain

$\displaystyle T^{-h_1} f = f + \partial_{h_1} f$

where ${T^{-h_1} f(x) = f(x+h_1)}$ is the shift of ${f}$ by ${-h_1}$. Then we take a further shift by a direction ${h_2 \in H_2}$ to obtain

$\displaystyle T^{-h_1-h_2} f = T^{-h_2} f + T^{-h_2} \partial_{h_1} f = f + \partial_{h_2} f + T^{-h_2} \partial_{h_1} f$

$\displaystyle \partial_{h_1+h_2} f = \partial_{h_2} f + T^{-h_2} \partial_{h_1} f.$

Since ${f}$ has degree less than ${d_1}$ along ${H_1}$ and degree less than ${d_2}$ along ${H_2}$, ${\partial_{h_1} f}$ has degree less than ${d_1-1}$ along ${H_1}$ and less than ${d_2}$ along ${H_2}$, so is degree less than ${d_1+d_2-2}$ along ${H_1+H_2}$ by induction hypothesis. Similarly ${\partial_{h_2} f}$ is also of degree less than ${d_1+d_2-2}$ along ${H_1+H_2}$. Combining this with the cocycle equation we see that ${\partial_{h_1+h_2}f}$ is of degree less than ${d_1+d_2-2}$ along ${H_1+H_2}$ for any ${h_1+h_2 \in H_1+H_2}$, and hence ${f}$ is of degree less than ${d_1+d_2-1}$ along ${H_1+H_2}$, as required. $\Box$

While this proposition is simple, it already illustrates some basic principles regarding how one would go about proving a concatenation theorem:

• (i) One should perform induction on the degrees ${d_1,d_2}$ involved, and take advantage of the recursive nature of degree (in this case, the fact that a function is of less than degree ${d}$ along some subgroup ${H}$ of directions iff all of its first derivatives along ${H}$ are of degree less than ${d-1}$).
• (ii) Structure is preserved by operations such as addition, shifting, and taking derivatives. In particular, if a function ${f}$ is of degree less than ${d}$ along some subgroup ${H}$, then any derivative ${\partial_k f}$ of ${f}$ is also of degree less than ${d}$ along ${H}$, even if ${k}$ does not belong to ${H}$.

Here is another simple example of a concatenation theorem. Suppose an at most countable additive group ${G}$ acts by measure-preserving shifts ${T: g \mapsto T^g}$ on some probability space ${(X, {\mathcal X}, \mu)}$; we call the pair ${(X,T)}$ (or more precisely ${(X, {\mathcal X}, \mu, T)}$) a ${G}$-system. We say that a function ${f \in L^\infty(X)}$ is a generalised eigenfunction of degree less than ${d}$ along some subgroup ${H}$ of ${G}$ and some ${d \geq 1}$ if one has

$\displaystyle T^h f = \lambda_h f$

almost everywhere for all ${h \in H}$, and some functions ${\lambda_h \in L^\infty(X)}$ of degree less than ${d-1}$ along ${H}$, with the convention that a function has degree less than ${0}$ if and only if it is equal to ${1}$. Thus for instance, a function ${f}$ is an generalised eigenfunction of degree less than ${1}$ along ${H}$ if it is constant on almost every ${H}$-ergodic component of ${G}$, and is a generalised function of degree less than ${2}$ along ${H}$ if it is an eigenfunction of the shift action on almost every ${H}$-ergodic component of ${G}$. A basic example of a higher order eigenfunction is the function ${f(x,y) := e^{2\pi i y}}$ on the skew shift ${({\bf R}/{\bf Z})^2}$ with ${{\bf Z}}$ action given by the generator ${T(x,y) := (x+\alpha,y+x)}$ for some irrational ${\alpha}$. One can check that ${T^h f = \lambda_h f}$ for every integer ${h}$, where ${\lambda_h: x \mapsto e^{2\pi i \binom{h}{2} \alpha} e^{2\pi i h x}}$ is a generalised eigenfunction of degree less than ${2}$ along ${{\bf Z}}$, so ${f}$ is of degree less than ${3}$ along ${{\bf Z}}$.

We then have

Proposition 2 (Concatenation of higher order eigenfunctions) Let ${(X,T)}$ be a ${G}$-system, and let ${f \in L^\infty(X)}$ be a generalised eigenfunction of degree less than ${d_1}$ along one subgroup ${H_1}$ of ${G}$, and a generalised eigenfunction of degree less than ${d_2}$ along another subgroup ${H_2}$ of ${G}$, for some ${d_1,d_2 \geq 1}$. Then ${f}$ is a generalised eigenfunction of degree less than ${d_1+d_2-1}$ along the subgroup ${H_1+H_2}$ of ${G}$.

The argument is almost identical to that of the previous proposition and is left as an exercise to the reader. The key point is the point (ii) identified earlier: the space of generalised eigenfunctions of degree less than ${d}$ along ${H}$ is preserved by multiplication and shifts, as well as the operation of “taking derivatives” ${f \mapsto \lambda_k}$ even along directions ${k}$ that do not lie in ${H}$. (To prove this latter claim, one should restrict to the region where ${f}$ is non-zero, and then divide ${T^k f}$ by ${f}$ to locate ${\lambda_k}$.)

A typical example of this proposition in action is as follows: consider the ${{\bf Z}^2}$-system given by the ${3}$-torus ${({\bf R}/{\bf Z})^3}$ with generating shifts

$\displaystyle T^{(1,0)}(x,y,z) := (x+\alpha,y,z+y)$

$\displaystyle T^{(0,1)}(x,y,z) := (x,y+\alpha,z+x)$

for some irrational ${\alpha}$, which can be checked to give a ${{\bf Z}^2}$ action

$\displaystyle T^{(n,m)}(x,y,z) := (x+n\alpha, y+m\alpha, z+ny+mx+nm\alpha).$

The function ${f(x,y,z) := e^{2\pi i z}}$ can then be checked to be a generalised eigenfunction of degree less than ${2}$ along ${{\bf Z} \times \{0\}}$, and also less than ${2}$ along ${\{0\} \times {\bf Z}}$, and less than ${3}$ along ${{\bf Z}^2}$. One can view this example as the dynamical systems translation of the example (1) (see this previous post for some more discussion of this sort of correspondence).

The main results of our concatenation paper are analogues of these propositions concerning a more complicated notion of “polynomial-like” structure that are of importance in additive combinatorics and in ergodic theory. On the ergodic theory side, the notion of structure is captured by the Host-Kra characteristic factors ${Z^{ of a ${G}$-system ${X}$ along a subgroup ${H}$. These factors can be defined in a number of ways. One is by duality, using the Gowers-Host-Kra uniformity seminorms (defined for instance here) ${\| \|_{U^d_H(X)}}$. Namely, ${Z^{ is the factor of ${X}$ defined up to equivalence by the requirement that

$\displaystyle \|f\|_{U^d_H(X)} = 0 \iff {\bf E}(f | Z^{

An equivalent definition is in terms of the dual functions ${{\mathcal D}^d_H(f)}$ of ${f}$ along ${H}$, which can be defined recursively by setting ${{\mathcal D}^0_H(f) = 1}$ and

$\displaystyle {\mathcal D}^d_H(f) = {\bf E}_h T^h f {\mathcal D}^{d-1}( f \overline{T^h f} )$

where ${{\bf E}_h}$ denotes the ergodic average along a Følner sequence in ${G}$ (in fact one can also define these concepts in non-amenable abelian settings as per this previous post). The factor ${Z^{ can then be alternately defined as the factor generated by the dual functions ${{\mathcal D}^d_H(f)}$ for ${f \in L^\infty(X)}$.

In the case when ${G=H={\bf Z}}$ and ${X}$ is ${G}$-ergodic, a deep theorem of Host and Kra shows that the factor ${Z^{ is equivalent to the inverse limit of nilsystems of step less than ${d}$. A similar statement holds with ${{\bf Z}}$ replaced by any finitely generated group by Griesmer, while the case of an infinite vector space over a finite field was treated in this paper of Bergelson, Ziegler, and myself. The situation is more subtle when ${X}$ is not ${G}$-ergodic, or when ${X}$ is ${G}$-ergodic but ${H}$ is a proper subgroup of ${G}$ acting non-ergodically, when one has to start considering measurable families of directional nilsystems; see for instance this paper of Austin for some of the subtleties involved (for instance, higher order group cohomology begins to become relevant!).

One of our main theorems is then

Proposition 3 (Concatenation of characteristic factors) Let ${(X,T)}$ be a ${G}$-system, and let ${f}$ be measurable with respect to the factor ${Z^{ and with respect to the factor ${Z^{ for some ${d_1,d_2 \geq 1}$ and some subgroups ${H_1,H_2}$ of ${G}$. Then ${f}$ is also measurable with respect to the factor ${Z^{.

We give two proofs of this proposition in the paper; an ergodic-theoretic proof using the Host-Kra theory of “cocycles of type ${ (along a subgroup ${H}$)”, which can be used to inductively describe the factors ${Z^{, and a combinatorial proof based on a combinatorial analogue of this proposition which is harder to state (but which roughly speaking asserts that a function which is nearly orthogonal to all bounded functions of small ${U^{d_1}_{H_1}}$ norm, and also to all bounded functions of small ${U^{d_2}_{H_2}}$ norm, is also nearly orthogonal to alll bounded functions of small ${U^{d_1+d_2-1}_{H_1+H_2}}$ norm). The combinatorial proof parallels the proof of Proposition 2. A key point is that dual functions ${F := {\mathcal D}^d_H(f)}$ obey a property analogous to being a generalised eigenfunction, namely that

$\displaystyle T^h F = {\bf E}_k \lambda_{h,k} F_k$

where ${F_k := T^k F}$ and ${\lambda_{h,k} := {\mathcal D}^{d-1}( T^h f \overline{T^k f} )}$ is a “structured function of order ${d-1}$” along ${H}$. (In the language of this previous paper of mine, this is an assertion that dual functions are uniformly almost periodic of order ${d}$.) Again, the point (ii) above is crucial, and in particular it is key that any structure that ${F}$ has is inherited by the associated functions ${\lambda_{h,k}}$ and ${F_k}$. This sort of inheritance is quite easy to accomplish in the ergodic setting, as there is a ready-made language of factors to encapsulate the concept of structure, and the shift-invariance and ${\sigma}$-algebra properties of factors make it easy to show that just about any “natural” operation one performs on a function measurable with respect to a given factor, returns a function that is still measurable in that factor. In the finitary combinatorial setting, though, encoding the fact (ii) becomes a remarkably complicated notational nightmare, requiring a huge amount of “epsilon management” and “second-order epsilon management” (in which one manages not only scalar epsilons, but also function-valued epsilons that depend on other parameters). In order to avoid all this we were forced to utilise a nonstandard analysis framework for the combinatorial theorems, which made the arguments greatly resemble the ergodic arguments in many respects (though the two settings are still not equivalent, see this previous blog post for some comparisons between the two settings). Unfortunately the arguments are still rather complicated.

For combinatorial applications, dual formulations of the concatenation theorem are more useful. A direct dualisation of the theorem yields the following decomposition theorem: a bounded function which is small in ${U^{d_1+d_2-1}_{H_1+H_2}}$ norm can be split into a component that is small in ${U^{d_1}_{H_1}}$ norm, and a component that is small in ${U^{d_2}_{H_2}}$ norm. (One may wish to understand this type of result by first proving the following baby version: any function that has mean zero on every coset of ${H_1+H_2}$, can be decomposed as the sum of a function that has mean zero on every ${H_1}$ coset, and a function that has mean zero on every ${H_2}$ coset. This is dual to the assertion that a function that is constant on every ${H_1}$ coset and constant on every ${H_2}$ coset, is constant on every ${H_1+H_2}$ coset.) Combining this with some standard “almost orthogonality” arguments (i.e. Cauchy-Schwarz) give the following Bessel-type inequality: if one has a lot of subgroups ${H_1,\dots,H_k}$ and a bounded function is small in ${U^{2d-1}_{H_i+H_j}}$ norm for most ${i,j}$, then it is also small in ${U^d_{H_i}}$ norm for most ${i}$. (Here is a baby version one may wish to warm up on: if a function ${f}$ has small mean on ${({\bf Z}/p{\bf Z})^2}$ for some large prime ${p}$, then it has small mean on most of the cosets of most of the one-dimensional subgroups of ${({\bf Z}/p{\bf Z})^2}$.)

There is also a generalisation of the above Bessel inequality (as well as several of the other results mentioned above) in which the subgroups ${H_i}$ are replaced by more general coset progressions ${H_i+P_i}$ (of bounded rank), so that one has a Bessel inequailty controlling “local” Gowers uniformity norms such as ${U^d_{P_i}}$ by “global” Gowers uniformity norms such as ${U^{2d-1}_{P_i+P_j}}$. This turns out to be particularly useful when attempting to compute polynomial averages such as

$\displaystyle \sum_{n \leq N} \sum_{r \leq \sqrt{N}} f(n) g(n+r^2) h(n+2r^2) \ \ \ \ \ (2)$

for various functions ${f,g,h}$. After repeated use of the van der Corput lemma, one can control such averages by expressions such as

$\displaystyle \sum_{n \leq N} \sum_{h,m,k \leq \sqrt{N}} f(n) f(n+mh) f(n+mk) f(n+m(h+k))$

(actually one ends up with more complicated expressions than this, but let’s use this example for sake of discussion). This can be viewed as an average of various ${U^2}$ Gowers uniformity norms of ${f}$ along arithmetic progressions of the form ${\{ mh: h \leq \sqrt{N}\}}$ for various ${m \leq \sqrt{N}}$. Using the above Bessel inequality, this can be controlled in turn by an average of various ${U^3}$ Gowers uniformity norms along rank two generalised arithmetic progressions of the form ${\{ m_1 h_1 + m_2 h_2: h_1,h_2 \le \sqrt{N}\}}$ for various ${m_1,m_2 \leq \sqrt{N}}$. But for generic ${m_1,m_2}$, this rank two progression is close in a certain technical sense to the “global” interval ${\{ n: n \leq N \}}$ (this is ultimately due to the basic fact that two randomly chosen large integers are likely to be coprime, or at least have a small gcd). As a consequence, one can use the concatenation theorems from our first paper to control expressions such as (2) in terms of global Gowers uniformity norms. This is important in number theoretic applications, when one is interested in computing sums such as

$\displaystyle \sum_{n \leq N} \sum_{r \leq \sqrt{N}} \mu(n) \mu(n+r^2) \mu(n+2r^2)$

or

$\displaystyle \sum_{n \leq N} \sum_{r \leq \sqrt{N}} \Lambda(n) \Lambda(n+r^2) \Lambda(n+2r^2)$

where ${\mu}$ and ${\Lambda}$ are the Möbius and von Mangoldt functions respectively. This is because we are able to control global Gowers uniformity norms of such functions (thanks to results such as the proof of the inverse conjecture for the Gowers norms, the orthogonality of the Möbius function with nilsequences, and asymptotics for linear equations in primes), but much less control is currently available for local Gowers uniformity norms, even with the assistance of the generalised Riemann hypothesis (see this previous blog post for some further discussion).

By combining these tools and strategies with the “transference principle” approach from our previous paper (as improved using the recent “densification” technique of Conlon, Fox, and Zhao, discussed in this previous post), we are able in particular to establish the following result:

Theorem 4 (Polynomial patterns in the primes) Let ${P_1,\dots,P_k: {\bf Z} \rightarrow {\bf Z}}$ be polynomials of degree at most ${d}$, whose degree ${d}$ coefficients are all distinct, for some ${d \geq 1}$. Suppose that ${P_1,\dots,P_k}$ is admissible in the sense that for every prime ${p}$, there are ${n,r}$ such that ${n+P_1(r),\dots,n+P_k(r)}$ are all coprime to ${p}$. Then there exist infinitely many pairs ${n,r}$ of natural numbers such that ${n+P_1(r),\dots,n+P_k(r)}$ are prime.

Furthermore, we obtain an asymptotic for the number of such pairs ${n,r}$ in the range ${n \leq N}$, ${r \leq N^{1/d}}$ (actually for minor technical reasons we reduce the range of ${r}$ to be very slightly less than ${N^{1/d}}$). In fact one could in principle obtain asymptotics for smaller values of ${r}$, and relax the requirement that the degree ${d}$ coefficients be distinct with the requirement that no two of the ${P_i}$ differ by a constant, provided one had good enough local uniformity results for the Möbius or von Mangoldt functions. For instance, we can obtain an asymptotic for triplets of the form ${n, n+r,n+r^d}$ unconditionally for ${d \leq 5}$, and conditionally on GRH for all ${d}$, using known results on primes in short intervals on average.

The ${d=1}$ case of this theorem was obtained in a previous paper of myself and Ben Green (using the aforementioned conjectures on the Gowers uniformity norm and the orthogonality of the Möbius function with nilsequences, both of which are now proven). For higher ${d}$, an older result of Tamar and myself was able to tackle the case when ${P_1(0)=\dots=P_k(0)=0}$ (though our results there only give lower bounds on the number of pairs ${(n,r)}$, and no asymptotics). Both of these results generalise my older theorem with Ben Green on the primes containing arbitrarily long arithmetic progressions. The theorem also extends to multidimensional polynomials, in which case there are some additional previous results; see the paper for more details. We also get a technical refinement of our previous result on narrow polynomial progressions in (dense subsets of) the primes by making the progressions just a little bit narrower in the case of the density of the set one is using is small.

. This latter Bessel type inequality is particularly useful in combinatorial and number-theoretic applications, as it allows one to convert “global” Gowers uniformity norm (basically, bounds on norms such as ${U^{2d-1}_{H_i+H_j}}$) to “local” Gowers uniformity norm control.