You are currently browsing the category archive for the ‘expository’ category.

The twin prime conjecture, still unsolved, asserts that there are infinitely many primes ${p}$ such that ${p+2}$ is also prime. A more precise form of this conjecture is (a special case) of the Hardy-Littlewood prime tuples conjecture, which asserts that

$\displaystyle \sum_{n \leq x} \Lambda(n) \Lambda(n+2) = (2\Pi_2+o(1)) x \ \ \ \ \ (1)$

as ${x \rightarrow \infty}$, where ${\Lambda}$ is the von Mangoldt function and ${\Pi_2 = 0.6606\dots}$ is the twin prime constant

$\displaystyle \prod_{p>2} (1 - \frac{1}{(p-1)^2}).$

Because ${\Lambda}$ is almost entirely supported on the primes, it is not difficult to see that (1) implies the twin prime conjecture.

One can give a heuristic justification of the asymptotic (1) (and hence the twin prime conjecture) via sieve theoretic methods. Recall that the von Mangoldt function can be decomposed as a Dirichlet convolution

$\displaystyle \Lambda(n) = \sum_{d|n} \mu(d) \log \frac{n}{d}$

where ${\mu}$ is the Möbius function. Because of this, we can rewrite the left-hand side of (1) as

$\displaystyle \sum_{d \leq x} \mu(d) \sum_{n \leq x: d|n} \log\frac{n}{d} \Lambda(n+2). \ \ \ \ \ (2)$

To compute this double sum, it is thus natural to consider sums such as

$\displaystyle \sum_{n \leq x: d|n} \log \frac{n}{d} \Lambda(n+2)$

or (to simplify things by removing the logarithm)

$\displaystyle \sum_{n \leq x: d|n} \Lambda(n+2).$

The prime number theorem in arithmetic progressions suggests that one has an asymptotic of the form

$\displaystyle \sum_{n \leq x: d|n} \Lambda(n+2) \approx \frac{g(d)}{d} x \ \ \ \ \ (3)$

where ${g}$ is the multiplicative function with ${g(d)=0}$ for ${d}$ even and

$\displaystyle g(d) := \frac{d}{\phi(d)} = \prod_{p|d} (1-\frac{1}{p})^{-1}$

for ${d}$ odd. Summing by parts, one then expects

$\displaystyle \sum_{n \leq x: d|n} \Lambda(n+2)\log \frac{n}{d} \approx \frac{g(d)}{d} x \log \frac{x}{d}$

and so we heuristically have

$\displaystyle \sum_{n \leq x} \Lambda(n) \Lambda(n+2) \approx x \sum_{d \leq x} \frac{\mu(d) g(d)}{d} \log \frac{x}{d}.$

The Dirichlet series

$\displaystyle \sum_n \frac{\mu(n) g(n)}{n^s}$

$\displaystyle \sum_n \frac{\mu(n) g(n)}{n^s} = \prod_p (1 - \frac{g(p)}{p^s})$

for ${\hbox{Re} s > 1}$; comparing this with the Euler product factorisation

$\displaystyle \zeta(s) = \prod_p (1 - \frac{1}{p^s})^{-1}$

for the Riemann zeta function, and recalling that ${\zeta}$ has a simple pole of residue ${1}$ at ${s=1}$, we see that

$\displaystyle \sum_n \frac{\mu(n) g(n)}{n^s} = \frac{1}{\zeta(s)} \prod_p \frac{1-g(p)/p^s}{1-p^s}$

has a simple zero at ${s=1}$ with first derivative

$\displaystyle \prod_p \frac{1 - g(p)/p}{1-1/p} = 2 \Pi_2.$

From this and standard multiplicative number theory manipulations, one can calculate the asymptotic

$\displaystyle \sum_{d \leq x} \frac{\mu(d) g(d)}{d} \log \frac{x}{d} = 2 \Pi_2 + o(1)$

which concludes the heuristic justification of (1).

What prevents us from making the above heuristic argument rigorous, and thus proving (1) and the twin prime conjecture? Note that the variable ${d}$ in (2) ranges to be as large as ${x}$. On the other hand, the prime number theorem in arithmetic progressions (3) is not expected to hold for ${d}$ anywhere that large (for instance, the left-hand side of (3) vanishes as soon as ${d}$ exceeds ${x}$). The best unconditional result known of the type (3) is the Siegel-Walfisz theorem, which allows ${d}$ to be as large as ${\log^{O(1)} x}$. Even the powerful generalised Riemann hypothesis (GRH) only lets one prove an estimate of the form (3) for ${d}$ up to about ${x^{1/2-o(1)}}$.

However, because of the averaging effect of the summation in ${d}$ in (2), we don’t need the asymptotic (3) to be true for all ${d}$ in a particular range; having it true for almost all ${d}$ in that range would suffice. Here the situation is much better; the celebrated Bombieri-Vinogradov theorem (sometimes known as “GRH on the average”) implies, roughly speaking, that the approximation (3) is valid for almost all ${d \leq x^{1/2-\varepsilon}}$ for any fixed ${\varepsilon>0}$. While this is not enough to control (2) or (1), the Bombieri-Vinogradov theorem can at least be used to control variants of (1) such as

$\displaystyle \sum_{n \leq x} (\sum_{d|n} \lambda_d) \Lambda(n+2)$

for various sieve weights ${\lambda_d}$ whose associated divisor function ${\sum_{d|n} \lambda_d}$ is supposed to approximate the von Mangoldt function ${\Lambda}$, although that theorem only lets one do this when the weights ${\lambda_d}$ are supported on the range ${d \leq x^{1/2-\varepsilon}}$. This is still enough to obtain some partial results towards (1); for instance, by selecting weights according to the Selberg sieve, one can use the Bombieri-Vinogradov theorem to establish the upper bound

$\displaystyle \sum_{n \leq x} \Lambda(n) \Lambda(n+2) \leq (4+o(1)) 2 \Pi_2 x, \ \ \ \ \ (4)$

which is off from (1) by a factor of about ${4}$. See for instance this blog post for details.

It has been difficult to improve upon the Bombieri-Vinogradov theorem in its full generality, although there are various improvements to certain restricted versions of the Bombieri-Vinogradov theorem, for instance in the famous work of Zhang on bounded gaps between primes. Nevertheless, it is believed that the Elliott-Halberstam conjecture (EH) holds, which roughly speaking would mean that (3) now holds for almost all ${d \leq x^{1-\varepsilon}}$ for any fixed ${\varepsilon>0}$. (Unfortunately, the ${\varepsilon}$ factor cannot be removed, as investigated in a series of papers by Friedlander, Granville, and also Hildebrand and Maier.) This comes tantalisingly close to having enough distribution to control all of (1). Unfortunately, it still falls short. Using this conjecture in place of the Bombieri-Vinogradov theorem leads to various improvements to sieve theoretic bounds; for instance, the factor of ${4+o(1)}$ in (4) can now be improved to ${2+o(1)}$.

In two papers from the 1970s (which can be found online here and here respectively, the latter starting on page 255 of the pdf), Bombieri developed what is now known as the Bombieri asymptotic sieve to clarify the situation more precisely. First, he showed that on the Elliott-Halberstam conjecture, while one still could not establish the asymptotic (1), one could prove the generalised asymptotic

$\displaystyle \sum_{n \leq x} \Lambda_k(n) \Lambda(n+2) = (2\Pi_2+o(1)) k x \log^{k-1} x \ \ \ \ \ (5)$

for all natural numbers ${k \geq 2}$, where the generalised von Mangoldt functions ${\Lambda_k}$ are defined by the formula

$\displaystyle \Lambda_k(n) := \sum_{d|n} \mu(d) \log^k \frac{n}{d}.$

These functions behave like the von Mangoldt function, but are concentrated on ${k}$-almost primes (numbers with at most ${k}$ prime factors) rather than primes. The right-hand side of (5) corresponds to what one would expect if one ran the same heuristics used to justify (1). Sadly, the ${k=1}$ case of (5), which is just (1), is just barely excluded from Bombieri’s analysis.

More generally, on the assumption of EH, the Bombieri asymptotic sieve provides the asymptotic

$\displaystyle \sum_{n \leq x} \Lambda_{(k_1,\dots,k_r)}(n) \Lambda(n+2) \ \ \ \ \ (6)$

$\displaystyle = (2\Pi_2+o(1)) \frac{\prod_{i=1}^r k_i!}{(k_1+\dots+k_r-1)!} x \log^{k_1+\dots+k_r-1} x$

for any fixed ${r \geq 1}$ and any tuple ${(k_1,\dots,k_r)}$ of natural numbers other than ${(1,\dots,1)}$, where

$\displaystyle \Lambda_{(k_1,\dots,k_r)} := \Lambda_{k_1} * \dots * \Lambda_{k_r}$

is a further generalisation of the von Mangoldt function (now concentrated on ${k_1+\dots+k_r}$-almost primes). By combining these asymptotics with some elementary identities involving the ${\Lambda_{(k_1,\dots,k_r)}}$, together with the Weierstrass approximation theorem, Bombieri was able to control a wide family of sums including (1), except for one undetermined scalar ${\delta_x \in [0,2]}$. Namely, he was able to show (again on EH) that for any fixed ${r \geq 1}$ and any continuous function ${g_r}$ on the simplex ${\Delta_r := \{ (t_1,\dots,t_r) \in {\bf R}^r: t_1+\dots+t_r = 1; 0 \leq t_1 \leq \dots \leq t_r\}}$ that had suitable vanishing at the boundary, the sum

$\displaystyle \sum_{n \leq x: n=p_1 \dots p_r} g_r( \frac{\log p_1}{\log n}, \dots, \frac{\log p_r}{\log n} ) \Lambda(n+2)$

was equal to

$\displaystyle (\delta_x+o(1)) \int_{\Delta_r} g_r \frac{x}{\log x} \ \ \ \ \ (7)$

when ${r}$ was odd and

$\displaystyle (2-\delta_x+o(1)) \int_{\Delta_r} g_r \frac{x}{\log x} \ \ \ \ \ (8)$

when ${r}$ was even, where the integral on ${\Delta_r}$ is with respect to the measure ${\frac{dt_1 \dots dt_{r-1}}{t_1 \dots t_r}}$ (this is Dirac measure in the case ${r=1}$). In particular, we have

$\displaystyle \sum_{n \leq x} \Lambda(n) \Lambda(n+2) = (\delta_x + o(1)) 2 \Pi_2 x$

and the twin prime conjecture would be proved if one could show that ${\delta_x}$ is bounded away from zero, while (1) is equivalent to the assertion that ${\delta_x}$ is equal to ${1+o(1)}$. Unfortunately, no additional bound beyond the inequalities ${0 \leq \delta_x \leq 2}$ provided by the Bombieri asymptotic sieve is known, even if one assumes all other major conjectures in number theory than the prime tuples conjecture and its variants (e.g. GRH, GEH, GUE, abc, Chowla, …).

To put it another way, the Bombieri asymptotic sieve is able (on EH) to compute asymptotics for sums

$\displaystyle \sum_{n \leq x} f(n) \Lambda(n+2) \ \ \ \ \ (9)$

without needing to know the unknown scalar ${\delta_x}$, when ${f}$ is a function supported on almost primes of the form

$\displaystyle f(p_1 \dots p_r) = g_r( \frac{\log p_1}{\log n}, \dots, \frac{\log p_r}{\log n} )$

for ${1 \leq r \leq r_*}$ and some fixed ${r_*}$, with ${f}$ vanishing elsewhere and for some continuous (symmetric) functions ${g_r: \Delta_r \rightarrow {\bf C}}$ obeying some vanishing at the boundary, so long as the parity condition

$\displaystyle \sum_{r \hbox{ odd}} \int_{\Delta_r} g_r = \sum_{r \hbox{ even}} \int_{\Delta_r} g_r$

is obeyed (informally: ${f}$ gives the same weight to products of an odd number of primes as to products of an even number of primes, or to put it another way, ${f}$ is asymptotically orthogonal to the Möbius function ${\mu}$). But when ${f}$ violates the parity condition, the asymptotic involves the unknown ${\delta_x}$. This scalar ${\delta_x}$ thus embodies the “parity problem” for the twin prime conjecture (discussed in these previous blog posts).

Because the obstruction to the parity problem is only one-dimensional (on EH), one can replace any parity-violating weight (such as ${\Lambda}$) with any other parity-violating weight and obtain a logically equivalent estimate. For instance, to prove the twin prime conjecture on EH, it would suffice to show that

$\displaystyle \sum_{p_1 p_2 p_3 \leq x: p_1,p_2,p_3 \geq x^\alpha} \Lambda(p_1 p_2 p_3 + 2) \gg \frac{x}{\log x}$

for some fixed ${\alpha>0}$, or equivalently that there are ${\gg \frac{x}{\log^2 x}}$ solutions to the equation ${p - p_1 p_2 p_3 = 2}$ in primes with ${p \leq x}$ and ${p_1,p_2,p_3 \geq x^\alpha}$. (In some cases, this sort of reduction can also be made using other sieves than the Bombieri asymptotic sieve, as was observed by Ng.) As another example, the Bombieri asymptotic sieve can be used to show that the asymptotic (1) is equivalent to the asymptotic

$\displaystyle \sum_{n \leq x} \mu(n) 1_R(n) \Lambda(n+2) = o( \frac{x}{\log x})$

where ${R}$ is the set of numbers that are rough in the sense that they have no prime factors less than ${x^\alpha}$ for some fixed ${\alpha>0}$ (the function ${\mu 1_R}$ clearly correlates with ${\mu}$ and so must violate the parity condition). One can replace ${1_R}$ with similar sieve weights (e.g. a Selberg sieve) that concentrate on almost primes if desired.

As it turns out, if one is willing to strengthen the assumption of the Elliott-Halberstam (EH) conjecture to the assumption of the generalised Elliott-Halberstam (GEH) conjecture (as formulated for instance in Claim 2.6 of the Polymath8b paper), one can also swap the ${\Lambda(n+2)}$ factor in the above asymptotics with other parity-violating weights and obtain a logically equivalent estimate, as the Bombieri asymptotic sieve also applies to weights such as ${\mu 1_R}$ under the assumption of GEH. For instance, on GEH one can use two such applications of the Bombieri asymptotic sieve to show that the twin prime conjecture would follow if one could show that there are ${\gg \frac{x}{\log^2 x}}$ solutions to the equation

$\displaystyle p_1 p_2 - p_3 p_4 = 2$

in primes with ${p_1,p_2,p_3,p_4 \geq x^\alpha}$ and ${p_1 p_2 \leq x}$, for some ${\alpha > 0}$. Similarly, on GEH the asymptotic (1) is equivalent to the asymptotic

$\displaystyle \sum_{n \leq x} \mu(n) 1_R(n) \mu(n+2) 1_R(n+2) = o( \frac{x}{\log^2 x})$

for some fixed ${\alpha>0}$, and similarly with ${1_R}$ replaced by other sieves. This form of the quantitative twin primes conjecture is appealingly similar to the (special case)

$\displaystyle \sum_{n \leq x} \mu(n) \mu(n+2) = o(x)$

of the Chowla conjecture, for which there has been some recent progress (discussed for instance in these recent posts). Informally, the Bombieri asymptotic sieve lets us (on GEH) view the twin prime conjecture as a sort of Chowla conjecture restricted to almost primes. Unfortunately, the recent progress on the Chowla conjecture relies heavily on the multiplicativity of ${\mu}$ at small primes, which is completely destroyed by inserting a weight such as ${1_R}$, so this does not yet yield a viable path towards the twin prime conjecture even assuming GEH. Still, the similarity is striking, and one can hope that further ways to attack the Chowla conjecture may emerge that could impact the twin prime conjecture. (Alternatively, if one assumes a sufficiently optimistic version of the GEH, one could perhaps relax the notion of “almost prime” to the extent that one could start usefully using multiplicativity at smallish primes, though this seems rather wishful at present, particularly since the most optimistic versions of GEH are known to be false.)

The Bombieri asymptotic sieve is already well explained in the original two papers of Bombieri; there is also a slightly different treatment of the sieve by Friedlander and Iwaniec, as well as a simplified version in the book of Friedlander and Iwaniec (in which the distribution hypothesis is strengthened in order to shorten the arguments. I’ve decided though to write up my own notes on the sieve below the fold; this is primarily for my own benefit, but may be useful to some readers also. I largely follow the treatment of Bombieri, with the one idiosyncratic twist of replacing the usual “elementary” Selberg sieve with the “analytic” Selberg sieve used in particular in many of the breakthrough works in small gaps between primes; I prefer working with the latter due to its Fourier-analytic flavour.

— 1. Controlling generalised von Mangoldt sums —

To prove (5), we shall first generalise it, by replacing the sequence ${\Lambda(n+2)}$ by a more general sequence ${a_n}$ obeying the following axioms:

• (i) (Non-negativity) One has ${a_n \geq 0}$ for all ${n}$.
• (ii) (Crude size bound) One has ${a_n \ll \tau(n)^{O(1)} \log^{O(1)} n}$ for all ${n}$, where ${\tau}$ is the divisor function.
• (iii) (Size) We have ${\sum_{n \leq x} a_n = (C+o(1)) x}$ for some constant ${C>0}$.
• (iv) (Elliott-Halberstam type conjecture) For any ${\varepsilon,A>0}$, one has

$\displaystyle \sum_{d \leq x^{1-\varepsilon}} |\sum_{n \leq x: d|n} a_n - C x \frac{g(d)}{d}| \ll_{\varepsilon,A} x \log^{-A} x$

where ${g}$ is a multiplicative function with ${g(p^j) = 1 + O(1/p)}$ for all primes ${p}$ and ${j \geq 1}$.

These axioms are a little bit stronger than what is actually needed to make the Bombieri asymptotic sieve work, but we will not attempt to work with the weakest possible axioms here.

We introduce the function

$\displaystyle G(s) := \prod_p \frac{1-g(p)/p^s}{1-1/p^s}$

which is analytic for ${\hbox{Re}(s) > 0}$; in particular it can be evaluated at ${s=1}$ to yield

$\displaystyle G(1) = \prod_p \frac{1-g(p)/p}{1-1/p}.$

There are two model examples of data ${a_n, C, g}$ to keep in mind. The first, discussed in the introduction, is when ${a_n =\Lambda(n+2)}$, then ${C = 2 \Pi_2}$ and ${g}$ is as in the introduction; one of course needs EH to justify axiom (iv) in this case. The other is when ${a_n=1}$, in which case ${C=1}$ and ${g(n)=1}$ for all ${n}$. We will later take advantage of the second example to avoid doing some (routine, but messy) main term computations.

The main result of this section is then

Theorem 1 Let ${a_n, g, C, G}$ be as above. Let ${\vec k = (k_1,\dots,k_r)}$ be a tuple of natural numbers (independent of ${x}$) that is not equal to ${(1,\dots,1)}$. Then one has the asymptotic

$\displaystyle \sum_{n \leq x} \Lambda_{\vec k}(n) a_n = (G(1)+o(1)) \frac{\prod_{i=1}^r k_i!}{(|\vec k|-1)!} C x \log^{|\vec k|-1} x$

as ${x \rightarrow \infty}$, where ${|\vec k| := k_1 + \dots + k_r}$.

Note that this recovers (5) (on EH) as a special case.

We now begin the proof of this theorem. Henceforth we allow implied constants in the ${O()}$ or ${\ll}$ notation to depend on ${r, \vec k}$ and ${g,G}$.

It will be convenient to replace the range ${n \leq x}$ by a shorter range by the following standard localisation trick. Let ${B}$ be a large quantity depending on ${r, \vec k}$ to be chosen later, and let ${I}$ denote the interval ${\{ n: x - x \log^{-B} x \leq n \leq x \}}$. We will show the estimate

$\displaystyle \sum_{n \in I} \Lambda_{\vec k}(n) a_n = (G(1)+o(1)) \frac{\prod_{i=1}^r k_i!}{(|\vec k|-1)!} C |I| \log^{|\vec k|-1} x \ \ \ \ \ (10)$

from which the original claim follows by a routine summation argument. Observe from axiom (iv) and the triangle inequality that

$\displaystyle \sum_{d \leq x^{1-\varepsilon}: \mu^2(d)=1} |\sum_{n \in I: d|n} a_n - C |I| \frac{g(d)}{d}| \ll_{\varepsilon,A} x \log^{-A} x$

for any ${\varepsilon,A > 0}$.

Write ${L}$ for the logarithm function ${L(n) := \log n}$, thus ${\Lambda_k = \mu * L^k}$ for any ${k}$. Without loss of generality we may assume that ${k_r > 1}$; we then factor ${\Lambda_{\vec k} = \mu_{\vec k} * L^{k_r}}$, where

$\displaystyle \mu_{\vec k} := \Lambda_{k_1} * \dots * \Lambda_{k_{r-1}} * \mu.$

This function is just ${\mu}$ when ${r=1}$. When ${r>1}$ the function is more complicated, but we at least have the following crude bound:

Lemma 2 One has the pointwise bound ${|\mu_{\vec k}| \leq L^{|\vec k|-k_r}}$.

Proof: We induct on ${r}$. The case ${r=1}$ is obvious, so suppose ${r>1}$ and the claim has already been proven for ${r-1}$. Since ${\mu_{\vec k} = \Lambda_{k_1} * \mu_{(k_2,\dots,k_r)}}$, we see from induction hypothesis and the triangle inequality that

$\displaystyle |\mu_{\vec k}| \leq \Lambda_{k_1} * L^{|\vec k| - k_r - k_1} \leq L^{|\vec k| - k_r - k_1} (\Lambda_{k_1} * 1).$

Since ${\Lambda_{k_1}*1 = L^{k_1}}$ by Möbius inversion, the claim follows. $\Box$

We can write

$\displaystyle \Lambda_{\vec k}(n) = \sum_{d|n} \mu_{\vec k}(d) \log^{k_r} \frac{n}{d}.$

In the region ${n \in I}$, we have ${\log^{k_r} \frac{n}{d} = \log^{k_r} \frac{x}{d} + O( \log^{-B+O(1)} x )}$. Thus

$\displaystyle \Lambda_{\vec k}(n) = \sum_{d|n} \mu_{\vec k}(d) \log^{k_r} \frac{x}{d} + O( \tau(x) \log^{-B+O(1)} x )$

for ${n \in I}$. The contribution of the error term to ${O( \tau(x) \log^{-B+O(1)} x )}$ to (10) is easily seen to be negligible if ${B}$ is large enough, so we may freely replace ${\Lambda_{\vec k}(n)}$ with ${\sum_{d|n} \mu_{\vec k}(d) \log^{k_r} \frac{x}{d}}$ with little difficulty.

If we insert this replacement directly into the left-hand side of (10) and rearrange, we get

$\displaystyle \sum_{d \leq x} \mu_{\vec k}(d) \log^{k_r} \frac{x}{d} \sum_{n \in I: d|n} a_d.$

We can’t quite control this using axiom (iv) because the range of ${d}$ is a bit too big, as explained in the introduction. So let us introduce a truncated function

$\displaystyle \Lambda_{\vec k,\varepsilon}(n) := \sum_{d|n} \mu_{\vec k}(d) \log^{k_r} \frac{x}{d} \eta_\varepsilon( \frac{\log d}{\log x} ) \ \ \ \ \ (11)$

where ${\varepsilon>0}$ is a small quantity to be chosen later, and ${\eta_\varepsilon: {\bf R} \rightarrow [0,1]}$ is a smooth function that equals ${1}$ on ${(-\infty,1-4\varepsilon)}$ and equals ${0}$ on ${(1-3\varepsilon,+\infty)}$. Suppose one could establish the following two estimates for any fixed ${\varepsilon>0}$:

$\displaystyle \sum_{n \in I} \Lambda_{\vec k}(n) a_n = \sum_{n \in I} \Lambda_{\vec k,\varepsilon}(n) a_n + O( (\varepsilon+o(1)) C |I| \log^{|\vec k|-1} x ) \ \ \ \ \ (12)$

and

$\displaystyle \sum_{n \in I} \Lambda_{\vec k,\varepsilon}(n) a_n = C Q_{\varepsilon,x} G(1) + o( |I| \log^{|\vec k|-1} x ) \ \ \ \ \ (13)$

where ${Q_{\varepsilon,x}}$ is a quantity that depends on ${\varepsilon, \eta_\varepsilon, \vec k, B, x}$ but not on ${C, g,G}$. Then on combining the two estimates we would have

$\displaystyle \sum_{n \in I} \Lambda_{\vec k}(n) a_n = C Q_{\varepsilon,x} G(1) + (O(\varepsilon) + o(1)) C |I| \log^{|\vec k|-1} x. \ \ \ \ \ (14)$

One could in principle compute ${Q_{\varepsilon,x}}$ explicitly from the proof of (13), but one can avoid doing so by the following comparison trick. In the special case ${a_n=1}$, standard multiplicative number theory (noting that the Dirichlet series ${\sum_n \frac{\Lambda_{\vec k}(n)}{n^s}}$ has a pole of order ${|\vec k|}$ at ${s=1}$, with top Laurent coefficient ${\prod_{j=1}^r k_j!}$) gives the asymptotic

$\displaystyle \sum_{n \in I} \Lambda_{\vec k}(n) a_n = \frac{\prod_{i=1}^r k_i!}{(|\vec k|-1)!} + o(1)) |I| \log^{|\vec k|-1} x$

which when compared with (14) for ${a_n=1}$ (recalling that ${G(1)=C=1}$ in this case) gives the formula

$\displaystyle Q_{\varepsilon,x} = (\prod_{j=1}^r k_j + O(\varepsilon)) |I| \log^{|\vec k|-1} x.$

Inserting this back into (14) and recalling that ${\varepsilon>0}$ can be made arbitrarily small, we obtain (10).

As it turns out, the estimate (13) is easy to establish, but the estimate (12) is not, roughly speaking because the typical number ${n}$ in ${I}$ has too many divisors ${d}$ in the range ${[x^{1-4\varepsilon},1]}$, each of which gives a contribution to the error term. (In the book of Friedlander and Iwaniec, the estimate (13) is established anyway, but only after assuming a stronger version of (iv), roughly speaking in which ${d}$ is allowed to be as large as ${x \exp( -\log^{1/4} x)}$.) To resolve this issue, we will insert a preliminary sieve ${\nu_\varepsilon}$ that will remove most of the potential divisors ${d}$ i the range ${[x^{1-4\varepsilon},1]}$ (leaving only about ${O(1)}$ such divisors on the average for typical ${n}$), making the analogue of (12) easier to prove (at the cost of making the analogue of (13) more difficult). Namely, if one can find a function ${\nu_\varepsilon: {\bf N} \rightarrow {\bf R}}$ for which one has the estimates

$\displaystyle \sum_{n \in I} \Lambda_{\vec k}(n) a_n = \sum_{n \in I} \Lambda_{\vec k}(n) \nu_\varepsilon(n) a_n + O( (\varepsilon+o(1)) C |I| \log^{|\vec k|-1} x ), \ \ \ \ \ (15)$

$\displaystyle \sum_{n \in I} \Lambda_{\vec k}(n) \nu_\varepsilon(n) a_n$

$\displaystyle = \sum_{n \in I} \Lambda_{\vec k,\varepsilon}(n) \nu_\varepsilon(n) a_n + O( (\varepsilon+o(1)) C |I| \log^{|\vec k|-1} x ) \ \ \ \ \ (16)$

and

$\displaystyle \sum_{n \in I} \Lambda_{\vec k,\varepsilon}(n) \nu_\varepsilon(n) a_n = C Q'_{\varepsilon,x} G(1) + o( |I| \log^{|\vec k|-1} x ) \ \ \ \ \ (17)$

for some quantity ${Q'_{\varepsilon,x}}$ that depends on ${\varepsilon, \eta_\varepsilon, \vec k, B, x}$ but not on ${C, g, G,}$, then by repeating the previous arguments we will again be able to establish (10).

The key estimate is (16). As we shall see, when comparing ${\Lambda_{\vec k}(n) \nu_\varepsilon(n)}$ with ${\Lambda_{\vec k,\varepsilon}(n) \nu_\varepsilon(n)}$, the weight ${\nu_\varepsilon}$ will cost us a factor of ${1/\varepsilon}$, but the ${\log^{k_r} \frac{x}{d}}$ term in the definitions of ${\Lambda_{\vec k}}$ and ${\Lambda_{\vec k,\varepsilon}}$ will recover a factor of ${\varepsilon^{k_r}}$, which will give the desired bound since we are assuming ${k_r > 1}$.

One has some flexibility in how to select the weight ${\nu_\varepsilon}$: basically any standard sieve that uses divisors of size at most ${x^{2\varepsilon}}$ to localise (at least approximately) to numbers that are rough in the sense that they have no (or at least very few) factors less than ${x^\varepsilon}$, will do. We will use the analytic Selberg sieve choice

$\displaystyle \nu_\varepsilon(n) := (\sum_{d|n} \mu(d) \psi( \frac{\log d}{\varepsilon \log x} ))^2 \ \ \ \ \ (18)$

where ${\psi: {\bf R} \rightarrow [0,1]}$ is a smooth function supported on ${[-1,1]}$ that equals ${1}$ on ${[-1/2,1/2]}$.

It remains to establish the bounds (15), (16), (17). To warm up and introduce the various methods needed, we begin with the standard bound

$\displaystyle \sum_{n \in I} \nu_\varepsilon(n) a_n = \frac{C|I|}{\varepsilon \log x} (\int_0^1 \psi'(u)^2\ du) G(1) + o(1)), \ \ \ \ \ (19)$

where ${\psi'}$ denotes the derivative of ${\psi}$. Note the loss of ${1/\varepsilon}$ that had previously been pointed out. In the arguments that follows I will be a little brief with the details, as they are standard (see e.g. this previous post).

We now prove (19). The left-hand side can be expanded as

$\displaystyle \sum_{d_1,d_2} \mu(d_1) \mu(d_2) \psi( \frac{\log d_1}{\varepsilon \log x} ) \psi( \frac{\log d_2}{\varepsilon \log x} ) \sum_{n \in I: [d_1,d_2]|n} a_n$

where ${[d_1,d_2]}$ denotes the least common multiple of ${d_1}$ and ${d_2}$. From the support of ${\psi}$ we see that the summand is only non-vanishing when ${[d_1,d_2] \leq x^{2\varepsilon}}$. We now use axiom (iv) and split the left-hand side into a main term

$\displaystyle \sum_{d_1,d_2} \mu(d_1) \mu(d_2) \psi( \frac{\log d_1}{\varepsilon \log x} ) \psi( \frac{\log d_2}{\varepsilon \log x} ) \frac{g(d)}{d} C |I|$

and an error term that is at most

$\displaystyle O_\varepsilon( \sum_{d \leq x^{2\varepsilon}} \tau(d)^{O(1)} | \sum_{n \in I: d|n} a_n - \frac{g(d)}{d} C |I|| ). \ \ \ \ \ (20)$

From axiom (ii) and elementary multiplicative number theory, we have the bound

$\displaystyle \sum_{d \leq x} \tau(d)^{O(1)} | \sum_{n \in I: d|n} a_n - \frac{g(d)}{d} C |I| \ll C |I| \log^{O(1)} x$

so from axiom (iv) and Cauchy-Schwarz we see that the error term (20) is acceptable. Thus it will suffice to establish the bound

$\displaystyle \sum_{d_1,d_2} \mu(d_1) \mu(d_2) \psi( \frac{\log d_1}{\varepsilon \log x} ) \psi( \frac{\log d_2}{\varepsilon \log x} ) \frac{g([d_1,d_2])}{[d_1,d_2]}$

$\displaystyle = \frac{1}{\varepsilon \log x} (\int_0^1 \psi'(u)^2\ du) G(1) + o(\frac{1}{\log x}). \ \ \ \ \ (21)$

The summand here is almost, but not quite, multiplicative in ${d_1,d_2}$. To make it genuinely multiplicative, we perform a (shifted) Fourier expansion

$\displaystyle \psi(u) = \int_{\bf R} e^{-(1+it)u} \Psi(t)\ dt \ \ \ \ \ (22)$

for some rapidly decreasing function ${\Psi}$ (essentially the Fourier transform of ${e^u \psi(u)}$). Thus

$\displaystyle \psi( \frac{\log d}{\varepsilon \log x} ) = \int_{\bf R} \frac{1}{d^{\frac{1+it}{\varepsilon \log x}}} \Psi(t)\ dt,$

and so the left-hand side of (21) can be rearranged using Fubini’s theorem as

$\displaystyle \int_{\bf R} \int_{\bf R} E(\frac{1+it_1}{\varepsilon \log x},\frac{1+it_2}{\varepsilon \log x})\ \Psi(t_1) \Psi(t_2) dt_1 dt_2 \ \ \ \ \ (23)$

where

$\displaystyle E(s_1,s_2) := \sum_{d_1,d_2} \frac{\mu(d_1) \mu(d_2)}{d_1^{s_1}d_2^{s_2}} \frac{g([d_1,d_2])}{[d_1,d_2]}.$

We can factorise ${E(s_1,s_2)}$ as an Euler product:

$\displaystyle E(s_1,s_2) = \prod_p (1 - \frac{g(p)}{p^{1+s_1}} - \frac{g(p)}{p^{1+s_2}} + \frac{g(p)}{p^{1+s_1+s_2}}).$

Taking absolute values and using Mertens’ theorem leads to the crude bound

$\displaystyle E(\frac{1+it_1}{\varepsilon \log x},\frac{1+it_2}{\varepsilon \log x}) \ll_\varepsilon \log^{O(1)} x$

which when combined with the rapid decrease of ${\Psi}$, allows us to restrict the region of integration in (23) to the square ${\{ |t_1|, |t_2| \leq \sqrt{\log x} \}}$ (say) with negligible error. Next, we use the Euler product

$\displaystyle \zeta(s) = \prod_p (1-\frac{1}{p^s})^{-1}$

for ${\hbox{Re} s > 1}$ to factorise

$\displaystyle E(s_1,s_2) = \frac{\zeta(1+s_1+s_2)}{\zeta(1+s_1) \zeta(1+s_2)} \prod_p E_p(s_1,s_2)$

where

$\displaystyle E_p(s_1,s_2) := \frac{(1 - \frac{g(p)}{p^{1+s_1}} - \frac{g(p)}{p^{1+s_2}} + \frac{g(p)}{p^{1+s_1+s_2}})(1 - \frac{1}{p^{1+s_1+s_2}})}{(1-\frac{1}{p^{1+s_1}})(1-\frac{1}{p^{1+s_2}})}.$

For ${s_1,s_2=o(1)}$ with nonnegative real part, one has

$\displaystyle E_p(s_1,s_2) = 1 + O(1/p^2)$

and so by the Weierstrass ${M}$-test, ${\prod_p E_p(s_1,s_2)}$ is continuous at ${s_1=s_2=0}$. Since

$\displaystyle \prod_p E_p(0,0) = G(1)$

we thus have

$\displaystyle \prod_p E_p(s_1,s_2) = G(1) + o(1)$

Also, since ${\zeta}$ has a pole of order ${1}$ at ${s=1}$ with residue ${1}$, we have

$\displaystyle \frac{\zeta(1+s_1+s_2)}{\zeta(1+s_1) \zeta(1+s_2)} = (1+o(1)) \frac{s_1 s_2}{s_1+s_2}$

and thus

$\displaystyle E(s_1,s_2) = (G(1)+o(1)) \frac{s_1s_2}{s_1+s_2}.$

The quantity (23) can thus be written, up to errors of ${o(\frac{1}{\log x})}$, as

$\displaystyle \frac{G(1)}{\varepsilon \log x} \int_{|t_1|, |t_2| \leq \sqrt{\log x}} \frac{(1+it_1)(1+it_2)}{1+it_1+1+it_2} \Psi(t_1) \Psi(t_2)\ dt_1 dt_2.$

Using the rapid decrease of ${\Psi}$, we may remove the restriction on ${t_1,t_2}$, and it will now suffice to prove the identity

$\displaystyle \int_{\bf R} \int_{\bf R} \frac{(1+it_1)(1+it_2)}{1+it_1+1+it_2} \Psi(t_1) \Psi(t_2)\ dt_1 dt_2 = (\int_0^1 \psi'(u)^2\ du)^2.$

But on differentiating and then squaring (22) we have

$\displaystyle \psi'(u)^2 = \int_{\bf R} \int_{\bf R} (1+it_1)(1+it_2) e^{-(1+it_1+1+it_2)u}\Psi(t_1) \Psi(t_2)\ dt_1 dt_2$

and the claim follows by integrating in ${u}$ from zero to infinity (noting that ${\psi'}$ vanishes for ${u>1}$).

We have the following variant of (19):

Lemma 3 For any ${d \leq x^{1-3\varepsilon}}$, one has

$\displaystyle \sum_{n \in I: d|n} \nu_\varepsilon(n) a_n \ll \frac{C|I|}{\varepsilon \log x} \frac{\prod_{p|d} O( \min( \frac{\log p}{\varepsilon \log x}, 1 )^2 )}{d} + R_d \ \ \ \ \ (24)$

where the ${R_d}$ are such that

$\displaystyle \sum_{d \leq x^{1-3\varepsilon}} R_d \ll_A |I| \log^{-A} x \ \ \ \ \ (25)$

for any ${A>0}$. We also have the variant

$\displaystyle \sum_{n \in I: d|n} \nu_\varepsilon(n/d) a_n \ll \frac{C|I|}{\varepsilon \log x} \frac{\prod_{p|d} O(1 ) )}{d} + R_d. \ \ \ \ \ (26)$

If in addition ${d}$ has no prime factors less than ${x^\delta}$ for some fixed ${\delta>0}$, one has

$\displaystyle \sum_{n \in I: d|n} \nu_\varepsilon(n) a_n$

$\displaystyle = \frac{1+o(1)}{d} \frac{C|I|}{\varepsilon \log x} (\int_0^1 \psi'(u)^2\ du) G(1) + O(R_d). \ \ \ \ \ (27)$

Roughly speaking, the above estimates assert that ${\nu_\varepsilon}$ is concentrated on those numbers ${n}$ with no prime factors much less than ${x^\varepsilon}$, but factors ${d}$ without such small prime divisors occur with about the same relative density as they do in the integers.

Proof: The left-hand side of (24) can be expanded as

$\displaystyle \sum_{d_1,d_2} \mu(d_1) \mu(d_2) \psi( \frac{\log d_1}{\varepsilon \log x} ) \psi( \frac{\log d_2}{\varepsilon \log x} ) \sum_{n \in I: [d_1,d_2,d]|n} a_n.$

If we define

$\displaystyle R_d := \sum_{d' \leq x^{1-\varepsilon}: d|d'} \tau(d')^2 |\sum_{n \in I:d'|n} a_n - \frac{g(d')}{d'} C|I||$

then the previous expression can be written as

$\displaystyle \sum_{d_1,d_2} \mu(d_1) \mu(d_2) \psi( \frac{\log d_1}{\varepsilon \log x} ) \psi( \frac{\log d_2}{\varepsilon \log x} ) \frac{g([d_1,d_2,d])}{[d_1,d_2,d]} C|I| + O(R_d),$

while one has

$\displaystyle \sum_{d \leq x^{1-3\varepsilon}} R_d \leq \sum_{d' \leq x^{1-\varepsilon}} \tau(d')^3 |\sum_{n \in I:d'|n} a_n - \frac{g(d')}{d'} C|I||$

which gives (25) from Axiom (iv). To prove (24), it now suffices to show that

$\displaystyle \sum_{d_1,d_2} \mu(d_1) \mu(d_2) \psi( \frac{\log d_1}{\varepsilon \log x} ) \psi( \frac{\log d_2}{\varepsilon \log x} ) \frac{g([d_1,d_2,d])}{[d_1,d_2,d]}$

$\displaystyle \ll \frac{1}{\varepsilon \log x} \frac{\prod_{p|d} O( \min( \frac{\log p}{\varepsilon \log x}, 1 )^2 )}{d}. \ \ \ \ \ (28)$

Arguing as before, the left-hand side is

$\displaystyle \int_{\bf R} \int_{\bf R} E^{(d)}(\frac{1+it_1}{\varepsilon \log x},\frac{1+it_2}{\varepsilon \log x})\ \Psi(t_1) \Psi(t_2) dt_1 dt_2$

where

$\displaystyle E^{(d)}(s_1,s_2) := \sum_{d_1,d_2} \frac{\mu(d_1) \mu(d_2)}{d_1^{s_1}d_2^{s_2}} \frac{g([d_1,d_2,d])}{[d_1,d_2,d]}.$

From Mertens’ theorem we have

$\displaystyle E^{(d)}(s_1,s_2) \ll_\varepsilon \frac{\prod_{p|d} O(1)}{d} \log^{O(1)} x$

when ${\hbox{Re} s_1, \hbox{Re} s_2 = \frac{1}{\varepsilon \log x}}$, so the contribution of the terms where ${|t_1|, |t_2| \geq \sqrt{\log x}}$ can be absorbed into the ${R_d}$ error (after increasing that error slightly). For the remaining contributions, we see that

$\displaystyle E^{(d)}(s_1,s_2) = \frac{\zeta(1+s_1+s_2)}{\zeta(1+s_1) \zeta(1+s_2)} \prod_p E^{(d)}_p(s_1,s_2)$

where ${E^{(d)}_p(s_1,s_2) = E_p(s_1,s_2)}$ if ${p}$ does not divide ${d}$, and

$\displaystyle E^{(d)}_p(s_1,s_2) = \frac{g(p^j)}{p^j} \frac{(1 - \frac{1}{p^{s_1}}) (1 - \frac{1}{p^{s_2}}) (1 - \frac{1}{p^{1+s_1+s_2}})}{(1-\frac{1}{p^{1+s_1}})(1-\frac{1}{p^{1+s_2}})}$

if ${p}$ divides ${d}$ ${j}$ times for some ${j \geq 1}$. In the latter case, Taylor expansion gives the bounds

$\displaystyle |E^{(d)}_p(\frac{1+it_1}{\varepsilon \log x},\frac{1+it_2}{\varepsilon \log x})| \lesssim (1+|t_1|+|t_2|)^{O(1)} \frac{\min( \frac{\log p}{\varepsilon \log x}, 1 )^2}{p}$

and the claim (28) follows. When ${p \geq x^\delta}$ and ${|t_1|, |t_2| \leq \sqrt{\log x}}$ we have

$\displaystyle E^{(d)}_p(\frac{1+it_1}{\varepsilon \log x},\frac{1+it_2}{\varepsilon \log x}) = \frac{1+o(1)}{p^j}$

and (27) follows by repeating the previous calculations. Finally, (26) is proven similarly to (24) (using ${d[d_1,d_2]}$ in place of ${[d_1,d_2,d]}$). $\Box$

Now we can prove (15), (16), (17). We begin with (15). Using the Leibniz rule ${L(f*g) = (Lf)*g + f*(Lg)}$ applied to the identity ${\mu = \mu * 1 * \mu}$ and using ${\Lambda = \mu*L}$ and Möbius inversion (and the associativity and commutativity of Dirichlet convolution) we see that

$\displaystyle L\mu = - \mu * \Lambda. \ \ \ \ \ (29)$

Next, by applying the Leibniz rule to ${\Lambda_k = \mu * L^k}$ for some ${k \geq 1}$ and using (29) we see that

$\displaystyle L \Lambda_k = L \mu * L^k + \mu * L^{k+1}$

$\displaystyle = - \mu * \Lambda * L^k + \Lambda_{k+1}$

and hence we have the recursive identity

$\displaystyle \Lambda_{k+1} = L \Lambda_k + \Lambda *\Lambda_k. \ \ \ \ \ (30)$

In particular, from induction we see that ${\Lambda_k}$ is supported on numbers with at most ${k}$ distinct prime factors, and hence ${\Lambda_{\vec k}}$ is supported on numbers with at most ${|\vec k|}$ distinct prime factors. In particular, from (18) we see that ${\nu_\varepsilon(n) = O(1)}$ on the support of ${\Lambda_{\vec k}}$. Thus it will suffice to show that

$\displaystyle \sum_{n \in I: \nu_\varepsilon(n) \neq 1} \Lambda_{\vec k}(n) a_n \ll (\varepsilon+o(1)) C |I| \log^{|\vec k|-1} x.$

If ${\nu_\varepsilon(n) \neq 1}$ and ${\Lambda_{\vec k}(n) \neq 0}$, then ${n}$ has at most ${|\vec k|}$ distinct prime factors ${p_1 < p_2 < \dots < p_r}$, with ${p_1 \leq x^\varepsilon}$. If we factor ${n = n_1 n_2}$, where ${n_1}$ is the contribution of those ${p_i}$ with ${p_i \leq x^{1/10|\vec k|}}$, and ${n_2}$ is the contribution of those ${p_i}$ with ${p_i > x^{1/10|\vec k|}}$, then at least one of the following two statements hold:

• (a) ${n_1}$ (and hence ${n}$) is divisible by a square number of size at least ${x^{1/10}}$.
• (b) ${n_1 \leq x^{1/5}}$.

The contribution of case (a) is easily seen to be acceptable by axiom (ii). For case (b), we observe from (30) and induction that

$\displaystyle \Lambda_k(n) \ll \log^{|\vec k|} x \prod_{j=1}^k \frac{\log p_j}{\log x}$

and so it will suffice to show that

$\displaystyle \sum_{n_1} (\prod_{p|n_1} \frac{\log p}{\log x}) \sum_{n \in I: n_1 | n} 1_R(n/n_1) a_n \ll (\varepsilon + o(1)) C |I| \log^{-1} x$

where ${n_1}$ ranges over numbers bounded by ${x^{1/5}}$ with at most ${|\vec k|}$ distinct prime factors, the smallest of which is at most ${x^\varepsilon}$, and ${R}$ consists of those numbers with no prime factor less than or equal to ${x^{1/10|\vec k|}}$. Applying (26) (with ${\varepsilon}$ replaced by ${1/10|\vec k|}$) gives the bound

$\displaystyle \sum_{n \in I: d|n} 1_R(n/n_1) a_n \ll \frac{C|I|}{\log x} \frac{1}{n_1} + R_d$

so by (25) it suffices to show that

$\displaystyle \sum_{n_1} (\prod_{p|n_1} \frac{\log p}{\log x}) \frac{1}{n_1} \ll \varepsilon$

subject to the same constraints on ${n_1}$ as before. The contribution of those ${n_1}$ with ${r}$ distinct prime factors can be bounded by

$\displaystyle O(\sum_{p_1 \leq x^\varepsilon} \frac{\log p_1}{p_1 \log x}) \times O(\sum_{p \leq x^{1/5}} \frac{\log p}{p\log x})^{r-1};$

applying Mertens’ theorem and summing over ${1 \leq r \leq |\vec k|}$, one obtains the claim.

Now we show (16). As discussed previously in this section, we can replace ${\Lambda_{\vec k}(n)}$ by ${\sum_{d|n} \mu_{\vec k}(d) \log^{k_r} \frac{x}{d}}$ with negligible error. Comparing this with (16) and (11), we see that it suffices to show that

$\displaystyle \sum_{n \in I} \sum_{d|n} \mu_{\vec k}(d) \log^{k_r} \frac{x}{d} (1 - \eta_\varepsilon(\frac{\log d}{\log x})) \nu_\varepsilon(n) a_n \ll (\varepsilon+o(1)) C |I| \log^{|\vec k|-1} x.$

From the support of ${\eta_\varepsilon}$, the summand on the left-hand side is only non-zero when ${d \geq x^{1-4\varepsilon}}$, which makes ${\log^{k_r} \frac{x}{d} \ll \varepsilon^{k_r} \log^{k_r} x \leq \varepsilon^2 \log^{k_r} x}$, where we use the crucial hypothesis ${k_r > 1}$ to gain enough powers of ${\varepsilon}$ to make the argument here work. Applying Lemma 2, we reduce to showing that

$\displaystyle \sum_{n \in I} \sum_{d|n: d \geq x^{1-4\varepsilon}} \nu_\varepsilon(n) a_n \ll \frac{1+o(1)}{\varepsilon \log x} C |I|.$

We can make the change of variables ${d \mapsto n/d}$ to flip the sum

$\displaystyle \sum_{d|n: d \geq x^{1-4\varepsilon}} 1 \leq \sum_{d|n: d \leq x^{3\varepsilon}} 1$

and then swap the sums to reduce to showing that

$\displaystyle \sum_{d \leq x^{4\varepsilon}} \sum_{n \in I} \nu_\varepsilon(n) a_n \ll \frac{1+o(1)}{\varepsilon \log x} C |I|.$

By Lemma 3, it suffices to show that

$\displaystyle \sum_{d \leq x^{4\varepsilon}} \frac{\prod_{p|d} O( \min( \frac{\log p}{\varepsilon \log x}, 1 )^2 )}{d} \ll 1.$

To prove this, we use the Rankin trick, bounding the implied weight ${1_{d \leq x^{4\varepsilon}}}$ by ${O( \frac{1}{d^{1/\varepsilon \log x}} )}$. We can then bound the left-hand side by the Euler product

$\displaystyle \prod_p (1 + O( \frac{\min( \frac{\log p}{\varepsilon \log x}, 1 )^2}{p^{1+1/\varepsilon \log x}} ))$

which can be bounded by

$\displaystyle \exp( O( \sum_p \frac{\min( \frac{\log p}{\varepsilon \log x}, 1 )^2}{p^{1+1/\varepsilon \log x}} ) )$

and the claim follows from Mertens’ theorem.

Finally, we show (17). By (11), the left-hand side expands as

$\displaystyle \sum_{d \leq x^{1-3\varepsilon}} \mu_{\vec k}(d) \log^{k_r} \frac{x}{d} \eta_\varepsilon(\frac{\log d}{\log x}) \sum_{n \in I: d|n} \nu_\varepsilon(n) a_n.$

We let ${\delta>0}$ be a small constant to be chosen later. We divide the outer sum into two ranges, depending on whether ${d}$ only has prime factors greater than ${x^\delta}$ or not. In the former case, we can apply (27) to write this contribution as

$\displaystyle \sum_{d \leq x^{1-3\varepsilon}} \mu_{\vec k}(d) \log^{k_r} \frac{x}{d} \eta_\varepsilon(\frac{\log d}{\log x}) \frac{1+o(1)}{d} \frac{C|I|}{\varepsilon \log x} (\int_0^1 \psi'(u)^2\ du) G(1)$

plus a negligible error, where the ${d}$ is implicitly restricted to numbers with all prime factors greater than ${x^\delta}$. The main term is messy, but it is of the required form ${C Q'_{\varepsilon,x} G(1)}$ up to an acceptable error, so there is no need to compute it any further. It remains to consider those ${d}$ that have at least one prime factor less than ${x^\delta}$. Here we use (24) instead of (27) as well as Lemma 3 to dominate this contribution by

$\displaystyle \sum_{d \leq x^{1-3\varepsilon}} O( \log^{|\vec k|} x \frac{C|I|}{\varepsilon \log x} \frac{\prod_{p|d} O( \min( \frac{\log p}{\varepsilon \log x}, 1 )^2 )}{d} )$

up to negligible errors, where ${d}$ is now restricted to have at least one prime factor less than ${x^\delta}$. This makes at least one of the factors ${\min( \frac{\log p}{\varepsilon \log x}, 1 )}$ to be at most ${O_\varepsilon(\delta)}$. A routine application of Rankin’s trick shows that

$\displaystyle \sum_{d \leq x^{1-3\varepsilon}} \frac{\prod_{p|d} O( \min( \frac{\log p}{\varepsilon \log x}, 1 ) )}{d} \ll_\varepsilon 1$

and so the total contribution of this case is ${O_\varepsilon((\delta+o(1)) |I| \log^{|\vec k|-1} x)}$. Since ${\delta>0}$ can be made arbitrarily small, (17) follows.

— 2. Weierstrass approximation —

Having proved Theorem 1, we now take linear combinations of this theorem, combined with the Weierstrass approximation theorem, to give the asymptotics (7), (8) described in the introduction.

Let ${a_n}$, ${g}$, ${C}$, ${G}$ be as in that theorem. It will be convenient to normalise the weights ${\Lambda_{\vec k}}$ by ${L^{1-|\vec k|}}$ to make their mean value comparable to ${1}$. From Theorem 1 and summation by parts we have

$\displaystyle \sum_{n \leq x} L^{1-|\vec k|} \Lambda_{\vec k}(n) a_n = (G(1)+o(1)) \frac{\prod_{i=1}^r k_i!}{(|\vec k|-1)!} C x \ \ \ \ \ (31)$

whenever ${\vec k}$ does not consist entirely of ones.

We now take a closer look at what happens when ${\vec k}$ does consist entirely of ones. Let ${1^r}$ denote the ${r}$-tuple ${(1,\dots,1)}$. Convolving the ${k=1}$ case of (30) with ${r-1}$ copies of ${\Lambda}$ for some ${r \geq 1}$ and using the Leibniz rule, we see that

$\displaystyle \Lambda_{(1^{r-1}, 2)} = \frac{1}{r} L \Lambda_{1^r} + \Lambda_{1^{r+1}}$

and hence

$\displaystyle L^{-r} \Lambda_{1^{r+1}} = L^{-r} \Lambda_{(1^{r-1},2)} - \frac{1}{r} L^{1-r} \Lambda_{1^r}.$

Multiplying by ${a_n}$ and summing over ${n \leq x}$, and using (31) to control the ${\Lambda_{(1^{r-1},2)}}$ term, one has

$\displaystyle \sum_{n \leq x} L^{-r} \Lambda_{1^{r+1}}(n) a_n = (G(1)+o(1)) \frac{2}{r!} - \frac{1}{r} \sum_{n \leq x} L^{1-r} \Lambda_{1^{r}}(n) a_n.$

If we define ${\delta_x}$ (up to an error of ${o(1)}$) by the formula

$\displaystyle \sum_{n \leq x} \Lambda(n) a_n = (\delta_x G(1) + o(1)) C x$

then an induction then shows that

$\displaystyle \sum_{n \leq x} L^{1-r} \Lambda_{1^r}(n) a_n = \frac{1}{(r-1)!} (\delta_x G(1) + o(1)) C x$

for odd ${r}$, and

$\displaystyle \sum_{n \leq x} L^{1-r} \Lambda_{1^r}(n) a_n = \frac{1}{(r-1)!} ((2-\delta_x) G(1) + o(1)) C x$

for even ${r}$. In particular, after adjusting ${\delta_x}$ by ${o(1)}$ if necessary, we have ${0 \leq \delta_x \leq 2}$ since the left-hand sides are non-negative.

If we now define the comparison sequence ${b_n := C G(1) (1 + (1-\delta_x) \mu(n))}$, standard multiplicative number theory shows that the above estimates also hold when ${a_n}$ is replaced by ${b_n}$; thus

$\displaystyle \sum_{n \leq x} L^{1-r} \Lambda_{1^r}(n) a_n = \sum_{n \leq x} L^{1-r} \Lambda_{1^r}(n) b_n + o( x )$

for both odd and even ${r}$. The bound (31) also holds for ${b_n}$ when ${\vec k}$ does not consist entirely of ones, and hence

$\displaystyle \sum_{n \leq x} L^{1-|\vec k|} \Lambda_{\vec k}(n) a_n = \sum_{n \leq x} L^{1-|\vec k|} \Lambda_{\vec k}(n) b_n + o( x )$

for any fixed ${\vec k}$ (which may or may not consist entirely of ones).

Next, from induction (on ${j_1+\dots+j_r}$), the Leibniz rule, and (30), we see that for any ${r \geq 1}$ and ${j_1,\dots,j_r \geq 0}$, ${k_1,\dots,k_r}$, the function

$\displaystyle L^{1-j_1-\dots-j_r-|\vec k|} ((L^{j_1} \Lambda_{k_1}) * \dots * (L^{j_r} \Lambda_{k_r})) \ \ \ \ \ (32)$

is a finite linear combination of functions of the form ${L^{1-|\vec k'|} \Lambda_{\vec k'}}$ for tuples ${\vec k'}$ that may possibly consist entirely of ones. We thus have

$\displaystyle \sum_{n \leq x} f(n) a_n = \sum_{n \leq x}f(n) b_n + o( x )$

whenever ${f}$ is one of these functions (32). Specialising to the case ${k_1=\dots=k_r=1}$, we thus have

$\displaystyle \sum_{n_1 \dots n_r \leq x} a_{n} \log^{1-r} n \prod_{i=1}^r (\log n_i/\log n)^{j_i} \Lambda(n_i)$

$\displaystyle = \sum_{n_1 \dots n_r \leq x} b_{n} \log^{1-r} n \prod_{i=1}^r (\log n_i/\log n)^{j_i} \Lambda(n_i) + o(x )$

where ${n := n_1 \dots n_r}$. The contribution of those ${n_i}$ that are powers of primes can be easily seen to be negligible, leading to

$\displaystyle \sum_{p_1 \dots p_r \leq x} a_{n} \log n \prod_{i=1}^r (\log p_i/\log n)^{j_i+1}$

$\displaystyle = \sum_{p_1 \dots p_r \leq x} b_{n} \prod_{i=1}^r (\log p_i/\log n)^{j_i+1} + o(x)$

where now ${n := p_1 \dots p_r}$. The contribution of the case where two of the primes ${p_i}$ agree can also be seen to be negligible, as can the error when replacing ${\log n}$ with ${\log x}$, and then by symmetry

$\displaystyle \sum_{p_1 \dots p_r \leq x: p_1 < \dots < p_r} a_{n} \prod_{i=1}^r (\log p_i/\log n)^{j_i+1}$

$\displaystyle = \sum_{p_1 \dots p_r \leq x: p_1 < \dots < p_r} b_{n} \prod_{i=1}^r (\log p_i/\log n)^{j_i+1} + o(x / \log x).$

By linearity, this implies that

$\displaystyle \sum_{p_1 \dots p_r \leq x: p_1 < \dots < p_r} a_{n} P( \log p_1/\log n, \dots, \log p_r/\log n)$

$\displaystyle = \sum_{p_1 \dots p_r \leq x: p_1 < \dots < p_r} b_{n} P( \log p_1/\log n, \dots, \log p_r/\log n) + o(x / \log x)$

for any polynomial ${P(t_1,\dots,t_r)}$ that vanishes on the coordinate hyperplanes ${t_i=0}$. The right-hand side can also be evaluated by Mertens’ theorem as

$\displaystyle CG(1) \delta_x \int_{\Delta_r} P x + o(x)$

when ${r}$ is odd and

$\displaystyle CG(1) (2-\delta_x) \int_{\Delta_r} P x + o(x)$

when ${r}$ is even. Using the Weierstrass approximation theorem, we then have

$\displaystyle \sum_{p_1 \dots p_r \leq x: p_1 < \dots < p_r} a_{n} g_r( \log p_1/\log n, \dots, \log p_r/\log n)$

$\displaystyle = \sum_{p_1 \dots p_r \leq x: p_1 < \dots < p_r} b_{n} g_r( \log p_1/\log n, \dots, \log p_r/\log n) + o(x / \log x)$

for any continuous function ${g_r}$ that is compactly supported in the interior of ${\Delta_r}$. Computing the right-hand side using Mertens’ theorem as before, we obtain the claimed asymptotics (7), (8).

Remark 4 The Bombieri asymptotic sieve has to use the full power of EH (or GEH); there are constructions due to Ford that show that if one only has a distributional hypothesis up to ${x^{1-c}}$ for some fixed constant ${c>0}$, then the asymptotics of sums such as (5), or more generally (9), are not determined by a single scalar parameter ${\delta_x}$, but can also vary in other ways as well. Thus the Bombieri asymptotic sieve really is asymptotic; in order to get ${o(1)}$ type error terms one needs the level ${1-\varepsilon}$ of distribution to be asymptotically equal to ${1}$ as ${x \rightarrow \infty}$. Related to this, the quantitative decay of the ${o(1)}$ error terms in the Bombieri asymptotic sieve are extremely poor; in particular, they depend on the dependence of implied constant in axiom (iv) on the parameters ${\varepsilon,A}$, for which there is no consensus on what one should conjecturally expect.

A capset in the vector space ${{\bf F}_3^n}$ over the finite field ${{\bf F}_3}$ of three elements is a subset ${A}$ of ${{\bf F}_3^n}$ that does not contain any lines ${\{ x,x+r,x+2r\}}$, where ${x,r \in {\bf F}_3^n}$ and ${r \neq 0}$. A basic problem in additive combinatorics (discussed in one of the very first posts on this blog) is to obtain good upper and lower bounds for the maximal size of a capset in ${{\bf F}_3^n}$.

Trivially, one has ${|A| \leq 3^n}$. Using Fourier methods (and the density increment argument of Roth), the bound of ${|A| \leq O( 3^n / n )}$ was obtained by Meshulam, and improved only as late as 2012 to ${O( 3^n /n^{1+c})}$ for some absolute constant ${c>0}$ by Bateman and Katz. But in a very recent breakthrough, Ellenberg (and independently Gijswijt) obtained the exponentially superior bound ${|A| \leq O( 2.756^n )}$, using a version of the polynomial method recently introduced by Croot, Lev, and Pach. (In the converse direction, a construction of Edel gives capsets as large as ${(2.2174)^n}$.) Given the success of the polynomial method in superficially similar problems such as the finite field Kakeya problem (discussed in this previous post), it was natural to wonder that this method could be applicable to the cap set problem (see for instance this MathOverflow comment of mine on this from 2010), but it took a surprisingly long time before Croot, Lev, and Pach were able to identify the precise variant of the polynomial method that would actually work here.

The proof of the capset bound is very short (Ellenberg’s and Gijswijt’s preprints are both 3 pages long, and Croot-Lev-Pach is 6 pages), but I thought I would present a slight reformulation of the argument which treats the three points on a line in ${{\bf F}_3}$ symmetrically (as opposed to treating the third point differently from the first two, as is done in the Ellenberg and Gijswijt papers; Croot-Lev-Pach also treat the middle point of a three-term arithmetic progression differently from the two endpoints, although this is a very natural thing to do in their context of ${({\bf Z}/4{\bf Z})^n}$). The basic starting point is this: if ${A}$ is a capset, then one has the identity

$\displaystyle \delta_{0^n}( x+y+z ) = \sum_{a \in A} \delta_a(x) \delta_a(y) \delta_a(z) \ \ \ \ \ (1)$

for all ${(x,y,z) \in A^3}$, where ${\delta_a(x) := 1_{a=x}}$ is the Kronecker delta function, which we view as taking values in ${{\bf F}_3}$. Indeed, (1) reflects the fact that the equation ${x+y+z=0}$ has solutions precisely when ${x,y,z}$ are either all equal, or form a line, and the latter is ruled out precisely when ${A}$ is a capset.

To exploit (1), we will show that the left-hand side of (1) is “low rank” in some sense, while the right-hand side is “high rank”. Recall that a function ${F: A \times A \rightarrow {\bf F}}$ taking values in a field ${{\bf F}}$ is of rank one if it is non-zero and of the form ${(x,y) \mapsto f(x) g(y)}$ for some ${f,g: A \rightarrow {\bf F}}$, and that the rank of a general function ${F: A \times A \rightarrow {\bf F}}$ is the least number of rank one functions needed to express ${F}$ as a linear combination. More generally, if ${k \geq 2}$, we define the rank of a function ${F: A^k \rightarrow {\bf F}}$ to be the least number of “rank one” functions of the form

$\displaystyle (x_1,\dots,x_k) \mapsto f(x_i) g(x_1,\dots,x_{i-1},x_{i+1},\dots,x_k)$

for some ${i=1,\dots,k}$ and some functions ${f: A \rightarrow {\bf F}}$, ${g: A^{k-1} \rightarrow {\bf F}}$, that are needed to generate ${F}$ as a linear combination. For instance, when ${k=3}$, the rank one functions take the form ${(x,y,z) \mapsto f(x) g(y,z)}$, ${(x,y,z) \mapsto f(y) g(x,z)}$, ${(x,y,z) \mapsto f(z) g(x,y)}$, and linear combinations of ${r}$ such rank one functions will give a function of rank at most ${r}$.

It is a standard fact in linear algebra that the rank of a diagonal matrix is equal to the number of non-zero entries. This phenomenon extends to higher dimensions:

Lemma 1 (Rank of diagonal hypermatrices) Let ${k \geq 2}$, let ${A}$ be a finite set, let ${{\bf F}}$ be a field, and for each ${a \in A}$, let ${c_a \in {\bf F}}$ be a coefficient. Then the rank of the function

$\displaystyle (x_1,\dots,x_k) \mapsto \sum_{a \in A} c_a \delta_a(x_1) \dots \delta_a(x_k) \ \ \ \ \ (2)$

is equal to the number of non-zero coefficients ${c_a}$.

Proof: We induct on ${k}$. As mentioned above, the case ${k=2}$ follows from standard linear algebra, so suppose now that ${k>2}$ and the claim has already been proven for ${k-1}$.

It is clear that the function (2) has rank at most equal to the number of non-zero ${c_a}$ (since the summands on the right-hand side are rank one functions), so it suffices to establish the lower bound. By deleting from ${A}$ those elements ${a \in A}$ with ${c_a=0}$ (which cannot increase the rank), we may assume without loss of generality that all the ${c_a}$ are non-zero. Now suppose for contradiction that (2) has rank at most ${|A|-1}$, then we obtain a representation

$\displaystyle \sum_{a \in A} c_a \delta_a(x_1) \dots \delta_a(x_k)$

$\displaystyle = \sum_{i=1}^k \sum_{\alpha \in I_i} f_{i,\alpha}(x_i) g_{i,\alpha}( x_1,\dots,x_{i-1},x_{i+1},\dots,x_k) \ \ \ \ \ (3)$

for some sets ${I_1,\dots,I_k}$ of cardinalities adding up to at most ${|A|-1}$, and some functions ${f_{i,\alpha}: A \rightarrow {\bf F}}$ and ${g_{i,\alpha}: A^{k-1} \rightarrow {\bf R}}$.

Consider the space of functions ${h: A \rightarrow {\bf F}}$ that are orthogonal to all the ${f_{k,\alpha}}$, ${\alpha \in I_k}$ in the sense that

$\displaystyle \sum_{x \in A} f_{k,\alpha}(x) h(x) = 0$

for all ${\alpha \in I_k}$. This space is a vector space whose dimension ${d}$ is at least ${|A| - |I_k|}$. A basis of this space generates a ${d \times |A|}$ coordinate matrix of full rank, which implies that there is at least one non-singular ${d \times d}$ minor. This implies that there exists a function ${h: A \rightarrow {\bf F}}$ in this space which is nowhere vanishing on some subset ${A'}$ of ${A}$ of cardinality at least ${|A|-|I_k|}$.

If we multiply (3) by ${h(x_k)}$ and sum in ${x_k}$, we conclude that

$\displaystyle \sum_{a \in A} c_a h(a) \delta_a(x_1) \dots \delta_a(x_{k-1})$

$\displaystyle = \sum_{i=1}^{k-1} \sum_{\alpha \in I_i} f_{i,\alpha}(x_i)\tilde g_{i,\alpha}( x_1,\dots,x_{i-1},x_{i+1},\dots,x_{k-1})$

where

$\displaystyle \tilde g_{i,\alpha}(x_1,\dots,x_{i-1},x_{i+1},\dots,x_{k-1})$

$\displaystyle := \sum_{x_k \in A} g_{i,\alpha}(x_1,\dots,x_{i-1},x_{i+1},\dots,x_k) h(x_k).$

The right-hand side has rank at most ${|A|-1-|I_k|}$, since the summands are rank one functions. On the other hand, from induction hypothesis the left-hand side has rank at least ${|A|-|I_k|}$, giving the required contradiction. $\Box$

On the other hand, we have the following (symmetrised version of a) beautifully simple observation of Croot, Lev, and Pach:

Lemma 2 On ${({\bf F}_3^n)^3}$, the rank of the function ${(x,y,z) \mapsto \delta_{0^n}(x+y+z)}$ is at most ${3N}$, where

$\displaystyle N := \sum_{a,b,c \geq 0: a+b+c=n, b+2c \leq 2n/3} \frac{n!}{a!b!c!}.$

Proof: Using the identity ${\delta_0(x) = 1 - x^2}$ for ${x \in {\bf F}_3}$, we have

$\displaystyle \delta_{0^n}(x+y+z) = \prod_{i=1}^n (1 - (x_i+y_i+z_i)^2).$

The right-hand side is clearly a polynomial of degree ${2n}$ in ${x,y,z}$, which is then a linear combination of monomials

$\displaystyle x_1^{i_1} \dots x_n^{i_n} y_1^{j_1} \dots y_n^{j_n} z_1^{k_1} \dots z_n^{k_n}$

with ${i_1,\dots,i_n,j_1,\dots,j_n,k_1,\dots,k_n \in \{0,1,2\}}$ with

$\displaystyle i_1 + \dots + i_n + j_1 + \dots + j_n + k_1 + \dots + k_n \leq 2n.$

In particular, from the pigeonhole principle, at least one of ${i_1 + \dots + i_n, j_1 + \dots + j_n, k_1 + \dots + k_n}$ is at most ${2n/3}$.

Consider the contribution of the monomials for which ${i_1 + \dots + i_n \leq 2n/3}$. We can regroup this contribution as

$\displaystyle \sum_\alpha f_\alpha(x) g_\alpha(y,z)$

where ${\alpha}$ ranges over those ${(i_1,\dots,i_n) \in \{0,1,2\}^n}$ with ${i_1 + \dots + i_n \leq 2n/3}$, ${f_\alpha}$ is the monomial

$\displaystyle f_\alpha(x_1,\dots,x_n) := x_1^{i_1} \dots x_n^{i_n}$

and ${g_\alpha: {\bf F}_3^n \times {\bf F}_3^n \rightarrow {\bf F}_3}$ is some explicitly computable function whose exact form will not be of relevance to our argument. The number of such ${\alpha}$ is equal to ${N}$, so this contribution has rank at most ${N}$. The remaining contributions arising from the cases ${j_1 + \dots + j_n \leq 2n/3}$ and ${k_1 + \dots + k_n \leq 2n/3}$ similarly have rank at most ${N}$ (grouping the monomials so that each monomial is only counted once), so the claim follows.

Upon restricting from ${({\bf F}_3^n)^3}$ to ${A^3}$, the rank of ${(x,y,z) \mapsto \delta_{0^n}(x+y+z)}$ is still at most ${3N}$. The two lemmas then combine to give the Ellenberg-Gijswijt bound

$\displaystyle |A| \leq 3N.$

All that remains is to compute the asymptotic behaviour of ${N}$. This can be done using the general tool of Cramer’s theorem, but can also be derived from Stirling’s formula (discussed in this previous post). Indeed, if ${a = (\alpha+o(1)) n}$, ${b = (\beta+o(1)) n}$, ${c = (\gamma+o(1)) n}$ for some ${\alpha,\beta,\gamma \geq 0}$ summing to ${1}$, Stirling’s formula gives

$\displaystyle \frac{n!}{a!b!c!} = \exp( n (h(\alpha,\beta,\gamma) + o(1)) )$

where ${h}$ is the entropy function

$\displaystyle h(\alpha,\beta,\gamma) = \alpha \log \frac{1}{\alpha} + \beta \log \frac{1}{\beta} + \gamma \log \frac{1}{\gamma}.$

We then have

$\displaystyle N = \exp( n (X + o(1))$

where ${X}$ is the maximum entropy ${h(\alpha,\beta,\gamma)}$ subject to the constraints

$\displaystyle \alpha,\beta,\gamma \geq 0; \alpha+\beta+\gamma=1; \beta+2\gamma \leq 2/3.$

A routine Lagrange multiplier computation shows that the maximum occurs when

$\displaystyle \alpha = \frac{32}{3(15 + \sqrt{33})}$

$\displaystyle \beta = \frac{4(\sqrt{33}-1)}{3(15+\sqrt{33})}$

$\displaystyle \gamma = \frac{(\sqrt{33}-1)^2}{6(15+\sqrt{33})}$

and ${h(\alpha,\beta,\gamma)}$ is approximately ${1.013455}$, giving rise to the claimed bound of ${O( 2.756^n )}$.

Remark 3 As noted in the Ellenberg and Gijswijt papers, the above argument extends readily to other fields than ${{\bf F}_3}$ to control the maximal size of subset of ${{\bf F}^n}$ that has no non-trivial solutions to the equation ${ax+by+cz=0}$, where ${a,b,c \in {\bf F}}$ are non-zero constants that sum to zero. Of course one replaces the function ${(x,y,z) \mapsto \delta_{0^n}(x+y+z)}$ in Lemma 2 by ${(x,y,z) \mapsto \delta_{0^n}(ax+by+cz)}$ in this case.

Remark 4 This symmetrised formulation suggests that one possible way to improve slightly on the numerical quantity ${2.756}$ by finding a more efficient way to decompose ${\delta_{0^n}(x+y+z)}$ into rank one functions, however I was not able to do so (though such improvements are reminiscent of the Strassen type algorithms for fast matrix multiplication).

Remark 5 It is tempting to see if this method can get non-trivial upper bounds for sets ${A}$ with no length ${4}$ progressions, in (say) ${{\bf F}_5^n}$. One can run the above arguments, replacing the function

$\displaystyle (x,y,z) \mapsto \delta_{0^n}(x+y+z)$

with

$\displaystyle (x,y,z,w) \mapsto \delta_{0^n}(x-2y+z) \delta_{0^n}(y-2z+w);$

this leads to the bound ${|A| \leq 4N}$ where

$\displaystyle N := \sum_{a,b,c,d,e \geq 0: a+b+c+d+e=n, b+2c+3d+4e \leq 2n} \frac{n!}{a!b!c!d!e!}.$

Unfortunately, ${N}$ is asymptotic to ${\frac{1}{2} 5^n}$ and so this bound is in fact slightly worse than the trivial bound ${|A| \leq 5^n}$! However, there is a slim chance that there is a more efficient way to decompose ${\delta_{0^n}(x-2y+z) \delta_{0^n}(y-2z+w)}$ into rank one functions that would give a non-trivial bound on ${A}$. I experimented with a few possible such decompositions but unfortunately without success.

Remark 6 Return now to the capset problem. Since Lemma 1 is valid for any field ${{\bf F}}$, one could perhaps hope to get better bounds by viewing the Kronecker delta function ${\delta}$ as taking values in another field than ${{\bf F}_3}$, such as the complex numbers ${{\bf C}}$. However, as soon as one works in a field of characteristic other than ${3}$, one can adjoin a cube root ${\omega}$ of unity, and one now has the Fourier decomposition

$\displaystyle \delta_{0^n}(x+y+z) = \frac{1}{3^n} \sum_{\xi \in {\bf F}_3^n} \omega^{\xi \cdot x} \omega^{\xi \cdot y} \omega^{\xi \cdot z}.$

Moving to the Fourier basis, we conclude from Lemma 1 that the function ${(x,y,z) \mapsto \delta_{0^n}(x+y+z)}$ on ${{\bf F}_3^n}$ now has rank exactly ${3^n}$, and so one cannot improve upon the trivial bound of ${|A| \leq 3^n}$ by this method using fields of characteristic other than three as the range field. So it seems one has to stick with ${{\bf F}_3}$ (or the algebraic completion thereof).

Thanks to Jordan Ellenberg and Ben Green for helpful discussions.

When teaching mathematics, the traditional method of lecturing in front of a blackboard is still hard to improve upon, despite all the advances in modern technology.  However, there are some nice things one can do in an electronic medium, such as this blog.  Here, I would like to experiment with the ability to animate images, which I think can convey some mathematical concepts in ways that cannot be easily replicated by traditional static text and images. Given that many readers may find these animations annoying, I am placing the rest of the post below the fold.

Throughout this post we shall always work in the smooth category, thus all manifolds, maps, coordinate charts, and functions are assumed to be smooth unless explicitly stated otherwise.

A (real) manifold ${M}$ can be defined in at least two ways. On one hand, one can define the manifold extrinsically, as a subset of some standard space such as a Euclidean space ${{\bf R}^d}$. On the other hand, one can define the manifold intrinsically, as a topological space equipped with an atlas of coordinate charts. The fundamental embedding theorems show that, under reasonable assumptions, the intrinsic and extrinsic approaches give the same classes of manifolds (up to isomorphism in various categories). For instance, we have the following (special case of) the Whitney embedding theorem:

Theorem 1 (Whitney embedding theorem) Let ${M}$ be a compact manifold. Then there exists an embedding ${u: M \rightarrow {\bf R}^d}$ from ${M}$ to a Euclidean space ${{\bf R}^d}$.

In fact, if ${M}$ is ${n}$-dimensional, one can take ${d}$ to equal ${2n}$, which is often best possible (easy examples include the circle ${{\bf R}/{\bf Z}}$ which embeds into ${{\bf R}^2}$ but not ${{\bf R}^1}$, or the Klein bottle that embeds into ${{\bf R}^4}$ but not ${{\bf R}^3}$). One can also relax the compactness hypothesis on ${M}$ to second countability, but we will not pursue this extension here. We give a “cheap” proof of this theorem below the fold which allows one to take ${d}$ equal to ${2n+1}$.

A significant strengthening of the Whitney embedding theorem is (a special case of) the Nash embedding theorem:

Theorem 2 (Nash embedding theorem) Let ${(M,g)}$ be a compact Riemannian manifold. Then there exists a isometric embedding ${u: M \rightarrow {\bf R}^d}$ from ${M}$ to a Euclidean space ${{\bf R}^d}$.

In order to obtain the isometric embedding, the dimension ${d}$ has to be a bit larger than what is needed for the Whitney embedding theorem; in this article of Gunther the bound

$\displaystyle d = \max( n(n+5)/2, n(n+3)/2 + 5) \ \ \ \ \ (1)$

is attained, which I believe is still the record for large ${n}$. (In the converse direction, one cannot do better than ${d = \frac{n(n+1)}{2}}$, basically because this is the number of degrees of freedom in the Riemannian metric ${g}$.) Nash’s original proof of theorem used what is now known as Nash-Moser inverse function theorem, but a subsequent simplification of Gunther allowed one to proceed using just the ordinary inverse function theorem (in Banach spaces).

I recently had the need to invoke the Nash embedding theorem to establish a blowup result for a nonlinear wave equation, which motivated me to go through the proof of the theorem more carefully. Below the fold I give a proof of the theorem that does not attempt to give an optimal value of ${d}$, but which hopefully isolates the main ideas of the argument (as simplified by Gunther). One advantage of not optimising in ${d}$ is that it allows one to freely exploit the very useful tool of pairing together two maps ${u_1: M \rightarrow {\bf R}^{d_1}}$, ${u_2: M \rightarrow {\bf R}^{d_2}}$ to form a combined map ${(u_1,u_2): M \rightarrow {\bf R}^{d_1+d_2}}$ that can be closer to an embedding or an isometric embedding than the original maps ${u_1,u_2}$. This lets one perform a “divide and conquer” strategy in which one first starts with the simpler problem of constructing some “partial” embeddings of ${M}$ and then pairs them together to form a “better” embedding.

In preparing these notes, I found the articles of Deane Yang and of Siyuan Lu to be helpful.

In functional analysis, it is common to endow various (infinite-dimensional) vector spaces with a variety of topologies. For instance, a normed vector space can be given the strong topology as well as the weak topology; if the vector space has a predual, it also has a weak-* topology. Similarly, spaces of operators have a number of useful topologies on them, including the operator norm topology, strong operator topology, and the weak operator topology. For function spaces, one can use topologies associated to various modes of convergence, such as uniform convergence, pointwise convergence, locally uniform convergence, or convergence in the sense of distributions. (A small minority of such modes are not topologisable, though, the most common of which is pointwise almost everywhere convergence; see Exercise 8 of this previous post).

Some of these topologies are much stronger than others (in that they contain many more open sets, or equivalently that they have many fewer convergent sequences and nets). However, even the weakest topologies used in analysis (e.g. convergence in distributions) tend to be Hausdorff, since this at least ensures the uniqueness of limits of sequences and nets, which is a fundamentally useful feature for analysis. On the other hand, some Hausdorff topologies used are “better” than others in that many more analysis tools are available for those topologies. In particular, topologies that come from Banach space norms are particularly valued, as such topologies (and their attendant norm and metric structures) grant access to many convenient additional results such as the Baire category theorem, the uniform boundedness principle, the open mapping theorem, and the closed graph theorem.

Of course, most topologies placed on a vector space will not come from Banach space norms. For instance, if one takes the space ${C_0({\bf R})}$ of continuous functions on ${{\bf R}}$ that converge to zero at infinity, the topology of uniform convergence comes from a Banach space norm on this space (namely, the uniform norm ${\| \|_{L^\infty}}$), but the topology of pointwise convergence does not; and indeed all the other usual modes of convergence one could use here (e.g. ${L^1}$ convergence, locally uniform convergence, convergence in measure, etc.) do not arise from Banach space norms.

I recently realised (while teaching a graduate class in real analysis) that the closed graph theorem provides a quick explanation for why Banach space topologies are so rare:

Proposition 1 Let ${V = (V, {\mathcal F})}$ be a Hausdorff topological vector space. Then, up to equivalence of norms, there is at most one norm ${\| \|}$ one can place on ${V}$ so that ${(V,\| \|)}$ is a Banach space whose topology is at least as strong as ${{\mathcal F}}$. In particular, there is at most one topology stronger than ${{\mathcal F}}$ that comes from a Banach space norm.

Proof: Suppose one had two norms ${\| \|_1, \| \|_2}$ on ${V}$ such that ${(V, \| \|_1)}$ and ${(V, \| \|_2)}$ were both Banach spaces with topologies stronger than ${{\mathcal F}}$. Now consider the graph of the identity function ${\hbox{id}: V \rightarrow V}$ from the Banach space ${(V, \| \|_1)}$ to the Banach space ${(V, \| \|_2)}$. This graph is closed; indeed, if ${(x_n,x_n)}$ is a sequence in this graph that converged in the product topology to ${(x,y)}$, then ${x_n}$ converges to ${x}$ in ${\| \|_1}$ norm and hence in ${{\mathcal F}}$, and similarly ${x_n}$ converges to ${y}$ in ${\| \|_2}$ norm and hence in ${{\mathcal F}}$. But limits are unique in the Hausdorff topology ${{\mathcal F}}$, so ${x=y}$. Applying the closed graph theorem (see also previous discussions on this theorem), we see that the identity map is continuous from ${(V, \| \|_1)}$ to ${(V, \| \|_2)}$; similarly for the inverse. Thus the norms ${\| \|_1, \| \|_2}$ are equivalent as claimed. $\Box$

By using various generalisations of the closed graph theorem, one can generalise the above proposition to Fréchet spaces, or even to F-spaces. The proposition can fail if one drops the requirement that the norms be stronger than a specified Hausdorff topology; indeed, if ${V}$ is infinite dimensional, one can use a Hamel basis of ${V}$ to construct a linear bijection on ${V}$ that is unbounded with respect to a given Banach space norm ${\| \|}$, and which can then be used to give an inequivalent Banach space structure on ${V}$.

One can interpret Proposition 1 as follows: once one equips a vector space with some “weak” (but still Hausdorff) topology, there is a canonical choice of “strong” topology one can place on that space that is stronger than the “weak” topology but arises from a Banach space structure (or at least a Fréchet or F-space structure), provided that at least one such structure exists. In the case of function spaces, one can usually use the topology of convergence in distribution as the “weak” Hausdorff topology for this purpose, since this topology is weaker than almost all of the other topologies used in analysis. This helps justify the common practice of describing a Banach or Fréchet function space just by giving the set of functions that belong to that space (e.g. ${{\mathcal S}({\bf R}^n)}$ is the space of Schwartz functions on ${{\bf R}^n}$) without bothering to specify the precise topology to serve as the “strong” topology, since it is usually understood that one is using the canonical such topology (e.g. the Fréchet space structure on ${{\mathcal S}({\bf R}^n)}$ given by the usual Schwartz space seminorms).

Of course, there are still some topological vector spaces which have no “strong topology” arising from a Banach space at all. Consider for instance the space ${c_c({\bf N})}$ of finitely supported sequences. A weak, but still Hausdorff, topology to place on this space is the topology of pointwise convergence. But there is no norm ${\| \|}$ stronger than this topology that makes this space a Banach space. For, if there were, then letting ${e_1,e_2,e_3,\dots}$ be the standard basis of ${c_c({\bf N})}$, the series ${\sum_{n=1}^\infty 2^{-n} e_n / \| e_n \|}$ would have to converge in ${\| \|}$, and hence pointwise, to an element of ${c_c({\bf N})}$, but the only available pointwise limit for this series lies outside of ${c_c({\bf N})}$. But I do not know if there is an easily checkable criterion to test whether a given vector space (equipped with a Hausdorff “weak” toplogy) can be equipped with a stronger Banach space (or Fréchet space or ${F}$-space) topology.

There is a very nice recent paper by Lemke Oliver and Soundararajan (complete with a popular science article about it by the consistently excellent Erica Klarreich for Quanta) about a surprising (but now satisfactorily explained) bias in the distribution of pairs of consecutive primes ${p_n, p_{n+1}}$ when reduced to a small modulus ${q}$.

This phenomenon is superficially similar to the more well known Chebyshev bias concerning the reduction of a single prime ${p_n}$ to a small modulus ${q}$, but is in fact a rather different (and much stronger) bias than the Chebyshev bias, and seems to arise from a completely different source. The Chebyshev bias asserts, roughly speaking, that a randomly selected prime ${p}$ of a large magnitude ${x}$ will typically (though not always) be slightly more likely to be a quadratic non-residue modulo ${q}$ than a quadratic residue, but the bias is small (the difference in probabilities is only about ${O(1/\sqrt{x})}$ for typical choices of ${x}$), and certainly consistent with known or conjectured positive results such as Dirichlet’s theorem or the generalised Riemann hypothesis. The reason for the Chebyshev bias can be traced back to the von Mangoldt explicit formula which relates the distribution of the von Mangoldt function ${\Lambda}$ modulo ${q}$ with the zeroes of the ${L}$-functions with period ${q}$. This formula predicts (assuming some standard conjectures like GRH) that the von Mangoldt function ${\Lambda}$ is quite unbiased modulo ${q}$. The von Mangoldt function is mostly concentrated in the primes, but it also has a medium-sized contribution coming from squares of primes, which are of course all located in the quadratic residues modulo ${q}$. (Cubes and higher powers of primes also make a small contribution, but these are quite negligible asymptotically.) To balance everything out, the contribution of the primes must then exhibit a small preference towards quadratic non-residues, and this is the Chebyshev bias. (See this article of Rubinstein and Sarnak for a more technical discussion of the Chebyshev bias, and this survey of Granville and Martin for an accessible introduction. The story of the Chebyshev bias is also related to Skewes’ number, once considered the largest explicit constant to naturally appear in a mathematical argument.)

The paper of Lemke Oliver and Soundararajan considers instead the distribution of the pairs ${(p_n \hbox{ mod } q, p_{n+1} \hbox{ mod } q)}$ for small ${q}$ and for large consecutive primes ${p_n, p_{n+1}}$, say drawn at random from the primes comparable to some large ${x}$. For sake of discussion let us just take ${q=3}$. Then all primes ${p_n}$ larger than ${3}$ are either ${1 \hbox{ mod } 3}$ or ${2 \hbox{ mod } 3}$; Chebyshev’s bias gives a very slight preference to the latter (of order ${O(1/\sqrt{x})}$, as discussed above), but apart from this, we expect the primes to be more or less equally distributed in both classes. For instance, assuming GRH, the probability that ${p_n}$ lands in ${1 \hbox{ mod } 3}$ would be ${1/2 + O( x^{-1/2+o(1)} )}$, and similarly for ${2 \hbox{ mod } 3}$.

In view of this, one would expect that up to errors of ${O(x^{-1/2+o(1)})}$ or so, the pair ${(p_n \hbox{ mod } 3, p_{n+1} \hbox{ mod } 3)}$ should be equally distributed amongst the four options ${(1 \hbox{ mod } 3, 1 \hbox{ mod } 3)}$, ${(1 \hbox{ mod } 3, 2 \hbox{ mod } 3)}$, ${(2 \hbox{ mod } 3, 1 \hbox{ mod } 3)}$, ${(2 \hbox{ mod } 3, 2 \hbox{ mod } 3)}$, thus for instance the probability that this pair is ${(1 \hbox{ mod } 3, 1 \hbox{ mod } 3)}$ would naively be expected to be ${1/4 + O(x^{-1/2+o(1)})}$, and similarly for the other three tuples. These assertions are not yet proven (although some non-trivial upper and lower bounds for such probabilities can be obtained from recent work of Maynard).

However, Lemke Oliver and Soundararajan argue (backed by both plausible heuristic arguments (based ultimately on the Hardy-Littlewood prime tuples conjecture), as well as substantial numerical evidence) that there is a significant bias away from the tuples ${(1 \hbox{ mod } 3, 1 \hbox{ mod } 3)}$ and ${(2 \hbox{ mod } 3, 2 \hbox{ mod } 3)}$ – informally, adjacent primes don’t like being in the same residue class! For instance, they predict that the probability of attaining ${(1 \hbox{ mod } 3, 1 \hbox{ mod } 3)}$ is in fact

$\displaystyle \frac{1}{4} - \frac{1}{8} \frac{\log\log x}{\log x} + O( \frac{1}{\log x} )$

with similar predictions for the other three pairs (in fact they give a somewhat more precise prediction than this). The magnitude of this bias, being comparable to ${\log\log x / \log x}$, is significantly stronger than the Chebyshev bias of ${O(1/\sqrt{x})}$.

One consequence of this prediction is that the prime gaps ${p_{n+1}-p_n}$ are slightly less likely to be divisible by ${3}$ than naive random models of the primes would predict. Indeed, if the four options ${(1 \hbox{ mod } 3, 1 \hbox{ mod } 3)}$, ${(1 \hbox{ mod } 3, 2 \hbox{ mod } 3)}$, ${(2 \hbox{ mod } 3, 1 \hbox{ mod } 3)}$, ${(2 \hbox{ mod } 3, 2 \hbox{ mod } 3)}$ all occurred with equal probability ${1/4}$, then ${p_{n+1}-p_n}$ should equal ${0 \hbox{ mod } 3}$ with probability ${1/2}$, and ${1 \hbox{ mod } 3}$ and ${2 \hbox{ mod } 3}$ with probability ${1/4}$ each (as would be the case when taking the difference of two random numbers drawn from those integers not divisible by ${3}$); but the Lemke Oliver-Soundararajan bias predicts that the probability of ${p_{n+1}-p_n}$ being divisible by three should be slightly lower, being approximately ${1/2 - \frac{1}{4} \frac{\log\log x}{\log x}}$.

Below the fold we will give a somewhat informal justification of (a simplified version of) this phenomenon, based on the Lemke Oliver-Soundararajan calculation using the prime tuples conjecture.

I’ve been meaning to return to fluids for some time now, in order to build upon my construction two years ago of a solution to an averaged Navier-Stokes equation that exhibited finite time blowup. (I recently spoke on this work in the recent conference in Princeton in honour of Sergiu Klainerman; my slides for that talk are here.)

One of the biggest deficiencies with my previous result is the fact that the averaged Navier-Stokes equation does not enjoy any good equation for the vorticity ${\omega = \nabla \times u}$, in contrast to the true Navier-Stokes equations which, when written in vorticity-stream formulation, become

$\displaystyle \partial_t \omega + (u \cdot \nabla) \omega = (\omega \cdot \nabla) u + \nu \Delta \omega$

$\displaystyle u = (-\Delta)^{-1} (\nabla \times \omega).$

(Throughout this post we will be working in three spatial dimensions ${{\bf R}^3}$.) So one of my main near-term goals in this area is to exhibit an equation resembling Navier-Stokes as much as possible which enjoys a vorticity equation, and for which there is finite time blowup.

Heuristically, this task should be easier for the Euler equations (i.e. the zero viscosity case ${\nu=0}$ of Navier-Stokes) than the viscous Navier-Stokes equation, as one expects the viscosity to only make it easier for the solution to stay regular. Indeed, morally speaking, the assertion that finite time blowup solutions of Navier-Stokes exist should be roughly equivalent to the assertion that finite time blowup solutions of Euler exist which are “Type I” in the sense that all Navier-Stokes-critical and Navier-Stokes-subcritical norms of this solution go to infinity (which, as explained in the above slides, heuristically means that the effects of viscosity are negligible when compared against the nonlinear components of the equation). In vorticity-stream formulation, the Euler equations can be written as

$\displaystyle \partial_t \omega + (u \cdot \nabla) \omega = (\omega \cdot \nabla) u$

$\displaystyle u = (-\Delta)^{-1} (\nabla \times \omega).$

As discussed in this previous blog post, a natural generalisation of this system of equations is the system

$\displaystyle \partial_t \omega + (u \cdot \nabla) \omega = (\omega \cdot \nabla) u \ \ \ \ \ (1)$

$\displaystyle u = T (-\Delta)^{-1} (\nabla \times \omega).$

where ${T}$ is a linear operator on divergence-free vector fields that is “zeroth order” in some sense; ideally it should also be invertible, self-adjoint, and positive definite (in order to have a Hamiltonian that is comparable to the kinetic energy ${\frac{1}{2} \int_{{\bf R}^3} |u|^2}$). (In the previous blog post, it was observed that the surface quasi-geostrophic (SQG) equation could be embedded in a system of the form (1).) The system (1) has many features in common with the Euler equations; for instance vortex lines are transported by the velocity field ${u}$, and Kelvin’s circulation theorem is still valid.

So far, I have not been able to fully achieve this goal. However, I have the following partial result, stated somewhat informally:

Theorem 1 There is a “zeroth order” linear operator ${T}$ (which, unfortunately, is not invertible, self-adjoint, or positive definite) for which the system (1) exhibits smooth solutions that blowup in finite time.

The operator ${T}$ constructed is not quite a zeroth-order pseudodifferential operator; it is instead merely in the “forbidden” symbol class ${S^0_{1,1}}$, and more precisely it takes the form

$\displaystyle T v = \sum_{j \in {\bf Z}} 2^{3j} \langle v, \phi_j \rangle \psi_j \ \ \ \ \ (2)$

for some compactly supported divergence-free ${\phi,\psi}$ of mean zero with

$\displaystyle \phi_j(x) := \phi(2^j x); \quad \psi_j(x) := \psi(2^j x)$

being ${L^2}$ rescalings of ${\phi,\psi}$. This operator is still bounded on all ${L^p({\bf R}^3)}$ spaces ${1 < p < \infty}$, and so is arguably still a zeroth order operator, though not as convincingly as I would like. Another, less significant, issue with the result is that the solution constructed does not have good spatial decay properties, but this is mostly for convenience and it is likely that the construction can be localised to give solutions that have reasonable decay in space. But the biggest drawback of this theorem is the fact that ${T}$ is not invertible, self-adjoint, or positive definite, so in particular there is no non-negative Hamiltonian for this equation. It may be that some modification of the arguments below can fix these issues, but I have so far been unable to do so. Still, the construction does show that the circulation theorem is insufficient by itself to prevent blowup.

We sketch the proof of the above theorem as follows. We use the barrier method, introducing the time-varying hyperboloid domains

$\displaystyle \Omega(t) := \{ (r,\theta,z): r^2 \leq 1-t + z^2 \}$

for ${t>0}$ (expressed in cylindrical coordinates ${(r,\theta,z)}$). We will select initial data ${\omega(0)}$ to be ${\omega(0,r,\theta,z) = (0,0,\eta(r))}$ for some non-negative even bump function ${\eta}$ supported on ${[-1,1]}$, normalised so that

$\displaystyle \int\int \eta(r)\ r dr d\theta = 1;$

in particular ${\omega(0)}$ is divergence-free supported in ${\Omega(0)}$, with vortex lines connecting ${z=-\infty}$ to ${z=+\infty}$. Suppose for contradiction that we have a smooth solution ${\omega}$ to (1) with this initial data; to simplify the discussion we assume that the solution behaves well at spatial infinity (this can be justified with the choice (2) of vorticity-stream operator, but we will not do so here). Since the domains ${\Omega(t)}$ disconnect ${z=-\infty}$ from ${z=+\infty}$ at time ${t=1}$, there must exist a time ${0 < T_* < 1}$ which is the first time where the support of ${\omega(T_*)}$ touches the boundary of ${\Omega(T_*)}$, with ${\omega(t)}$ supported in ${\Omega(t)}$.

From (1) we see that the support of ${\omega(t)}$ is transported by the velocity field ${u(t)}$. Thus, at the point of contact of the support of ${\omega(T_*)}$ with the boundary of ${\Omega(T_*)}$, the inward component of the velocity field ${u(T_*)}$ cannot exceed the inward velocity of ${\Omega(T_*)}$. We will construct the functions ${\phi,\psi}$ so that this is not the case, leading to the desired contradiction. (Geometrically, what is going on here is that the operator ${T}$ is pinching the flow to pass through the narrow cylinder ${\{ z, r = O( \sqrt{1-t} )\}}$, leading to a singularity by time ${t=1}$ at the latest.)

First we observe from conservation of circulation, and from the fact that ${\omega(t)}$ is supported in ${\Omega(t)}$, that the integrals

$\displaystyle \int\int \omega_z(t,r,\theta,z) \ r dr d\theta$

are constant in both space and time for ${0 \leq t \leq T_*}$. From the choice of initial data we thus have

$\displaystyle \int\int \omega_z(t,r,\theta,z) \ r dr d\theta = 1$

for all ${t \leq T_*}$ and all ${z}$. On the other hand, if ${T}$ is of the form (2) with ${\phi = \nabla \times \eta}$ for some bump function ${\eta = (0,0,\eta_z)}$ that only has ${z}$-components, then ${\phi}$ is divergence-free with mean zero, and

$\displaystyle \langle (-\Delta) (\nabla \times \omega), \phi_j \rangle = 2^{-j} \langle (-\Delta) (\nabla \times \omega), \nabla \times \eta_j \rangle$

$\displaystyle = 2^{-j} \langle \omega, \eta_j \rangle$

$\displaystyle = 2^{-j} \int\int\int \omega_z(t,r,\theta,z) \eta_z(2^j r, \theta, 2^j z)\ r dr d\theta dz,$

where ${\eta_j(x) := \eta(2^j x)}$. We choose ${\eta_z}$ to be supported in the slab ${\{ C \leq z \leq 2C\}}$ for some large constant ${C}$, and to equal a function ${f(z)}$ depending only on ${z}$ on the cylinder ${\{ C \leq z \leq 2C; r \leq 10C \}}$, normalised so that ${\int f(z)\ dz = 1}$. If ${C/2^j \geq (1-t)^{1/2}}$, then ${\Omega(t)}$ passes through this cylinder, and we conclude that

$\displaystyle \langle (-\Delta) (\nabla \times \omega), \phi_j \rangle = -2^{-j} \int f(2^j z)\ dz$

$\displaystyle = 2^{-2j}.$

Inserting ths into (2), (1) we conclude that

$\displaystyle u = \sum_{j: C/2^j \geq (1-t)^{1/2}} 2^j \psi_j + \sum_{j: C/2^j < (1-t)^{1/2}} c_j(t) \psi_j$

for some coefficients ${c_j(t)}$. We will not be able to control these coefficients ${c_j(t)}$, but fortunately we only need to understand ${u}$ on the boundary ${\partial \Omega(t)}$, for which ${r+|z| \gg (1-t)^{1/2}}$. So, if ${\psi}$ happens to be supported on an annulus ${1 \ll r+|z| \ll 1}$, then ${\psi_j}$ vanishes on ${\partial \Omega(t)}$ if ${C}$ is large enough. We then have

$\displaystyle u = \sum_j 2^j \psi_j$

on the boundary of ${\partial \Omega(t)}$.

Let ${\Phi(r,\theta,z)}$ be a function of the form

$\displaystyle \Phi(r,\theta,z) = C z \varphi(z/r)$

where ${\varphi}$ is a bump function supported on ${[-2,2]}$ that equals ${1}$ on ${[-1,1]}$. We can perform a dyadic decomposition ${\Phi = \sum_j \Psi_j}$ where

$\displaystyle \Psi_j(r,\theta,z) = \Phi(r,\theta,z) a(2^j r)$

where ${a}$ is a bump function supported on ${[1/2,2]}$ with ${\sum_j a(2^j r) = 1}$. If we then set

$\displaystyle \psi_j = \frac{2^{-j}}{r} (-\partial_z \Psi_j, 0, \partial_r \Psi_j)$

then one can check that ${\psi_j(x) = \psi(2^j x)}$ for a function ${\psi}$ that is divergence-free and mean zero, and supported on the annulus ${1 \ll r+|z| \ll 1}$, and

$\displaystyle \sum_j 2^j \psi_j = \frac{1}{r} (-\partial_z \Phi, 0, \partial_r \Phi)$

so on ${\partial \Omega(t)}$ (where ${|z| \leq r}$) we have

$\displaystyle u = (-\frac{C}{r}, 0, 0 ).$

One can manually check that the inward velocity of this vector on ${\partial \Omega(t)}$ exceeds the inward velocity of ${\Omega(t)}$ if ${C}$ is large enough, and the claim follows.

Remark 2 The type of blowup suggested by this construction, where a unit amount of circulation is squeezed into a narrow cylinder, is of “Type II” with respect to the Navier-Stokes scaling, because Navier-Stokes-critical norms such ${L^3({\bf R}^3)}$ (or at least ${L^{3,\infty}({\bf R}^3)}$) look like they stay bounded during this squeezing procedure (the velocity field is of size about ${2^j}$ in cylinders of radius and length about ${2^j}$). So even if the various issues with ${T}$ are repaired, it does not seem likely that this construction can be directly adapted to obtain a corresponding blowup for a Navier-Stokes type equation. To get a “Type I” blowup that is consistent with Kelvin’s circulation theorem, it seems that one needs to coil the vortex lines around a loop multiple times in order to get increased circulation in a small space. This seems possible to pull off to me – there don’t appear to be any unavoidable obstructions coming from topology, scaling, or conservation laws – but would require a more complicated construction than the one given above.

In this blog post, I would like to specialise the arguments of Bourgain, Demeter, and Guth from the previous post to the two-dimensional case of the Vinogradov main conjecture, namely

Theorem 1 (Two-dimensional Vinogradov main conjecture) One has

$\displaystyle \int_{[0,1]^2} |\sum_{j=0}^N e( j x + j^2 y)|^6\ dx dy \ll N^{3+o(1)}$

as ${N \rightarrow \infty}$.

This particular case of the main conjecture has a classical proof using some elementary number theory. Indeed, the left-hand side can be viewed as the number of solutions to the system of equations

$\displaystyle j_1 + j_2 + j_3 = k_1 + k_2 + k_3$

$\displaystyle j_1^2 + j_2^2 + j_3^2 = k_1^2 + k_2^2 + k_3^2$

with ${j_1,j_2,j_3,k_1,k_2,k_3 \in \{0,\dots,N\}}$. These two equations can combine (using the algebraic identity ${(a+b-c)^2 - (a^2+b^2-c^2) = 2 (a-c)(b-c)}$ applied to ${(a,b,c) = (j_1,j_2,k_3), (k_1,k_2,j_3)}$) to imply the further equation

$\displaystyle (j_1 - k_3) (j_2 - k_3) = (k_1 - j_3) (k_2 - j_3)$

which, when combined with the divisor bound, shows that each ${k_1,k_2,j_3}$ is associated to ${O(N^{o(1)})}$ choices of ${j_1,j_2,k_3}$ excluding diagonal cases when two of the ${j_1,j_2,j_3,k_1,k_2,k_3}$ collide, and this easily yields Theorem 1. However, the Bourgain-Demeter-Guth argument (which, in the two dimensional case, is essentially contained in a previous paper of Bourgain and Demeter) does not require the divisor bound, and extends for instance to the the more general case where ${j}$ ranges in a ${1}$-separated set of reals between ${0}$ to ${N}$.

In this special case, the Bourgain-Demeter argument simplifies, as the lower dimensional inductive hypothesis becomes a simple ${L^2}$ almost orthogonality claim, and the multilinear Kakeya estimate needed is also easy (collapsing to just Fubini’s theorem). Also one can work entirely in the context of the Vinogradov main conjecture, and not turn to the increased generality of decoupling inequalities (though this additional generality is convenient in higher dimensions). As such, I am presenting this special case as an introduction to the Bourgain-Demeter-Guth machinery.

We now give the specialisation of the Bourgain-Demeter argument to Theorem 1. It will suffice to establish the bound

$\displaystyle \int_{[0,1]^2} |\sum_{j=0}^N e( j x + j^2 y)|^p\ dx dy \ll N^{p/2+o(1)}$

for all ${4, (where we keep ${p}$ fixed and send ${N}$ to infinity), as the ${L^6}$ bound then follows by combining the above bound with the trivial bound ${|\sum_{j=0}^N e( j x + j^2 x^2)| \ll N}$. Accordingly, for any ${\eta > 0}$ and ${4, we let ${P(p,\eta)}$ denote the claim that

$\displaystyle \int_{[0,1]^2} |\sum_{j=0}^N e( j x + j^2 y)|^p\ dx dy \ll N^{p/2+\eta+o(1)}$

as ${N \rightarrow \infty}$. Clearly, for any fixed ${p}$, ${P(p,\eta)}$ holds for some large ${\eta}$, and it will suffice to establish

Proposition 2 Let ${4, and let ${\eta>0}$ be such that ${P(p,\eta)}$ holds. Then there exists ${0 < \eta' < \eta}$ such that ${P(p,\eta')}$ holds.

Indeed, this proposition shows that for ${4, the infimum of the ${\eta}$ for which ${P(p,\eta)}$ holds is zero.

We prove the proposition below the fold, using a simplified form of the methods discussed in the previous blog post. To simplify the exposition we will be a bit cavalier with the uncertainty principle, for instance by essentially ignoring the tails of rapidly decreasing functions.

Given any finite collection of elements ${(f_i)_{i \in I}}$ in some Banach space ${X}$, the triangle inequality tells us that

$\displaystyle \| \sum_{i \in I} f_i \|_X \leq \sum_{i \in I} \|f_i\|_X.$

However, when the ${f_i}$ all “oscillate in different ways”, one expects to improve substantially upon the triangle inequality. For instance, if ${X}$ is a Hilbert space and the ${f_i}$ are mutually orthogonal, we have the Pythagorean theorem

$\displaystyle \| \sum_{i \in I} f_i \|_X = (\sum_{i \in I} \|f_i\|_X^2)^{1/2}.$

For sake of comparison, from the triangle inequality and Cauchy-Schwarz one has the general inequality

$\displaystyle \| \sum_{i \in I} f_i \|_X \leq (\# I)^{1/2} (\sum_{i \in I} \|f_i\|_X^2)^{1/2} \ \ \ \ \ (1)$

for any finite collection ${(f_i)_{i \in I}}$ in any Banach space ${X}$, where ${\# I}$ denotes the cardinality of ${I}$. Thus orthogonality in a Hilbert space yields “square root cancellation”, saving a factor of ${(\# I)^{1/2}}$ or so over the trivial bound coming from the triangle inequality.

More generally, let us somewhat informally say that a collection ${(f_i)_{i \in I}}$ exhibits decoupling in ${X}$ if one has the Pythagorean-like inequality

$\displaystyle \| \sum_{i \in I} f_i \|_X \ll_\varepsilon (\# I)^\varepsilon (\sum_{i \in I} \|f_i\|_X^2)^{1/2}$

for any ${\varepsilon>0}$, thus one obtains almost the full square root cancellation in the ${X}$ norm. The theory of almost orthogonality can then be viewed as the theory of decoupling in Hilbert spaces such as ${L^2({\bf R}^n)}$. In ${L^p}$ spaces for ${p < 2}$ one usually does not expect this sort of decoupling; for instance, if the ${f_i}$ are disjointly supported one has

$\displaystyle \| \sum_{i \in I} f_i \|_{L^p} = (\sum_{i \in I} \|f_i\|_{L^p}^p)^{1/p}$

and the right-hand side can be much larger than ${(\sum_{i \in I} \|f_i\|_{L^p}^2)^{1/2}}$ when ${p < 2}$. At the opposite extreme, one usually does not expect to get decoupling in ${L^\infty}$, since one could conceivably align the ${f_i}$ to all attain a maximum magnitude at the same location with the same phase, at which point the triangle inequality in ${L^\infty}$ becomes sharp.

However, in some cases one can get decoupling for certain ${2 < p < \infty}$. For instance, suppose we are in ${L^4}$, and that ${f_1,\dots,f_N}$ are bi-orthogonal in the sense that the products ${f_i f_j}$ for ${1 \leq i < j \leq N}$ are pairwise orthogonal in ${L^2}$. Then we have

$\displaystyle \| \sum_{i = 1}^N f_i \|_{L^4}^2 = \| (\sum_{i=1}^N f_i)^2 \|_{L^2}$

$\displaystyle = \| \sum_{1 \leq i,j \leq N} f_i f_j \|_{L^2}$

$\displaystyle \ll (\sum_{1 \leq i,j \leq N} \|f_i f_j \|_{L^2}^2)^{1/2}$

$\displaystyle = \| (\sum_{1 \leq i,j \leq N} |f_i f_j|^2)^{1/2} \|_{L^2}$

$\displaystyle = \| \sum_{i=1}^N |f_i|^2 \|_{L^2}$

$\displaystyle \leq \sum_{i=1}^N \| |f_i|^2 \|_{L^2}$

$\displaystyle = \sum_{i=1}^N \|f_i\|_{L^4}^2$

giving decoupling in ${L^4}$. (Similarly if each of the ${f_i f_j}$ is orthogonal to all but ${O_\varepsilon( N^\varepsilon )}$ of the other ${f_{i'} f_{j'}}$.) A similar argument also gives ${L^6}$ decoupling when one has tri-orthogonality (with the ${f_i f_j f_k}$ mostly orthogonal to each other), and so forth. As a slight variant, Khintchine’s inequality also indicates that decoupling should occur for any fixed ${2 < p < \infty}$ if one multiplies each of the ${f_i}$ by an independent random sign ${\epsilon_i \in \{-1,+1\}}$.

In recent years, Bourgain and Demeter have been establishing decoupling theorems in ${L^p({\bf R}^n)}$ spaces for various key exponents of ${2 < p < \infty}$, in the “restriction theory” setting in which the ${f_i}$ are Fourier transforms of measures supported on different portions of a given surface or curve; this builds upon the earlier decoupling theorems of Wolff. In a recent paper with Guth, they established the following decoupling theorem for the curve ${\gamma({\bf R}) \subset {\bf R}^n}$ parameterised by the polynomial curve

$\displaystyle \gamma: t \mapsto (t, t^2, \dots, t^n).$

For any ball ${B = B(x_0,r)}$ in ${{\bf R}^n}$, let ${w_B: {\bf R}^n \rightarrow {\bf R}^+}$ denote the weight

$\displaystyle w_B(x) := \frac{1}{(1 + \frac{|x-x_0|}{r})^{100n}},$

which should be viewed as a smoothed out version of the indicator function ${1_B}$ of ${B}$. In particular, the space ${L^p(w_B) = L^p({\bf R}^n, w_B(x)\ dx)}$ can be viewed as a smoothed out version of the space ${L^p(B)}$. For future reference we observe a fundamental self-similarity of the curve ${\gamma({\bf R})}$: any arc ${\gamma(I)}$ in this curve, with ${I}$ a compact interval, is affinely equivalent to the standard arc ${\gamma([0,1])}$.

Theorem 1 (Decoupling theorem) Let ${n \geq 1}$. Subdivide the unit interval ${[0,1]}$ into ${N}$ equal subintervals ${I_i}$ of length ${1/N}$, and for each such ${I_i}$, let ${f_i: {\bf R}^n \rightarrow {\bf R}}$ be the Fourier transform

$\displaystyle f_i(x) = \int_{\gamma(I_i)} e(x \cdot \xi)\ d\mu_i(\xi)$

of a finite Borel measure ${\mu_i}$ on the arc ${\gamma(I_i)}$, where ${e(\theta) := e^{2\pi i \theta}}$. Then the ${f_i}$ exhibit decoupling in ${L^{n(n+1)}(w_B)}$ for any ball ${B}$ of radius ${N^n}$.

Orthogonality gives the ${n=1}$ case of this theorem. The bi-orthogonality type arguments sketched earlier only give decoupling in ${L^p}$ up to the range ${2 \leq p \leq 2n}$; the point here is that we can now get a much larger value of ${n}$. The ${n=2}$ case of this theorem was previously established by Bourgain and Demeter (who obtained in fact an analogous theorem for any curved hypersurface). The exponent ${n(n+1)}$ (and the radius ${N^n}$) is best possible, as can be seen by the following basic example. If

$\displaystyle f_i(x) := \int_{I_i} e(x \cdot \gamma(\xi)) g_i(\xi)\ d\xi$

where ${g_i}$ is a bump function adapted to ${I_i}$, then standard Fourier-analytic computations show that ${f_i}$ will be comparable to ${1/N}$ on a rectangular box of dimensions ${N \times N^2 \times \dots \times N^n}$ (and thus volume ${N^{n(n+1)/2}}$) centred at the origin, and exhibit decay away from this box, with ${\|f_i\|_{L^{n(n+1)}(w_B)}}$ comparable to

$\displaystyle 1/N \times (N^{n(n+1)/2})^{1/(n(n+1))} = 1/\sqrt{N}.$

On the other hand, ${\sum_{i=1}^N f_i}$ is comparable to ${1}$ on a ball of radius comparable to ${1}$ centred at the origin, so ${\|\sum_{i=1}^N f_i\|_{L^{n(n+1)}(w_B)}}$ is ${\gg 1}$, which is just barely consistent with decoupling. This calculation shows that decoupling will fail if ${n(n+1)}$ is replaced by any larger exponent, and also if the radius of the ball ${B}$ is reduced to be significantly smaller than ${N^n}$.

This theorem has the following consequence of importance in analytic number theory:

Corollary 2 (Vinogradov main conjecture) Let ${s, n, N \geq 1}$ be integers, and let ${\varepsilon > 0}$. Then

$\displaystyle \int_{[0,1]^n} |\sum_{j=1}^N e( j x_1 + j^2 x_2 + \dots + j^n x_n)|^{2s}\ dx_1 \dots dx_n$

$\displaystyle \ll_{\varepsilon,s,n} N^{s+\varepsilon} + N^{2s - \frac{n(n+1)}{2}+\varepsilon}.$

Proof: By the Hölder inequality (and the trivial bound of ${N}$ for the exponential sum), it suffices to treat the critical case ${s = n(n+1)/2}$, that is to say to show that

$\displaystyle \int_{[0,1]^n} |\sum_{j=1}^N e( j x_1 + j^2 x_2 + \dots + j^n x_n)|^{n(n+1)}\ dx_1 \dots dx_n \ll_{\varepsilon,n} N^{\frac{n(n+1)}{2}+\varepsilon}.$

We can rescale this as

$\displaystyle \int_{[0,N] \times [0,N^2] \times \dots \times [0,N^n]} |\sum_{j=1}^N e( x \cdot \gamma(j/N) )|^{n(n+1)}\ dx \ll_{\varepsilon,n} N^{3\frac{n(n+1)}{2}+\varepsilon}.$

As the integrand is periodic along the lattice ${N{\bf Z} \times N^2 {\bf Z} \times \dots \times N^n {\bf Z}}$, this is equivalent to

$\displaystyle \int_{[0,N^n]^n} |\sum_{j=1}^N e( x \cdot \gamma(j/N) )|^{n(n+1)}\ dx \ll_{\varepsilon,n} N^{\frac{n(n+1)}{2}+n^2+\varepsilon}.$

The left-hand side may be bounded by ${\ll \| \sum_{j=1}^N f_j \|_{L^{n(n+1)}(w_B)}^{n(n+1)}}$, where ${B := B(0,N^n)}$ and ${f_j(x) := e(x \cdot \gamma(j/N))}$. Since

$\displaystyle \| f_j \|_{L^{n(n+1)}(w_B)} \ll (N^{n^2})^{\frac{1}{n(n+1)}},$

the claim now follows from the decoupling theorem and a brief calculation. $\Box$

Using the Plancherel formula, one may equivalently (when ${s}$ is an integer) write the Vinogradov main conjecture in terms of solutions ${j_1,\dots,j_s,k_1,\dots,k_s \in \{1,\dots,N\}}$ to the system of equations

$\displaystyle j_1^i + \dots + j_s^i = k_1^i + \dots + k_s^i \forall i=1,\dots,n,$

but we will not use this formulation here.

A history of the Vinogradov main conjecture may be found in this survey of Wooley; prior to the Bourgain-Demeter-Guth theorem, the conjecture was solved completely for ${n \leq 3}$, or for ${n > 3}$ and ${s}$ either below ${n(n+1)/2 - n/3 + O(n^{2/3})}$ or above ${n(n-1)}$, with the bulk of recent progress coming from the efficient congruencing technique of Wooley. It has numerous applications to exponential sums, Waring’s problem, and the zeta function; to give just one application, the main conjecture implies the predicted asymptotic for the number of ways to express a large number as the sum of ${23}$ fifth powers (the previous best result required ${28}$ fifth powers). The Bourgain-Demeter-Guth approach to the Vinogradov main conjecture, based on decoupling, is ostensibly very different from the efficient congruencing technique, which relies heavily on the arithmetic structure of the program, but it appears (as I have been told from second-hand sources) that the two methods are actually closely related, with the former being a sort of “Archimedean” version of the latter (with the intervals ${I_i}$ in the decoupling theorem being analogous to congruence classes in the efficient congruencing method); hopefully there will be some future work making this connection more precise. One advantage of the decoupling approach is that it generalises to non-arithmetic settings in which the set ${\{1,\dots,N\}}$ that ${j}$ is drawn from is replaced by some other similarly separated set of real numbers. (A random thought – could this allow the Vinogradov-Korobov bounds on the zeta function to extend to Beurling zeta functions?)

Below the fold we sketch the Bourgain-Demeter-Guth argument proving Theorem 1.

I thank Jean Bourgain and Andrew Granville for helpful discussions.

Let ${\lambda}$ denote the Liouville function. The prime number theorem is equivalent to the estimate

$\displaystyle \sum_{n \leq x} \lambda(n) = o(x)$

as ${x \rightarrow \infty}$, that is to say that ${\lambda}$ exhibits cancellation on large intervals such as ${[1,x]}$. This result can be improved to give cancellation on shorter intervals. For instance, using the known zero density estimates for the Riemann zeta function, one can establish that

$\displaystyle \int_X^{2X} |\sum_{x \leq n \leq x+H} \lambda(n)|\ dx = o( HX ) \ \ \ \ \ (1)$

as ${X \rightarrow \infty}$ if ${X^{1/6+\varepsilon} \leq H \leq X}$ for some fixed ${\varepsilon>0}$; I believe this result is due to Ramachandra (see also Exercise 21 of this previous blog post), and in fact one could obtain a better error term on the right-hand side that for instance gained an arbitrary power of ${\log X}$. On the Riemann hypothesis (or the weaker density hypothesis), it was known that the ${X^{1/6+\varepsilon}}$ could be lowered to ${X^\varepsilon}$.

Early this year, there was a major breakthrough by Matomaki and Radziwill, who (among other things) showed that the asymptotic (1) was in fact valid for any ${H = H(X)}$ with ${H \leq X}$ that went to infinity as ${X \rightarrow \infty}$, thus yielding cancellation on extremely short intervals. This has many further applications; for instance, this estimate, or more precisely its extension to other “non-pretentious” bounded multiplicative functions, was a key ingredient in my recent solution of the Erdös discrepancy problem, as well as in obtaining logarithmically averaged cases of Chowla’s conjecture, such as

$\displaystyle \sum_{n \leq x} \frac{\lambda(n) \lambda(n+1)}{n} = o(\log x). \ \ \ \ \ (2)$

It is of interest to twist the above estimates by phases such as the linear phase ${n \mapsto e(\alpha n) := e^{2\pi i \alpha n}}$. In 1937, Davenport showed that

$\displaystyle \sup_\alpha |\sum_{n \leq x} \lambda(n) e(\alpha n)| \ll_A x \log^{-A} x$

which of course improves the prime number theorem. Recently with Matomaki and Radziwill, we obtained a common generalisation of this estimate with (1), showing that

$\displaystyle \sup_\alpha \int_X^{2X} |\sum_{x \leq n \leq x+H} \lambda(n) e(\alpha n)|\ dx = o(HX) \ \ \ \ \ (3)$

as ${X \rightarrow \infty}$, for any ${H = H(X) \leq X}$ that went to infinity as ${X \rightarrow \infty}$. We were able to use this estimate to obtain an averaged form of Chowla’s conjecture.

In that paper, we asked whether one could improve this estimate further by moving the supremum inside the integral, that is to say to establish the bound

$\displaystyle \int_X^{2X} \sup_\alpha |\sum_{x \leq n \leq x+H} \lambda(n) e(\alpha n)|\ dx = o(HX) \ \ \ \ \ (4)$

as ${X \rightarrow \infty}$, for any ${H = H(X) \leq X}$ that went to infinity as ${X \rightarrow \infty}$. This bound is asserting that ${\lambda}$ is locally Fourier-uniform on most short intervals; it can be written equivalently in terms of the “local Gowers ${U^2}$ norm” as

$\displaystyle \int_X^{2X} \sum_{1 \leq a \leq H} |\sum_{x \leq n \leq x+H} \lambda(n) \lambda(n+a)|^2\ dx = o( H^3 X )$

from which one can see that this is another averaged form of Chowla’s conjecture (stronger than the one I was able to prove with Matomaki and Radziwill, but a consequence of the unaveraged Chowla conjecture). If one inserted such a bound into the machinery I used to solve the Erdös discrepancy problem, it should lead to further averaged cases of Chowla’s conjecture, such as

$\displaystyle \sum_{n \leq x} \frac{\lambda(n) \lambda(n+1) \lambda(n+2)}{n} = o(\log x), \ \ \ \ \ (5)$

though I have not fully checked the details of this implication. It should also have a number of new implications for sign patterns of the Liouville function, though we have not explored these in detail yet.

One can write (4) equivalently in the form

$\displaystyle \int_X^{2X} \sum_{x \leq n \leq x+H} \lambda(n) e( \alpha(x) n + \beta(x) )\ dx = o(HX) \ \ \ \ \ (6)$

uniformly for all ${x}$-dependent phases ${\alpha(x), \beta(x)}$. In contrast, (3) is equivalent to the subcase of (6) when the linear phase coefficient ${\alpha(x)}$ is independent of ${x}$. This dependency of ${\alpha(x)}$ on ${x}$ seems to necessitate some highly nontrivial additive combinatorial analysis of the function ${x \mapsto \alpha(x)}$ in order to establish (4) when ${H}$ is small. To date, this analysis has proven to be elusive, but I would like to record what one can do with more classical methods like Vaughan’s identity, namely:

Proposition 1 The estimate (4) (or equivalently (6)) holds in the range ${X^{2/3+\varepsilon} \leq H \leq X}$ for any fixed ${\varepsilon>0}$. (In fact one can improve the right-hand side by an arbitrary power of ${\log X}$ in this case.)

The values of ${H}$ in this range are far too large to yield implications such as new cases of the Chowla conjecture, but it appears that the ${2/3}$ exponent is the limit of “classical” methods (at least as far as I was able to apply them), in the sense that one does not do any combinatorial analysis on the function ${x \mapsto \alpha(x)}$, nor does one use modern equidistribution results on “Type III sums” that require deep estimates on Kloosterman-type sums. The latter may shave a little bit off of the ${2/3}$ exponent, but I don’t see how one would ever hope to go below ${1/2}$ without doing some non-trivial combinatorics on the function ${x \mapsto \alpha(x)}$. UPDATE: I have come across this paper of Zhan which uses mean-value theorems for L-functions to lower the ${2/3}$ exponent to ${5/8}$.

Let me now sketch the proof of the proposition, omitting many of the technical details. We first remark that known estimates on sums of the Liouville function (or similar functions such as the von Mangoldt function) in short arithmetic progressions, based on zero-density estimates for Dirichlet ${L}$-functions, can handle the “major arc” case of (4) (or (6)) where ${\alpha}$ is restricted to be of the form ${\alpha = \frac{a}{q} + O( X^{-1/6-\varepsilon} )}$ for ${q = O(\log^{O(1)} X)}$ (the exponent here being of the same numerology as the ${X^{1/6+\varepsilon}}$ exponent in the classical result of Ramachandra, tied to the best zero density estimates currently available); for instance a modification of the arguments in this recent paper of Koukoulopoulos would suffice. Thus we can restrict attention to “minor arc” values of ${\alpha}$ (or ${\alpha(x)}$, using the interpretation of (6)).

Next, one breaks up ${\lambda}$ (or the closely related Möbius function) into Dirichlet convolutions using one of the standard identities (e.g. Vaughan’s identity or Heath-Brown’s identity), as discussed for instance in this previous post (which is focused more on the von Mangoldt function, but analogous identities exist for the Liouville and Möbius functions). The exact choice of identity is not terribly important, but the upshot is that ${\lambda(n)}$ can be decomposed into ${\log^{O(1)} X}$ terms, each of which is either of the “Type I” form

$\displaystyle \sum_{d \sim D; m \sim M: dm=n} a_d$

for some coefficients ${a_d}$ that are roughly of logarithmic size on the average, and scales ${D, M}$ with ${D \ll X^{2/3}}$ and ${DM \sim X}$, or else of the “Type II” form

$\displaystyle \sum_{d \sim D; m \sim M: dm=n} a_d b_m$

for some coefficients ${a_d, b_m}$ that are roughly of logarithmic size on the average, and scales ${D,M}$ with ${X^{1/3} \ll D,M \ll X^{2/3}}$ and ${DM \sim X}$. As discussed in the previous post, the ${2/3}$ exponent is a natural barrier in these identities if one is unwilling to also consider “Type III” type terms which are roughly of the shape of the third divisor function ${\tau_3(n) := \sum_{d_1d_2d_3=1} 1}$.

A Type I sum makes a contribution to ${ \sum_{x \leq n \leq x+H} \lambda(n) e( \alpha(x) n + \beta(x) )}$ that can be bounded (via Cauchy-Schwarz) in terms of an expression such as

$\displaystyle \sum_{d \sim D} | \sum_{x/d \leq m \leq x/d+H/d} e(\alpha(x) dm )|^2.$

The inner sum exhibits a lot of cancellation unless ${\alpha(x) d}$ is within ${O(D/H)}$ of an integer. (Here, “a lot” should be loosely interpreted as “gaining many powers of ${\log X}$ over the trivial bound”.) Since ${H}$ is significantly larger than ${D}$, standard Vinogradov-type manipulations (see e.g. Lemma 13 of these previous notes) show that this bad case occurs for many ${d}$ only when ${\alpha}$ is “major arc”, which is the case we have specifically excluded. This lets us dispose of the Type I contributions.

A Type II sum makes a contribution to ${ \sum_{x \leq n \leq x+H} \lambda(n) e( \alpha(x) n + \beta(x) )}$ roughly of the form

$\displaystyle \sum_{d \sim D} | \sum_{x/d \leq m \leq x/d+H/d} b_m e(\alpha(x) dm)|.$

We can break this up into a number of sums roughly of the form

$\displaystyle \sum_{d = d_0 + O( H / M )} | \sum_{x/d_0 \leq m \leq x/d_0 + H/D} b_m e(\alpha(x) dm)|$

for ${d_0 \sim D}$; note that the ${d}$ range is non-trivial because ${H}$ is much larger than ${M}$. Applying the usual bilinear sum Cauchy-Schwarz methods (e.g. Theorem 14 of these notes) we conclude that there is a lot of cancellation unless one has ${\alpha(x) = a/q + O( \frac{X \log^{O(1)} X}{H^2} )}$ for some ${q = O(\log^{O(1)} X)}$. But with ${H \geq X^{2/3+\varepsilon}}$, ${X \log^{O(1)} X/H^2}$ is well below the threshold ${X^{-1/6-\varepsilon}}$ for the definition of major arc, so we can exclude this case and obtain the required cancellation.