In this, the final lecture notes of this course, we discuss one of the motivating applications of the theory developed thus far, namely to count solutions to linear equations in primes {{\mathcal P} = \{2,3,5,7,\ldots\}} (or in dense subsets {A} of primes {{\mathcal P}}). Unfortunately, the most famous linear equations in primes: the twin prime equation {p_2 - p_1 = 2} and the even Goldbach equation {p_1+p_2=N} – remain out of reach of this technology (because the relevant affine linear forms involved are commensurate, and thus have infinite complexity with respect to the Gowers norms), but most other systems of equations, in particular that of arithmetic progressions {p_i = n+ir} for {i=0,\ldots,k-1} (or equivalently, {p_i + p_{i+2} = 2p_{i+1}} for {i=0,\ldots,k-2}) , as well as the odd Goldbach equation {p_1+p_2+p_3=N}, are tractable.

To illustrate the main ideas, we will focus on the following result of Green:

Theorem 1 (Roth’s theorem in the primes) Let {A \subset {\mathcal P}} be a subset of primes whose upper density {\limsup_{N \rightarrow \infty} |A \cap [N]|/|{\mathcal P} \cap [N]|} is positive. Then {A} contains infinitely many arithmetic progressions of length three.

This should be compared with Roth’s theorem in the integers (Notes 2), which is the same statement but with the primes {{\mathcal P}} replaced by the integers {{\bf Z}} (or natural numbers {{\bf N}}). Indeed, Roth’s theorem for the primes is proven by transferring Roth’s theorem for the integers to the prime setting; the latter theorem is used as a “black box”. The key difficulty here in performing this transference is that the primes have zero density inside the integers; indeed, from the prime number theorem we have {|{\mathcal P} \cap [N]| = (1+o(1)) \frac{N}{\log N} = o(N)}.

There are a number of generalisations of this transference technique. In a paper of Green and myself, we extended the above theorem to progressions of longer length (thus transferring Szemerédi’s theorem to the primes). In a series of papers (culminating in a paper to appear shortly) of Green, myself, and also Ziegler, related methods are also used to obtain an asymptotic for the number of solutions in the primes to any system of linear equations of bounded complexity. This latter result uses the full power of higher order Fourier analysis, in particular relying heavily on the inverse conjecture for the Gowers norms; in contrast, Roth’s theorem and Szemerédi’s theorem in the primes are “softer” results that do not need this conjecture.

To transfer results from the integers to the primes, there are three basic steps:

  • A general transference principle, that transfers certain types of additive combinatorial results from dense subsets of the integers to dense subsets of a suitably “pseudorandom set” of integers (or more precisely, to the integers weighted by a suitably “pseudorandom measure”);
  • An application of sieve theory to show that the primes (or more precisely, an affine modification of the primes) lie inside a suitably pseudorandom set of integers (or more precisely, have significant mass with respect to a suitably pseudorandom measure).
  • If one is seeking asymptotics for patterns in the primes, and not simply lower bounds, one also needs to control correlations between the primes (or proxies for the primes, such as the Möbius function) with various objects that arise from higher order Fourier analysis, such as nilsequences.

The former step can be accomplished in a number of ways. For progressions of length three (and more generally, for controlling linear patterns of complexity at most one), transference can be accomplished by Fourier-analytic methods. For more complicated patterns, one can use techniques inspired by ergodic theory; more recently, simplified and more efficient methods based on duality (the Hahn-Banach theorem) have also been used. No number theory is used in this step. (In the case of transference to genuinely random sets, rather than pseudorandom sets, similar ideas appeared earlier in the graph theory setting, see this paper of Kohayakawa, Luczak, and Rodl.

The second step is accomplished by fairly standard sieve theory methods (e.g. the Selberg sieve, or the slight variants of this sieve used by Goldston and Yildirim). Remarkably, very little of the formidable apparatus of modern analytic number theory is needed for this step; for instance, the only fact about the Riemann zeta function that is truly needed is that it has a simple pole at {s=1}, and no knowledge of L-functions is needed.

The third step does draw more significantly on analytic number theory techniques and results (most notably, the method of Vinogradov to compute oscillatory sums over the primes, and also the Siegel-Walfisz theorem that gives a good error term on the prime number theorem in arithemtic progressions). As these techniques are somewhat orthogonal to the main topic of this course, we shall only touch briefly on this aspect of the transference strategy.

— 1. Transference —

The transference principle is not a single theorem, but is instead a family of related results with a common purpose, namely to show that a sufficiently pseudorandom set, measure, or probability distribution will be “indistinguishable” from the whole set (or the uniform measure or probability distribution) in certain statistical senses. A key tool in this regard is a dense model theorem that allows one to approximate or model any set or function that is dense with respect to a pseudorandom measure, by a set or function which is dense with respect to the uniform measure. It turns out that one can do this as long as the approximation is made with respect to a sufficiently weak topology; for the applications to counting arithmetic patterns, it turns out that the topology given by the Gowers norms is the right one to use. The somewhat complicated nature of these norms, though, does make the verification of the required pseudorandomness properties to be slightly tricky.

We illustrate these themes with Roth’s theorem, though the general strategy applies to several other results in additive combinatorics. We begin with Roth’s theorem in a cyclic group {{\bf Z}/N{\bf Z}}, which we phrase as follows:

Theorem 2 (Roth’s theorem in {{\bf Z}/N{\bf Z}}) Let {N} be odd. If {f: {\bf Z}/N{\bf Z} \rightarrow {\bf R}} is a function obeying the pointwise bound {0 \leq f \leq 1} and the lower bound {\mathop{\bf E}_{n \in {\bf Z}/N{\bf Z}} f(n) \geq \delta > 0}, then one has {\Lambda(f,f,f) \geq c(\delta)} for some {c(\delta)>0}, where {\Lambda(f,g,h) := \mathop{\bf E}_{n,r \in {\bf Z}/N{\bf Z}} f(n) g(n+r) h(n+2r)}.

We assume this theorem as a “black box”, in that we will not care as to how this theorem is proven. As noted in previous notes, this theorem easily implies the existence of non-trivial arithmetic progressions of length three in any subset {A} of {[N/3]} (say) with {|A| \geq \delta N}, as long as {N} is sufficiently large depending on {\delta}, as it provides a non-trivial lower bound on {\Lambda(1_A,1_A,1_A)}.

Now we generalise the above theorem. We view {N} as an (odd) parameter going off to infinity, and use {o_{N \rightarrow \infty}(1)} to denote any quantity that goes to zero as {N \rightarrow \infty}. We define a measure (or more precisely, a weight function) to be a non-negative function {\nu: {\bf Z}/N{\bf Z} \rightarrow {\bf R}^+} depending on {N}, such that {\mathop{\bf E}_{n \in [N]} \nu(n) = 1 + o_{N \rightarrow\infty}(1)}, thus {\nu} is basically the density function of a probability distribution on {{\bf Z}/N{\bf Z}}. We say that {\nu} is Roth-pseudorandom if for every {\delta > 0} (independent of {N}) there exists {c_\nu(\delta) > 0} such that one has the lower bound

\displaystyle  \Lambda(f,f,f) \geq c_\nu(\delta) + o_{N \rightarrow \infty;\delta}(1)

whenever {f: {\bf Z}/N{\bf Z} \rightarrow {\bf R}} is a function obeying the pointwise bound {0 \leq f \leq \nu} and the lower bound {\mathop{\bf E}_{n \in {\bf Z}/N{\bf Z}} f \geq \delta}, and {o_{N \rightarrow \infty;\delta}(1)} goes to zero as {N \rightarrow \infty} for any fixed {\delta}. Thus, Roth’s theorem asserts that the uniform measure {1} is Roth-pseudorandom. Observe that if {\nu} is Roth-pseudorandom, then any subset {A} of {[N/3]} whose weighted density {\nu(A) := \mathop{\bf E}_{n \in {\bf Z}/N{\bf Z}} 1_A(n) \nu(n)} is at least {\delta} will contain a non-trivial arithmetic progression of length three, if {N} is sufficiently large depending on {\delta}, as we once again obtain a non-trivial lower bound on {\Lambda(1_A,1_A,1_A)} in this case. Thus it is of interest to establish Roth-pseudorandomness for a wide class of measures.

Exercise 1 Show that if {\nu} is Roth-pseudorandom, and {\eta} is another measure which is “uniformly absolutely continuous” with respect to {\nu} in the sense that one has the bound {\eta(A) \leq f(\nu(A)) + o_{N \rightarrow \infty}(1)} all {A \subset {\bf Z}/N{\bf Z}} and some function {f: {\bf R}^+ \rightarrow {\bf R}^+} with {f(x) \rightarrow 0} as {x \rightarrow 0}, then {\eta} is also Roth-pseudorandom.

In view of the above exercise, the case of measures that are absolutely continuous with respect to the uniform distribution is uninteresting: the important case is instead when {\eta} is “singular” with respect to the uniform measure, in the sense that it is concentrated on a set of density {o_{N \rightarrow \infty}(1)} with respect to uniform measure, as this will allow us to detect progressions of length three in sparse sets.

A model example to keep in mind of a candidate for a Roth-pseudorandom measure is a random sparse measure of some small density {0 < p \ll 1}, in which each {\nu(n)} is an independent random variable that equals {1/p} with probability {p} and {0} otherwise. The case {p = 1/\log N} can be thought of as a crude model for the primes (cf. Cramér’s random model for the primes).

Recall that the form {\Lambda(f,g,h)} is controlled by the {U^2} norm in the sense that one has the inequality

\displaystyle  |\Lambda(f,g,h)| \leq \|f\|_{U^2({\bf Z}/N{\bf Z})}

whenever {f, g, h: {\bf Z}/N{\bf Z} \rightarrow {\bf C}} are bounded in magnitude by {1}, and similarly for permutations. Actually one has the slightly more precise inequality

\displaystyle  |\Lambda(f,g,h)| \leq \|f\|_{u^2({\bf Z}/N{\bf Z})}

where

\displaystyle  \|f\|_{u^2({\bf Z}/N{\bf Z})} := \sup_{\xi \in {\bf Z}/N{\bf Z}} |\hat f(\xi)|

as can easily be seen from the identity

\displaystyle  \Lambda(f,g,h) = \sum_{\xi \in {\bf Z}/N{\bf Z}} \hat f(\xi) \hat g(-2\xi) \hat h(\xi), \ \ \ \ \ (1)

Hölder’s inequality, and the Plancherel identity.

This suggests a strategy to establish the Roth-pseudorandomness of a measure by showing that functions {f} dominated that measure can be approximated in {u^2} norm by functions dominated instead by the uniform measure {1}. Indeed, we have

Lemma 3 (Criterion for Roth-pseudorandomness) Suppose we have a measure {\nu} with the following properties:

  • (Control by {u^2}) For any {f,g,h: {\bf Z}/N{\bf Z} \rightarrow {\bf R}} with the pointwise bound {|f|, |g|, |h| \leq \nu+1}, one has {|\Lambda(f,g,h)| \leq \alpha(\|f\|_{u^2({\bf Z}/N{\bf Z})}) + o_{N \rightarrow \infty}(1)}, where {\alpha: {\bf R}^+ \rightarrow {\bf R}^+} is a function with {\alpha(x) \rightarrow 0} as {x \rightarrow 0}, and similarly for permutations.
  • (Approximation in {u^2}) For any {f: {\bf Z}/N{\bf Z} \rightarrow {\bf R}} with the pointwise bound {0 \leq f \leq \nu}, and any {\epsilon > 0}, there exists {g: {\bf Z}/N{\bf Z} \rightarrow {\bf R}} with the pointwise bound {0 \leq g \leq 1 + o_{n \rightarrow \infty;\epsilon}(1)} such that {\|f-g\|_{u^2({\bf Z}/N{\bf Z})} \leq \epsilon + o_{n \rightarrow \infty; \epsilon}(1)}.

Then {\nu} is Roth-pseudorandom.

Proof: Let {f: {\bf Z}/N{\bf Z} \rightarrow {\bf C}} be such that {0 \leq f \leq \nu} and {\mathop{\bf E}_{n \in {\bf Z}/N{\bf Z}} f \geq \delta}. Let {\epsilon > 0} be a small number to be chosen later. We then use the decomposition to split {f = g + (f-g)} with the above stated properties. Since

\displaystyle  |\mathop{\bf E}_{n \in {\bf Z}/N{\bf Z}} f(n) - g(n) | \leq \|f-g\|_{u^2({\bf Z}/N{\bf Z})} \leq \epsilon + o_{n \rightarrow \infty;\epsilon}(1)

we have from the triangle inequality that

\displaystyle  \mathop{\bf E}_{n \in {\bf Z}/N{\bf Z}} g(n) \geq \delta - \epsilon - o_{n \rightarrow \infty;\epsilon}(1)

and in particular

\displaystyle  \mathop{\bf E}_{n \in {\bf Z}/N{\bf Z}} g(n) \geq \delta/2

for {N} large enough. Similarly we have {0 \leq g \leq 2} (say) for {N} large enough. From Roth’s theorem we conclude that

\displaystyle  \Lambda(g,g,g) \gg c(\delta/4)

for {N} large enough. On the other hand, by the first hypothesis, the other seven terms in

\displaystyle  \Lambda(f,f,f) = \Lambda( g+(f-g), g+(f-g), g+(f-g) )

are {O( \alpha( O( \epsilon ) )} for {N} large enough. If {\epsilon} is sufficiently small depending on {\delta}, we obtain the claim. \Box

Note that this argument in fact gives a value of {c_\nu(\delta)} that is essentially the same as {c(\delta)}. Also, we see that the {u^2} norm here could be replaced by the {U^2} norm, or indeed by any other quantity which is strong enough for the control hypothesis to hold, and also weak enough for the approximation property to hold.

So now we need to find some conditions on {\nu} that will allow us to obtain both the control and approximation properties. We begin with the control property. One way to accomplish this is via a restriction estimate:

Lemma 4 (Restriction estimate implies control) Let {\nu} be a measure. Suppose there exists an exponent {2 < q < 3} such that one has the restriction estimate

\displaystyle  \|\hat f \|_{\ell^q({\bf Z}/N{\bf Z})} \leq C \ \ \ \ \ (2)

whenever {f: {\bf Z}/N{\bf Z} \rightarrow {\bf C}} obeys the pointwise bound {|f| \leq \nu}, where {C} is independent of {n}. Then {\nu} enjoys the control in {u^2} property from Lemma 5.

Proof: From Plancherel’s theorem, we see that (2) already holds if we have {|f| \leq 1}, so by the triangle inequality it also holds (with a slightly different value of {C}) if {|f| \leq \nu+1}.

Now suppose that {|f|, |g|, |h| \leq \nu+1}. From (1) and Hölder’s inequality one has

\displaystyle  |\Lambda(f,g,h)| \leq \|f\|_{\ell^q({\bf Z}/N{\bf Z})}^{q-2} \|\hat f \|_{\ell^\infty({\bf Z}/N{\bf Z})}^{3-q} \|g\|_{\ell^q({\bf Z}/N{\bf Z})} \|h\|_{\ell^q({\bf Z}/N{\bf Z})}

and thus by (2)

\displaystyle  |\Lambda(f,g,h)| \leq C^q \|f \|_{u^2({\bf Z}/N{\bf Z})}^{3-q}

and the claim follows. \Box

Exercise 2 Show that the estimate (2) for {q \leq 2} can only hold when {\nu} is bounded uniformly in {N}; this explains the presence of the hypothesis {q>2} in the above condition.

Exercise 3 Show that the estimate (2) is equivalent to the estimate

\displaystyle  \mathop{\bf E}_{n \in {\bf Z}/N{\bf Z}} |\sum_{\xi \in {\bf Z}/N{\bf Z}} g(\xi) e(\xi n x/N)| \nu(n) \leq C \|g\|_{\ell^{q'}({\bf Z}/N{\bf Z})}

for all {g: {\bf Z}/N{\bf Z} \rightarrow {\bf C}}, where {q' := q/(q-1)} is the dual exponent to {q}. Informally, this asserts that a Fourier series with {\ell^{q'}} coefficients can be “restricted” to the support of {\nu} in an uniformly absolutely integrable manner (relative to {\nu}). Historically, this is the origin of the term “restriction theorem” (in the context where {{\bf Z}/N{\bf Z}} is replaced with a Euclidean space such as {{\bf R}^n}, and {\nu} is surface measure on a manifold such as the sphere {S^{n-1}}). See for instance my notes on this topic.

We will briefly discuss the standard method for establishing restriction estimates, namely the Tomas-Stein argument or large sieve inequality, later in these notes.

Now we turn to the approximation property. The approximation {g} to {f} needs to be close in {u^2} norm, i.e. the Fourier coefficients need to be uniformly close. One attempt to accomplish this is hard thresholding: one simply discards all Fourier coefficients in the Fourier expansion

\displaystyle  f(n) = \sum_{\xi \in {\bf Z}/N{\bf Z}} \hat f(\xi) e(x \xi / N )

of {f} that are too small, thus setting {g} equal to something like

\displaystyle  g(n) = \sum_{\xi \in {\bf Z}/N{\bf Z}: |\hat f(\xi)| \geq \epsilon} \hat f(\xi) e(x \xi / N ).

The main problem with this choice is that there is no guarantee that the non-negativity of {f} will transfer over to the non-negativity of {g}; also, there is no particular reason why {g} would be bounded.

But a small modification of this idea does work, as follows. Let {S := \{ \xi \in {\bf Z}/N{\bf Z}: |\hat f(\xi)| \geq \epsilon \}} denote the large Fourier coefficients of {f}. The function {g} proposed above can be viewed as a convolution {f*K}, where {K(n) := \sum_{\xi \in S} e(x \xi/N)} and {f*K(n) := \mathop{\bf E}_{m \in {\bf Z}/N{\bf Z}} f(m) K(n-m)}. The inability to get good pointwise bounds on {f*K} can be traced back to the oscillatory nature of the convolution kernel {K} (which can be viewed as a generalised Dirichlet kernel).

But experience with Fourier analysis tells us that the behaviour of such convolutions improves if one replaces the Dirichlet-type kernels with something more like a Fejér type kernel instead. With that in mind, we try

\displaystyle  g(n) := \mathop{\bf E}_{m_1, m_2 \in B} f(n+m_1-m_2)

where {B} is the Bohr set

\displaystyle  B := \{ n \in {\bf Z}/N{\bf Z}: |e( n \xi / N) - 1| \leq \epsilon \hbox{ for all } \xi \in S \}.

Clearly, if {f} is non-negative, then {g} is also. Now we look at upper bounds on {g}. Clearly

\displaystyle  g(n) \leq \mathop{\bf E}_{m_1, m_2 \in B} \nu(n+m_1-m_2)

so by Fourier expansion

\displaystyle  \|g\|_{L^\infty({\bf Z}/N{\bf Z})} \leq \sum_{\xi \in {\bf Z}/N{\bf Z}} |\mathop{\bf E}_{m \in B} e(\xi B)|^2 |\hat \nu(\xi)|.

Let us make the Fourier-pseudorandomness assumption

\displaystyle  \sup_{\xi \neq 0} |\hat \nu(\xi)| = o_{N \rightarrow \infty}(1). \ \ \ \ \ (3)

Evaluating the {\xi=0} term on the RHS separately, we conclude

\displaystyle  \|g\|_{L^\infty({\bf Z}/N{\bf Z})} \leq 1 +o_{N \rightarrow \infty}( \sum_{\xi \in {\bf Z}/N{\bf Z}} |\mathop{\bf E}_{m \in B} e(\xi B)|^2 ).

By Plancherel’s theorem we have

\displaystyle  \sum_{\xi \in {\bf Z}/N{\bf Z}} |\mathop{\bf E}_{m \in B} e(\xi B)|^2 = |B|/N.

From the Kronecker approximation theorem we have

\displaystyle  |B| / N \gg (\epsilon/10)^{|S|}

(say). Finally, if we assume (2) we have {|S| \ll \epsilon^{-q}}. Putting this all together we obtain the pointwise bound

\displaystyle  g \leq 1 + o_{N \rightarrow \infty; q,\epsilon}(1).

Finally, we see how {g} approximates {f}. From Fourier analysis one has

\displaystyle  \hat g(\xi) = \hat f(\xi) |\mathop{\bf E}_{m \in B} e(\xi B)|^2

and so

\displaystyle  \|f-g\|_{u^2({\bf Z}/N{\bf Z})} = \sup_{\xi \in {\bf Z}/N{\bf Z}} |\hat f(\xi)| (1 - |\mathop{\bf E}_{m \in B} e(\xi B)|^2).

The frequencies {\xi} that lie outside {\xi} give a contribution of at most {\epsilon} by the definition of {S}, so we look now at the terms where {\xi \in S}. From the definition of {B} and the triangle inequality we have

\displaystyle  |\mathop{\bf E}_{m \in B} e(\xi B) - 1| \leq \epsilon

in such cases, while from the measure nature of {\nu} we have

\displaystyle  |\hat f(\xi)| \leq \mathop{\bf E}_{n \in {\bf Z}/N{\bf Z}} \nu(n) = 1 + o_{N \rightarrow \infty}(1).

Putting this all together, we obtain

\displaystyle  \|f-g\|_{u^2({\bf Z}/N{\bf Z})} \ll \epsilon + o_{N \rightarrow \infty}(1).

To summarise, we have the following result, which essentially appears in this paper of Green and myself:

Theorem 5 (Criterion for Roth-pseudorandomness) Let {\nu} be a measure obeying the Fourier-pseudorandomness assumption (3) and the restriction estimate (2) for some {2 < q < 3}. Then {\nu} is Roth-pseudorandom.

This turns out to be a fairly tractable criterion for establishing the Roth-pseudorandomness of various measures, which in turn can be used to detect progressions of length three (and related patterns) on various sparse sets, such as the primes; see the next section.

The above arguments to establish Roth-pseudorandomness relied heavily on linear Fourier analysis. Now we give an alternate approach that avoids Fourier analysis entirely; it is less efficient and a bit messier, but will extend in a fairly straightforward (but notationally intensive) manner to higher order patterns. To do this, we replace the {u^2} norm in Lemma 5 with the {U^2} norm, so we now have to verify a control by {U^2} hypothesis and an approximation by {U^2} hypothesis.

We begin with the control by {U^2} hypothesis. Instead of Fourier analysis, we will rely solely on the Cauchy-Schwarz inequality, using a weighted version of the arguments from Notes 3 that first appeared in this paper of Green and myself. We wish to control the expression

\displaystyle  \Lambda(f,g,h)=\mathop{\bf E}_{n,r \in {\bf Z}/N{\bf Z}} f(n) g(n+r) h(n+2r)

where {f,g,h} are bounded in magnitude by {\nu+1}. For simplicity we will just assume that {f,g,h} are bounded in magnitude by {\nu}; the more general case is similar but a little bit messier. For brevity we will also omit the domain {{\bf Z}/N{\bf Z}} in the averages, and also abbreviate {o_{N \rightarrow \infty}(1)} as {o(1)}. We make the change of variables {(n,r) = (b+2c,-a-b-c)} to write this expression as

\displaystyle  \mathop{\bf E}_{a,b,c} f(b+2c) g(a-c) h(-2a-b)

the point being that each term involves only two of the three variables {a,b,c}.

We can pointwise bound {h} by {\nu} and estimate the above expression in magnitude by

\displaystyle  \mathop{\bf E}_{a,b} |\mathop{\bf E}_c f(b+2c) g(a-c)| \nu(-2a-b).

Since {\mathop{\bf E} \nu = 1+o(1)}, we can use Cauchy-Schwarz and bound this by

\displaystyle  (1+o(1)) (\mathop{\bf E}_{a,b} |\mathop{\bf E}_c f(b+2c) g(a-c)|^2 \nu(-2a-b))^{1/2}

which we rewrite as

\displaystyle  (1+o(1)) (\mathop{\bf E}_{a,b,c,c'} f(b+2c) f(b+2c') g(a-c) g(a-c') \nu(-2a-b))^{1/2}.

We now bound {g} by {\nu}, to obtain

\displaystyle  (1+o(1)) (\mathop{\bf E}_{a,c,c'} \nu(a-c) \nu(a-c') |\mathop{\bf E}_b f(b+2c) f(b+2c') \nu(-2a-b)|)^{1/2}.

If we make the hypothesis

\displaystyle  \mathop{\bf E}_{a,c,c'} \nu(a-c) \nu(a-c') = 1+o(1) \ \ \ \ \ (4)

(which is a variant of (3), as can be seen by expanding out using Fourier analysis), followed by Cauchy-Schwarz, we can bound this by

\displaystyle  (1+o(1)) (\mathop{\bf E}_{a,c,c'} \nu(a-c) \nu(a-c') |\mathop{\bf E}_b f(b+2c) f(b+2c') \nu(-2a-b)|^2)^{1/4}.

We expand this out as

\displaystyle  (1+o(1)) |\mathop{\bf E}_{a,b,b',c,c'} f(b+2c) f(b'+2c) f(b+2c') f(b'+2c') F(b,b',c,c')|^{1/4}.

where

\displaystyle  F(b,b',c,c') := \mathop{\bf E}_a \nu(a-c)\nu(a-c') \nu(-2a-b) \nu(-2a-b').

If the {F} factor could be replaced by {1}, then the expression inside the absolute values would just be {\|f\|_{U^2({\bf Z}/N{\bf Z})}^4}, which is what we wanted. Applying the triangle inequality and bounding {f} by {\nu}, we can thus bound the previous expression by

\displaystyle  \ll ( \|f\|_{U^2({\bf Z}/N{\bf Z})} + |\mathop{\bf E}_{a,b,b',c,c'} \nu(b+2c) \nu(b'+2c) \nu(b+2c') \nu(b'+2c') |F(b,b',c,c')-1||^{1/4}.

If we make the hypotheses

\displaystyle  \mathop{\bf E}_{a,b,b',c,c'} \nu(b+2c) \nu(b'+2c) \nu(b+2c') \nu(b'+2c') F(b,b',c,c')^i = 1+o(1) \ \ \ \ \ (5)

for {i=0,1,2}, then another application of Cauchy-Schwarz gives

\displaystyle  \mathop{\bf E}_{a,b,b',c,c'} \nu(b+2c) \nu(b'+2c) \nu(b+2c') \nu(b'+2c') |F(b,b',c,c')-1| = o(1)

and so we have obtained the control in {U^2} hypothesis (at least for {f}, and assuming boundedness by {\nu} and {\nu+1} assuming the conditions (4), (5)). We refer to such conditions (involving the product of {\nu} evaluated at distinct linear forms on the left-hand side, and a {1+o(1)} on the right-hand side) as linear forms conditions. Generalising to the case of functions bounded by {\nu+1}, and permuting {f,g,h}, we can soon obtain the following result (stated somewhat informally):

Lemma 6 (Generalised von Neumann theorem) If {\nu} obeys a certain finite list of linear forms conditions, then the control by {U^2} hypothesis in Lemma 5 holds.

Now we turn to the approximation in {U^2} property. It is possible to establish this approximation property by an energy increment method, analogous to the energy increment proof of Roth’s theorem in Notes 2; see my paper with Green for details. However, this argument turns out to be rather complicated. We give here a simpler approach based on duality (and more precisely, the Hahn-Banach theorem) that yields the same result, due independently to Gowers and to Reingold-Trevisan-Tulsiani-Vadhan. This approach also has the benefit of giving somewhat sharper quantitative refinements.

The first task is to represent the {U^2} norm in a dual formulation. The starting point is that the expression

\displaystyle  \|f\|_{U^2({\bf Z}/N{\bf Z})}^4 = \mathop{\bf E}_{n,a,b} f(n) f(n+a) f(n+b) f(n+a+b)

whenever {f: {\bf Z}/N{\bf Z} \rightarrow {\bf R}}, can be rewritten as

\displaystyle  \|f\|_{U^2({\bf Z}/N{\bf Z})}^4 = \langle f, {\mathcal D} f \rangle_{L^2({\bf Z}/N{\bf Z})}

where the dual function {{\mathcal D} f = {\mathcal D}_2 f: {\bf Z}/N{\bf Z} \rightarrow {\bf R}} is defined by

\displaystyle  {\mathcal D} f(n) := \mathop{\bf E}_{a,b} f(n+a) f(n+b) f(n+a+b).

Define a basic anti-uniform function to be any function of the form {{\mathcal D} F}, where {F: {\bf Z}/N{\bf Z} \rightarrow {\bf R}} obeys the pointwise bound {|F| \leq \nu+1}. To obtain the approximation property, it thus suffices to show that for every {\epsilon > 0}, for {N} sufficiently large depending on {\epsilon}, and any {f: {\bf Z}/N{\bf Z} \rightarrow {\bf R}} with {0 \leq f \leq \nu}, one can decompose {f = f_1 + f_2} where {0 \leq f_1 \leq 1} and {|\langle f_2, {\mathcal D} F \rangle| \leq \epsilon^4} for all basic anti-uniform functions {{\mathcal D} F}. Indeed, if one sets {F := f_2}, the latter bound gives {\|f_2\|_{U^2({\bf Z}/N{\bf Z})}^4 \leq \epsilon^4}, and the desired decomposition follows.

In order to apply the Hahn-Banach theorem properly, it is convenient to symmetrise and convexify the space of basic anti-uniform functions. Define an averaged anti-uniform function to be any convex combination of basic anti-uniform functions and their negations, and denote the space of all such averaged anti-uniform functions as {B}. Thus {B} is a compact convex symmetric subset of the finite-dimensional real vector space {L^2({\bf Z}/N{\bf Z})} that contains a neighbourhood of the origin; equivalently, it defines a norm on {L^2({\bf Z}/N{\bf Z})}. Our task is then to show (for fixed {\epsilon} and large {N}) that for any {f \in {\bf Z}/N{\bf Z} \rightarrow {\bf R}} with {0 \leq f \leq \nu+1}, the sets

\displaystyle  U := \{ (f_1,f_2) \in L^2({\bf Z}/N{\bf Z}) \cap L^2({\bf Z}/N{\bf Z}): f_1 + f_2 = f \}

and

\displaystyle  V := \{ (f_1,f_2) \in L^2({\bf Z}/N{\bf Z}) \cap L^2({\bf Z}/N{\bf Z}): 0 \leq f_1 \leq 1; \langle f_2, \phi \rangle \leq \epsilon^4 \hbox{ for all } \phi \in B \}

have non-empty intersection.

The point of phrasing things this way is that {U} and {V} are both closed convex subsets of the finite-dimensional vector space {L^2({\bf Z}/N{\bf Z}) \cap L^2({\bf Z}/N{\bf Z})}, and so the Hahn-Banach theorem is applicable. (One could also use closely related results, such as the Farkas lemma: see this earlier blog post for more discussion.) Indeed, suppose that there was some {f} for which {U} and {V} were disjoint. Then, by the Hahn-Banach theorem, there must exist some linear functional

\displaystyle  (f_1,f_2) \mapsto \langle f_1, \phi_1 \rangle_{L^2({\bf Z}/N{\bf Z})} + \langle f_2, \phi_2 \rangle_{L^2({\bf Z}/N{\bf Z})}

which separates the two sets, in the sense that

\displaystyle  \langle f_1, \phi_1 \rangle_{L^2({\bf Z}/N{\bf Z})} + \langle f_2, \phi_2 \rangle_{L^2({\bf Z}/N{\bf Z})} > c

for all {(f_1,f_2) \in U}, and

\displaystyle  \langle f_1, \phi_1 \rangle_{L^2({\bf Z}/N{\bf Z})} + \langle f_2, \phi_2 \rangle_{L^2({\bf Z}/N{\bf Z})} \leq c

for all {(f_1,f_2) \in V}, where {c} is a real number.

From the form of {U}, we see that we must have {\phi_1=\phi_2}. In particular, we may normalise {\phi = \phi_1 = \phi_2} to be on the boundary of {B}. As all finite-dimensional spaces are reflexive, we see in that case that {\langle f_2, \phi \rangle} can be as large as {\epsilon^4} on {V}, and independently {\langle f_1, \phi \rangle} can be as large as {\mathop{\bf E}_{n \in {\bf Z}/N{\bf Z}} \max(\phi,0)}. We conclude that

\displaystyle  \mathop{\bf E}_{n \in {\bf Z}/N{\bf Z}} \max(\phi,0) + \epsilon^4 \leq \mathop{\bf E}_{n \in {\bf Z}/N{\bf Z}} f \phi.

As {0 \leq f \leq \nu}, we see that {f \phi \leq \nu \max(\phi,0)}, and thus

\displaystyle  \mathop{\bf E}_{n \in {\bf Z}/N{\bf Z}} (\nu-1) \max(\phi,0) \geq \epsilon^4.

We now make the hypothesis that the dual function {{\mathcal D}(\nu+1)} of {\nu+1} is uniformly bounded:

\displaystyle  {\mathcal D}(\nu+1) \leq C. \ \ \ \ \ (6)

We remark that the linear forms condition (which we have not specified explicitly) will give this bound with {C = {\mathcal D}(1+1) + o(1) = 2^{2^2-1} + o(1)}.

Since {\phi} is a convex combination of functions of the form {\pm {\mathcal D} F} and {|F| \leq \nu+1}, this implies that {\phi} is bounded uniformly as well: {|\phi| \leq C}. Applying the Weierstrass approximation theorem to the function {\max(x,0)} for {|x| \leq C} (and noting that the {L^1} norm of {\nu-1} is {O(1)}) we conclude that there exists a polynomial {P: {\bf R} \rightarrow {\bf R}} (depending only on {\epsilon} and {C}) such that

\displaystyle  \mathop{\bf E}_{n \in {\bf Z}/N{\bf Z}} (\nu-1) P(\phi) \geq \epsilon^4/2

(say). Breaking {P} into monomials, and using the pigeonhole principle, we conclude that there exists a non-negative integer {k = O_{\epsilon,C}(1)} such that

\displaystyle  |\mathop{\bf E}_{n \in {\bf Z}/N{\bf Z}} (\nu-1) \phi^k| \gg_{\epsilon,C} 1;

since {\phi} was a convex combination of functions of the form {\pm {\mathcal D} F}, we thus conclude that there exist {F_1,\ldots,F_k} with {|F_1|,\ldots,|F_k| \leq \nu+1} such that

\displaystyle  |\mathop{\bf E}_{n \in {\bf Z}/N{\bf Z}} (\nu-1) ({\mathcal D} F_1) \ldots ({\mathcal D} F_k)| \gg_{\epsilon,C} 1.

We shall contradict this by fiat, making the hypothesis that

\displaystyle  \mathop{\bf E}_{n \in {\bf Z}/N{\bf Z}} (\nu-1) ({\mathcal D} F_1) \ldots ({\mathcal D} F_k)| = o_{N \rightarrow \infty;k}(1) \ \ \ \ \ (7)

for all {k \geq 1} and all {F_1,\ldots,F_k} bounded in magnitude by {\nu+1}.

We summarise this discussion as follows:

Theorem 7 (Dense model theorem) If (6), (7) hold, then the approximation in {U^2} hypothesis in Lemma 5 holds.

There is nothing too special about the {U^2} norm here; one could work with higher Gowers norms, or indeed with any other norm for which one has a reasonably explicit description of the dual.

The abstract version of theorem was first (implicitly) proven , and made more explicit in this paper of Ziegler and myself. The methods there were different (and somewhat more complicated). To prove approximation, the basic idea was to write {g = \mathop{\bf E}(f|{\mathcal B})} for some carefully chosen {\sigma}-algebra {{\mathcal B}} (built out of dual functions that correlated with things like the residual {f - \mathop{\bf E}(f|{\mathcal B})}). This automatically gave the non-negativity of {g}; the upper bound on {g} came from the bound {\mathop{\bf E}(f|{\mathcal B}) \leq \mathop{\bf E}(\nu|{\mathcal B})}, with the latter expression then being bounded by the Weierstrass approximation theorem and (7).

To summarise, in order to establish the Roth-pseudorandomness of a measure {\mu}, we have at least two options. The first (which relies on Fourier analysis, and is thus largely restricted to complexity {1} problems) is to establish the Fourier pseudorandomness bound (3) and the restriction estimate (2). The other (which does not require Fourier analysis) is to establish a finite number of linear forms conditions, as well as the estimate (7).

Next, we informally sketch how one can deduce (7) from a finite number of linear forms conditions, as well as a crude estimate

\displaystyle  \nu = O( N^{o(1)}) \ \ \ \ \ (8)

and a condition known as the correlation condition. At the cost of oversimplifying slightly, we express this condition as the assertion that

\displaystyle  \mathop{\bf E}_{n \in {\bf Z}/N{\bf Z}} \nu(n+h_1) \ldots \nu(n+h_k) \ll_k 1 \ \ \ \ \ (9)

whenever {h_1,\ldots,h_k \in {\bf Z}/N{\bf Z}} are distinct, thus the {k}-point correlation function of {\nu} is bounded for each {k}. For the number-theoretic applications, one needs to replace the {1} on the right-hand side by a more complicated expression, but we will defer this technicality to the exercises. We remark that for each fixed {k}, the correlation condition would be implied by the linear forms condition, but it is important that we can make {k} arbitrarily large.

For simplicity of notation we assume that the {F_j} are bounded in magnitude by {\nu} rather than by {\nu+1}. We begin by expanding out (7) as

\displaystyle  |\mathop{\bf E}_{n,h_{1,1},\ldots,h_{2,k}} (\nu(n)-1) \prod_{j=1}^k F_j( n + h_{1,j} ) F_j( n + h_{2,j} ) F_j( n + h_{1,j} + h_{2,j} )|.

Shifting {h_{i,j}} by {h_i} for some {h_1,h_2} and re-averaging, we can rewrite this as

\displaystyle  |\mathop{\bf E}_{h_{1,1},\ldots,h_{2,k}} \mathop{\bf E}_{n,h_1,h_2} (\nu(n)-1) F_{\vec h_1}( n + h_{1} ) F_{\vec h_2}( n + h_{2} ) F_{\vec h_1 + \vec h_2}( n + h_{1} + h_{2} )|

where {\vec h_i := (h_{i,1},\ldots,h_{i,k})} for {i=1,2} and

\displaystyle  F_{(v_1,\ldots,v_k)}(n) := \prod_{j=1}^k F_j(n+v_j).

The inner expectation is the Gowers inner product of {\nu-1}, {F_{\vec h_1}}, {F_{\vec h_2}}, and {F_{\vec h_1+h_2}}. Using the linear forms condition we may assume that

\displaystyle  \|\nu-1\|_{U^2({\bf Z}/N{\bf Z})} = o(1)

and so it will suffice by the Cauchy-Schwarz-Gowers inequality, followed by the Hölder inequality, to show that

\displaystyle  \mathop{\bf E}_{h_{1,1},\ldots,h_{2,k}} F_{\vec h_1} \|_{U^2({\bf Z}/N{\bf Z})}^4 \ll_K 1

and similarly for {\vec h_2} and {\vec h_1+\vec h_2}.

We just prove the claim for {\vec h_1}, as the other two cases are similar. We expand the left-hand side as

\displaystyle  |\mathop{\bf E}_{n,a,b,h_1,\ldots,h_k} \prod_{j=1}^k F_j( n + h_j ) F_j( n + h_j + a ) F_j( n + h_j + b ) F_j( n + h_j + a + b )|

which we can upper bound by

\displaystyle  |\mathop{\bf E}_{n,a,b,h_1,\ldots,h_k} \prod_{j=1}^k \nu( n + h_j ) \nu( n + h_j + a ) \nu( n + h_j + b ) \nu( n + h_j + a + b )|

We can factorise this as

\displaystyle  \mathop{\bf E}_{a,b} |\mathop{\bf E}_n \nu( n ) \nu( n + a ) \nu( n + b ) \nu( n + a + b )|^k.

Using (9), we see that the inner expectation is {O_k(1)} as long as {0,a,b,a+b} are distinct; in all other cases they are {O(N^{o(1)})}, by (8). Combining these two cases we obtain the claim.

Exercise 4 Show that (7) also follows from a finite number of linear forms conditions and (9), if the {F_j} are only assumed to be bounded in magnitude by {\nu+1} rather than {\nu}, and the right-hand side of (9) is weakened to {\sum_{1 \leq i < j \leq m} \tau(h_i-h_j)}, where {\tau:{\bf Z}/N{\bf Z} \rightarrow {\bf R}^+} is a function obeying the moment bounds {\mathop{\bf E}_{n \in {\bf Z}/N{\bf Z}} \tau(n)^q \ll_q 1} for each {q \geq 1}.

The above machinery was geared to getting Roth-type lower bounds on {\Lambda(f,f,f)}; but it also can be used to give more precise asymptotics:

Exercise 5 Suppose that {\nu} obeys the hypotheses of Lemma 5 (with the {u^2} norm). Let {f: {\bf Z}/N{\bf Z} \rightarrow {\bf R}} obey the pointwise bound {0 \leq f \leq 1} and has mean {\mathop{\bf E}_{n \in {\bf Z}/N{\bf Z}} f(n) = \delta}; suppose also that one has the pseudorandomness bound {\sup_{\xi \in {\bf Z}/N{\bf Z} \backslash 0}|\hat f(\xi)| = o_{N \rightarrow \infty}(1)}. Show that {\Lambda(f,f,f) = \delta^3 + o_{N \rightarrow \infty}(1)}.

Exercise 6 Repeat the previous exercise, but with the {u^2} norm replaced by the {U^2} norm.

Informally, the above exercises show that if one wants to obtain asymptotics for three-term progressions in a set {A} which has positive relative density with respect to a Roth-pseudorandom measure, then it suffices to obtain a non-trivial bound on the exponential sums {\sum_{n \in A} e( \xi n )} for non-zero frequencies {\xi}.

For longer progressions, one uses higher-order Gowers norms, and a similar argument (using the inverse conjecture for the Gowers norms) shows (roughly speaking) that to obtain asymptotics for {k}-term progressions (or more generally, linear patterns of complexity {k-1}) in a {U^{k-1}}-pseudorandom measure (by which we mean that the analogue of Lemma 5 for the {U^{k-1}} norm holds) then it suffices to obtain a non-trivial bound on sums of the form {\sum_{n \in A} F( g(n) \Gamma )} for {k-2}-step nilsequences {F(g(n) \Gamma)}. See this paper of Green and myself for further discussion.

— 2. A brief discussion of sieve theory —

In order to apply the above theory to find patterns in the primes, we need to build a measure {\nu} with respect to which the primes have a positive density, and for which one can verify conditions such as the Fourier pseudorandomness condition (3), the restriction estimate (2), linear forms conditions, and the correlation condition (9).

There is an initial problem with this, namely that the primes themselves are not uniformly distributed with respect to small moduli. For instance, all primes are coprime to two (with one exception). In contrast, any measure {\nu} obeying the Fourier pseudorandomness condition (3) (which is implied by the condition {\|\nu-1\|_{U^2}=o(1)}, which would follow in turn from the linear forms condition), must be evenly distributed in both odd and even residue classes up to {o(1)} errors; this forces the density of the primes in {\nu} to be at most {1/2+o(1)}. A similar argument using all the prime moduli less than some parameter {w} shows in fact that the density of primes in {\nu} is at most {\prod_{p<w} (1-\frac{1}{p}) + o_{N \rightarrow \infty;w}(1)}. Since {\sum_p \frac{1}{p}} diverges to {+\infty}, {\prod_p (1-\frac{1}{p})} diverges to zero, and so we see that the primes cannot in fact have a positive density with respect to any pseudorandom measure.

This difficulty can be overcome by a simple affine change of variables known as the {W}-trick, where we replace the primes {{\mathcal P} = \{ 2, 3, 5, \ldots \}} by the modified set {{\mathcal P}_{W,b} := \{ n \in {\bf N}: Wn+b \in {\mathcal P} \}}, where {W := \prod_{p<w} p} is the product of all the primes less than {w}, and {1 \leq b < W} is a residue class coprime to {W}. In practice, {w} (and {W}) are slowly growing functions of {N}, e.g. one could take {w = \log \log \log N}. By the pigeonhole principle, for any given {N} and {W} there will exist a {b} for which {{\mathcal P}_{W,b}} is large (of cardinality {\gg \frac{N}{\phi(W) \log N}}, where {\phi(W)} is the number of residue classes coprime to {W}); indeed, thanks to the prime number theorem in arithmetic progressions, any such {b} would work (e.g. one can take {b=1}). Note that every arithmetic progression in {{\mathcal P}_{W,b}} is associated to a corresponding arithmetic progression in {{\mathcal P}}. Thus, for the task of locating arithmetic progressions at least, we may as well work with {{\mathcal P}_{W,b}}; a similar claim also holds for more complicated tasks, such as counting the number of linear patterns in {{\mathcal P}}, though one now has to work with several residue classes at once. The point of passing from {{\mathcal P}} to {{\mathcal P}_{W,b}} is that the latter set no longer has any particular bias to favorr or disfavour any residue class with modulus less than {w}; there are still biases at higher moduli, but as long as {w} goes to infinity with {N}, the effect of such biases will end up being negligible (ultimately contributing {o(1)} terms to things like the linear forms condition).

To simplify the exposition a bit, though, let us ignore the {W}-trick and pretend that we are working with the primes themselves rather than the affine-shifted primes. We will also ignore the technical distinctions between the interval {[N]} and the cyclic group {{\bf Z}/N{\bf Z}}.

The most natural candidate for the measure {\nu} is the von Mangoldt function {\Lambda: {\bf N} \rightarrow {\bf R}^+}, defined by setting {\Lambda(n) :=\log p} when {n=p^j} is a prime {p} or a power of a prime, and {\Lambda(n)=0} otherwise. One hint as to the significance of this function is provided by the identity

\displaystyle  \log n = \sum_{d|n} \Lambda(d)

for all natural numbers {n}, which can be viewed as a generating function of the fundamental theorem of arithmetic.

The prime number theorem tells us that {\Lambda} is indeed a measure: {\mathop{\bf E}_{n \in [N]} \Lambda(n) = 1 + o(1)}. And the primes have full density with respect to this function: {\mathop{\bf E}_{n \in [N]} 1_{\mathcal P}(n) \Lambda(n) = 1 + o(1)}. Furthermore, the von Mangoldt function has good Fourier pseudorandomness properties (after applying the {W}-trick), thanks to the classical techniques of Hardy-Littlewood and Vinogradov. Indeed, to control exponential sums such as {\mathop{\bf E}_{n \in [N]} \Lambda(n) e( \xi n )} for some {\xi \in {\bf R}}, one can use tools such as the Siegel-Walfisz theorem (a quantitative version of the prime number theorem in arithmetic progressions) to control such sums in the “major arc” case when {\xi} is close to a rational of small height, while in the “minor arc” case when {\xi} behaves irrationally, one can use the standard identity

\displaystyle  \Lambda(n) = \sum_{d|n} \mu(d) \log\frac{n}{d}, \ \ \ \ \ (10)

where {\mu} is the Möbius function, to re-express such a sum in terms of expressions roughly of the form

\displaystyle  \sum_{d,m} \mu(d) \log m e( \xi dm )

where we are intentionally vague as to what range the {d,m} parameters are being summed over. The idea is then to eliminate the {\mu} factor by tools such as the triangle inequality or the Cauchy-Schwarz inequality, leading to expressions such as

\displaystyle  \sum_{d} |\sum_m \log m e( \xi dm )|;

the point is that the inner sum does not contain any number-theoretic factors such as {\Lambda} or {\mu}, but is still oscillatory (at least if {\xi} is sufficiently irrational), and so one can extract useful cancellation from here. Actually, the situation is more complicated than this, because there are regions of the range of {(d,m)} for which this method provides insufficient cancellation, in which case one has to rearrange the sum further using more arithmetic identities such as (10) (for instance, using a truncated version of (10) known as Vaughan’s identity). We will not discuss this further here, but any advanced analytic number theory text (e.g. Iwaniec-Kowalski) will cover this material.

Unfortunately, while the Fourier-pseudorandomness of {\Lambda} is well-understood, the linear forms and correlation conditions are essentially equivalent to (and in fact slightly harder than) the original problem of obtaining asymptotics for linear patterns in primes, and so using {\Lambda} for the pseudorandom measure would result in a circular argument. Furthermore, correlations such as

\displaystyle  \mathop{\bf E}_{n \in [N]} \Lambda(n) \Lambda(n+2)

(which essentially counts the number of twin primes up to {N}) are notoriously difficult to compute. For instance, if one tries to expand the above sum using (10), one ends up with expressions such as

\displaystyle  \sum_{d,d' \leq N} \mu(d) \mu(d') \sum_{n \leq N: d|n, d' | n+2} \log \frac{n}{d} \log \frac{n+2}{d'}.

By the Chinese remainder theorem, the two residue conditions {d|n, d'|n+2} can be combined to a single residue condition for {n} modulo the least common multiple {lcm(d,d')} of {d} and {d'}. If {d} and {d'} are both small, e.g. {d, d' \leq N^{1/10}}, then this least common multiple is much less than {N}, and in such a case one can compute the inner sum very precisely; as it turns out, the main term in this estimate is multiplicative in {d,d'}, which allows the outer sum to be estimated using the techniques of multiplicative number theory (and in particular, using the theory of the Riemann zeta function). Unfortunately, for the bulk of the above sum, {d} and {d'} are instead comparable to {N}, and the least common multiple is typically of size {N^2}, and then it becomes extraordinarily difficult to estimate the inner sum (and hence the entire sum).

However, if we truncate the divisor sum (10) to restrict {d} to a range such as {d \leq N^{1/10}}, then the situation improves substantially. This leads to expressions such as

\displaystyle  \nu(n) := \frac{1}{\log R} ( \sum_{d|n; d<R} \mu(d) \log\frac{R}{d} )^2 \ \ \ \ \ (11)

or more generally

\displaystyle  \nu(n) := \log R ( \sum_{d|n} \mu(d) \psi( \frac{\log d}{\log R} ) )^2 \ \ \ \ \ (12)

for some cutoff function {\psi}, where {R} is a small power of {N} (e.g. {R = N^{1/10}}); the expression (11) corresponds to the case {\psi(x) := \max(1-x,0)}. The presence of the square is to ensure that {\nu} is non-negative, and the presence of the {\frac{1}{\log R}} is a normalisation factor to ensure that {\nu} has mean close to {1}. Such expressions were essentially introduced to Selberg (as part of what is now known as the Selberg sieve), although the sieve weight factors {\psi(\frac{\log d}{\log R})} are usually modified slightly for the Selberg sieve (see this paper of Green and myself for further discussion). The correlation properties of the particular expression (11) were studied intensively by Goldston and Yildirim, and have particularly sharp estimates, although for applications discussed here, one can work instead with a smoother choice of cutoff {\psi}, which makes the required correlation estimates on {\nu} easier to prove (but with slightly worse bounds). Indeed, the required linear forms and correlation conditions can be verified for (12) (or more precisely, a variant of {\nu} in which the {W}-trick is applied) by a moderately lengthy, but elementary and straightforward calculation, based ultimately on the Chinese remainder theorem, an analysis of the local problem (working mod {q} for small {q}), and the fundamental fact that the Riemann zeta function {\zeta(s)} is approximately equal to {1/(s-1)} for {s} close to {1}. See for instance this unpublished note of mine for more discussion. (The exact power of {N} that one sets {R} equal to will depend on the complexity of the linear forms and co

If one uses (11), then we see that {\nu(n)} is equal to {\log R} when {n} is any prime larger than {R}; if {\log R} is comparable to {\log N}, we thus see (from the prime number theorem) that the primes in {[N]} do indeed have positive density relative to {\nu}. This is then enough to be able to invoke the transference principle and extend results such as Szemerédi’s theorem to the primes, establishing in particular that the primes contain arbitrarily long arithmetic progressions; see this paper of Green and myself for details.

To use the Fourier-analytic approach, it turns out to be convenient to replace the above measures {\nu} by a slight variant which looks more complicated in the spatial domain, but is easier to manipulate in the frequency domain. More specifically, the expression (11) or (12) is replaced with a variant such as

\displaystyle  \nu := \log R ( \sum_{d|n; d \leq R} \mu(d) \frac{d}{\phi(d)} \sum_{q \leq R/d; (q,d)=1} \frac{1}{\phi(q)} )^2

where {\phi(d)} is the Euler totient function (the number of integers from {1} to {d} that are coprime to {d}). Some standard multiplicative number theory shows that the weights {\frac{d}{\phi(d)} \sum_{q \leq R/d; (q,d)=1} \frac{1}{\phi(q)}} are approximately equal to {\log \frac{R}{d}} in some sense. With such a representation, it turns out that the Fourier coefficients of {\nu} can be computed more or less explicitly, and is essentially supported on those frequencies of the form {a/q} with {q \leq R^2}. This makes it easy to verify the required Fourier-pseudorandomness hypothesis (3) (once one applies the {W}-trick). As for the restriction estimate (2), the first step is to use Exercise (3) and the Cauchy-Schwarz inequality to reduce matters to showing an estimate of the shape

\displaystyle  \mathop{\bf E}_{n} |\sum_{\xi} g(\xi) e(\xi n x/N)|^2 \nu(n) \ll \|g\|_{\ell^{q'}}.

The right-hand side can be rearranged to be of the shape

\displaystyle  \sum_{\xi, \xi'} g(\xi) \overline{g(\xi')} \hat \nu(\xi - \xi').

It is then possible to use the good pointwise control on the Fourier transform {\hat \nu} of {\nu} (in particular, the fact that it “decays” quite rapidly away from the major arcs) to get a good restriction estimate. See this paper of Green and myself for further discussion.

As discussed in the previous section, to get asymptotics for patterns in the primes we also need to control exponential sums such as

\displaystyle  \sum_{p \leq N} e( \xi p )

and more generally (for higher complexity patterns)

\displaystyle  \sum_{p \leq N} F( g(p) \Gamma )

for various nilsequences {n \mapsto F(g(n)\Gamma)}. Again, it is convenient to use the von Mangoldt function {\Lambda} as a proxy for the primes, thus leading to expressions such as

\displaystyle  \sum_{n \leq N} \Lambda(n) F( g(n) \Gamma ).

Actually, for technical reasons it is convenient to use identities such as (10) to replace this type of expression with expressions such as

\displaystyle  \sum_{n \leq N} \mu(n) F( g(n) \Gamma ),

because the Möbius function {\mu} enjoys better boundedness and equidistribution properties than {\Lambda}. (For instance, {\Lambda} strongly favours odd numbers over even numbers, whereas the Möbius function has no preference.) It turns out that these expressions can be controlled by a generalisation of the method of Vinogradov used to compute exponential sums over primes, using the equidistribution theory of nilsequences as a substitute for the classical theory of exponential sums over integers. See this paper of Green and myself for details.