A fundamental and recurring problem in analytic number theory is to demonstrate the presence of cancellation in an oscillating sum, a typical example of which might be a correlation

\displaystyle  \sum_{n} f(n) \overline{g(n)} \ \ \ \ \ (1)

between two arithmetic functions {f: {\bf N} \rightarrow {\bf C}} and {g: {\bf N} \rightarrow {\bf C}}, which to avoid technicalities we will assume to be finitely supported (or that the {n} variable is localised to a finite range, such as {\{ n: n \leq x \}}). A key example to keep in mind for the purposes of this set of notes is the twisted von Mangoldt summatory function

\displaystyle  \sum_{n \leq x} \Lambda(n) \overline{\chi(n)} \ \ \ \ \ (2)

that measures the correlation between the primes and a Dirichlet character {\chi}. One can get a “trivial” bound on such sums from the triangle inequality

\displaystyle  |\sum_{n} f(n) \overline{g(n)}| \leq \sum_{n} |f(n)| |g(n)|;

for instance, from the triangle inequality and the prime number theorem we have

\displaystyle  |\sum_{n \leq x} \Lambda(n) \overline{\chi(n)}| \leq x + o(x) \ \ \ \ \ (3)

as {x \rightarrow \infty}. But the triangle inequality is insensitive to the phase oscillations of the summands, and often we expect (e.g. from the probabilistic heuristics from Supplement 4) to be able to improve upon the trivial triangle inequality bound by a substantial amount; in the best case scenario, one typically expects a “square root cancellation” that gains a factor that is roughly the square root of the number of summands. (For instance, for Dirichlet characters {\chi} of conductor {O(x^{O(1)})}, it is expected from probabilistic heuristics that the left-hand side of (3) should in fact be {O_\varepsilon(x^{1/2+\varepsilon})} for any {\varepsilon>0}.)

It has proven surprisingly difficult, however, to establish significant cancellation in many of the sums of interest in analytic number theory, particularly if the sums do not have a strong amount of algebraic structure (e.g. multiplicative structure) which allow for the deployment of specialised techniques (such as multiplicative number theory techniques). In fact, we are forced to rely (to an embarrassingly large extent) on (many variations of) a single basic tool to capture at least some cancellation, namely the Cauchy-Schwarz inequality. In fact, in many cases the classical case

\displaystyle  |\sum_n f(n) \overline{g(n)}| \leq (\sum_n |f(n)|^2)^{1/2} (\sum_n |g(n)|^2)^{1/2}, \ \ \ \ \ (4)

considered by Cauchy, where at least one of {f, g: {\bf N} \rightarrow {\bf C}} is finitely supported, suffices for applications. Roughly speaking, the Cauchy-Schwarz inequality replaces the task of estimating a cross-correlation between two different functions {f,g}, to that of measuring self-correlations between {f} and itself, or {g} and itself, which are usually easier to compute (albeit at the cost of capturing less cancellation). Note that the Cauchy-Schwarz inequality requires almost no hypotheses on the functions {f} or {g}, making it a very widely applicable tool.

There is however some skill required to decide exactly how to deploy the Cauchy-Schwarz inequality (and in particular, how to select {f} and {g}); if applied blindly, one loses all cancellation and can even end up with a worse estimate than the trivial bound. For instance, if one tries to bound (2) directly by applying Cauchy-Schwarz with the functions {\Lambda} and {\chi}, one obtains the bound

\displaystyle  |\sum_{n \leq x} \Lambda(n) \overline{\chi(n)}| \leq (\sum_{n \leq x} \Lambda(n)^2)^{1/2} (\sum_{n \leq x} |\chi(n)|^2)^{1/2}.

The right-hand side may be bounded by {\ll x \log^{1/2} x}, but this is worse than the trivial bound (3) by a logarithmic factor. This can be “blamed” on the fact that {\Lambda} and {\chi} are concentrated on rather different sets ({\Lambda} is concentrated on primes, while {\chi} is more or less uniformly distributed amongst the natural numbers); but even if one corrects for this (e.g. by weighting Cauchy-Schwarz with some suitable “sieve weight” that is more concentrated on primes), one still does not do any better than (3). Indeed, the Cauchy-Schwarz inequality suffers from the same key weakness as the triangle inequality: it is insensitive to the phase oscillation of the factors {f, g}.

While the Cauchy-Schwarz inequality can be poor at estimating a single correlation such as (1), its power improves when considering an average (or sum, or square sum) of multiple correlations. In this set of notes, we will focus on one such situation of this type, namely that of trying to estimate a square sum

\displaystyle  (\sum_{j=1}^J |\sum_{n} f(n) \overline{g_j(n)}|^2)^{1/2} \ \ \ \ \ (5)

that measures the correlations of a single function {f: {\bf N} \rightarrow {\bf C}} with multiple other functions {g_j: {\bf N} \rightarrow {\bf C}}. One should think of the situation in which {f} is a “complicated” function, such as the von Mangoldt function {\Lambda}, but the {g_j} are relatively “simple” functions, such as Dirichlet characters. In the case when the {g_j} are orthonormal functions, we of course have the classical Bessel inequality:

Lemma 1 (Bessel inequality) Let {g_1,\dots,g_J: {\bf N} \rightarrow {\bf C}} be finitely supported functions obeying the orthonormality relationship

\displaystyle  \sum_n g_j(n) \overline{g_{j'}(n)} = 1_{j=j'}

for all {1 \leq j,j' \leq J}. Then for any function {f: {\bf N} \rightarrow {\bf C}}, we have

\displaystyle  (\sum_{j=1}^J |\sum_{n} f(n) \overline{g_j(n)}|^2)^{1/2} \leq (\sum_n |f(n)|^2)^{1/2}.

For sake of comparison, if one were to apply the Cauchy-Schwarz inequality (4) separately to each summand in (5), one would obtain the bound of {J^{1/2} (\sum_n |f(n)|^2)^{1/2}}, which is significantly inferior to the Bessel bound when {J} is large. Geometrically, what is going on is this: the Cauchy-Schwarz inequality (4) is only close to sharp when {f} and {g} are close to parallel in the Hilbert space {\ell^2({\bf N})}. But if {g_1,\dots,g_J} are orthonormal, then it is not possible for any other vector {f} to be simultaneously close to parallel to too many of these orthonormal vectors, and so the inner products of {f} with most of the {g_j} should be small. (See this previous blog post for more discussion of this principle.) One can view the Bessel inequality as formalising a repulsion principle: if {f} correlates too much with some of the {g_j}, then it does not have enough “energy” to have large correlation with the rest of the {g_j}.

In analytic number theory applications, it is useful to generalise the Bessel inequality to the situation in which the {g_j} are not necessarily orthonormal. This can be accomplished via the Cauchy-Schwarz inequality:

Proposition 2 (Generalised Bessel inequality) Let {g_1,\dots,g_J: {\bf N} \rightarrow {\bf C}} be finitely supported functions, and let {\nu: {\bf N} \rightarrow {\bf R}^+} be a non-negative function. Let {f: {\bf N} \rightarrow {\bf C}} be such that {f} vanishes whenever {\nu} vanishes, we have

\displaystyle  (\sum_{j=1}^J |\sum_{n} f(n) \overline{g_j(n)}|^2)^{1/2} \leq (\sum_n |f(n)|^2 / \nu(n))^{1/2} \ \ \ \ \ (6)

\displaystyle  \times ( \sum_{j=1}^J \sum_{j'=1}^J c_j \overline{c_{j'}} \sum_n \nu(n) g_j(n) \overline{g_{j'}(n)} )^{1/2}

for some sequence {c_1,\dots,c_J} of complex numbers with {\sum_{j=1}^J |c_j|^2 = 1}, with the convention that {|f(n)|^2/\nu(n)} vanishes whenever {f(n), \nu(n)} both vanish.

Note by relabeling that we may replace the domain {{\bf N}} here by any other at most countable set, such as the integers {{\bf Z}}. (Indeed, one can give an analogue of this lemma on arbitrary measure spaces, but we will not do so here.) This result first appears in this paper of Boas.

Proof: We use the method of duality to replace the role of the function {f} by a dual sequence {c_1,\dots,c_J}. By the converse to Cauchy-Schwarz, we may write the left-hand side of (6) as

\displaystyle  \sum_{j=1}^J \overline{c_j} \sum_{n} f(n) \overline{g_j(n)}

for some complex numbers {c_1,\dots,c_J} with {\sum_{j=1}^J |c_j|^2 = 1}. Indeed, if all of the {\sum_{n} f(n) \overline{g_j(n)}} vanish, we can set the {c_j} arbitrarily, otherwise we set {(c_1,\dots,c_J)} to be the unit vector formed by dividing {(\sum_{n} f(n) \overline{g_j(n)})_{j=1}^J} by its length. We can then rearrange this expression as

\displaystyle  \sum_n f(n) \overline{\sum_{j=1}^J c_j g_j(n)}.

Applying Cauchy-Schwarz (dividing the first factor by {\nu(n)^{1/2}} and multiplying the second by {\nu(n)^{1/2}}, after first removing those {n} for which {\nu(n)} vanish), this is bounded by

\displaystyle  (\sum_n |f(n)|^2 / \nu(n))^{1/2} (\sum_n \nu(n) |\sum_{j=1}^J c_j g_j(n)|^2)^{1/2},

and the claim follows by expanding out the second factor. \Box

Observe that Lemma 1 is a special case of Proposition 2 when {\nu=1} and the {g_j} are orthonormal. In general, one can expect Proposition 2 to be useful when the {g_j} are almost orthogonal relative to {\nu}, in that the correlations {\sum_n \nu(n) g_j(n) \overline{g_{j'}(n)}} tend to be small when {j,j'} are distinct. In that case, one can hope for the diagonal term {j=j'} in the right-hand side of (6) to dominate, in which case one can obtain estimates of comparable strength to the classical Bessel inequality. The flexibility to choose different weights {\nu} in the above proposition has some technical advantages; for instance, if {f} is concentrated in a sparse set (such as the primes), it is sometimes useful to tailor {\nu} to a comparable set (e.g. the almost primes) in order not to lose too much in the first factor {\sum_n |f(n)|^2 / \nu(n)}. Also, it can be useful to choose a fairly “smooth” weight {\nu}, in order to make the weighted correlations {\sum_n \nu(n) g_j(n) \overline{g_{j'}(n)}} small.

Remark 3 In harmonic analysis, the use of tools such as Proposition 2 is known as the method of almost orthogonality, or the {TT^*} method. The explanation for the latter name is as follows. For sake of exposition, suppose that {\nu} is never zero (or we remove all {n} from the domain for which {\nu(n)} vanishes). Given a family of finitely supported functions {g_1,\dots,g_J: {\bf N} \rightarrow {\bf C}}, consider the linear operator {T: \ell^2(\nu^{-1}) \rightarrow \ell^2(\{1,\dots,J\})} defined by the formula

\displaystyle  T f := ( \sum_{n} f(n) \overline{g_j(n)} )_{j=1}^J.

This is a bounded linear operator, and the left-hand side of (6) is nothing other than the {\ell^2(\{1,\dots,J\})} norm of {Tf}. Without any further information on the function {f} other than its {\ell^2(\nu^{-1})} norm {(\sum_n |f(n)|^2 / \nu(n))^{1/2}}, the best estimate one can obtain on (6) here is clearly

\displaystyle  (\sum_n |f(n)|^2 / \nu(n))^{1/2} \times \|T\|_{op},

where {\|T\|_{op}} denotes the operator norm of {T}.

The adjoint {T^*: \ell^2(\{1,\dots,J\}) \rightarrow \ell^2(\nu^{-1})} is easily computed to be

\displaystyle  T^* (c_j)_{j=1}^J := (\sum_{j=1}^J c_j \nu(n) g_j(n) )_{n \in {\bf N}}.

The composition {TT^*: \ell^2(\{1,\dots,J\}) \rightarrow \ell^2(\{1,\dots,J\})} of {T} and its adjoint is then given by

\displaystyle  TT^* (c_j)_{j=1}^J := (\sum_{j=1}^J c_j \sum_n \nu(n) g_j(n) \overline{g_{j'}}(n) )_{j=1}^J.

From the spectral theorem (or singular value decomposition), one sees that the operator norms of {T} and {TT^*} are related by the identity

\displaystyle  \|T\|_{op} = \|TT^*\|_{op}^{1/2},

and as {TT^*} is a self-adjoint, positive semi-definite operator, the operator norm {\|TT^*\|_{op}} is also the supremum of the quantity

\displaystyle  \langle TT^* (c_j)_{j=1}^J, (c_j)_{j=1}^J \rangle_{\ell^2(\{1,\dots,J\})} = \sum_{j=1}^J \sum_{j'=1}^J c_j \overline{c_{j'}} \sum_n \nu(n) g_j(n) \overline{g_{j'}(n)}

where {(c_j)_{j=1}^J} ranges over unit vectors in {\ell^2(\{1,\dots,J\})}. Putting these facts together, we obtain Proposition 2; furthermore, we see from this analysis that the bound here is essentially optimal if the only information one is allowed to use about {f} is its {\ell^2(\nu^{-1})} norm.

For further discussion of almost orthogonality methods from a harmonic analysis perspective, see Chapter VII of this text of Stein.

Exercise 4 Under the same hypotheses as Proposition 2, show that

\displaystyle  \sum_{j=1}^J |\sum_{n} f(n) \overline{g_j(n)}| \leq (\sum_n |f(n)|^2 / \nu(n))^{1/2}

\displaystyle  \times ( \sum_{j=1}^J \sum_{j'=1}^J |\sum_n \nu(n) g_j(n) \overline{g_{j'}(n)}| )^{1/2}

as well as the variant inequality

\displaystyle  |\sum_{j=1}^J \sum_{n} f(n) \overline{g_j(n)}| \leq (\sum_n |f(n)|^2 / \nu(n))^{1/2}

\displaystyle  \times | \sum_{j=1}^J \sum_{j'=1}^J \sum_n \nu(n) g_j(n) \overline{g_{j'}(n)}|^{1/2}.

Proposition 2 has many applications in analytic number theory; for instance, we will use it in later notes to control the large value of Dirichlet series such as the Riemann zeta function. One of the key benefits is that it largely eliminates the need to consider further correlations of the function {f} (other than its self-correlation {\sum_n |f(n)|^2 / \nu(n)} relative to {\nu^{-1}}, which is usually fairly easy to compute or estimate as {\nu} is usually chosen to be relatively simple); this is particularly useful if {f} is a function which is significantly more complicated to analyse than the functions {g_j}. Of course, the tradeoff for this is that one now has to deal with the coefficients {c_j}, which if anything are even less understood than {f}, since literally the only thing we know about these coefficients is their square sum {\sum_{j=1}^J |c_j|^2}. However, as long as there is enough almost orthogonality between the {g_j}, one can estimate the {c_j} by fairly crude estimates (e.g. triangle inequality or Cauchy-Schwarz) and still get reasonably good estimates.

In this set of notes, we will use Proposition 2 to prove some versions of the large sieve inequality, which controls a square-sum of correlations

\displaystyle  \sum_n f(n) e( -\xi_j n )

of an arbitrary finitely supported function {f: {\bf Z} \rightarrow {\bf C}} with various additive characters {n \mapsto e( \xi_j n)} (where {e(x) := e^{2\pi i x}}), or alternatively a square-sum of correlations

\displaystyle  \sum_n f(n) \overline{\chi_j(n)}

of {f} with various primitive Dirichlet characters {\chi_j}; it turns out that one can prove a (slightly sub-optimal) version of this inequality quite quickly from Proposition 2 if one first prepares the sum by inserting a smooth cutoff with well-behaved Fourier transform. The large sieve inequality has many applications (as the name suggests, it has particular utility within sieve theory). For the purposes of this set of notes, though, the main application we will need it for is the Bombieri-Vinogradov theorem, which in a very rough sense gives a prime number theorem in arithmetic progressions, which, “on average”, is of strength comparable to the results provided by the Generalised Riemann Hypothesis (GRH), but has the great advantage of being unconditional (it does not require any unproven hypotheses such as GRH); it can be viewed as a significant extension of the Siegel-Walfisz theorem from Notes 2. As we shall see in later notes, the Bombieri-Vinogradov theorem is a very useful ingredient in sieve-theoretic problems involving the primes.

There is however one additional important trick, beyond the large sieve, which we will need in order to establish the Bombieri-Vinogradov theorem. As it turns out, after some basic manipulations (and the deployment of some multiplicative number theory, and specifically the Siegel-Walfisz theorem), the task of proving the Bombieri-Vinogradov theorem is reduced to that of getting a good estimate on sums that are roughly of the form

\displaystyle  \sum_{j=1}^J |\sum_n \Lambda(n) \overline{\chi_j}(n)| \ \ \ \ \ (7)

for some primitive Dirichlet characters {\chi_j}. This looks like the type of sum that can be controlled by the large sieve (or by Proposition 2), except that this is an ordinary sum rather than a square sum (i.e., an {\ell^1} norm rather than an {\ell^2} norm). One could of course try to control such a sum in terms of the associated square-sum through the Cauchy-Schwarz inequality, but this turns out to be very wasteful (it loses a factor of about {J^{1/2}}). Instead, one should try to exploit the special structure of the von Mangoldt function {\Lambda}, in particular the fact that it can be expressible as a Dirichlet convolution {\alpha * \beta} of two further arithmetic sequences {\alpha,\beta} (or as a finite linear combination of such Dirichlet convolutions). The reason for introducing this convolution structure is through the basic identity

\displaystyle  (\sum_n \alpha*\beta(n) \overline{\chi_j}(n)) = (\sum_n \alpha(n) \overline{\chi_j}(n)) (\sum_n \beta(n) \overline{\chi_j}(n)) \ \ \ \ \ (8)

for any finitely supported sequences {\alpha,\beta: {\bf N} \rightarrow {\bf C}}, as can be easily seen by multiplying everything out and using the completely multiplicative nature of {\chi_j}. (This is the multiplicative analogue of the well-known relationship {\widehat{f*g}(\xi) = \hat f(\xi) \hat g(\xi)} between ordinary convolution and Fourier coefficients.) This factorisation, together with yet another application of the Cauchy-Schwarz inequality, lets one control (7) by square-sums of the sort that can be handled by the large sieve inequality.

As we have seen in Notes 1, the von Mangoldt function {\Lambda} does indeed admit several factorisations into Dirichlet convolution type, such as the factorisation {\Lambda = \mu * L}. One can try directly inserting this factorisation into the above strategy; it almost works, however there turns out to be a problem when considering the contribution of the portion of {\mu} or {L} that is supported at very small natural numbers, as the large sieve loses any gain over the trivial bound in such settings. Because of this, there is a need for a more sophisticated decomposition of {\Lambda} into Dirichlet convolutions {\alpha * \beta} which are non-degenerate in the sense that {\alpha,\beta} are supported away from small values. (As a non-example, the trivial factorisation {\Lambda = \Lambda * \delta} would be a totally inappropriate factorisation for this purpose.) Fortunately, it turns out that through some elementary combinatorial manipulations, some satisfactory decompositions of this type are available, such as the Vaughan identity and the Heath-Brown identity. By using one of these identities we will be able to complete the proof of the Bombieri-Vinogradov theorem. (These identities are also useful for other applications in which one wishes to control correlations between the von Mangoldt function {\Lambda} and some other sequence; we will see some examples of this in later notes.)

For further reading on these topics, including a significantly larger number of examples of the large sieve inequality, see Chapters 7 and 17 of Iwaniec and Kowalski.

Remark 5 We caution that the presentation given in this set of notes is highly ahistorical; we are using modern streamlined proofs of results that were first obtained by more complicated arguments.

— 1. The large sieve inequality —

We begin with a (slightly weakened) form of the large sieve inequality for additive characters, also known as the analytic large sieve inequality, first extracted explicitly by Davenport and Halberstam from previous work on the large sieve, and then refined further by many authors (see Remark 7 below).

Proposition 6 (Analytic large sieve inequality) Let {f: {\bf Z} \rightarrow {\bf C}} be a function supported on an interval {[M,M+N]} for some {M \in {\bf R}} and {N > 0}, and let {\xi_1,\dots,\xi_J \in {\bf R}/{\bf Z}} be {\delta}-separated (thus {\|\xi_i - \xi_j\|_{{\bf R}/{\bf Z}} \ge \delta} for all {1 \leq i < j \leq J} and some {\delta>0}, where {\|\xi\|_{{\bf R}/{\bf Z}}} denotes the distance from {\xi} to the nearest integer. Then

\displaystyle  \sum_{j=1}^J |\sum_n f(n) e( - \xi_j n )|^2 \ll (N + \frac{1}{\delta}) \sum_n |f(n)|^2. \ \ \ \ \ (9)

One can view this proposition as a variant of the Plancherel identity

\displaystyle  \sum_{j=1}^N |\sum_{n=1}^N f(n) e( - jn / N )|^2 = N \sum_{n=1}^N |f(n)|^2

associated to the Fourier transform on a cyclic group {{\bf Z}/N{\bf Z}}. This identity also shows that apart from the implied constant, the bound (9) is essentially best possible.

Proof: By increasing {N} (if {N < 1/\delta}) or decreasing {\delta} (if {1/\delta < N}), we can reduce to the case {N = \frac{1}{\delta}} without making the hypotheses any stronger. Thus the {\xi_j} are now {1/N}-separated, and our task is now to show that

\displaystyle  \sum_{j=1}^J |\sum_n f(n) e( - \xi_j n )|^2 \ll N \sum_n |f(n)|^2.

We now wish to apply Proposition 2 (with {{\bf N}} relabeled by {{\bf Z}}), but we need to choose a suitable weight function {\nu: {\bf Z} \rightarrow {\bf R}^+}. It turns out that it is advantageous for technical reasons to select a weight that has good Fourier-analytic properties.

Fix a smooth non-negative function {\psi: {\bf R} \rightarrow {\bf R}} supported on {[-1/4,1/4]} and not identically zero; we allow implied constants to depend on {\psi}. (Actually, the smoothness of {\psi} is not absolutely necessary for this argument; one could take {\psi = 1_{[-1/4,1/4]}} below if desired.) Consider the weight {\nu: {\bf R} \rightarrow {\bf R}^+} defined by

\displaystyle  \nu(n) := |\int_{\bf R} \psi( x ) e( x \frac{n-M}{N} )\ dx|^2.

Observe that {\nu \gg 1} for all {n \in [M,M+N]}. Thus, by Proposition 2 with this choice of {\nu}, we reduce to showing that

\displaystyle  \sum_{j=1}^J \sum_{j'=1}^J c_j \overline{c_{j'}} \sum_n \nu(n) e( (\xi_j - \xi_{j'}) n ) \ll N \ \ \ \ \ (10)

whenever {c_1,\dots,c_J} are complex numbers with {\sum_{j=1}^J |c_j|^2 = 1}.

We first consider the diagonal contribution {j=j'}. We can write {\nu} in terms of the Fourier transform {\hat \psi(\xi) := \int_{\bf R} \psi(x) e^{i x\xi}\ dx} of {\psi} as

\displaystyle  \nu(t) = |\hat \psi( 2\pi (t-M) / N )|^2,

and hence by Exercise 28 of Supplement 2, we have the bounds

\displaystyle  \nu(n) \ll \frac{1}{(1 + (n-M)/N)^2} \ \ \ \ \ (11)

(say) and thus

\displaystyle  \sum_n \nu(n) \ll N.

Thus the diagonal contribution {j=j'} of (10) is acceptable.

Now we consider an off-diagonal term {\sum_n \nu(n) e( (\xi_j - \xi_{j'}) n )} with {j \neq j'}. By the Poisson summation formula (Theorem 34 from Supplement 2), we can rewrite this as

\displaystyle  \sum_m \hat \nu( 2\pi (m + \xi_j - \xi_{j'}) ).

Now from the Fourier inversion formula, the function {\hat \psi} has Fourier transform supported on {[-1/4,1/4]}, and so {\nu} has Fourier transform supported in {[-\frac{\pi}{N}, \frac{\pi}{N}]}. Since {\xi_j - \xi_{j'}} is at least {1/N} from the nearest integer, we conclude that {\hat \nu( 2\pi (m + \xi_j - \xi_{j'}) )=0} for all integers {m}, and hence all the off-diagonal terms in fact vanish! The claim follows. \Box

Remark 7 If {M,N} are integers, one can in fact obtain the sharper bound

\displaystyle  \sum_{j=1}^J |\sum_n f(n) e( - \xi_j n )|^2 \leq (N + \frac{1}{\delta}) \sum_n |f(n)|^2,

a result of Montgomery and Vaughan (with {N} replaced by {N+1}, although it was observed subsequently subsequently by Paul Cohen that the additional {+1} term could be deleted by an amplification trick similar to those discussed in this previous post). See this survey of Montgomery for these results and on the (somewhat complicated) evolution of the large sieve, starting with the pioneering work of Linnik. However, in our applications the cruder form of the analytic large sieve inequality given by Proposition 6 will suffice.

Exercise 8 Let {f: {\bf N} \rightarrow {\bf C}} be a function supported on an interval {[M,M+N]}. Show that for any {A \geq 1}, one has

\displaystyle  |\sum_n f(n) e( - \xi n)| \leq A (\sum_n |f(n)|^2)^{1/2}

for all {\xi \in {\bf R}/{\bf Z}} outside of the union of {O( \frac{N}{A^2} )} intervals (or arcs) of length {1/N}. In particular, we have

\displaystyle  |\sum_n f(n) e( - \xi n)| \ll \sqrt{N} (\sum_n |f(n)|^2)^{1/2}

for all {\xi \in {\bf R}/{\bf Z}}, and the significantly superior estimate

\displaystyle  |\sum_n f(n) e( - \xi n)| \ll (\sum_n |f(n)|^2)^{1/2}

for most {\xi \in {\bf R}/{\bf Z}} (outside of at most (say) {N/2} intervals of length {1/N}).

Exercise 9 (Continuous large sieve inequality) Let {[M,M+N]} be an interval for some {M \in {\bf R}} and {N > 0}, and let {\xi_1,\dots,\xi_J \in {\bf R}} be {\delta}-separated.

  • (i) For any complex numbers {c_1,\dots,c_J}, show that

    \displaystyle  \int_M^{M+N} |\sum_{j=1}^J c_j e( \xi_j t )|^2\ dt \ll (N + \frac{1}{\delta}) \sum_{j=1}^J |c_j|^2.

    (Hint: replace the restriction of {t} to {[M,M+N]} with the weight {\nu(t)} used to prove Proposition 6.)

  • (ii) For any continuous {f: [M,M+N] \rightarrow {\bf C}}, show that

    \displaystyle  \sum_{j=1}^J |\int_M^{M+N} f(t) e( - \xi_j t )\ dt|^2 \ll (N + \frac{1}{\delta}) \int_M^{M+N} |f(t)|^2\ dt.

Now we establish a variant of the analytic large sieve inequality, involving Dirichlet characters, due to Bombieri and Davenport.

Proposition 10 (Large sieve inequality for characters) Let {f: {\bf Z} \rightarrow {\bf C}} be a function supported on an interval {[M,M+N]} for some {M \in {\bf R}} and {N > 0}, and let {Q \geq 1}. Then

\displaystyle  \sum_{q \leq Q} \frac{q}{\phi(q)} \sum^*_{\chi\ (q)} |\sum_n f(n) \overline{\chi(n)}|^2 \ll (N + Q^2) \sum_n |f(n)|^2, \ \ \ \ \ (12)

where {\sum^*_{\chi\ (q)}} denotes the sum over all primitive Dirichlet characters of modulus {q}.

Proof: By increasing {N} or decreasing {Q}, we may assume that {N=Q^2}, so {N \geq 1} and our task is now to show that

\displaystyle  \sum_{q \leq \sqrt{N}} \frac{q}{\phi(q)} \sum^*_{\chi\ (q)} |\sum_n f(n) \overline{\chi(n)}|^2 \ll N \sum_n |f(n)|^2.

Let {\nu} be the weight from Proposition 6. By Proposition 2 (and some relabeling), using the functions {(\frac{q}{\phi(q)})^{1/2} \chi}, it suffices to show that

\displaystyle  \sum_{q,q' \leq \sqrt{N}} \left(\frac{q}{\phi(q)}\right)^{1/2} \left(\frac{q'}{\phi(q')}\right)^{1/2} \sum^*_{\chi\ (q)} \sum^*_{\chi'\ (q')} c_\chi \overline{c_{\chi'}} \sum_n \nu(n) \chi(n) \overline{\chi'(n)} \ \ \ \ \ (13)

\displaystyle  \ll N

whenever {c_\chi} are complex numbers with

\displaystyle  \sum_{q \leq \sqrt{N}} \sum^*_{\chi\ (q)} |c_\chi|^2 = 1. \ \ \ \ \ (14)

We first establish that the diagonal contribution {(q,\chi)=(q',\chi')} to (13) is acceptable, that is to say that

\displaystyle  \sum_{q \leq \sqrt{N}} \frac{q}{\phi(q)} \sum^*_{\chi\ (q)} |c_\chi|^2 \sum_n \nu(n) |\chi(n)|^2 \ll N.

The function {|\chi(n)|^2} is the principal character of modulus {q}; it is periodic with period {q} and has mean value {\phi(q)/q}. In particular, since {q \leq \sqrt{N} \leq N}, we see that {\sum_{M' \leq n \leq M'+N} |\chi(n)|^2 \ll \frac{\phi(q)}{q} N} for any interval {[M',M'+N]} of length {N}. From (11) and a partition into intervals of length {N}, we see that

\displaystyle  \sum_n \nu(n) |\chi(n)|^2 \ll \frac{\phi(q)}{q} N

and the claim then follows from (14).

Now consider an off-diagonal term {\sum_n \nu(n) \chi(n) \overline{\chi'(n)}} with {\chi \neq \chi'}. As {\chi,\chi'} are primitive, this implies that {\chi \overline{\chi'}} is a non-principal character and thus has mean zero. Let {r} be the modulus of this character; by Fourier expansion we may write {\chi \overline{\chi'}} as a linear combination of the additive characters {n \mapsto e( k n / r )} for {1 \leq k < r}. (We can obtain explicit coefficients from this expansion by invoking Lemma 48 of Supplement 3, but we will not need those coefficients here.) Thus {\sum_n \nu(n) \chi(n) \overline{\chi'(n)}} is a linear combination of the quantities {\sum_n \nu(n) e( kn/r)} for {1 \leq k < r}. But the modulus {r} is the least common multiple of the moduli of {\chi,\chi'}, so in particular {r \leq N}, while as observed in the proof of Proposition 6, we have {\sum_n \nu(n) e( \theta n ) = 0} whenever {\|\theta\|_{{\bf R}/{\bf Z}} \geq 1/2N}. So the off-diagonal terms all vanish, and the claim follows. \Box

One can also derive Proposition 10 from Proposition 6:

Exercise 11 Let {f: {\bf Z} \rightarrow {\bf C}} be a finitely supported sequence.

  • (i) For any natural number {q}, establish the identity

    \displaystyle  \sum_{a \in ({\bf Z}/q{\bf Z})^\times} |\sum_n f(n) e(an/q)|^2

    \displaystyle = \frac{1}{\phi(q)} \sum_{\chi\ (q)} |\sum_n f(n) \sum_{a \in {\bf Z}/q{\bf Z}} \chi(a) e(an/q)|^2

    where {({\bf Z}/q{\bf Z})^\times} is the set of congruence classes {a\ (q)} coprime to {q}, and {\sum_{\chi\ (q)}} is the sum over all characters (not necessarily primitive) of modulus {q}.

  • (ii) For any natural number {q}, establish the inequality

    \displaystyle  \frac{q}{\phi(q)} \sum^*_{\chi\ (q)} |\sum_n f(n) \overline{\chi(n)}|^2 \leq \sum_{a \in ({\bf Z}/q{\bf Z})^\times} |\sum_n f(n) e(an/q)|^2

    and use this and Proposition 6 to derive Proposition 10. (Hint: use Lemma 48 from Supplement 3.)

Remark 12 By combining the arguments in the above exercise with the results in Remark 7, one can sharpen Proposition 10 to

\displaystyle  \sum_{q \leq Q} \frac{q}{\phi(q)} \sum^*_{\chi\ (q)} |\sum_n f(n) \overline{\chi(n)}|^2 \leq (N + Q^2) \sum_n |f(n)|^2,

that is to say one can delete the implied constant. See this paper of Montgomery and Vaughan for some further refinements of this inequality.

— 2. The Barban-Davenport-Halberstam theorem —

We now apply the large sieve inequality for characters to obtain an analogous inequality for arithmetic progressions, due independently to Barban, and to Davenport and Halberstam; we state a slightly weakened form of that theorem here. For any finitely supported arithmetic function {f: {\bf N} \rightarrow {\bf C}} and any primitive residue class {a\ (q)}, we introduce the discrepancy

\displaystyle  \Delta(f; a\ (q)) := \sum_{n: n = a\ (q)} f(n) - \frac{1}{\phi(q)} \sum_{n: (n,q)=1} f(n).

This quantity measures the extent to which {f} is well distributed among the primitive residue classes modulo {n}. From multiplicative Fourier inversion (see Theorem 69 from Notes 1) we have the identity

\displaystyle  \Delta(f; a\ (q)) = \frac{1}{\phi(q)} \sum_{\chi\ (q): \chi \neq \chi_0} \chi(a) \sum_n f(n) \overline{\chi(n)} \ \ \ \ \ (15)

where the sum is over non-principal characters {\chi} of modulus {q}.

Theorem 13 (Barban-Davenport-Halberstam) Let {x > 2}, and let {f: {\bf N} \rightarrow {\bf C}} be a function supported on {[1,x]} with the property that

\displaystyle  \sum_n |f(n)|^2 \ll x \log^{O(1)} x \ \ \ \ \ (16)

and obeying the Siegel-Walfisz property

\displaystyle  \Delta(f 1_{(\cdot,s)=1}; a\ (r)) \ll_A x \log^{-A} x \ \ \ \ \ (17)

for any fixed {A>0}, any primitive residue class {a\ (r)}, and any {1 \leq s \leq x}. Then one has

\displaystyle  \sum_{q \leq Q} \sum_{a \in ({\bf Z}/q{\bf Z})^\times} |\Delta(f; a\ (q))|^2 \ll_A x^2 \log^{-A} x \ \ \ \ \ (18)

for any {A > 0}, provided that {Q \leq x \log^{-B} x} for some sufficiently large {B = B(A)} depending only on {A}.

Informally, (18) is asserting that

\displaystyle  \Delta(f; a\ (q)) \ll_A \frac{x}{\phi(q)} \log^{-A} x

for “most” primitive residue classes {a\ (q)} with {q} much smaller than {x}; in most applications, the trivial bounds on {\Delta(f; a\ (q))} are of the type {O( \frac{x}{\phi(q)} \log^{O(1)} x )}, so this represents a savings of an arbitrary power of a logarithm on the average. Note that a direct application of (17) only gives (18) for {Q} of size {\log^{O(1)} x}; it is the large sieve which allows for the significant enlargement of {Q}.

Proof: Let {x, f, Q, A} be as above, with {Q \leq x \log^{-B} x} for some large {B} to be chosen later. From (15) and the Plancherel identity one has

\displaystyle  \sum_{a \in ({\bf Z}/q{\bf Z})^\times} |\Delta(f; a\ (q))|^2 = \frac{1}{\phi(q)} \sum_{\chi\ (q): \chi \neq \chi_0} |\sum_n f(n) \overline{\chi(n)}|^2

so our task is to show that

\displaystyle  \sum_{q \leq Q} \frac{1}{\phi(q)} \sum_{\chi\ (q): \chi \neq \chi_0} |\sum_n f(n) \overline{\chi(n)}|^2 \ll_A x^2 \log^{-A} x. \ \ \ \ \ (19)

We cannot apply the large sieve inequality yet, because the characters {\chi} here are not necessarily primitive. But we may express any non-principal character {\chi(n)} as {\tilde \chi(n) 1_{(n,s)=1}} for some primitive character {\tilde \chi} of conductor {r>1}, where {r,s,t} are natural numbers with {rst = q}. In particular, {r,s,t \leq Q} and {\frac{1}{\phi(q)} \leq \frac{1}{\phi(s)} \frac{1}{\phi(r)} \frac{1}{\phi(t)}}. Thus we may (somewhat crudely) upper bound the left-hand side of (19) by

\displaystyle  \sum_{t \leq Q} \frac{1}{\phi(t)} \sum_{s \leq Q} \frac{1}{\phi(s)} \sum_{1 < r \leq Q} \frac{1}{\phi(r)} \sum_{\tilde \chi\ (r)}^* |\sum_n f(n) 1_{(n,s)=1} \overline{\tilde \chi(n)}|^2.

From Theorem 27 of Notes 1 we have {\sum_{s \leq Q} \frac{1}{\phi(s)} \ll \log x}, so we may bound the above by

\displaystyle  (\log x)^2 \sup_{s \leq Q} \sum_{1 < r \leq Q} \frac{1}{\phi(r)} \sum_{\tilde \chi\ (r)}^* |\sum_n f(n) 1_{(n,s)=1} \overline{\tilde \chi(n)}|^2.

By dyadic decomposition (and adjusting {A} slightly), it thus suffices to show that

\displaystyle  \sum_{R < r \leq 2R} \frac{1}{\phi(r)} \sum_{\tilde \chi\ (r)}^* |\sum_n f(n) 1_{(n,s)=1} \overline{\tilde \chi(n)}|^2 \ll_A x^2 \log^{-A} x \ \ \ \ \ (20)

for any {1 \leq R \leq Q} and {s \leq Q}.

From Proposition 10 and (16), we may bound the left-hand side of (20) by {\frac{1}{R} (x + R^2) x \log^{O(1)} x}. If {R \geq \log^B x} and {R \leq Q \leq x \log^{-B} x}, then we obtain (20) if {B} is sufficiently large depending on {A}. The only remaining case to consider is when {R < log^B x}. But from the Siegel-Walfisz hypothesis (17) we easily see that

\displaystyle  \sum_n f(n) 1_{(n,s)=1} \overline{\tilde \chi(n)} \ll_{A'} R x \log^{-A'} x

for any {A' > 0} and any primitive character {\tilde \chi} of conductor {R < r \leq 2R}. Since the total number of primitive characters appearing in (20) is {O(R^2) = O( \log^{2B} x)}, the claim follows by taking {A'} large enough. \Box

One can specialise this to the von Mangoldt function:

Exercise 14 Use the Barban-Davenport-Halberstam theorem and the Siegel-Walfisz theorem (Exercise 64 from Notes 2) to conclude that

\displaystyle  \sum_{q \leq Q} \sum_{a \in ({\bf Z}/q{\bf Z})^\times} |\Delta(\Lambda 1_{[1,x]}; a\ (q))|^2 \ll_A x^2 \log^{-A} x \ \ \ \ \ (21)

for any {A > 0}, provided that {Q \leq x \log^{-B} x} for some sufficiently large {B = B(A)} depending only on {A}. Obtain a similar claim with {\Lambda} replaced by the Möbius function.

Remark 15 Recall that the implied constants in the Siegel-Walfisz theorem depended on {A} in an ineffective fashion. As such, the implied constants in (21) also depend ineffectively on {A}. However, if one replaces Siegel’s theorem by an effective substitute such as Tatuzawa’s theorem (see Theorem 62 of Notes 2) or the Landau-Page theorem (Theorem 53 of Notes 2), one can obtain an effective version of the Siegel-Walfisz theorem for all moduli {q} that are not multiples of a single exceptional modulus {q_*}. One can then obtain an effective version of (21) if one restricts to moduli {q} that are not multiples of {q_*}. Similarly for the Bombieri-Vinogradov theorem in the next section. Such variants of the Barban-Davenport-Halberstam theorem or Bombieri-Vinogradov theorem can be used as a substitute in some applications to remove any ineffective dependence of constants, at the cost of making the argument slightly more convoluted.

— 3. The Bombieri-Vinogradov theorem —

The Barban-Davenport-Halberstam theorem controls the discrepancy {\Delta(f; a\ (q))} after averaging in both the modulus {q} and the residue class {a}. For many problems in sieve theory, it turns out to be more important to control the discrepancy with an averaging only in the modulus {q}, with the residue class {a} being allowed to vary in {q} in the “worst-case” fashion. Specifically, one often wishes to control expressions of the form

\displaystyle  \sum_{q \leq Q} \sup_{a \in ({\bf Z}/q{\bf Z})^\times} |\Delta(f; a\ (q))|. \ \ \ \ \ (22)

for some finitely supported {f: {\bf N} \rightarrow {\bf C}} and {Q>1}. This expression is difficult to control for arbitrary {f}, but it turns out that one can obtain a good bound if {f} is expressible as a Dirichlet convolution {f = \alpha*\beta} for some suitably “non-degenerate” sequences {\alpha,\beta}. More precisely, we have the following general form of the Bombieri-Vinogradov theorem, first articulated by Motohashi:

Theorem 16 (General Bombieri-Vinogradov theorem) Let {x > 2}, let {M,N \geq 1} be such that {MN \ll x}, and let {\alpha,\beta: {\bf N} \rightarrow {\bf C}} be arithmetic functions supported on {[1,M]}, {[1,N]} respectively, with

\displaystyle  \sum_m |\alpha(m)|^2 \ll M \log^{O(1)} x \ \ \ \ \ (23)

and

\displaystyle  \sum_n |\beta(n)|^2 \ll N \log^{O(1)} x. \ \ \ \ \ (24)

Suppose that {\beta} obeys the Siegel-Walfisz property

\displaystyle  \Delta(\beta 1_{(\cdot,s)=1}; a\ (r)) \ll_A N \log^{-A} x \ \ \ \ \ (25)

for all {A > 0}, all primitive residue classes {a\ (r)}, and all {1 \leq s \leq x}. Then one has

\displaystyle  \sum_{q \leq Q} \sup_{a \in ({\bf Z}/q{\bf Z})^\times} |\Delta(\alpha * \beta; a\ (q))| \ll_{A} x \log^{-A} x \ \ \ \ \ (26)

for any {A > 0}, provided that {Q \leq x^{1/2} \log^{-B} x} and {M,N \geq \log^B x} for some sufficiently large {B = B(A)} depending on {A}.

Proof: We adapt the arguments of the previous section. From (15) and the triangle inequality, we have

\displaystyle  \sup_{a \in ({\bf Z}/q{\bf Z})^\times} |\Delta(\alpha*\beta; a\ (q))| \leq \frac{1}{\phi(q)} \sum_{\chi\ (q): \chi \neq \chi_0} |\sum_n \alpha*\beta(n) \overline{\chi(n)}|

and so we can upper bound the left-hand side of (26) by

\displaystyle  \sum_{q \leq Q} \frac{1}{\phi(q)} \sum_{\chi\ (q): \chi \neq \chi_0} |\sum_n \alpha*\beta(n) \overline{\chi(n)}|.

As in the previous section, we may reduce {\chi} to primitive characters and bound this expression by

\displaystyle  \ll (\log x)^2 \sup_{s \leq Q} \sum_{1 < r \leq Q} \frac{1}{\phi(r)} \sum_{\tilde \chi\ (r)}^* |\sum_n \alpha*\beta(n) 1_{(n,s)=1} \overline{\tilde \chi(n)}|.

By dyadic decomposition (and adjusting {A} slightly), it thus suffices to show that

\displaystyle  \sum_{R < r \leq 2R} \frac{1}{\phi(r)} \sum_{\tilde \chi\ (r)}^* |\sum_n \alpha*\beta(n) 1_{(n,s)=1} \overline{\tilde \chi(n)}| \ll_A x \log^{-A} x \ \ \ \ \ (27)

for all {1 \leq R \leq Q} and {s \leq Q}, and any {A>1}, assuming {Q \leq x^{1/2} \log^{-B} x} and {M,N \geq \log^B x} with {B} sufficiently large depending on {A}.

We cannot yet easily apply the large sieve inequality, because the character sums here are not squared. But we now crucially exploit the Dirichlet convolution structure using the identity (8), to factor {\sum_n \alpha*\beta(n) 1_{(n,s)=1} \overline{\tilde \chi(n)}} as the product of {\sum_m \alpha(m) 1_{(m,s)=1} \overline{\tilde \chi(m)}} and {\sum_n \beta(n) 1_{(n,s)=1} \overline{\tilde \chi(n)}}. From the Cauchy-Schwarz inequality, we may thus bound (27) by the geometric mean of

\displaystyle  \sum_{R < r \leq 2R} \frac{1}{\phi(r)} \sum_{\tilde \chi\ (r)}^* |\sum_m \alpha(m) 1_{(m,s)=1} \overline{\tilde \chi(m)}|^2 \ \ \ \ \ (28)

and

\displaystyle  \sum_{R < r \leq 2R} \frac{1}{\phi(r)} \sum_{\tilde \chi\ (r)}^* |\sum_n \beta(n) 1_{(n,s)=1} \overline{\tilde \chi(n)}|^2. \ \ \ \ \ (29)

Now we have the all-important square needed in the large sieve inequality. From (23), (24) and Proposition 2, we may bound (28) by

\displaystyle  \ll \frac{1}{R} (M + R^2) M \log^{O(1)} x \ \ \ \ \ (30)

and (29) by

\displaystyle  \ll \frac{1}{R} (N + R^2) N \log^{O(1)} x

and so (27) is bounded by

\displaystyle  \ll ( \frac{MN}{R} + M N^{1/2} + M^{1/2} N + R (MN)^{1/2} ) \log^{O(1)} x.

Since {MN \leq x}, we can write this as

\displaystyle  \ll ( \frac{1}{R} + \frac{1}{N^{1/2}} + \frac{1}{M^{1/2}} + \frac{R}{x^{1/2}} ) x \log^{O(1)} x.

Since {N,M \geq \log^B x} and {R \leq Q \leq x^{1/2} \log^{-B} x}, we obtain (27) if {R \geq \log^B x} and {B} is sufficiently large depending on {A}. The only remaining case to handle is when {R \leq \log^B x}. In this case, we can use the Siegel-Walfisz hypothesis (25) as in the previous section to bound (29) by {O_{A'}( N^2 \log^{-A'} x)} for any {A'>0}. Meanwhile, from (30), (28) is bounded by {O( M^2 \log^{O(B+1)} x )}. By taking {A'} sufficiently large, we conclude (27) in this case also. \Box

In analogy with Exercise 14, we would like to apply this general result to specific arithmetic functions, such as the von Mangoldt function {\Lambda} or the Möbius function {\mu}, and in particular to prove the following famous result of Bombieri and of A. I. Vinogradov (not to be confused with the more well known number theorist I. M. Vinogradov):

Theorem 17 (Bombieri-Vinogradov theorem) Let {x \geq 2}. Then one has

\displaystyle  \sum_{q \leq Q} \sup_{a \in ({\bf Z}/q{\bf Z})^\times} |\Delta(\Lambda 1_{[1,x]}; a\ (q))| \ll_{A} x \log^{-A} x \ \ \ \ \ (31)

for any {A > 0}, provided that {Q \leq x^{1/2} \log^{-B} x} for some sufficiently large {B = B(A)} depending on {A}.

Informally speaking, the Bombieri-Vinogradov theorem asserts that for “almost all” moduli {q} that are significantly less than {x^{1/2}}, one has

\displaystyle  |\Delta(\Lambda 1_{[1,x]}; a\ (q))| \ll_A \frac{x}{\phi(q)} \log^{-A} x

for all primitive residue classes {a\ (q)} to this modulus. This should be compared with the Generalised Riemann Hypothesis (GRH), which gives the bound

\displaystyle  |\Delta(\Lambda 1_{[1,x]}; a\ (q))| \ll x^{1/2} \log^2 x

for all {q \leq x^{1/2}}; see Exercise 48 of Notes 2. Thus one can view the Bombieri-Vinogradov theorem as an assertion that the GRH holds (with a slightly weaker error term) “on average”, at least insofar as the impact of GRH on the prime number theorem in arithmetic progressions is concerned.

The initial arguments of Bombieri and Vinogradov were somewhat complicated, in particular involving the explicit formula for {L}-functions (Exercise 45 of Notes 2); the modern proof of the Bombieri-Vinogradov theorem avoids this and proceeds instead through Theorem 16 (or a close cousin thereof). Note that this theorem generalises the Siegel-Walfisz theorem (Exercise 64 of Notes 2), which is equivalent to the special case of Theorem 17 when {Q = \log^{O(1)} x}.

The obvious thing to try when proving Theorem 17 using Theorem 16 is to use one of the basic factorisations of such functions into Dirichlet convolutions, e.g. {\Lambda = \mu * L}, and then to decompose that convolution into pieces {\alpha*\beta} of the form required in Theorem 16; we will refer to such convolutions as Type II convolutions, loosely following the terminology of Vaughan. However, one runs into a problem coming from the components of the factors {\mu, L} supported at small numbers (of size {n = O(\log^{O(1)} x)}), as the parameters {M,N} associated to those parameters cannot obey the conditions {MN \ll x}, {M, N \geq \log^B x}. Indeed, observe that any Type II convolution {\alpha * \beta} will necessarily vanish at primes of size comparable to {x}, and so one cannot possibly represent functions such as {\Lambda} or {\mu} purely in terms of such Type II convolutions.

However, it turns out that we can still decompose functions such as {\Lambda,\mu} into two types of convolutions: not just the Type II convolutions considered above, but also a further class of Type I convolutions {\alpha * \beta}, in which one of the factors, say {\beta}, is very slowly varying (or “smooth”) and supported on a very long interval, e.g. {\beta = 1_{[N,2N]}} for some large {N}; with these sums {\alpha} is permitted to be concentrated arbitrarily close to {n=1}, and in particular the Type I convolution can be non-zero on primes comparable to {x}. It turns out that bounding the discrepancy of Type I convolutions is relatively easy, and leads to a proof of Theorem 17.

We turn to the details. There are a number of decompositions of {\Lambda} or {\mu} that one could use to accomplish the desired task. One popular choice of decomposition is the Vaughan identity, which may be compared with the decompositions appearing in the Dirichlet hyperbola method (see Section 3 of Notes 1):

Lemma 18 (Vaughan identity) For any {U,V > 1}, one has

\displaystyle  \Lambda = \Lambda_{\leq V} + \mu_{\leq U} * L - \mu_{\leq U} * \Lambda_{\leq V} * 1 + \mu_{>U} * \Lambda_{>V} * 1 \ \ \ \ \ (32)

where {\Lambda_{\leq V}(n) :=\Lambda(n) 1_{n \leq V}}, {\Lambda_{> V}(n) :=\Lambda(n) 1_{n > V}}, {\mu_{\leq V}(n) :=\mu(n) 1_{n \leq U}}, and {\mu_{> V}(n) :=\mu(n) 1_{n > U}}.

In this decomposition, {U} and {V} are typically two small powers of {x} (e.g. {U=V=x^{1/5}}), although the exact choice of {U,V} is often not of critical importance. The terms {\mu_{\leq U} * L} and {\mu_{\leq U} * \Lambda_{\leq V} * 1} are “Type I” convolutions, the terms {\mu_{>U} * \Lambda_{>V} * 1} should be considered as “Type II” convolutions. The term {\Lambda_{\leq V}} is a lower order error that is usually disposed of quite quickly. The Vaughan identity is already strong enough for many applications, but in some more advanced applications (particularly those in which one exploits the structure of triple or higher convolutions) it becomes convenient to use more sophisticated identities such as the Heath-Brown identity, which we will discuss in later notes.

Proof: Since {\mu = \mu_{\leq U} + \mu_{> U}} and {\Lambda = \Lambda_{\leq V} + \Lambda_{>V}}, we have

\displaystyle  \mu * \Lambda = \mu * \Lambda_{\leq V} + \mu_{\leq U} * \Lambda - \mu_{\leq U} * \Lambda_{\leq V} + \mu_{>U}* \Lambda_{>V}.

Convolving both sides by {1} and using the identities {\mu*1 =\delta} and {\Lambda*1=L}, we obtain the claim. \Box

Armed with this identity and Proposition 2 we may now finish off the proof of the Bombieri-Vinogradov theorem. We may assume that {x} is large (depending on {A}) as the claim is trivial otherwise. We apply the Vaughan identity (32) with {U=V=x^{1/5}} (actually for our argument below, any choice of {U,V} with {U,V \geq \log^B x} and {UV \leq x^{1/2} \log^{-B} x} would have sufficed). By the triangle inequality, it now suffices to establish (31) with {\Lambda} replaced by {\Lambda_{\leq V}}, {\mu_{\leq U} * L}, {\mu_{\leq U} * \Lambda_{\leq V}*1}, and {\mu_{>U}*\Lambda_{>V}*1}.

The term {\Lambda_{\leq V}} is easily disposed of: from the triangle inequality (and crudely bounding {\Lambda_{\leq V}} by {\log x 1_{\leq V}}) we see that

\displaystyle  \Delta( \Lambda_{\leq V} 1_{[1,x]}; a\ (q)) \ll \frac{V}{\phi(q)} \log x

and the claim follows since {\sum_{q \leq Q} \frac{1}{\phi(q)} \ll \log x} and {V = x^{1/5}}.

Next, we deal with the Type II convolution {\mu_{>U} *\Lambda_{>V}*1}. The presence of the {1_{[1,x]}} cutoff is slightly annoying (it prevents one from directly applying Proposition 2), but we will deal with this by using the following finer-than-dyadic decomposition trick, originally due to Fouvry and to Fouvry-Iwaniec. We may replace {\mu_{>U} * 1} by {(\mu_{>U}*1) 1_{(U,x]}}, since the portion of {\mu_{>U}*1} on {(x,\infty)} has no contribution. We may similarly replace {\Lambda_{>V}} by {\Lambda 1_{(V,x]}}. Next, we set {\lambda := 1 + \log^{-A-100} x}, and decompose {(\mu_{>U}*1) 1_{(U,x]}} into {O(\log^{A+101} x)} components {\alpha}, each of which is supported in an interval {[M, \lambda M]} and bounded in magnitude by {\tau} for some {M \geq U}. We similarly decompose {\Lambda 1_{(V,x]}} into {O(\log^{A+101} x)} components {\beta} supported in an interval {[N, \lambda N]} and bounded in magnitude by {\log x} for some {N \geq V}. Thus {(\mu_{>U} * \Lambda_{>V} * 1) 1_{[1,x]}} can be decomposed into {O( \log^{2A+202} x )} terms of the form {(\alpha*\beta)1_{[1,x]}} for various {\alpha,\beta} that are components of {\mu_{>U}*1} and {\Lambda_{>V}} respectively.

If {MN > x} then {(\alpha*\beta)1_{[1,x]}} vanishes, so we may assume that {MN \leq x}. By construction we also have {M,N \geq x^{1/5}}, so in particular {M,N \geq \log^B x} if {B} depends only on {A} (recall we are assuming {x} to be large). If {MN < \lambda^{-2} x}, then {(\alpha*\beta)1_{[1,x]} = \alpha*\beta}. The bounds (23), (24) are clear (bounding {\mu_{>U}*1} in magnitude by {\tau}), and from the Siegel-Walfisz theorem we see that the {\beta} components obey the hypothesis (25). Thus by applying Proposition 2 (with {A} replaced by {3A+202}) we see that the total contribution of all the {\alpha*\beta} terms with {MN < \lambda^{-2} x} is acceptable.

It remains to control the total contribution of the {\alpha*\beta} terms with {\lambda^{-2} x \leq MN \leq x}. Note that for each {\alpha} there are only {O(1)} choices of {\beta} that are in this range, so there are only {O( \log^{A+101} x )} such terms {\alpha*\beta} to deal with. We then crudely bound

\displaystyle  \sup_{a \in ({\bf Z}/q{\bf Z})^\times} \Delta( \alpha*\beta 1_{[1,x]}; a\ (q) ) \ll \sup_{a \in ({\bf Z}/q{\bf Z})^\times} \sum_{n = a\ (q)} |\alpha| * |\beta|(n)

\displaystyle  \ll (\log x) \sup_{a \in ({\bf Z}/q{\bf Z})^\times} \sum_{M \leq m \leq \lambda M, N \leq n \leq \lambda N: mn = a\ (q)} \tau(m).

Since {MN \geq \lambda^{-2} x} and {q \leq x^{1/2}}, one has either {M \gg q} or {N \gg q}. If {N \gg q}, we observe that for each fixed {m} in the above sum, there are {O( \frac{N}{q} \log^{-A-100} x )} choices of {n} that contribute, so the double sum is {O( \frac{MN}{q} \log^{-2A-199} x ) = O( \frac{x}{q} \log^{-2A-199} x )} (using the mean value bounds for {\tau}). Thus we see that the total contribution of this case is at most

\displaystyle  \ll \log^{A+101} x \times \log x \times \log^{-2A-199} x \times \sum_{q \leq Q} \frac{x}{q}

which is acceptable.

Now we consider the case when {M \gg q}. For fixed {n} in the above sum, the sum in {m} can be bounded by {\log^{65} x \times \frac{M}{q} \log^{-A-100} x}, thanks to Corollary 26 below (taking, say, {c_1 =1/2} and {c_2 = 5/6}). The total contribution of this sum is then

\displaystyle  \ll \log^{A+101} x \times \log x \times \log^{65} x \times \log^{-A-100} x \times \log^{-A-100} x \times \sum_{q \leq Q} \frac{x}{q}

which is also acceptable.

Now let us consider a Type I term {\mu_{\leq U}*L}. From the triangle inequality have

\displaystyle  |\Delta((\mu_{\leq U} * L) 1_{[1,x]}; a\ (q))| \leq \sum_{d: (d,q)=1} |\mu_{\leq U}(d)| |\Delta( L 1_{[1,x/d]}; a/d\ (q) )|.

We exploit the monotonicity of {L} via the following simple fact:

Exercise 19 Let {f: [y,x] \rightarrow {\bf R}} be a monotone function. Show that

\displaystyle  |\Delta( f 1_{[y,x]}; a\ (q) )| \ll |f(y)| + |f(x)|

for all primitive residue classes {a\ (q)}. (Hint: use Lemma 2 from Notes 1 and a change of variables.)

Applying this exercise, we see that the contribution of this term to (31) is {O( \sum_{q \leq Q} \sum_d |\mu_{\leq U}(d)| \log x ) = O( Q U \log x )}, which is acceptable since {Q \leq x^{1/2}} and {U = x^{1/5}}.

A similar argument for the Type I term {\mu_{\leq U} * \Lambda_{\leq V} * 1} term (with {\mu_{\leq U} * \Lambda_{\leq V}} and {1} replacing {\mu_{\leq U}} and {L}) gives a contribution to (31) of

\displaystyle  \ll \sum_{q \leq Q} \sum_d |\mu_{\leq U} * \Lambda_{\leq V}|(d)

\displaystyle  \ll Q \left(\sum_d |\mu_{\leq U}|(d)\right) \left(\sum_m \Lambda_{\leq V}(m)\right)

\displaystyle  \ll Q U V

which is also acceptable since {Q \leq x^{1/2}} and {U=V=x^{1/5}}. This concludes the proof of Theorem 17.

Exercise 20 Strengthen the Bombieri-Vinogradov theorem by showing that

\displaystyle  \sum_{q \leq Q} \sup_{y \leq x} \sup_{a \in ({\bf Z}/q{\bf Z})^\times} |\Delta(\Lambda 1_{[1,y]}; a\ (q))| \ll_{A} x \log^{-A} x

for all {x \geq 2} and {A>0}, if {Q \leq x^{1/2} \log^{-B} x} for some sufficiently large {B} depending on {A}. (Hint: at present {y} ranges over an uncountable number of values, but if one can round {y} to the nearest multiple of (say) {x \log^{-A-10} x} then there are only {O(\log^{A+10} x)} values of {y} in the supremum that need to be considered. Then use the original Bombieri-Vinogradov theorem as a black box.)

Exercise 21 For any {U, V > 1}, establish the Vaughan-type identity

\displaystyle  \mu = \mu_{\leq V} + \mu_{\leq U} - \mu_{\leq U} * \mu_{\leq V} * 1 + \mu_{>U} * \mu_{>V} * 1

and use this to show that the Bombieri-Vinogradov theorem continues to hold when {\Lambda} is replaced by {\mu}.

Exercise 22 Let us say that {0 < \theta < 1} is a level of distribution for the von Mangoldt function {\Lambda} if one has

\displaystyle  \sum_{q \leq Q} \sup_{a \in ({\bf Z}/q{\bf Z})^\times} |\Delta(\Lambda 1_{[1,x]}; a\ (q))| \ll_{A,\theta} x \log^{-A} x

whenever {A>0} and {Q \leq x^\theta}; thus, for instance, the Bombieri-Vinogradov theorem implies that every {0 < \theta < 1/2} is a level of distribution for {\Lambda}. Use the Cramér random model (see Section 1 of Supplement 4) to predict that every {0 < \theta < 1} is a level of distribution for {\Lambda}; this claim is known as the Elliott-Halberstam conjecture, and would have a number of consequences in sieve theory. Unfortunately, no level of distribution above {1/2} (or even at {1/2}) is currently known, however there are weaker versions of the Elliott-Halberstam conjecture known with levels above {1/2} which do have some interesting number-theoretic consequences; we will return to this point in later notes. For now, we will just remark that {1/2} appears to be the limit of what one can do by using the large sieve and Dirichlet character methods in this set of notes, and all the advances beyond {1/2} have had to rely on other tools (such as exponential sum estimates), although the Cauchy-Schwarz inequality remains an indispensable tool in all of these results.

Exercise 23 Strengthen the Bombieri-Vinogradov theorem by showing that

\displaystyle  \sum_{q \leq Q} \tau(q)^C \sup_{a \in ({\bf Z}/q{\bf Z})^\times} |\Delta(\Lambda 1_{[1,x]}; a\ (q))| \ll_{A,C} x \log^{-A} x

for all {x \geq 2}, {C \geq 1}, and {A>0}, if {Q \leq x^{1/2} \log^{-B} x} for some sufficiently large {B} depending on {A,C}. (Hint: use the Cauchy-Schwarz inequality and the trivial bound {|\Delta(\Lambda 1_{[1,x]}; a\ (q))| \ll \frac{x}{q} \log x}, together with elementary estimates on such sums as {\sum_{q \leq Q} \frac{\tau(q)^{2C}}{q}}.)

Exercise 24 Show that the Bombieri-Vinogradov theorem continues to hold if the von Mangoldt function {\Lambda} is replaced by the indicator function {{\mathcal P}} of the primes.

— 4. Appendix: the divisor function in arithmetic progressions —

We begin with a lemma of Landreau that controls the divisor function by a short divisor sum.

Lemma 25 For any {\theta > 0} one has

\displaystyle  \tau(n) \leq 2^{2/\theta} \sum_{d|n: d \leq n^\theta} \tau(d)^{2/\theta}

for all natural numbers {n}.

Proof: We write {n = qm}, where {q} is the product of all the prime factors of {n} that are greater than or equal to {n^{\theta/2}}, and {m} is the product of all the prime factors of {n} that are less than {n^{\theta/2}}, counting multiplicity of course. By a greedy algorithm, we can repeatedly pull out factors of {m} of size between {n^{\theta/2}} and {n^\theta} until the remaining portion of {m} falls below {n^\theta}, yielding a factorisation of the form {m = n_1 \dots n_r} where {n_1,\dots,n_{r-1}} lie between {n^{\theta/2}} and {n^\theta}, {n_r} is at most {n^\theta}, and (if {r>1}) {n_{r-1} n_r} is at least {n^\theta}. The lower bounds on {n_1,\dots,n_{r-2}} and {n_{r-1} n_r} imply that {r \leq 2/\theta}. By the trivial inequality {\tau(ab) \leq \tau(a) \tau(b)} we have

\displaystyle  \tau(n) \leq \tau(q) \tau(m) \leq 2^{2/\theta} \tau(n_1) \ldots \tau(n_r)

and thus by the pigeonhole principle one has

\displaystyle  \tau(n) \leq 2^{2/\theta} \tau(n_j)^r

for some {j}. Since {n_j} is a factor of {n} that is at most {n^\theta}, this gives the claim as long as {r \leq 2/\theta}. If {r > 2/\theta}, then as each of the {n_1,\dots,n_{r-1}} are at least {n^{\theta/2}}, we see that {r-1 \leq 2/\theta} and {n_{r-1} n_r} cannot exceed {n^\theta}. If we now use the inequality

\displaystyle  \tau(n) \leq 2^{2/\theta} \tau(n_1) \dots \tau(n_{r-2}) \tau(n_{r-1} n_r)

and repeat the pigeonholing argument, we again obtain the claim. \Box

Corollary 26 Let {0 < c_1 < c_2 < 1}, let {x \geq 1}, let {x^{c_2} \leq y \leq x}, and let {a\ (q)} be a primitive residue class with {q \leq x^{c_1}}. Then

\displaystyle  \sum_{x \leq n\leq x+y: n=a\ (q)} \tau(n) \ll_{c_1,c_2} \frac{y}{q} \log^{2^{\frac{2}{c_2-c_1}}+1} x. \ \ \ \ \ (33)

One can lower the exponent of the logarithm here to {\log x} (consistent with the heuristic that the average value of {\tau(n)} for {n=O(x)} is {O(\log x)}), but this requires additional arguments; see for instance this previous blog post. In contrast, the divisor bound {\tau(n)=O(n^{o(1)})} only gives an upper bound of {x^{o(1)} \frac{y}{q}} here.

Proof: We allow implied constants to depend on {c_1,c_2}. Set {\theta := c_2-c_1}. Using Lemma 25 we have

\displaystyle  \tau(n) \ll \sum_{d|n: d \leq (x+y)^\theta} \tau(d)^{2/\theta}

for all {n \leq x+y}, and so the left-hand side of (33) is bounded by

\displaystyle  \ll \sum_{d \leq (x+y)^\theta} \tau(d)^{2/\theta} \sum_{x \leq n\leq x+y: n=a\ (q); d|n} 1.

The conditions {n=a\ (q)}, {d|n} constrain {n} to either the empty set, or an arithmetic progression of modulus {qd \leq x^{c_1} (x+y)^\theta \ll y}, so the inner sum is {O(\frac{y}{qd})}. On the other hand, from Theorem 27 (or Exercise 29) of Notes 1 we have

\displaystyle  \sum_{d \leq (x+y)^\theta} \frac{\tau(d)^{2/\theta}}{d} \ll \log^{2^{2/\theta} + 1} x,

and the claim follows. \Box

Exercise 27 Under the same assumptions as in Corollary 26, establish the bound

\displaystyle  \sum_{x \leq n\leq x+y: n=a\ (q)} \tau^k(n) \ll_{c_1,c_2,k} \frac{y}{q} \log^{O_{c_1,c_2,k}(1)} x

for any {k \geq 1}.