You are currently browsing the tag archive for the ‘large sieve’ tag.

Klaus Roth, who made fundamental contributions to analytic number theory, died this Tuesday, aged 90.

I never met or communicated with Roth personally, but was certainly influenced by his work; he wrote relatively few papers, but they tended to have outsized impact. For instance, he was one of the key people (together with Bombieri) to work on simplifying and generalising the large sieve, taking it from the technically formidable original formulation of Linnik and Rényi to the clean and general almost orthogonality principle that we have today (discussed for instance in these lecture notes of mine). The paper of Roth that had the most impact on my own personal work was his three-page paper proving what is now known as Roth’s theorem on arithmetic progressions:

Theorem 1 (Roth’s theorem on arithmetic progressions) Let ${A}$ be a set of natural numbers of positive upper density (thus ${\limsup_{N \rightarrow\infty} |A \cap \{1,\dots,N\}|/N > 0}$). Then ${A}$ contains infinitely many arithmetic progressions ${a,a+r,a+2r}$ of length three (with ${r}$ non-zero of course).

At the heart of Roth’s elegant argument was the following (surprising at the time) dichotomy: if ${A}$ had some moderately large density within some arithmetic progression ${P}$, either one could use Fourier-analytic methods to detect the presence of an arithmetic progression of length three inside ${A \cap P}$, or else one could locate a long subprogression ${P'}$ of ${P}$ on which ${A}$ had increased density. Iterating this dichotomy by an argument now known as the density increment argument, one eventually obtains Roth’s theorem, no matter which side of the dichotomy actually holds. This argument (and the many descendants of it), based on various “dichotomies between structure and randomness”, became essential in many other results of this type, most famously perhaps in Szemerédi’s proof of his celebrated theorem on arithmetic progressions that generalised Roth’s theorem to progressions of arbitrary length. More recently, my recent work on the Chowla and Elliott conjectures that was a crucial component of the solution of the Erdös discrepancy problem, relies on an entropy decrement argument which was directly inspired by the density increment argument of Roth.

The Erdös discrepancy problem also is connected with another well known theorem of Roth:

Theorem 2 (Roth’s discrepancy theorem for arithmetic progressions) Let ${f(1),\dots,f(n)}$ be a sequence in ${\{-1,+1\}}$. Then there exists an arithmetic progression ${a+r, a+2r, \dots, a+kr}$ in ${\{1,\dots,n\}}$ with ${r}$ positive such that

$\displaystyle |\sum_{j=1}^k f(a+jr)| \geq c n^{1/4}$

for an absolute constant ${c>0}$.

In fact, Roth proved a stronger estimate regarding mean square discrepancy, which I am not writing down here; as with the Roth theorem in arithmetic progressions, his proof was short and Fourier-analytic in nature (although non-Fourier-analytic proofs have since been found, for instance the semidefinite programming proof of Lovasz). The exponent ${1/4}$ is known to be sharp (a result of Matousek and Spencer).

As a particular corollary of the above theorem, for an infinite sequence ${f(1), f(2), \dots}$ of signs, the sums ${|\sum_{j=1}^k f(a+jr)|}$ are unbounded in ${a,r,k}$. The Erdös discrepancy problem asks whether the same statement holds when ${a}$ is restricted to be zero. (Roth also established discrepancy theorems for other sets, such as rectangles, which will not be discussed here.)

Finally, one has to mention Roth’s most famous result, cited for instance in his Fields medal citation:

Theorem 3 (Roth’s theorem on Diophantine approximation) Let ${\alpha}$ be an irrational algebraic number. Then for any ${\varepsilon > 0}$ there is a quantity ${c_{\alpha,\varepsilon}}$ such that

$\displaystyle |\alpha - \frac{a}{q}| > \frac{c_{\alpha,\varepsilon}}{q^{2+\varepsilon}}.$

From the Dirichlet approximation theorem (or from the theory of continued fractions) we know that the exponent ${2+\varepsilon}$ in the denominator cannot be reduced to ${2}$ or below. A classical and easy theorem of Liouville gives the claim with the exponent ${2+\varepsilon}$ replaced by the degree of the algebraic number ${\alpha}$; work of Thue and Siegel reduced this exponent, but Roth was the one who obtained the near-optimal result. An important point is that the constant ${c_{\alpha,\varepsilon}}$ is ineffective – it is a major open problem in Diophantine approximation to produce any bound significantly stronger than Liouville’s theorem with effective constants. This is because the proof of Roth’s theorem does not exclude any single rational ${a/q}$ from being close to ${\alpha}$, but instead very ingeniously shows that one cannot have two different rationals ${a/q}$, ${a'/q'}$ that are unusually close to ${\alpha}$, even when the denominators ${q,q'}$ are very different in size. (I refer to this sort of argument as a “dueling conspiracies” argument; they are strangely prevalent throughout analytic number theory.)

A fundamental and recurring problem in analytic number theory is to demonstrate the presence of cancellation in an oscillating sum, a typical example of which might be a correlation

$\displaystyle \sum_{n} f(n) \overline{g(n)} \ \ \ \ \ (1)$

between two arithmetic functions ${f: {\bf N} \rightarrow {\bf C}}$ and ${g: {\bf N} \rightarrow {\bf C}}$, which to avoid technicalities we will assume to be finitely supported (or that the ${n}$ variable is localised to a finite range, such as ${\{ n: n \leq x \}}$). A key example to keep in mind for the purposes of this set of notes is the twisted von Mangoldt summatory function

$\displaystyle \sum_{n \leq x} \Lambda(n) \overline{\chi(n)} \ \ \ \ \ (2)$

that measures the correlation between the primes and a Dirichlet character ${\chi}$. One can get a “trivial” bound on such sums from the triangle inequality

$\displaystyle |\sum_{n} f(n) \overline{g(n)}| \leq \sum_{n} |f(n)| |g(n)|;$

for instance, from the triangle inequality and the prime number theorem we have

$\displaystyle |\sum_{n \leq x} \Lambda(n) \overline{\chi(n)}| \leq x + o(x) \ \ \ \ \ (3)$

as ${x \rightarrow \infty}$. But the triangle inequality is insensitive to the phase oscillations of the summands, and often we expect (e.g. from the probabilistic heuristics from Supplement 4) to be able to improve upon the trivial triangle inequality bound by a substantial amount; in the best case scenario, one typically expects a “square root cancellation” that gains a factor that is roughly the square root of the number of summands. (For instance, for Dirichlet characters ${\chi}$ of conductor ${O(x^{O(1)})}$, it is expected from probabilistic heuristics that the left-hand side of (3) should in fact be ${O_\varepsilon(x^{1/2+\varepsilon})}$ for any ${\varepsilon>0}$.)

It has proven surprisingly difficult, however, to establish significant cancellation in many of the sums of interest in analytic number theory, particularly if the sums do not have a strong amount of algebraic structure (e.g. multiplicative structure) which allow for the deployment of specialised techniques (such as multiplicative number theory techniques). In fact, we are forced to rely (to an embarrassingly large extent) on (many variations of) a single basic tool to capture at least some cancellation, namely the Cauchy-Schwarz inequality. In fact, in many cases the classical case

$\displaystyle |\sum_n f(n) \overline{g(n)}| \leq (\sum_n |f(n)|^2)^{1/2} (\sum_n |g(n)|^2)^{1/2}, \ \ \ \ \ (4)$

considered by Cauchy, where at least one of ${f, g: {\bf N} \rightarrow {\bf C}}$ is finitely supported, suffices for applications. Roughly speaking, the Cauchy-Schwarz inequality replaces the task of estimating a cross-correlation between two different functions ${f,g}$, to that of measuring self-correlations between ${f}$ and itself, or ${g}$ and itself, which are usually easier to compute (albeit at the cost of capturing less cancellation). Note that the Cauchy-Schwarz inequality requires almost no hypotheses on the functions ${f}$ or ${g}$, making it a very widely applicable tool.

There is however some skill required to decide exactly how to deploy the Cauchy-Schwarz inequality (and in particular, how to select ${f}$ and ${g}$); if applied blindly, one loses all cancellation and can even end up with a worse estimate than the trivial bound. For instance, if one tries to bound (2) directly by applying Cauchy-Schwarz with the functions ${\Lambda}$ and ${\chi}$, one obtains the bound

$\displaystyle |\sum_{n \leq x} \Lambda(n) \overline{\chi(n)}| \leq (\sum_{n \leq x} \Lambda(n)^2)^{1/2} (\sum_{n \leq x} |\chi(n)|^2)^{1/2}.$

The right-hand side may be bounded by ${\ll x \log^{1/2} x}$, but this is worse than the trivial bound (3) by a logarithmic factor. This can be “blamed” on the fact that ${\Lambda}$ and ${\chi}$ are concentrated on rather different sets (${\Lambda}$ is concentrated on primes, while ${\chi}$ is more or less uniformly distributed amongst the natural numbers); but even if one corrects for this (e.g. by weighting Cauchy-Schwarz with some suitable “sieve weight” that is more concentrated on primes), one still does not do any better than (3). Indeed, the Cauchy-Schwarz inequality suffers from the same key weakness as the triangle inequality: it is insensitive to the phase oscillation of the factors ${f, g}$.

While the Cauchy-Schwarz inequality can be poor at estimating a single correlation such as (1), its power improves when considering an average (or sum, or square sum) of multiple correlations. In this set of notes, we will focus on one such situation of this type, namely that of trying to estimate a square sum

$\displaystyle (\sum_{j=1}^J |\sum_{n} f(n) \overline{g_j(n)}|^2)^{1/2} \ \ \ \ \ (5)$

that measures the correlations of a single function ${f: {\bf N} \rightarrow {\bf C}}$ with multiple other functions ${g_j: {\bf N} \rightarrow {\bf C}}$. One should think of the situation in which ${f}$ is a “complicated” function, such as the von Mangoldt function ${\Lambda}$, but the ${g_j}$ are relatively “simple” functions, such as Dirichlet characters. In the case when the ${g_j}$ are orthonormal functions, we of course have the classical Bessel inequality:

Lemma 1 (Bessel inequality) Let ${g_1,\dots,g_J: {\bf N} \rightarrow {\bf C}}$ be finitely supported functions obeying the orthonormality relationship

$\displaystyle \sum_n g_j(n) \overline{g_{j'}(n)} = 1_{j=j'}$

for all ${1 \leq j,j' \leq J}$. Then for any function ${f: {\bf N} \rightarrow {\bf C}}$, we have

$\displaystyle (\sum_{j=1}^J |\sum_{n} f(n) \overline{g_j(n)}|^2)^{1/2} \leq (\sum_n |f(n)|^2)^{1/2}.$

For sake of comparison, if one were to apply the Cauchy-Schwarz inequality (4) separately to each summand in (5), one would obtain the bound of ${J^{1/2} (\sum_n |f(n)|^2)^{1/2}}$, which is significantly inferior to the Bessel bound when ${J}$ is large. Geometrically, what is going on is this: the Cauchy-Schwarz inequality (4) is only close to sharp when ${f}$ and ${g}$ are close to parallel in the Hilbert space ${\ell^2({\bf N})}$. But if ${g_1,\dots,g_J}$ are orthonormal, then it is not possible for any other vector ${f}$ to be simultaneously close to parallel to too many of these orthonormal vectors, and so the inner products of ${f}$ with most of the ${g_j}$ should be small. (See this previous blog post for more discussion of this principle.) One can view the Bessel inequality as formalising a repulsion principle: if ${f}$ correlates too much with some of the ${g_j}$, then it does not have enough “energy” to have large correlation with the rest of the ${g_j}$.

In analytic number theory applications, it is useful to generalise the Bessel inequality to the situation in which the ${g_j}$ are not necessarily orthonormal. This can be accomplished via the Cauchy-Schwarz inequality:

Proposition 2 (Generalised Bessel inequality) Let ${g_1,\dots,g_J: {\bf N} \rightarrow {\bf C}}$ be finitely supported functions, and let ${\nu: {\bf N} \rightarrow {\bf R}^+}$ be a non-negative function. Let ${f: {\bf N} \rightarrow {\bf C}}$ be such that ${f}$ vanishes whenever ${\nu}$ vanishes, we have

$\displaystyle (\sum_{j=1}^J |\sum_{n} f(n) \overline{g_j(n)}|^2)^{1/2} \leq (\sum_n |f(n)|^2 / \nu(n))^{1/2} \ \ \ \ \ (6)$

$\displaystyle \times ( \sum_{j=1}^J \sum_{j'=1}^J c_j \overline{c_{j'}} \sum_n \nu(n) g_j(n) \overline{g_{j'}(n)} )^{1/2}$

for some sequence ${c_1,\dots,c_J}$ of complex numbers with ${\sum_{j=1}^J |c_j|^2 = 1}$, with the convention that ${|f(n)|^2/\nu(n)}$ vanishes whenever ${f(n), \nu(n)}$ both vanish.

Note by relabeling that we may replace the domain ${{\bf N}}$ here by any other at most countable set, such as the integers ${{\bf Z}}$. (Indeed, one can give an analogue of this lemma on arbitrary measure spaces, but we will not do so here.) This result first appears in this paper of Boas.

Proof: We use the method of duality to replace the role of the function ${f}$ by a dual sequence ${c_1,\dots,c_J}$. By the converse to Cauchy-Schwarz, we may write the left-hand side of (6) as

$\displaystyle \sum_{j=1}^J \overline{c_j} \sum_{n} f(n) \overline{g_j(n)}$

for some complex numbers ${c_1,\dots,c_J}$ with ${\sum_{j=1}^J |c_j|^2 = 1}$. Indeed, if all of the ${\sum_{n} f(n) \overline{g_j(n)}}$ vanish, we can set the ${c_j}$ arbitrarily, otherwise we set ${(c_1,\dots,c_J)}$ to be the unit vector formed by dividing ${(\sum_{n} f(n) \overline{g_j(n)})_{j=1}^J}$ by its length. We can then rearrange this expression as

$\displaystyle \sum_n f(n) \overline{\sum_{j=1}^J c_j g_j(n)}.$

Applying Cauchy-Schwarz (dividing the first factor by ${\nu(n)^{1/2}}$ and multiplying the second by ${\nu(n)^{1/2}}$, after first removing those ${n}$ for which ${\nu(n)}$ vanish), this is bounded by

$\displaystyle (\sum_n |f(n)|^2 / \nu(n))^{1/2} (\sum_n \nu(n) |\sum_{j=1}^J c_j g_j(n)|^2)^{1/2},$

and the claim follows by expanding out the second factor. $\Box$

Observe that Lemma 1 is a special case of Proposition 2 when ${\nu=1}$ and the ${g_j}$ are orthonormal. In general, one can expect Proposition 2 to be useful when the ${g_j}$ are almost orthogonal relative to ${\nu}$, in that the correlations ${\sum_n \nu(n) g_j(n) \overline{g_{j'}(n)}}$ tend to be small when ${j,j'}$ are distinct. In that case, one can hope for the diagonal term ${j=j'}$ in the right-hand side of (6) to dominate, in which case one can obtain estimates of comparable strength to the classical Bessel inequality. The flexibility to choose different weights ${\nu}$ in the above proposition has some technical advantages; for instance, if ${f}$ is concentrated in a sparse set (such as the primes), it is sometimes useful to tailor ${\nu}$ to a comparable set (e.g. the almost primes) in order not to lose too much in the first factor ${\sum_n |f(n)|^2 / \nu(n)}$. Also, it can be useful to choose a fairly “smooth” weight ${\nu}$, in order to make the weighted correlations ${\sum_n \nu(n) g_j(n) \overline{g_{j'}(n)}}$ small.

Remark 3 In harmonic analysis, the use of tools such as Proposition 2 is known as the method of almost orthogonality, or the ${TT^*}$ method. The explanation for the latter name is as follows. For sake of exposition, suppose that ${\nu}$ is never zero (or we remove all ${n}$ from the domain for which ${\nu(n)}$ vanishes). Given a family of finitely supported functions ${g_1,\dots,g_J: {\bf N} \rightarrow {\bf C}}$, consider the linear operator ${T: \ell^2(\nu^{-1}) \rightarrow \ell^2(\{1,\dots,J\})}$ defined by the formula

$\displaystyle T f := ( \sum_{n} f(n) \overline{g_j(n)} )_{j=1}^J.$

This is a bounded linear operator, and the left-hand side of (6) is nothing other than the ${\ell^2(\{1,\dots,J\})}$ norm of ${Tf}$. Without any further information on the function ${f}$ other than its ${\ell^2(\nu^{-1})}$ norm ${(\sum_n |f(n)|^2 / \nu(n))^{1/2}}$, the best estimate one can obtain on (6) here is clearly

$\displaystyle (\sum_n |f(n)|^2 / \nu(n))^{1/2} \times \|T\|_{op},$

where ${\|T\|_{op}}$ denotes the operator norm of ${T}$.

The adjoint ${T^*: \ell^2(\{1,\dots,J\}) \rightarrow \ell^2(\nu^{-1})}$ is easily computed to be

$\displaystyle T^* (c_j)_{j=1}^J := (\sum_{j=1}^J c_j \nu(n) g_j(n) )_{n \in {\bf N}}.$

The composition ${TT^*: \ell^2(\{1,\dots,J\}) \rightarrow \ell^2(\{1,\dots,J\})}$ of ${T}$ and its adjoint is then given by

$\displaystyle TT^* (c_j)_{j=1}^J := (\sum_{j=1}^J c_j \sum_n \nu(n) g_j(n) \overline{g_{j'}}(n) )_{j=1}^J.$

From the spectral theorem (or singular value decomposition), one sees that the operator norms of ${T}$ and ${TT^*}$ are related by the identity

$\displaystyle \|T\|_{op} = \|TT^*\|_{op}^{1/2},$

and as ${TT^*}$ is a self-adjoint, positive semi-definite operator, the operator norm ${\|TT^*\|_{op}}$ is also the supremum of the quantity

$\displaystyle \langle TT^* (c_j)_{j=1}^J, (c_j)_{j=1}^J \rangle_{\ell^2(\{1,\dots,J\})} = \sum_{j=1}^J \sum_{j'=1}^J c_j \overline{c_{j'}} \sum_n \nu(n) g_j(n) \overline{g_{j'}(n)}$

where ${(c_j)_{j=1}^J}$ ranges over unit vectors in ${\ell^2(\{1,\dots,J\})}$. Putting these facts together, we obtain Proposition 2; furthermore, we see from this analysis that the bound here is essentially optimal if the only information one is allowed to use about ${f}$ is its ${\ell^2(\nu^{-1})}$ norm.

For further discussion of almost orthogonality methods from a harmonic analysis perspective, see Chapter VII of this text of Stein.

Exercise 4 Under the same hypotheses as Proposition 2, show that

$\displaystyle \sum_{j=1}^J |\sum_{n} f(n) \overline{g_j(n)}| \leq (\sum_n |f(n)|^2 / \nu(n))^{1/2}$

$\displaystyle \times ( \sum_{j=1}^J \sum_{j'=1}^J |\sum_n \nu(n) g_j(n) \overline{g_{j'}(n)}| )^{1/2}$

as well as the variant inequality

$\displaystyle |\sum_{j=1}^J \sum_{n} f(n) \overline{g_j(n)}| \leq (\sum_n |f(n)|^2 / \nu(n))^{1/2}$

$\displaystyle \times | \sum_{j=1}^J \sum_{j'=1}^J \sum_n \nu(n) g_j(n) \overline{g_{j'}(n)}|^{1/2}.$

Proposition 2 has many applications in analytic number theory; for instance, we will use it in later notes to control the large value of Dirichlet series such as the Riemann zeta function. One of the key benefits is that it largely eliminates the need to consider further correlations of the function ${f}$ (other than its self-correlation ${\sum_n |f(n)|^2 / \nu(n)}$ relative to ${\nu^{-1}}$, which is usually fairly easy to compute or estimate as ${\nu}$ is usually chosen to be relatively simple); this is particularly useful if ${f}$ is a function which is significantly more complicated to analyse than the functions ${g_j}$. Of course, the tradeoff for this is that one now has to deal with the coefficients ${c_j}$, which if anything are even less understood than ${f}$, since literally the only thing we know about these coefficients is their square sum ${\sum_{j=1}^J |c_j|^2}$. However, as long as there is enough almost orthogonality between the ${g_j}$, one can estimate the ${c_j}$ by fairly crude estimates (e.g. triangle inequality or Cauchy-Schwarz) and still get reasonably good estimates.

In this set of notes, we will use Proposition 2 to prove some versions of the large sieve inequality, which controls a square-sum of correlations

$\displaystyle \sum_n f(n) e( -\xi_j n )$

of an arbitrary finitely supported function ${f: {\bf Z} \rightarrow {\bf C}}$ with various additive characters ${n \mapsto e( \xi_j n)}$ (where ${e(x) := e^{2\pi i x}}$), or alternatively a square-sum of correlations

$\displaystyle \sum_n f(n) \overline{\chi_j(n)}$

of ${f}$ with various primitive Dirichlet characters ${\chi_j}$; it turns out that one can prove a (slightly sub-optimal) version of this inequality quite quickly from Proposition 2 if one first prepares the sum by inserting a smooth cutoff with well-behaved Fourier transform. The large sieve inequality has many applications (as the name suggests, it has particular utility within sieve theory). For the purposes of this set of notes, though, the main application we will need it for is the Bombieri-Vinogradov theorem, which in a very rough sense gives a prime number theorem in arithmetic progressions, which, “on average”, is of strength comparable to the results provided by the Generalised Riemann Hypothesis (GRH), but has the great advantage of being unconditional (it does not require any unproven hypotheses such as GRH); it can be viewed as a significant extension of the Siegel-Walfisz theorem from Notes 2. As we shall see in later notes, the Bombieri-Vinogradov theorem is a very useful ingredient in sieve-theoretic problems involving the primes.

There is however one additional important trick, beyond the large sieve, which we will need in order to establish the Bombieri-Vinogradov theorem. As it turns out, after some basic manipulations (and the deployment of some multiplicative number theory, and specifically the Siegel-Walfisz theorem), the task of proving the Bombieri-Vinogradov theorem is reduced to that of getting a good estimate on sums that are roughly of the form

$\displaystyle \sum_{j=1}^J |\sum_n \Lambda(n) \overline{\chi_j}(n)| \ \ \ \ \ (7)$

for some primitive Dirichlet characters ${\chi_j}$. This looks like the type of sum that can be controlled by the large sieve (or by Proposition 2), except that this is an ordinary sum rather than a square sum (i.e., an ${\ell^1}$ norm rather than an ${\ell^2}$ norm). One could of course try to control such a sum in terms of the associated square-sum through the Cauchy-Schwarz inequality, but this turns out to be very wasteful (it loses a factor of about ${J^{1/2}}$). Instead, one should try to exploit the special structure of the von Mangoldt function ${\Lambda}$, in particular the fact that it can be expressible as a Dirichlet convolution ${\alpha * \beta}$ of two further arithmetic sequences ${\alpha,\beta}$ (or as a finite linear combination of such Dirichlet convolutions). The reason for introducing this convolution structure is through the basic identity

$\displaystyle (\sum_n \alpha*\beta(n) \overline{\chi_j}(n)) = (\sum_n \alpha(n) \overline{\chi_j}(n)) (\sum_n \beta(n) \overline{\chi_j}(n)) \ \ \ \ \ (8)$

for any finitely supported sequences ${\alpha,\beta: {\bf N} \rightarrow {\bf C}}$, as can be easily seen by multiplying everything out and using the completely multiplicative nature of ${\chi_j}$. (This is the multiplicative analogue of the well-known relationship ${\widehat{f*g}(\xi) = \hat f(\xi) \hat g(\xi)}$ between ordinary convolution and Fourier coefficients.) This factorisation, together with yet another application of the Cauchy-Schwarz inequality, lets one control (7) by square-sums of the sort that can be handled by the large sieve inequality.

As we have seen in Notes 1, the von Mangoldt function ${\Lambda}$ does indeed admit several factorisations into Dirichlet convolution type, such as the factorisation ${\Lambda = \mu * L}$. One can try directly inserting this factorisation into the above strategy; it almost works, however there turns out to be a problem when considering the contribution of the portion of ${\mu}$ or ${L}$ that is supported at very small natural numbers, as the large sieve loses any gain over the trivial bound in such settings. Because of this, there is a need for a more sophisticated decomposition of ${\Lambda}$ into Dirichlet convolutions ${\alpha * \beta}$ which are non-degenerate in the sense that ${\alpha,\beta}$ are supported away from small values. (As a non-example, the trivial factorisation ${\Lambda = \Lambda * \delta}$ would be a totally inappropriate factorisation for this purpose.) Fortunately, it turns out that through some elementary combinatorial manipulations, some satisfactory decompositions of this type are available, such as the Vaughan identity and the Heath-Brown identity. By using one of these identities we will be able to complete the proof of the Bombieri-Vinogradov theorem. (These identities are also useful for other applications in which one wishes to control correlations between the von Mangoldt function ${\Lambda}$ and some other sequence; we will see some examples of this in later notes.)

For further reading on these topics, including a significantly larger number of examples of the large sieve inequality, see Chapters 7 and 17 of Iwaniec and Kowalski.

Remark 5 We caution that the presentation given in this set of notes is highly ahistorical; we are using modern streamlined proofs of results that were first obtained by more complicated arguments.

One of the most fundamental principles in Fourier analysis is the uncertainty principle. It does not have a single canonical formulation, but one typical informal description of the principle is that if a function ${f}$ is restricted to a narrow region of physical space, then its Fourier transform ${\hat f}$ must be necessarily “smeared out” over a broad region of frequency space. Some versions of the uncertainty principle are discussed in this previous blog post.

In this post I would like to highlight a useful instance of the uncertainty principle, due to Hugh Montgomery, which is useful in analytic number theory contexts. Specifically, suppose we are given a complex-valued function ${f: {\bf Z} \rightarrow {\bf C}}$ on the integers. To avoid irrelevant issues at spatial infinity, we will assume that the support ${\hbox{supp}(f) := \{ n \in {\bf Z}: f(n) \neq 0 \}}$ of this function is finite (in practice, we will only work with functions that are supported in an interval ${[M+1,M+N]}$ for some natural numbers ${M,N}$). Then we can define the Fourier transform ${\hat f: {\bf R}/{\bf Z} \rightarrow {\bf C}}$ by the formula

$\displaystyle \hat f(\xi) := \sum_{n \in {\bf Z}} f(n) e(-n\xi)$

where ${e(x) := e^{2\pi i x}}$. (In some literature, the sign in the exponential phase is reversed, but this will make no substantial difference to the arguments below.)

The classical uncertainty principle, in this context, asserts that if ${f}$ is localised in an interval of length ${N}$, then ${\hat f}$ must be “smeared out” at a scale of at least ${1/N}$ (and essentially constant at scales less than ${1/N}$). For instance, if ${f}$ is supported in ${[M+1,M+N]}$, then we have the Plancherel identity

$\displaystyle \int_{{\bf R}/{\bf Z}} |\hat f(\xi)|^2\ d\xi = \| f \|_{\ell^2({\bf Z})}^2 \ \ \ \ \ (1)$

while from the Cauchy-Schwarz inequality we have

$\displaystyle |\hat f(\xi)| \leq N^{1/2} \| f \|_{\ell^2({\bf Z})}$

for each frequency ${\xi}$, and in particular that

$\displaystyle \int_I |\hat f(\xi)|^2\ d\xi \leq N |I| \| f \|_{\ell^2({\bf Z})}^2$

for any arc ${I}$ in the unit circle (with ${|I|}$ denoting the length of ${I}$). In particular, an interval of length significantly less than ${1/N}$ can only capture a fraction of the ${L^2}$ energy of the Fourier transform of ${\hat f}$, which is consistent with the above informal statement of the uncertainty principle.

Another manifestation of the classical uncertainty principle is the large sieve inequality. A particularly nice formulation of this inequality is due independently to Montgomery and Vaughan and Selberg: if ${f}$ is supported in ${[M+1,M+N]}$, and ${\xi_1,\ldots,\xi_J}$ are frequencies in ${{\bf R}/{\bf Z}}$ that are ${\delta}$-separated for some ${\delta>0}$, thus ${\| \xi_i-\xi_j\|_{{\bf R}/{\bf Z}} \geq \delta}$ for all ${1 \leq i (where ${\|\xi\|_{{\bf R}/{\bf Z}}}$ denotes the distance of ${\xi}$ to the origin in ${{\bf R}/{\bf Z}}$), then

$\displaystyle \sum_{j=1}^J |\hat f(\xi_j)|^2 \leq (N + \frac{1}{\delta}) \| f \|_{\ell^2({\bf Z})}^2. \ \ \ \ \ (2)$

The reader is encouraged to see how this inequality is consistent with the Plancherel identity (1) and the intuition that ${\hat f}$ is essentially constant at scales less than ${1/N}$. The factor ${N + \frac{1}{\delta}}$ can in fact be amplified a little bit to ${N + \frac{1}{\delta} - 1}$, which is essentially optimal, by using a neat dilation trick of Paul Cohen, in which one dilates ${[M+1,M+N]}$ to ${[MK+K, MK+NK]}$ (and replaces each frequency ${\xi_j}$ by their ${K^{th}}$ roots), and then sending ${K \rightarrow \infty}$ (cf. the tensor product trick); see this survey of Montgomery for details. But we will not need this refinement here.

In the above instances of the uncertainty principle, the concept of narrow support in physical space was formalised in the Archimedean sense, using the standard Archimedean metric ${d_\infty(n,m) := |n-m|}$ on the integers ${{\bf Z}}$ (in particular, the parameter ${N}$ is essentially the Archimedean diameter of the support of ${f}$). However, in number theory, the Archimedean metric is not the only metric of importance on the integers; the ${p}$-adic metrics play an equally important role; indeed, it is common to unify the Archimedean and ${p}$-adic perspectives together into a unified adelic perspective. In the ${p}$-adic world, the metric balls are no longer intervals, but are instead residue classes modulo some power of ${p}$. Intersecting these balls from different ${p}$-adic metrics together, we obtain residue classes with respect to various moduli ${q}$ (which may be either prime or composite). As such, another natural manifestation of the concept of “narrow support in physical space” is “vanishes on many residue classes modulo ${q}$“. This notion of narrowness is particularly common in sieve theory, when one deals with functions supported on thin sets such as the primes, which naturally tend to avoid many residue classes (particularly if one throws away the first few primes).

In this context, the uncertainty principle is this: the more residue classes modulo ${q}$ that ${f}$ avoids, the more the Fourier transform ${\hat f}$ must spread out along multiples of ${1/q}$. To illustrate a very simple example of this principle, let us take ${q=2}$, and suppose that ${f}$ is supported only on odd numbers (thus it completely avoids the residue class ${0 \mod 2}$). We write out the formulae for ${\hat f(\xi)}$ and ${\hat f(\xi+1/2)}$:

$\displaystyle \hat f(\xi) = \sum_n f(n) e(-n\xi)$

$\displaystyle \hat f(\xi+1/2) = \sum_n f(n) e(-n\xi) e(-n/2).$

If ${f}$ is supported on the odd numbers, then ${e(-n/2)}$ is always equal to ${-1}$ on the support of ${f}$, and so we have ${\hat f(\xi+1/2)=-\hat f(\xi)}$. Thus, whenever ${\hat f}$ has a significant presence at a frequency ${\xi}$, it also must have an equally significant presence at the frequency ${\xi+1/2}$; there is a spreading out across multiples of ${1/2}$. Note that one has a similar effect if ${f}$ was supported instead on the even integers instead of the odd integers.

A little more generally, suppose now that ${f}$ avoids a single residue class modulo a prime ${p}$; for sake of argument let us say that it avoids the zero residue class ${0 \mod p}$, although the situation for the other residue classes is similar. For any frequency ${\xi}$ and any ${j=0,\ldots,p-1}$, one has

$\displaystyle \hat f(\xi+j/p) = \sum_n f(n) e(-n\xi) e(-jn/p).$

From basic Fourier analysis, we know that the phases ${e(-jn/p)}$ sum to zero as ${j}$ ranges from ${0}$ to ${p-1}$ whenever ${n}$ is not a multiple of ${p}$. We thus have

$\displaystyle \sum_{j=0}^{p-1} \hat f(\xi+j/p) = 0.$

In particular, if ${\hat f(\xi)}$ is large, then one of the other ${\hat f(\xi+j/p)}$ has to be somewhat large as well; using the Cauchy-Schwarz inequality, we can quantify this assertion in an ${\ell^2}$ sense via the inequality

$\displaystyle \sum_{j=1}^{p-1} |\hat f(\xi+j/p)|^2 \geq \frac{1}{p-1} |\hat f(\xi)|^2. \ \ \ \ \ (3)$

Let us continue this analysis a bit further. Now suppose that ${f}$ avoids ${\omega(p)}$ residue classes modulo a prime ${p}$, for some ${0 \leq \omega(p) < p}$. (We exclude the case ${\omega(p)=p}$ as it is clearly degenerates by forcing ${f}$ to be identically zero.) Let ${\nu(n)}$ be the function that equals ${1}$ on these residue classes and zero away from these residue classes, then

$\displaystyle \sum_n f(n) e(-n\xi) \nu(n) = 0.$

Using the periodic Fourier transform, we can write

$\displaystyle \nu(n) = \sum_{j=0}^{p-1} c(j) e(-jn/p)$

for some coefficients ${c(j)}$, thus

$\displaystyle \sum_{j=0}^{p-1} \hat f(\xi+j/p) c(j) = 0.$

Some Fourier-analytic computations reveal that

$\displaystyle c(0) = \frac{\omega(p)}{p}$

and

$\displaystyle \sum_{j=0}^{p-1} |c(j)|^2 = \frac{\omega(p)}{p}$

and so after some routine algebra and the Cauchy-Schwarz inequality, we obtain a generalisation of (3):

$\displaystyle \sum_{j=1}^{p-1} |\hat f(\xi+j/p)|^2 \geq \frac{\omega(p)}{p-\omega(p)} |\hat f(\xi)|^2.$

Thus we see that the more residue classes mod ${p}$ we exclude, the more Fourier energy has to disperse along multiples of ${1/p}$. It is also instructive to consider the extreme case ${\omega(p)=p-1}$, in which ${f}$ is supported on just a single residue class ${a \mod p}$; in this case, one clearly has ${\hat f(\xi+j/p) = e(-aj/p) \hat f(\xi)}$, and so ${\hat f}$ spreads its energy completely evenly along multiples of ${1/p}$.

In 1968, Montgomery observed the following useful generalisation of the above calculation to arbitrary modulus:

Proposition 1 (Montgomery’s uncertainty principle) Let ${f: {\bf Z} \rightarrow{\bf C}}$ be a finitely supported function which, for each prime ${p}$, avoids ${\omega(p)}$ residue classes modulo ${p}$ for some ${0 \leq \omega(p) < p}$. Then for each natural number ${q}$, one has

$\displaystyle \sum_{1 \leq a \leq q: (a,q)=1} |\hat f(\xi+\frac{a}{q})|^2 \geq h(q) |\hat f(\xi)|^2$

where ${h(q)}$ is the quantity

$\displaystyle h(q) := \mu(q)^2 \prod_{p|q} \frac{\omega(p)}{p-\omega(p)} \ \ \ \ \ (4)$

where ${\mu}$ is the Möbius function.

We give a proof of this proposition below the fold.

Following the “adelic” philosophy, it is natural to combine this uncertainty principle with the large sieve inequality to take simultaneous advantage of localisation both in the Archimedean sense and in the ${p}$-adic senses. This leads to the following corollary:

Corollary 2 (Arithmetic large sieve inequality) Let ${f: {\bf Z} \rightarrow {\bf C}}$ be a function supported on an interval ${[M+1,M+N]}$ which, for each prime ${p}$, avoids ${\omega(p)}$ residue classes modulo ${p}$ for some ${0 \leq \omega(p) < p}$. Let ${\xi_1,\ldots,\xi_J \in {\bf R}/{\bf Z}}$, and let ${{\mathcal Q}}$ be a finite set of natural numbers. Suppose that the frequencies ${\xi_j + \frac{a}{q}}$ with ${1 \leq j \leq J}$, ${q \in {\mathcal Q}}$, and ${(a,q)=1}$ are ${\delta}$-separated. Then one has

$\displaystyle \sum_{j=1}^J |\hat f(\xi_j)|^2 \leq \frac{N + \frac{1}{\delta}}{\sum_{q \in {\mathcal Q}} h(q)} \| f \|_{\ell^2({\bf Z})}^2$

where ${h(q)}$ was defined in (4).

Indeed, from the large sieve inequality one has

$\displaystyle \sum_{j=1}^J \sum_{q \in {\mathcal Q}} \sum_{1 \leq a \leq q: (a,q)=1} |\hat f(\xi_j+\frac{a}{q})|^2 \leq (N + \frac{1}{\delta}) \| f \|_{\ell^2({\bf Z})}^2$

while from Proposition 1 one has

$\displaystyle \sum_{q \in {\mathcal Q}} \sum_{1 \leq a \leq q: (a,q)=1} |\hat f(\xi_j+\frac{a}{q})|^2 \geq (\sum_{q \in {\mathcal Q}} h(q)) |\hat f(\xi_j)|^2$

whence the claim.

There is a great deal of flexibility in the above inequality, due to the ability to select the set ${{\mathcal Q}}$, the frequencies ${\xi_1,\ldots,\xi_J}$, the omitted classes ${\omega(p)}$, and the separation parameter ${\delta}$. Here is a typical application concerning the original motivation for the large sieve inequality, namely in bounding the size of sets which avoid many residue classes:

Corollary 3 (Large sieve) Let ${A}$ be a set of integers contained in ${[M+1,M+N]}$ which avoids ${\omega(p)}$ residue classes modulo ${p}$ for each prime ${p}$, and let ${R>0}$. Then

$\displaystyle |A| \leq \frac{N+R^2}{G(R)}$

where

$\displaystyle G(R) := \sum_{q \leq R} h(q).$

Proof: We apply Corollary 2 with ${f = 1_A}$, ${J=1}$, ${\delta=1/R^2}$, ${\xi_1=0}$, and ${{\mathcal Q} := \{ q: q \leq R\}}$. The key point is that the Farey sequence of fractions ${a/q}$ with ${q \leq R}$ and ${(a,q)=1}$ is ${1/R^2}$-separated, since

$\displaystyle \| \frac{a}{q}-\frac{a'}{q'} \|_{{\bf R}/{\bf Z}} \geq \frac{1}{qq'} \geq \frac{1}{R^2}$

whenever ${a/q, a'/q'}$ are distinct fractions in this sequence. $\Box$

If, for instance, ${A}$ is the set of all primes in ${[M+1,M+N]}$ larger than ${R}$, then one can set ${\omega(p)=1}$ for all ${p \leq R}$, which makes ${h(q) = \frac{\mu^2(q)}{\phi(q)}}$, where ${\phi}$ is the Euler totient function. It is a classical estimate that

$\displaystyle G(R) = \log R + O(1).$

Using this fact and optimising in ${R}$, we obtain (a special case of) the Brun-Titchmarsh inequality

$\displaystyle \pi(M+N)-\pi(M) \leq (2+o_{N \rightarrow \infty}(1)) \frac{N}{\log N}$

where ${\pi(x)}$ is the prime counting function; a variant of the same argument gives the more general Brun-Titchmarsh inequality

$\displaystyle \pi(M+N;a,q)-\pi(M;a,q) \leq (2+o_{N \rightarrow \infty;q}(1)) \frac{q}{\phi(q)} \frac{N}{\log N}$

for any primitive residue class ${a \mod q}$, where ${\pi(M;a,q)}$ is the number of primes less than or equal to ${M}$ that are congruent to ${a \mod q}$. By performing a more careful optimisation using a slightly sharper version of the large sieve inequality (2) that exploits the irregular spacing of the Farey sequence, Montgomery and Vaughan were able to delete the error term in the Brun-Titchmarsh inequality, thus establishing the very nice inequality

$\displaystyle \pi(M+N;a,q)-\pi(M;a,q) \leq 2 \frac{q}{\phi(q)} \frac{N}{\log N}$

for any natural numbers ${M,N,a,q}$ with ${N>1}$. This is a particularly useful inequality in non-asymptotic analytic number theory (when one wishes to study number theory at explicit orders of magnitude, rather than the number theory of sufficiently large numbers), due to the absence of asymptotic notation.

I recently realised that Corollary 2 also establishes a stronger version of the “restriction theorem for the Selberg sieve” that Ben Green and I proved some years ago (indeed, one can view Corollary 2 as a “restriction theorem for the large sieve”). I’m placing the details below the fold.