You are currently browsing the tag archive for the ‘large sieve’ tag.
Klaus Roth, who made fundamental contributions to analytic number theory, died this Tuesday, aged 90.
I never met or communicated with Roth personally, but was certainly influenced by his work; he wrote relatively few papers, but they tended to have outsized impact. For instance, he was one of the key people (together with Bombieri) to work on simplifying and generalising the large sieve, taking it from the technically formidable original formulation of Linnik and Rényi to the clean and general almost orthogonality principle that we have today (discussed for instance in these lecture notes of mine). The paper of Roth that had the most impact on my own personal work was his three-page paper proving what is now known as Roth’s theorem on arithmetic progressions:
Theorem 1 (Roth’s theorem on arithmetic progressions) Let be a set of natural numbers of positive upper density (thus ). Then contains infinitely many arithmetic progressions of length three (with non-zero of course).
At the heart of Roth’s elegant argument was the following (surprising at the time) dichotomy: if had some moderately large density within some arithmetic progression , either one could use Fourier-analytic methods to detect the presence of an arithmetic progression of length three inside , or else one could locate a long subprogression of on which had increased density. Iterating this dichotomy by an argument now known as the density increment argument, one eventually obtains Roth’s theorem, no matter which side of the dichotomy actually holds. This argument (and the many descendants of it), based on various “dichotomies between structure and randomness”, became essential in many other results of this type, most famously perhaps in Szemerédi’s proof of his celebrated theorem on arithmetic progressions that generalised Roth’s theorem to progressions of arbitrary length. More recently, my recent work on the Chowla and Elliott conjectures that was a crucial component of the solution of the Erdös discrepancy problem, relies on an entropy decrement argument which was directly inspired by the density increment argument of Roth.
The Erdös discrepancy problem also is connected with another well known theorem of Roth:
Theorem 2 (Roth’s discrepancy theorem for arithmetic progressions) Let be a sequence in . Then there exists an arithmetic progression in with positive such that
for an absolute constant .
In fact, Roth proved a stronger estimate regarding mean square discrepancy, which I am not writing down here; as with the Roth theorem in arithmetic progressions, his proof was short and Fourier-analytic in nature (although non-Fourier-analytic proofs have since been found, for instance the semidefinite programming proof of Lovasz). The exponent is known to be sharp (a result of Matousek and Spencer).
As a particular corollary of the above theorem, for an infinite sequence of signs, the sums are unbounded in . The Erdös discrepancy problem asks whether the same statement holds when is restricted to be zero. (Roth also established discrepancy theorems for other sets, such as rectangles, which will not be discussed here.)
Finally, one has to mention Roth’s most famous result, cited for instance in his Fields medal citation:
Theorem 3 (Roth’s theorem on Diophantine approximation) Let be an irrational algebraic number. Then for any there is a quantity such that
From the Dirichlet approximation theorem (or from the theory of continued fractions) we know that the exponent in the denominator cannot be reduced to or below. A classical and easy theorem of Liouville gives the claim with the exponent replaced by the degree of the algebraic number ; work of Thue and Siegel reduced this exponent, but Roth was the one who obtained the near-optimal result. An important point is that the constant is ineffective – it is a major open problem in Diophantine approximation to produce any bound significantly stronger than Liouville’s theorem with effective constants. This is because the proof of Roth’s theorem does not exclude any single rational from being close to , but instead very ingeniously shows that one cannot have two different rationals , that are unusually close to , even when the denominators are very different in size. (I refer to this sort of argument as a “dueling conspiracies” argument; they are strangely prevalent throughout analytic number theory.)
between two arithmetic functions and , which to avoid technicalities we will assume to be finitely supported (or that the variable is localised to a finite range, such as ). A key example to keep in mind for the purposes of this set of notes is the twisted von Mangoldt summatory function
that measures the correlation between the primes and a Dirichlet character . One can get a “trivial” bound on such sums from the triangle inequality
as . But the triangle inequality is insensitive to the phase oscillations of the summands, and often we expect (e.g. from the probabilistic heuristics from Supplement 4) to be able to improve upon the trivial triangle inequality bound by a substantial amount; in the best case scenario, one typically expects a “square root cancellation” that gains a factor that is roughly the square root of the number of summands. (For instance, for Dirichlet characters of conductor , it is expected from probabilistic heuristics that the left-hand side of (3) should in fact be for any .)
It has proven surprisingly difficult, however, to establish significant cancellation in many of the sums of interest in analytic number theory, particularly if the sums do not have a strong amount of algebraic structure (e.g. multiplicative structure) which allow for the deployment of specialised techniques (such as multiplicative number theory techniques). In fact, we are forced to rely (to an embarrassingly large extent) on (many variations of) a single basic tool to capture at least some cancellation, namely the Cauchy-Schwarz inequality. In fact, in many cases the classical case
considered by Cauchy, where at least one of is finitely supported, suffices for applications. Roughly speaking, the Cauchy-Schwarz inequality replaces the task of estimating a cross-correlation between two different functions , to that of measuring self-correlations between and itself, or and itself, which are usually easier to compute (albeit at the cost of capturing less cancellation). Note that the Cauchy-Schwarz inequality requires almost no hypotheses on the functions or , making it a very widely applicable tool.
There is however some skill required to decide exactly how to deploy the Cauchy-Schwarz inequality (and in particular, how to select and ); if applied blindly, one loses all cancellation and can even end up with a worse estimate than the trivial bound. For instance, if one tries to bound (2) directly by applying Cauchy-Schwarz with the functions and , one obtains the bound
The right-hand side may be bounded by , but this is worse than the trivial bound (3) by a logarithmic factor. This can be “blamed” on the fact that and are concentrated on rather different sets ( is concentrated on primes, while is more or less uniformly distributed amongst the natural numbers); but even if one corrects for this (e.g. by weighting Cauchy-Schwarz with some suitable “sieve weight” that is more concentrated on primes), one still does not do any better than (3). Indeed, the Cauchy-Schwarz inequality suffers from the same key weakness as the triangle inequality: it is insensitive to the phase oscillation of the factors .
While the Cauchy-Schwarz inequality can be poor at estimating a single correlation such as (1), its power improves when considering an average (or sum, or square sum) of multiple correlations. In this set of notes, we will focus on one such situation of this type, namely that of trying to estimate a square sum
that measures the correlations of a single function with multiple other functions . One should think of the situation in which is a “complicated” function, such as the von Mangoldt function , but the are relatively “simple” functions, such as Dirichlet characters. In the case when the are orthonormal functions, we of course have the classical Bessel inequality:
for all . Then for any function , we have
For sake of comparison, if one were to apply the Cauchy-Schwarz inequality (4) separately to each summand in (5), one would obtain the bound of , which is significantly inferior to the Bessel bound when is large. Geometrically, what is going on is this: the Cauchy-Schwarz inequality (4) is only close to sharp when and are close to parallel in the Hilbert space . But if are orthonormal, then it is not possible for any other vector to be simultaneously close to parallel to too many of these orthonormal vectors, and so the inner products of with most of the should be small. (See this previous blog post for more discussion of this principle.) One can view the Bessel inequality as formalising a repulsion principle: if correlates too much with some of the , then it does not have enough “energy” to have large correlation with the rest of the .
In analytic number theory applications, it is useful to generalise the Bessel inequality to the situation in which the are not necessarily orthonormal. This can be accomplished via the Cauchy-Schwarz inequality:
for some sequence of complex numbers with , with the convention that vanishes whenever both vanish.
Note by relabeling that we may replace the domain here by any other at most countable set, such as the integers . (Indeed, one can give an analogue of this lemma on arbitrary measure spaces, but we will not do so here.) This result first appears in this paper of Boas.
Proof: We use the method of duality to replace the role of the function by a dual sequence . By the converse to Cauchy-Schwarz, we may write the left-hand side of (6) as
for some complex numbers with . Indeed, if all of the vanish, we can set the arbitrarily, otherwise we set to be the unit vector formed by dividing by its length. We can then rearrange this expression as
Applying Cauchy-Schwarz (dividing the first factor by and multiplying the second by , after first removing those for which vanish), this is bounded by
and the claim follows by expanding out the second factor.
Observe that Lemma 1 is a special case of Proposition 2 when and the are orthonormal. In general, one can expect Proposition 2 to be useful when the are almost orthogonal relative to , in that the correlations tend to be small when are distinct. In that case, one can hope for the diagonal term in the right-hand side of (6) to dominate, in which case one can obtain estimates of comparable strength to the classical Bessel inequality. The flexibility to choose different weights in the above proposition has some technical advantages; for instance, if is concentrated in a sparse set (such as the primes), it is sometimes useful to tailor to a comparable set (e.g. the almost primes) in order not to lose too much in the first factor . Also, it can be useful to choose a fairly “smooth” weight , in order to make the weighted correlations small.
Remark 3 In harmonic analysis, the use of tools such as Proposition 2 is known as the method of almost orthogonality, or the method. The explanation for the latter name is as follows. For sake of exposition, suppose that is never zero (or we remove all from the domain for which vanishes). Given a family of finitely supported functions , consider the linear operator defined by the formula
This is a bounded linear operator, and the left-hand side of (6) is nothing other than the norm of . Without any further information on the function other than its norm , the best estimate one can obtain on (6) here is clearly
where denotes the operator norm of .
The adjoint is easily computed to be
The composition of and its adjoint is then given by
From the spectral theorem (or singular value decomposition), one sees that the operator norms of and are related by the identity
and as is a self-adjoint, positive semi-definite operator, the operator norm is also the supremum of the quantity
where ranges over unit vectors in . Putting these facts together, we obtain Proposition 2; furthermore, we see from this analysis that the bound here is essentially optimal if the only information one is allowed to use about is its norm.
For further discussion of almost orthogonality methods from a harmonic analysis perspective, see Chapter VII of this text of Stein.
Exercise 4 Under the same hypotheses as Proposition 2, show that
as well as the variant inequality
Proposition 2 has many applications in analytic number theory; for instance, we will use it in later notes to control the large value of Dirichlet series such as the Riemann zeta function. One of the key benefits is that it largely eliminates the need to consider further correlations of the function (other than its self-correlation relative to , which is usually fairly easy to compute or estimate as is usually chosen to be relatively simple); this is particularly useful if is a function which is significantly more complicated to analyse than the functions . Of course, the tradeoff for this is that one now has to deal with the coefficients , which if anything are even less understood than , since literally the only thing we know about these coefficients is their square sum . However, as long as there is enough almost orthogonality between the , one can estimate the by fairly crude estimates (e.g. triangle inequality or Cauchy-Schwarz) and still get reasonably good estimates.
In this set of notes, we will use Proposition 2 to prove some versions of the large sieve inequality, which controls a square-sum of correlations
of an arbitrary finitely supported function with various additive characters (where ), or alternatively a square-sum of correlations
of with various primitive Dirichlet characters ; it turns out that one can prove a (slightly sub-optimal) version of this inequality quite quickly from Proposition 2 if one first prepares the sum by inserting a smooth cutoff with well-behaved Fourier transform. The large sieve inequality has many applications (as the name suggests, it has particular utility within sieve theory). For the purposes of this set of notes, though, the main application we will need it for is the Bombieri-Vinogradov theorem, which in a very rough sense gives a prime number theorem in arithmetic progressions, which, “on average”, is of strength comparable to the results provided by the Generalised Riemann Hypothesis (GRH), but has the great advantage of being unconditional (it does not require any unproven hypotheses such as GRH); it can be viewed as a significant extension of the Siegel-Walfisz theorem from Notes 2. As we shall see in later notes, the Bombieri-Vinogradov theorem is a very useful ingredient in sieve-theoretic problems involving the primes.
There is however one additional important trick, beyond the large sieve, which we will need in order to establish the Bombieri-Vinogradov theorem. As it turns out, after some basic manipulations (and the deployment of some multiplicative number theory, and specifically the Siegel-Walfisz theorem), the task of proving the Bombieri-Vinogradov theorem is reduced to that of getting a good estimate on sums that are roughly of the form
for some primitive Dirichlet characters . This looks like the type of sum that can be controlled by the large sieve (or by Proposition 2), except that this is an ordinary sum rather than a square sum (i.e., an norm rather than an norm). One could of course try to control such a sum in terms of the associated square-sum through the Cauchy-Schwarz inequality, but this turns out to be very wasteful (it loses a factor of about ). Instead, one should try to exploit the special structure of the von Mangoldt function , in particular the fact that it can be expressible as a Dirichlet convolution of two further arithmetic sequences (or as a finite linear combination of such Dirichlet convolutions). The reason for introducing this convolution structure is through the basic identity
for any finitely supported sequences , as can be easily seen by multiplying everything out and using the completely multiplicative nature of . (This is the multiplicative analogue of the well-known relationship between ordinary convolution and Fourier coefficients.) This factorisation, together with yet another application of the Cauchy-Schwarz inequality, lets one control (7) by square-sums of the sort that can be handled by the large sieve inequality.
As we have seen in Notes 1, the von Mangoldt function does indeed admit several factorisations into Dirichlet convolution type, such as the factorisation . One can try directly inserting this factorisation into the above strategy; it almost works, however there turns out to be a problem when considering the contribution of the portion of or that is supported at very small natural numbers, as the large sieve loses any gain over the trivial bound in such settings. Because of this, there is a need for a more sophisticated decomposition of into Dirichlet convolutions which are non-degenerate in the sense that are supported away from small values. (As a non-example, the trivial factorisation would be a totally inappropriate factorisation for this purpose.) Fortunately, it turns out that through some elementary combinatorial manipulations, some satisfactory decompositions of this type are available, such as the Vaughan identity and the Heath-Brown identity. By using one of these identities we will be able to complete the proof of the Bombieri-Vinogradov theorem. (These identities are also useful for other applications in which one wishes to control correlations between the von Mangoldt function and some other sequence; we will see some examples of this in later notes.)
For further reading on these topics, including a significantly larger number of examples of the large sieve inequality, see Chapters 7 and 17 of Iwaniec and Kowalski.
Remark 5 We caution that the presentation given in this set of notes is highly ahistorical; we are using modern streamlined proofs of results that were first obtained by more complicated arguments.
One of the most fundamental principles in Fourier analysis is the uncertainty principle. It does not have a single canonical formulation, but one typical informal description of the principle is that if a function is restricted to a narrow region of physical space, then its Fourier transform must be necessarily “smeared out” over a broad region of frequency space. Some versions of the uncertainty principle are discussed in this previous blog post.
In this post I would like to highlight a useful instance of the uncertainty principle, due to Hugh Montgomery, which is useful in analytic number theory contexts. Specifically, suppose we are given a complex-valued function on the integers. To avoid irrelevant issues at spatial infinity, we will assume that the support of this function is finite (in practice, we will only work with functions that are supported in an interval for some natural numbers ). Then we can define the Fourier transform by the formula
where . (In some literature, the sign in the exponential phase is reversed, but this will make no substantial difference to the arguments below.)
The classical uncertainty principle, in this context, asserts that if is localised in an interval of length , then must be “smeared out” at a scale of at least (and essentially constant at scales less than ). For instance, if is supported in , then we have the Plancherel identity
while from the Cauchy-Schwarz inequality we have
for each frequency , and in particular that
for any arc in the unit circle (with denoting the length of ). In particular, an interval of length significantly less than can only capture a fraction of the energy of the Fourier transform of , which is consistent with the above informal statement of the uncertainty principle.
Another manifestation of the classical uncertainty principle is the large sieve inequality. A particularly nice formulation of this inequality is due independently to Montgomery and Vaughan and Selberg: if is supported in , and are frequencies in that are -separated for some , thus for all (where denotes the distance of to the origin in ), then
The reader is encouraged to see how this inequality is consistent with the Plancherel identity (1) and the intuition that is essentially constant at scales less than . The factor can in fact be amplified a little bit to , which is essentially optimal, by using a neat dilation trick of Paul Cohen, in which one dilates to (and replaces each frequency by their roots), and then sending (cf. the tensor product trick); see this survey of Montgomery for details. But we will not need this refinement here.
In the above instances of the uncertainty principle, the concept of narrow support in physical space was formalised in the Archimedean sense, using the standard Archimedean metric on the integers (in particular, the parameter is essentially the Archimedean diameter of the support of ). However, in number theory, the Archimedean metric is not the only metric of importance on the integers; the -adic metrics play an equally important role; indeed, it is common to unify the Archimedean and -adic perspectives together into a unified adelic perspective. In the -adic world, the metric balls are no longer intervals, but are instead residue classes modulo some power of . Intersecting these balls from different -adic metrics together, we obtain residue classes with respect to various moduli (which may be either prime or composite). As such, another natural manifestation of the concept of “narrow support in physical space” is “vanishes on many residue classes modulo “. This notion of narrowness is particularly common in sieve theory, when one deals with functions supported on thin sets such as the primes, which naturally tend to avoid many residue classes (particularly if one throws away the first few primes).
In this context, the uncertainty principle is this: the more residue classes modulo that avoids, the more the Fourier transform must spread out along multiples of . To illustrate a very simple example of this principle, let us take , and suppose that is supported only on odd numbers (thus it completely avoids the residue class ). We write out the formulae for and :
If is supported on the odd numbers, then is always equal to on the support of , and so we have . Thus, whenever has a significant presence at a frequency , it also must have an equally significant presence at the frequency ; there is a spreading out across multiples of . Note that one has a similar effect if was supported instead on the even integers instead of the odd integers.
A little more generally, suppose now that avoids a single residue class modulo a prime ; for sake of argument let us say that it avoids the zero residue class , although the situation for the other residue classes is similar. For any frequency and any , one has
From basic Fourier analysis, we know that the phases sum to zero as ranges from to whenever is not a multiple of . We thus have
Let us continue this analysis a bit further. Now suppose that avoids residue classes modulo a prime , for some . (We exclude the case as it is clearly degenerates by forcing to be identically zero.) Let be the function that equals on these residue classes and zero away from these residue classes, then
Using the periodic Fourier transform, we can write
for some coefficients , thus
Some Fourier-analytic computations reveal that
and so after some routine algebra and the Cauchy-Schwarz inequality, we obtain a generalisation of (3):
Thus we see that the more residue classes mod we exclude, the more Fourier energy has to disperse along multiples of . It is also instructive to consider the extreme case , in which is supported on just a single residue class ; in this case, one clearly has , and so spreads its energy completely evenly along multiples of .
In 1968, Montgomery observed the following useful generalisation of the above calculation to arbitrary modulus:
where is the Möbius function.
We give a proof of this proposition below the fold.
Following the “adelic” philosophy, it is natural to combine this uncertainty principle with the large sieve inequality to take simultaneous advantage of localisation both in the Archimedean sense and in the -adic senses. This leads to the following corollary:
Corollary 2 (Arithmetic large sieve inequality) Let be a function supported on an interval which, for each prime , avoids residue classes modulo for some . Let , and let be a finite set of natural numbers. Suppose that the frequencies with , , and are -separated. Then one has
where was defined in (4).
Indeed, from the large sieve inequality one has
while from Proposition 1 one has
whence the claim.
There is a great deal of flexibility in the above inequality, due to the ability to select the set , the frequencies , the omitted classes , and the separation parameter . Here is a typical application concerning the original motivation for the large sieve inequality, namely in bounding the size of sets which avoid many residue classes:
Corollary 3 (Large sieve) Let be a set of integers contained in which avoids residue classes modulo for each prime , and let . Then
whenever are distinct fractions in this sequence.
If, for instance, is the set of all primes in larger than , then one can set for all , which makes , where is the Euler totient function. It is a classical estimate that
Using this fact and optimising in , we obtain (a special case of) the Brun-Titchmarsh inequality
where is the prime counting function; a variant of the same argument gives the more general Brun-Titchmarsh inequality
for any primitive residue class , where is the number of primes less than or equal to that are congruent to . By performing a more careful optimisation using a slightly sharper version of the large sieve inequality (2) that exploits the irregular spacing of the Farey sequence, Montgomery and Vaughan were able to delete the error term in the Brun-Titchmarsh inequality, thus establishing the very nice inequality
for any natural numbers with . This is a particularly useful inequality in non-asymptotic analytic number theory (when one wishes to study number theory at explicit orders of magnitude, rather than the number theory of sufficiently large numbers), due to the absence of asymptotic notation.
I recently realised that Corollary 2 also establishes a stronger version of the “restriction theorem for the Selberg sieve” that Ben Green and I proved some years ago (indeed, one can view Corollary 2 as a “restriction theorem for the large sieve”). I’m placing the details below the fold.