- “A sharp square function estimate for the cone in “, by Larry Guth, Hong Wang, and Ruixiang Zhang. This paper establishes an optimal (up to epsilon losses) square function estimate for the three-dimensional light cone that was essentially conjectured by Mockenhaupt, Seeger, and Sogge, which has a number of other consequences including Sogge’s local smoothing conjecture for the wave equation in two spatial dimensions, which in turn implies the (already known) Bochner-Riesz, restriction, and Kakeya conjectures in two dimensions. Interestingly, modern techniques such as polynomial partitioning and decoupling estimates are not used in this argument; instead, the authors mostly rely on an induction on scales argument and Kakeya type estimates. Many previous authors (including myself) were able to get weaker estimates of this type by an induction on scales method, but there were always significant inefficiencies in doing so; in particular knowing the sharp square function estimate at smaller scales did not imply the sharp square function estimate at the given larger scale. The authors here get around this issue by finding an even stronger estimate that implies the square function estimate, but behaves significantly better with respect to induction on scales.
- “On the Chowla and twin primes conjectures over “, by Will Sawin and Mark Shusterman. This paper resolves a number of well known open conjectures in analytic number theory, such as the Chowla conjecture and the twin prime conjecture (in the strong form conjectured by Hardy and Littlewood), in the case of function fields where the field is a prime power which is fixed (in contrast to a number of existing results in the “large ” limit) but has a large exponent . The techniques here are orthogonal to those used in recent progress towards the Chowla conjecture over the integers (e.g., in this previous paper of mine); the starting point is an algebraic observation that in certain function fields, the Mobius function behaves like a quadratic Dirichlet character along certain arithmetic progressions. In principle, this reduces problems such as Chowla’s conjecture to problems about estimating sums of Dirichlet characters, for which more is known; but the task is still far from trivial.
- “Bounds for sets with no polynomial progressions“, by Sarah Peluse. This paper can be viewed as part of a larger project to obtain quantitative density Ramsey theorems of Szemeredi type. For instance, Gowers famously established a relatively good quantitative bound for Szemeredi’s theorem that all dense subsets of integers contain arbitrarily long arithmetic progressions . The corresponding question for polynomial progressions is considered more difficult for a number of reasons. One of them is that dilation invariance is lost; a dilation of an arithmetic progression is again an arithmetic progression, but a dilation of a polynomial progression will in general not be a polynomial progression with the same polynomials . Another issue is that the ranges of the two parameters are now at different scales. Peluse gets around these difficulties in the case when all the polynomials have distinct degrees, which is in some sense the opposite case to that considered by Gowers (in particular, she avoids the need to obtain quantitative inverse theorems for high order Gowers norms; which was recently obtained in this integer setting by Manners but with bounds that are probably not strong enough to for the bounds in Peluse’s results, due to a degree lowering argument that is available in this case). To resolve the first difficulty one has to make all the estimates rather uniform in the coefficients of the polynomials , so that one can still run a density increment argument efficiently. To resolve the second difficulty one needs to find a quantitative concatenation theorem for Gowers uniformity norms. Many of these ideas were developed in previous papers of Peluse and Peluse-Prendiville in simpler settings.
- “On blow up for the energy super critical defocusing non linear Schrödinger equations“, by Frank Merle, Pierre Raphael, Igor Rodnianski, and Jeremie Szeftel. This paper (when combined with two companion papers) resolves a long-standing problem as to whether finite time blowup occurs for the defocusing supercritical nonlinear Schrödinger equation (at least in certain dimensions and nonlinearities). I had a previous paper establishing a result like this if one “cheated” by replacing the nonlinear Schrodinger equation by a system of such equations, but remarkably they are able to tackle the original equation itself without any such cheating. Given the very analogous situation with Navier-Stokes, where again one can create finite time blowup by “cheating” and modifying the equation, it does raise hope that finite time blowup for the incompressible Navier-Stokes and Euler equations can be established… In fact the connection may not just be at the level of analogy; a surprising key ingredient in the proofs here is the observation that a certain blowup ansatz for the nonlinear Schrodinger equation is governed by solutions to the (compressible) Euler equation, and finite time blowup examples for the latter can be used to construct finite time blowup examples for the former.

(From a differential geometry viewpoint, it would be more accurate (especially in other dimensions than three) to define the vorticity as the exterior derivative of the musical isomorphism of the Euclidean metric applied to the velocity field ; see these previous lecture notes. However, we will not need this geometric formalism in this post.)

Assuming suitable regularity and decay hypotheses of the velocity field , it is possible to recover the velocity from the vorticity as follows. From the general vector identity applied to the velocity field , we see that

and thus (by the commutativity of all the differential operators involved)

Using the Newton potential formula

and formally differentiating under the integral sign, we obtain the Biot-Savart law

This law is of fundamental importance in the study of incompressible fluid equations, such as the Euler equations

since on applying the curl operator one obtains the vorticity equation

and then by substituting (1) one gets an autonomous equation for the vorticity field . Unfortunately, this equation is non-local, due to the integration present in (1).

In a recent work, it was observed by Elgindi that in a certain regime, the Biot-Savart law can be approximated by a more “low rank” law, which makes the non-local effects significantly simpler in nature. This simplification was carried out in spherical coordinates, and hinged on a study of the invertibility properties of a certain second order linear differential operator in the latitude variable ; however in this post I would like to observe that the approximation can also be seen directly in Cartesian coordinates from the classical Biot-Savart law (1). As a consequence one can also initiate the beginning of Elgindi’s analysis in constructing somewhat regular solutions to the Euler equations that exhibit self-similar blowup in finite time, though I have not attempted to execute the entirety of the analysis in this setting.

Elgindi’s approximation applies under the following hypotheses:

- (i) (Axial symmetry without swirl) The velocity field is assumed to take the form
for some functions of the cylindrical radial variable and the vertical coordinate . As a consequence, the vorticity field takes the form

- (ii) (Odd symmetry) We assume that and , so that .

A model example of a divergence-free vector field obeying these properties (but without good decay at infinity) is the linear vector field

which is of the form (3) with and . The associated vorticity vanishes.

We can now give an illustration of Elgindi’s approximation:

Proposition 1 (Elgindi’s approximation)Under the above hypotheses (and assuing suitable regularity and decay), we have the pointwise boundsfor any , where is the vector field (5), and is the scalar function

Thus under the hypotheses (i), (ii), and assuming that is slowly varying, we expect to behave like the linear vector field modulated by a radial scalar function. In applications one needs to control the error in various function spaces instead of pointwise, and with similarly controlled in other function space norms than the norm, but this proposition already gives a flavour of the approximation. If one uses spherical coordinates

then we have (using the spherical change of variables formula and the odd nature of )

where

is the operator introduced in Elgindi’s paper.

*Proof:* By a limiting argument we may assume that is non-zero, and we may normalise . From the triangle inequality we have

and hence by (1)

In the regime we may perform the Taylor expansion

Since

we see from the triangle inequality that the error term contributes to . We thus have

where is the constant term

and are the linear term

By the hypotheses (i), (ii), we have the symmetries

The even symmetry (8) ensures that the integrand in is odd, so vanishes. The symmetry (6) or (7) similarly ensures that , so vanishes. Since , we conclude that

Using (4), the right-hand side is

where . Because of the odd nature of , only those terms with one factor of give a non-vanishing contribution to the integral. Using the rotation symmetry we also see that any term with a factor of also vanishes. We can thus simplify the above expression as

Using the rotation symmetry again, we see that the term in the first component can be replaced by or by , and similarly for the term in the second component. Thus the above expression is

giving the claim.

Example 2Consider the divergence-free vector field , where the vector potential takes the formfor some bump function supported in . We can then calculate

and

In particular the hypotheses (i), (ii) are satisfied with

One can then calculate

If we take the specific choice

where is a fixed bump function supported some interval and is a small parameter (so that is spread out over the range ), then we see that

(with implied constants allowed to depend on ),

and

which is completely consistent with Proposition 1.

One can use this approximation to extract a plausible ansatz for a self-similar blowup to the Euler equations. We let be a small parameter and let be a time-dependent vorticity field obeying (i), (ii) of the form

where and is a smooth field to be chosen later. Admittedly the signum function is not smooth at , but let us ignore this issue for now (to rigorously make an ansatz one will have to smooth out this function a little bit; Elgindi uses the choice , where ). With this ansatz one may compute

By Proposition 1, we thus expect to have the approximation

We insert this into the vorticity equation (2). The transport term will be expected to be negligible because , and hence , is slowly varying (the discontinuity of will not be encountered because the vector field is parallel to this singularity). The modulating function is similarly slowly varying, so derivatives falling on this function should be lower order. Neglecting such terms, we arrive at the approximation

and so in the limit we expect obtain a simple model equation for the evolution of the vorticity envelope :

If we write for the logarithmic primitive of , then we have and hence

which integrates to the Ricatti equation

which can be explicitly solved as

where is any function of that one pleases. (In Elgindi’s work a time dilation is used to remove the unsightly factor of appearing here in the denominator.) If for instance we set , we obtain the self-similar solution

and then on applying

Thus, we expect to be able to construct a self-similar blowup to the Euler equations with a vorticity field approximately behaving like

and velocity field behaving like

In particular, would be expected to be of regularity (and smooth away from the origin), and blows up in (say) norm at time , and one has the self-similarity

and

A self-similar solution of this approximate shape is in fact constructed rigorously in Elgindi’s paper (using spherical coordinates instead of the Cartesian approach adopted here), using a nonlinear stability analysis of the above ansatz. It seems plausible that one could also carry out this stability analysis using this Cartesian coordinate approach, although I have not tried to do this in detail.

]]>- The Möbius function ;
- The Liouville function ;
- “Archimedean” characters (which I call Archimedean because they are pullbacks of a Fourier character on the multiplicative group , which has the Archimedean property);
- Dirichlet characters (or “non-Archimedean” characters) (which are essentially pullbacks of Fourier characters on a multiplicative cyclic group with the discrete (non-Archimedean) metric);
- Hybrid characters .

The space of -bounded multiplicative functions is also closed under multiplication and complex conjugation.

Given a multiplicative function , we are often interested in the asymptotics of long averages such as

for large values of , as well as short sums

where and are both large, but is significantly smaller than . (Throughout these notes we will try to normalise most of the sums and integrals appearing here as averages that are trivially bounded by ; note that other normalisations are preferred in some of the literature cited here.) For instance, as we established in Theorem 58 of Notes 1, the prime number theorem is equivalent to the assertion that

as . The Liouville function behaves almost identically to the Möbius function, in that estimates for one function almost always imply analogous estimates for the other:

Exercise 1Without using the prime number theorem, show that (1) is also equivalent to

Henceforth we shall focus our discussion more on the Liouville function, and turn our attention to averages on shorter intervals. From (2) one has

as if is such that for some fixed . However it is significantly more difficult to understand what happens when grows much slower than this. By using the techniques based on zero density estimates discussed in Notes 6, it was shown by Motohashi and that one can also establish \eqref. On the Riemann Hypothesis Maier and Montgomery lowered the threshold to for an absolute constant (the bound is more classical, following from Exercise 33 of Notes 2). On the other hand, the randomness heuristics from Supplement 4 suggest that should be able to be taken as small as , and perhaps even if one is particularly optimistic about the accuracy of these probabilistic models. On the other hand, the Chowla conjecture (mentioned for instance in Supplement 4) predicts that cannot be taken arbitrarily slowly growing in , due to the conjectured existence of arbitrarily long strings of consecutive numbers where the Liouville function does not change sign (and in fact one can already show from the known partial results towards the Chowla conjecture that (3) fails for some sequence and some sufficiently slowly growing , by modifying the arguments in these papers of mine).

The situation is better when one asks to understand the mean value on *almost all* short intervals, rather than all intervals. There are several equivalent ways to formulate this question:

Exercise 2Let be a function of such that and as . Let be a -bounded function. Show that the following assertions are equivalent:

As it turns out the second moment formulation in (iii) will be the most convenient for us to work with in this set of notes, as it is well suited to Fourier-analytic techniques (and in particular the Plancherel theorem).

Using zero density methods, for instance, it was shown by Ramachandra that

whenever and . With this quality of bound (saving arbitrary powers of over the trivial bound of ), this is still the lowest value of one can reach unconditionally. However, in a striking recent breakthrough, it was shown by Matomaki and Radziwill that as long as one is willing to settle for weaker bounds (saving a small power of or , or just a qualitative decay of ), one can obtain non-trivial estimates on far shorter intervals. For instance, they show

Theorem 3 (Matomaki-Radziwill theorem for Liouville)For any , one hasfor some absolute constant .

In fact they prove a slightly more precise result: see Theorem 1 of that paper. In particular, they obtain the asymptotic (4) for *any* function that goes to infinity as , no matter how slowly! This ability to let grow slowly with is important for several applications; for instance, in order to combine this type of result with the entropy decrement methods from Notes 9, it is essential that be allowed to grow more slowly than . See also this survey of Soundararajan for further discussion.

Exercise 4In this exercise you may use Theorem 3 freely.

- (i) Establish the lower bound
for some absolute constant and all sufficiently large . (

Hint:if this bound failed, then would hold for almost all ; use this to create many intervals for which is extremely large.)- (ii) Show that Theorem 3 also holds with replaced by , where is the principal character of period . (Use the fact that for all .) Use this to establish the corresponding upper bound
to (i).

(There is a curious asymmetry to the difficulty level of these bounds; the upper bound in (ii) was established much earlier by Harman, Pintz, and Wolke, but the lower bound in (i) was only established in the Matomaki-Radziwill paper.)

The techniques discussed previously were highly complex-analytic in nature, relying in particular on the fact that functions such as or have Dirichlet series , that extend meromorphically into the critical strip. In contrast, the Matomaki-Radziwill theorem does *not* rely on such meromorphic continuations, and in fact holds for more general classes of -bounded multiplicative functions , for which one typically does not expect any meromorphic continuation into the strip. Instead, one can view the Matomaki-Radziwill theory as following the philosophy of a slightly different approach to multiplicative number theory, namely the *pretentious multiplicative number theory* of Granville and Soundarajan (as presented for instance in their draft monograph). A basic notion here is the *pretentious distance* between two -bounded multiplicative functions (at a given scale ), which informally measures the extent to which “pretends” to be like (or vice versa). The precise definition is

Definition 5 (Pretentious distance)Given two -bounded multiplicative functions , and a threshold , thepretentious distancebetween and up to scale is given by the formula

Note that one can also define an infinite version of this distance by removing the constraint , though in such cases the pretentious distance may then be infinite. The pretentious distance is not quite a metric (because can be non-zero, and furthermore can vanish without being equal), but it is still quite close to behaving like a metric, in particular it obeys the triangle inequality; see Exercise 16 below. The philosophy of pretentious multiplicative number theory is that two -bounded multiplicative functions will exhibit similar behaviour at scale if their pretentious distance is bounded, but will become uncorrelated from each other if this distance becomes large. A simple example of this philosophy is given by the following “weak Halasz theorem”, proven in Section 2:

Proposition 6 (Logarithmically averaged version of Halasz)Let be sufficiently large. Then for any -bounded multiplicative functions , one hasfor an absolute constant .

In particular, if does not pretend to be , then the logarithmic average will be small. This condition is basically necessary, since of course .

If one works with non-logarithmic averages , then not pretending to be is insufficient to establish decay, as was already observed in Exercise 11 of Notes 1: if is an Archimedean character for some non-zero real , then goes to zero as (which is consistent with Proposition 6), but does not go to zero. However, this is in some sense the “only” obstruction to these averages decaying to zero, as quantified by the following basic result:

Theorem 7 (Halasz’s theorem)Let be sufficiently large. Then for any -bounded multiplicative function , one hasfor an absolute constant and any .

Informally, we refer to a -bounded multiplicative function as “pretentious’; if it pretends to be a character such as , and “non-pretentious” otherwise. The precise distinction is rather malleable, as the precise class of characters that one views as “obstructions” varies from situation to situation. For instance, in Proposition 6 it is just the trivial character which needs to be considered, but in Theorem 7 it is the characters with . In other contexts one may also need to add Dirichlet characters or hybrid characters such as to the list of characters that one might pretend to be. The division into pretentious and non-pretentious functions in multiplicative number theory is faintly analogous to the division into major and minor arcs in the circle method applied to additive number theory problems; see Notes 8. The Möbius and Liouville functions are model examples of non-pretentious functions; see Exercise 24.

In the contrapositive, Halasz’ theorem can be formulated as the assertion that if one has a large mean

for some , then one has the pretentious property

for some . This has the flavour of an “inverse theorem”, of the type often found in arithmetic combinatorics.

Among other things, Halasz’s theorem gives yet another proof of the prime number theorem (1); see Section 2.

We now give a version of the Matomaki-Radziwill theorem for general (non-pretentious) multiplicative functions that is formulated in a similar contrapositive (or “inverse theorem”) fashion, though to simplify the presentation we only state a qualitative version that does not give explicit bounds.

Theorem 8 ((Qualitative) Matomaki-Radziwill theorem)Let , and let , with sufficiently large depending on . Suppose that is a -bounded multiplicative function such thatThen one has

for some .

The condition is basically optimal, as the following example shows:

Exercise 9Let be a sufficiently small constant, and let be such that . Let be the Archimedean character for some . Show that

Combining Theorem 8 with standard non-pretentiousness facts about the Liouville function (see Exercise 24), we recover Theorem 3 (but with a decay rate of only rather than ). We refer the reader to the original paper of Matomaki-Radziwill (as well as this followup paper with myself) for the quantitative version of Theorem 8 that is strong enough to recover the full version of Theorem 3, and which can also handle real-valued pretentious functions.

With our current state of knowledge, the only arguments that can establish the full strength of Halasz and Matomaki-Radziwill theorems are Fourier analytic in nature, relating sums involving an arithmetic function with its Dirichlet series

which one can view as a discrete Fourier transform of (or more precisely of the measure , if one evaluates the Dirichlet series on the right edge of the critical strip). In this aspect, the techniques resemble the complex-analytic methods from Notes 2, but with the key difference that no analytic or meromorphic continuation into the strip is assumed. The key identity that allows us to pass to Dirichlet series is the following variant of Proposition 7 of Notes 2:

Proposition 10 (Parseval type identity)Let be finitely supported arithmetic functions, and let be a Schwartz function. Thenwhere is the Fourier transform of . (Note that the finite support of and the Schwartz nature of ensure that both sides of the identity are absolutely convergent.)

The restriction that be finitely supported will be slightly annoying in places, since most multiplicative functions will fail to be finitely supported, but this technicality can usually be overcome by suitably truncating the multiplicative function, and taking limits if necessary.

*Proof:* By expanding out the Dirichlet series, it suffices to show that

for any natural numbers . But this follows from the Fourier inversion formula applied at .

For applications to Halasz type theorems, one sets equal to the Kronecker delta , producing weighted integrals of of “” type. For applications to Matomaki-Radziwill theorems, one instead sets , and more precisely uses the following corollary of the above proposition, to obtain weighted integrals of of “” type:

Exercise 11 (Plancherel type identity)If is finitely supported, and is a Schwartz function, establish the identity

In contrast, information about the non-pretentious nature of a multiplicative function will give “pointwise” or “” type control on the Dirichlet series , as is suggested from the Euler product factorisation of .

It will be convenient to formalise the notion of , , and control of the Dirichlet series , which as previously mentioned can be viewed as a sort of “Fourier transform” of :

Definition 12 (Fourier norms)Let be finitely supported, and let be a bounded measurable set. We define theFourier normthe

Fourier normand the

Fourier norm

One could more generally define norms for other exponents , but we will only need the exponents in this current set of notes. It is clear that all the above norms are in fact (semi-)norms on the space of finitely supported arithmetic functions.

As mentioned above, Halasz’s theorem gives good control on the Fourier norm for restrictions of non-pretentious functions to intervals:

Exercise 13 (Fourier control via Halasz)Let be a -bounded multiplicative function, let be an interval in for some , let , and let be a bounded measurable set. Show that(Hint: you will need to use summation by parts (or an equivalent device) to deal with a weight.)

Meanwhile, the Plancherel identity in Exercise 11 gives good control on the Fourier norm for functions on long intervals (compare with Exercise 2 from Notes 6):

Exercise 14 ( mean value theorem)Let , and let be finitely supported. Show thatConclude in particular that if is supported in for some and , then

In the simplest case of the logarithmically averaged Halasz theorem (Proposition 6), Fourier estimates are already sufficient to obtain decent control on the (weighted) Fourier type expressions that show up. However, these estimates are not enough by themselves to establish the full Halasz theorem or the Matomaki-Radziwill theorem. To get from Fourier control to Fourier or control more efficiently, the key trick is use Hölder’s inequality, which when combined with the basic Dirichlet series identity

The strategy is then to factor (or approximately factor) the original function as a Dirichlet convolution (or average of convolutions) of various components, each of which enjoys reasonably good Fourier or estimates on various regions , and then combine them using the Hölder inequalities (5), (6) and the triangle inequality. For instance, to prove Halasz’s theorem, we will split into the Dirichlet convolution of three factors, one of which will be estimated in using the non-pretentiousness hypothesis, and the other two being estimated in using Exercise 14. For the Matomaki-Radziwill theorem, one uses a significantly more complicated decomposition of into a variety of Dirichlet convolutions of factors, and also splits up the Fourier domain into several subregions depending on whether the Dirichlet series associated to some of these components are large or small. In each region and for each component of these decompositions, all but one of the factors will be estimated in , and the other in ; but the precise way in which this is done will vary from component to component. For instance, in some regions a key factor will be small in by construction of the region; in other places, the control will come from Exercise 13. Similarly, in some regions, satisfactory control is provided by Exercise 14, but in other regions one must instead use “large value” theorems (in the spirit of Proposition 9 from Notes 6), or amplify the power of the standard mean value theorems by combining the Dirichlet series with other Dirichlet series that are known to be large in this region.

There are several ways to achieve the desired factorisation. In the case of Halasz’s theorem, we can simply work with a crude version of the Euler product factorisation, dividing the primes into three categories (“small”, “medium”, and “large” primes) and expressing as a triple Dirichlet convolution accordingly. For the Matomaki-Radziwill theorem, one instead exploits the Turan-Kubilius phenomenon (Section 5 of Notes 1, or Lemma 2 of Notes 9)) that for various moderately wide ranges of primes, the number of prime divisors of a large number in the range is almost always close to . Thus, if we introduce the arithmetic functions

and more generally we have a twisted approximation

for multiplicative functions . (Actually, for technical reasons it will be convenient to work with a smoothed out version of these functions; see Section 3.) Informally, these formulas suggest that the “ energy” of a multiplicative function is concentrated in those regions where is extremely large in a sense. Iterations of this formula (or variants of this formula, such as an identity due to Ramaré) will then give the desired (approximate) factorisation of .

** — 1. Pretentious distance — **

In this section we explore the notion of pretentious distance. The following Hilbert space lemma will be useful for establishing the triangle inequality for this distance:

Lemma 15 (Triangle inequality)Let be vectors in a real Hilbert space with . Then

*Proof:* First suppose that are unit vectors: . Then by the cosine rule , and similarly for and . The claim now follows from the usual triangle inequality .

Now suppose we are in the general case when . In this case we extend to unit vectors by working in the product of with the Euclidean space and applying the previous inequality to the extended unit vectors

observing that the extensions have the same inner products as the original vectors.

Exercise 16 (Basic properties of pretentious distance)Let be -bounded multiplicative functions, and let .

- (i) (Metric type properties) Show that , with equality if and only if and for all primes . Furthermore, show that and . (Hint: for the last property, apply Lemma 15 to a suitable Hilbert space .)
- (ii) (Alternate triangle inequality) Show that .
- (iii) (Bounds) One has
and if , then

- (iv) (Invariance) One has , and
In particular, if for all , then .

Exercise 17If are Dirichlet characters of periods respectively induced from the same primitive character, and , show that for some absolute constant (the only purpose of which is to keep the triple logarithm positive). (Hint:control the contributions of the primes in each dyadic block separately for .)

Next, we relate pretentious distance to the value of Dirichlet series just to the right of the critical strip. There is an annoying minor technicality that the prime has to be treated separately, but this will not cause too much trouble.

Lemma 18 (Dirichlet series and pretentious distance)Let be a -bounded multiplicative function, , and . ThenIn particular, we always have the upper bound

and if one imposes the technical condition that either for all or for all , then

If for all and , then we may delete the terms in the above claims.

*Proof:* By replacing with we may assume without loss of generality that . We begin with the first claim (8). By expanding out the Euler product, the left-hand side of (8) is equal to

and from Definition 5 and Mertens’ theorem we have

and so it will suffice on canceling the factor and taking logarithms to show that

For , the quantity differs from by at most . Also we have

and hence by Taylor expansion

By the triangle inequality, it thus suffices to show that

But the first bound follows from the mean value estimate and Mertens’ theorems, while the second bound follows from summing the bounds

that also arise from Mertens’ theorems.

The quantity is bounded in magnitude by , giving (9). Under either of the two technical conditions listed, this quantity is equal to either or , and in either case it is comparable in magnitude to , giving (10).

If for and , we may repeat the above arguments with the terms deleted, since we no longer need to control the tail contribution .

Now we explore the geometry of the Archimedean characters with respect to pretentious distance.

Proposition 19If is sufficiently large, thenfor . In particular one has

for .

The precise exponent here is not of particular significance; any constant between and would work for our application to the Matomaki-Radziwill theorem, with the most important feature being that grows significantly faster than any fixed power of . The ability to raise the exponent beyond will be provided by the Vinogradov estimates for the zeta function. (For Halasz’s theorem one only needs these bounds in the easier range , which does not require the Vinogradov estimates.) As a particular corollary of this proposition and Exercise 16(iii), we see that

whenever ; thus the Archimedean characters do not pretend to be like each other at all once the parameter is changed by at least a unit distance (but not changed by an enormous amount).

*Proof:* By Definition 5, our task is to show that

We begin with the upper bound. For , the claim follows from Mertens’ theorems and the triangle inequality. For , we bound

and the claim again follows from Mertens’ theorems (note that in this case). For , we bound by for and by for , and the claim once again follows from Mertens’ theorems.

Now we establish the lower bound. We first work in the range . In this case we have a matching lower bound

for and some small absolute constant , and hence

giving the lower bound. Now suppose that . Applying Lemma 18 with and replaced by some , we have

and thus

or equivalently by Mertens’ theorem

Applying this bound with replaced by and by we conclude that

and hence by Mertens’ theorem and the triangle inequality (for small enough)

giving the claim.

Finally, assume that . From Mertens’ theorems we have

so by the triangle inequality it will suffice to show that

Taking logarithms in (11) for we have

and also

hence by the fundamental theorem of calculus

for some . However, from the Vinogradov-Korobov estimates (Exercise 43 of Notes 2) we have

whenever ; since we are assuming , the claim follows.

Exercise 20Assume the Riemann hypothesis. Establish a bound of the formfor some absolute constant whenever for a sufficiently large absolute constant . (

Hint:use Perron’s formula and shift the contour to within of the critical line.) Use this to conclude that the upper bound in Proposition 19 can be relaxed (assuming RH) to .

Exercise 21Let be a -bounded multiplicative function with for all . For any , show thatThus some sort of upper bound on in Proposition 19 is necessary.

Exercise 22Let be a non-principal character of modulus , and let be sufficiently large depending on . Show thatfor all . (One will need to adapt the Vinogradov-Korobov theory to Dirichet -functions.)

Proposition 19 measures how close the function lies to the Archimedean characters . Using the triangle inequality, one can then lower bound the distance of any other -bounded multiplicative function to these characters:

Proposition 23Let be sufficiently large. Then for any -bounded multiplicative function , there exists a real number with such thatwhenever . In particular we have

if and . If is real-valued, one can take .

*Proof:* For the first claim, choose to minimize among all real numbers with . Then for any other , we see from the triangle inequality that

But from Proposition 19 we have

giving the first claim. When is real valued, we can similarly use the triangle inequality to bound

which gives

giving the second claim.

We can now quantify the non-pretentious nature of the Möbius and Liouville functions.

- (i) If is sufficiently large, and is a real-valued -bounded multiplicative function, show that
whenever .

- (ii) Show that
whenever and is sufficiently large.

- (iii) If is a Dirichlet character of some period , show that
whenever and is sufficiently large depending on .

** — 2. Halasz’s inequality — **

We now prove Halasz’s inequality. As a warm up, we prove Proposition 6:

*Proof:* (Proof of Proposition 6) By Exercise 16(iv) we may normalise . We may assume that when , since the value of on these primes has no impact on the sum or on . In particular, from Euler products we now have the absolute convergence . Let be a small quantity to be optimized later, and be smooth compactly supported function on that equals one on with the derivative bounds , on , so on integration by parts we see that the Fourier transform obeys the bounds

for any and . From the triangle inequality have

Applying Proposition 10 applied to a finite truncation of , and then using the absolute convergence of and dominated convergence to eliminate the truncation (or by using Proposition 7 of Notes 2 and then shifting the contour), we can write the right-hand side as

which after rescaling by gives

Now from Lemma 18 one has

where , and thus we can bound (12) by

if is chosen to be a sufficiently large absolute constant. The claim then follows by optimising in .

We remark that an elementary proof of Proposition 6 with was given in Proposition 1.2.6 of .

It was observed \href

that we can also sharpen Proposition 6 when is non-negative by purely elementary methods:

Proposition 25Let , and suppose that is multiplicative, -bounded, and non-negative. Then

*Proof:* From Definition 5 and Mertens’ theorems the estimate is equivalent to

For the upper bound, we may assume that vanishes for since these primes make no contribution to either side. We can then bound

For the lower bound, we let be the -bounded multiplicative function with and for and all primes . Then we observe the pointwise bound for all , hence

By the upper bound just obtained and Mertens’ theorems, we have

and the claim follows.

Now we can prove Theorem 7.

*Proof:* (Proof of Theorem 7) We may assume that , since the claim is trivial otherwise. On the other hand, for for a sufficiently large , the second term on the right-hand side is dominated by the first, so the estimate does not become any stronger as one increases beyond , and hence we may assume without loss of generality that .

We abbreviate , thus we wish to show that

It is convenient to remove some exceptional values of . Let be a small quantity to be chosen later, subject to the restriction

From standard sieves (e.g., Theorem 32 from Notes 4), we see that the proportion of numbers in that do not have a “large” prime factor in , or do not have a “medium” prime factor in , is . Thus by paying an error of , we may restrict to numbers that have at least one “large” prime factor in and at least one “medium” prime factor in (and no prime factor larger than ). This is the same as replacing with the Dirichlet convolution

where is the restriction of to numbers with all prime factors in the “small” range , is the restriction of to numbers in with all prime factors in the “medium” range , and is the restriction of to numbers in with all prime factors in the “large” range . We can thus write

This we can write in turn as

where . It is not advantageous to immediately apply Proposition 10 due to the rough nature of (which is not even Schwartz). But if we let be a Schwartz function of total mass whose Fourier transform is supported on , and define the mollified function

then one easily checks that

which from the triangle inequality soon gives the bounds

Hence we may write

Now we apply Proposition 10 and the triangle inequality to bound this by

But we may factor

and we may rather crudely bound , hence we have

Using the Hölder inequalities (5), (6) we have

From Exercise 14 we have

Note that is supported on and hence is effectively also restricted to the range . From standard sieves (using (15)), we have

and thus

Similarly

Finally, from Lemma 18 one has

Putting all this together, we conclude that

Setting for some sufficiently small constant (which in particular will ensure (15) since ), we obtain the claim.

One can optimise this argument to make the constant in Theorem 7 arbitrarily close to ; see this previous post. With an even more refined argument, one can prove the sharper estimate

with , a result initially due to Montgomery and Tenenbaum; see Theorem 2.3.1 of this text of Granville and Soundararajan. In the case of non-negative , an elementary argument gives the stronger bound ; see Corollary 1.2.3 of . However, the slightly weaker estimates in Theorem \ref Let be sufficiently large. Then for any real-valued -bounded multiplicative function , one has

for an absolute constant .

Thus for instance, setting , we can use Wirsing’s theorem and Exercise 24 (or Mertens’ theorem) to recover a form of the prime number theorem with a modestly decaying error term, in that

for all large and some absolute constant . (Admittedly, we did use the far stronger Vinogradov-Korobov estimates earlier in this set of notes; but a careful inspection reveals that those estimates were not used in the proof of (16), so this is a non-circular proof of the prime number theorem.)

** — 3. The Matomaki-Radziwill theorem — **

We now give the proof of the Matomaki-Radziwill theorem, though we will leave several of the details to exercises. We first make a small but convenient reduction:

Exercise 26Show that to prove Theorem 8, it suffices to do so for functions that are completely multiplicative. (This is similar to Exercise 1.)

Now we use Exercise 11 to phrase the theorem in an equivalent Fourier form:

Theorem 27 (Matomaki-Radziwill theorem, Fourier form)Let , and let , with sufficiently large depending on . Let be a fixed smooth compactly supported function, and set . Suppose that is a -bounded completely multiplicative function such thatfor some .

Let us assume Theorem 27 for the moment and see how it implies Theorem 8. In the latter theorem we may assume without loss of generality that is small. We may assume that , since the case follows easily from Theorem \reF{halasz}.

Let be a smooth compactly supported function with on . By hypothesis, we have

Let be a Schwartz function of mean whose Fourier transform is supported on . For any , we consider the expression

A routine calculation using the rapid decrease of shows that

and thus the expression (19) can be estimated as

By Cauchy-Schwarz we then have

Averaging this for and using Fubini’s theorem, we have

and thus from (18) and the triangle inequality we have

On the other hand, from Exercise 11 we have

Applying Theorem 27 (with a slightly smaller value of and ), we obtain the claim.

Exercise 28In the converse direction, show that Theorem 27 is a consequence of Theorem 8.

Exercise 29Let be supported on , and let . Show that(Hint: use summation by parts to express as a suitable linear combination of sums and , then use the Cauchy-Schwarz inequality and the Fubini-Tonelli theorem.) Conclude in particular that

where . (This argument is due to Saffari and Vaughan.)

It remains to establish Theorem 27. As before we may assume that is small. Let us call a finitely supported arithmetic function *large* on some subset of if

and *small* on if

Note that a function cannot be simultaneously large and small on the same set ; and if a function is large on some subset , then it remains large on after modifying by any small error (assuming is small enough, and adjusting the implied constants appropriately). From the hypothesis (17) we know that is large on . As discussed in the introduction, the strategy is now to decompose into various regions, and on each of these regions split (up to small errors) as an average of Dirichlet convolutions of other factors which enjoy either good estimates or good estimates on the given region.

We will need the following ranges:

- (i) is the interval
- (ii) is the interval
- (iii) is the interval

We will be able to cover the range just using arguments involving the zeroth interval ; the range can be covered using arguments involving the zeroth interval and the first interval ; and the range can be covered using arguments involving all three intervals . Coverage of the remaining ranges of can be done by an extension of the methods given here and will be left to the exercises at the end of the notes.

We introduce some weight functions and some exceptional sets. For any , let be a bump function on of total mass , and let denote the arithmetic function

supported on the primes . We then define the following subsets of :

- (i) is the set of those such that
for some dyadic (i.e., is restricted to be a power of ).

- (ii) is the set of those such that
for some dyadic .

- (iii) is the set of those such that
for some dyadic .

We will establish the following claims:

- (i) If for some , then is small on .
- (ii) is small on .
- (iii) is small on .
- (iv) If is large on , then one has for some .

Note that parts (i) (with ) and (iv) of the claims are already enough to treat the case ; parts (i) (with ), (ii), and (iv) are enough to treat the case ; and parts (i) (with ), (ii), (iii), and (iv) are enough to treat the case .

We first prove (i). For , let denote the function

This function is a variant of the function introduced in (7). A key point is that the convolutions stay close to :

Exercise 31 (Turan-Kubilius inequalities)For , show that(

Hint:use the second moment method, as in the proof of Lemma 2 of Notes 9.)

Let . Inserting the bounded factor in the above estimates, and applying Exercise 14, we conclude in particular that the expression

is small on . Since is completely multiplicative, we can write this expression as

We now perform some technical manipulations to move the cutoff to a more convenient location. From (20) we have

We would like to approximate by . A brief triangle inequality calculation using the smoothness of , the -boundedness of , and the narrow support of shows that

where is defined similarly to but with a slightly larger choice of initial cutoff . Integrating this we conclude that

Using Exercise 31 and Exercise 14, the error term is small on . Thus we conclude that

is small on , and hence also on . Thus by the triangle inequality, it will suffice to show that

is small on for each . But by construction we definitely have

while from Exercise 14 and the hypothesis we have

and the claim (i) now follows from (6).

We now jump to (iv). The first observation is that the set is quite small. To quantify this we use the following bound:

Proposition 32 (Large values of )Let be sufficiently large depending on , and let . Then for any , the sethas measure at most .

*Proof:* We use the high moment method. Let be a natural number to be optimised in later, and let be the convolution of copies of . Then on we have . Thus by Markov’s inequality, the measure of is at most

To bound this we use Exercise 14. If we choose to be the first integer for which , then this exercise gives us the bound

From the fundamental theorem of arithmetic we see that , hence

From the prime number theorem we have . Putting all this together, we conclude that the measure of is at most

Since , we obtain the claim.

Applying this proposition with ranging between and and , and applying the union bound, the we see that the measure of is at most . To exploit this, we will need some bounds of Vinogradov-Korobov type:

Exercise 33 (Vinogradov bounds)For , establish the boundfor any and . (

Hint:replace with a weighted version of the von Mangoldt function, apply Proposition 7 of Notes 2, and shift the contour, using the zero free region from Exercise 43 of Notes 2.)

Now we can establish we use the following variant of the Montgomery-Halasz large values theorem (cf. Proposition 9 from Notes 6):

Proposition 34 (Montgomery-Halasz for primes)Let , let , and have measure . Then for any -bounded function , one has

*Proof:* By duality, we may write

for some measurable function with . We can rearrange the right-hand side as

which by Cauchy-Schwarz is boudned in magnitude by

We can rearrange this as

which by the elementary inequality is bounded by

(cf. Lemma 6 from Notes 9). By Exercise 33 we have

The claim follows.

Now we can prove (iv). By hypothesis, is large on . On the other hand, the function (21) (with ) is small on . By the triangle inequality, we conclude that

is large on , hence by the pigeonhole principle, there exists such that

is large on . On the other hand, from Proposition 32 and Prosition 34 we have

hence by (6)

By the pigeonhole principle, this implies that

for some interval . The claim (iv) now follows from Exercise 13. Note that this already concludes the argument in the range .

Now we establish (ii). Here the set is not as well controlled in size as , but is still quite small. Indeed, from applying Proposition 32 with ranging between and and , and applying the union bound, the we see that the measure of is at most . This is too large of a bound to apply Proposition 34, but we may instead apply a different bound:

Exercise 35 (Montgomery-Halasz for integers)Let , and let have measure . For any -bounded function supported on , show that(It is possible to remove the logarithmic loss here by being careful, but this loss will be acceptable for our arguments. One can either repeat the arguments used to prove Proposition 34, or else appeal to Proposition 9 from Notes 6.)

The point here is that we can get good bounds even when the function is supported at narrower scales (such as ) than the Fourier interval under consideration (such as or ). In particular, this exercise will serve as a replacement for Exercise 14, which will not give good estimates in this case.

As before, the function (21) is small on , so it will suffice by the triangle inequality to show that

is small on for all . From Exercise 35, we have

while from definition of we have

and the claim now follows from (6). Note that this already concludes the argument in the range .

Finally, we establish (iii). The function (21) (with ) is small on , so by the triangle inequality as before it suffices to show that

is small on for all . On the one hand, the definition of gives a good bound on :

To conclude using (6) we now need a good bound for . Unfortunately, the function is now supported on too short of an interval for Exercise 14 to give good estimates, and is too large for Exercise 35 to be applicable either.

But from definition we see that for , we have

for at least one . We can use this to amplify the power of Exercise 14:

*Proof:* For each , let denote the set of where (23) holds. Since there are values of , and , it suffices by the union bound to show that

(say) for each . Let be the first integer for which the quantity , thus

From (23) we have the pointwise bound

and hence by Exercise 14

where . As is -bounded, and the summand only vanishes when , we can bound the right-hand side by

where denotes the set of primes in the interval .

Suppose has prime factors in this interval (counting multiplicity). Then vanishes unless , in which case we can bound

and

Thus we may bound the above sum by

By the prime number theorem, has elements, so by double counting we have

and thus the previous bound becomes

which sums to

Since

we thus have

so that , we obtain the claim.

Combining this proposition with (22) and (6), we conclude part (iii) of Proposition 30. This establishes Theorem 27 up to the range .

Exercise 37Show that for any fixed , Theorem 27 holds in the rangewhere denotes the -fold iterated logarithm of . (

Hint:this is already accomplished for . For higher , one has to introduce additional exceptional intervals and extend Proposition 30 appropriately.)

]]>

Exercise 38Establish Theorem 27 (and hence Theorem 8) in full generality. (This is the preceding exercise, but now with potentially as large as , where the inverse tower exponential function is defined as the least for which . Now one has to start tracking dependence on of all the arguments in the above analysis; in particular, the convenient notation of arithmetic functions being “large” or “small” needs to be replaced with something more precise.)

Theorem 1 (Eigenvector-eigenvalue identity)Let be an Hermitian matrix, with eigenvalues . Let be a unit eigenvector corresponding to the eigenvalue , and let be the component of . Thenwhere is the Hermitian matrix formed by deleting the row and column from .

When we posted the first version of this paper, we were unaware of previous appearances of this identity in the literature; a related identity had been used by Erdos-Schlein-Yau and by myself and Van Vu for applications to random matrix theory, but to our knowledge this specific identity appeared to be new. Even two months after our preprint first appeared on the arXiv in August, we had only learned of one other place in the literature where the identity showed up (by Forrester and Zhang, who also cite an earlier paper of Baryshnikov).

The situation changed rather dramatically with the publication of a popular science article in Quanta on this identity in November, which gave this result significantly more exposure. Within a few weeks we became informed (through private communication, online discussion, and exploration of the citation tree around the references we were alerted to) of over three dozen places where the identity, or some other closely related identity, had previously appeared in the literature, in such areas as numerical linear algebra, various aspects of graph theory (graph reconstruction, chemical graph theory, and walks on graphs), inverse eigenvalue problems, random matrix theory, and neutrino physics. As a consequence, we have decided to completely rewrite our article in order to collate this crowdsourced information, and survey the history of this identity, all the known proofs (we collect seven distinct ways to prove the identity (or generalisations thereof)), and all the applications of it that we are currently aware of. The citation graph of the literature that this *ad hoc* crowdsourcing effort produced is only very weakly connected, which we found surprising:

The earliest explicit appearance of the eigenvector-eigenvalue identity we are now aware of is in a 1966 paper of Thompson, although this paper is only cited (directly or indirectly) by a fraction of the known literature, and also there is a precursor identity of Löwner from 1934 that can be shown to imply the identity as a limiting case. At the end of the paper we speculate on some possible reasons why this identity only achieved a modest amount of recognition and dissemination prior to the November 2019 Quanta article.

]]>Let be a discrete group. A *(concrete) measure-preserving action* of on is a group homomorphism from to , thus is the identity map and for all . A large portion of ergodic theory is concerned with the study of such measure-preserving actions, especially in the classical case when is the integers (with the additive group law).

Let be a compact Hausdorff abelian group, which we can endow with the Borel -algebra . A *(concrete measurable) –cocycle* is a collection of concrete measurable maps obeying the *cocycle equation*

for -almost every . (Here we are glossing over a measure-theoretic subtlety that we will return to later in this post – see if you can spot it before then!) Cocycles arise naturally in the theory of group extensions of dynamical systems; in particular (and ignoring the aforementioned subtlety), each cocycle induces a measure-preserving action on (which we endow with the product of with Haar probability measure on ), defined by

This connection with group extensions was the original motivation for our study of measurable cohomology, but is not the focus of the current paper.

A special case of a -valued cocycle is a *(concrete measurable) -valued coboundary*, in which for each takes the special form

for -almost every , where is some measurable function; note that (ignoring the aforementioned subtlety), every function of this form is automatically a concrete measurable -valued cocycle. One of the first basic questions in measurable cohomology is to try to characterize which -valued cocycles are in fact -valued coboundaries. This is a difficult question in general. However, there is a general result of Moore and Schmidt that at least allows one to reduce to the model case when is the unit circle , by taking advantage of the Pontryagin dual group of characters , that is to say the collection of continuous homomorphisms to the unit circle. More precisely, we have

Theorem 1 (Countable Moore-Schmidt theorem)Let be a discrete group acting in a concrete measure-preserving fashion on a probability space . Let be a compact Hausdorff abelian group. Assume the following additional hypotheses:

- (i) is at most countable.
- (ii) is a standard Borel space.
- (iii) is metrisable.
Then a -valued concrete measurable cocycle is a concrete coboundary if and only if for each character , the -valued cocycles are concrete coboundaries.

The hypotheses (i), (ii), (iii) are saying in some sense that the data are not too “large”; in all three cases they are saying in some sense that the data are only “countably complicated”. For instance, (iii) is equivalent to being second countable, and (ii) is equivalent to being modeled by a complete separable metric space. It is because of this restriction that we refer to this result as a “countable” Moore-Schmidt theorem. This theorem is a useful tool in several other applications, such as the Host-Kra structure theorem for ergodic systems; I hope to return to these subsequent applications in a future post.

Let us very briefly sketch the main ideas of the proof of Theorem 1. Ignore for now issues of measurability, and pretend that something that holds almost everywhere in fact holds everywhere. The hard direction is to show that if each is a coboundary, then so is . By hypothesis, we then have an equation of the form

for all and some functions , and our task is then to produce a function for which

for all .

Comparing the two equations, the task would be easy if we could find an for which

for all . However there is an obstruction to this: the left-hand side of (3) is additive in , so the right-hand side would have to be also in order to obtain such a representation. In other words, for this strategy to work, one would have to first establish the identity

for all . On the other hand, the good news is that if we somehow manage to obtain the equation, then we can obtain a function obeying (3), thanks to Pontryagin duality, which gives a one-to-one correspondence between and the homomorphisms of the (discrete) group to .

Now, it turns out that one cannot derive the equation (4) directly from the given information (2). However, the left-hand side of (2) is additive in , so the right-hand side must be also. Manipulating this fact, we eventually arrive at

In other words, we don’t get to show that the left-hand side of (4) vanishes, but we do at least get to show that it is -invariant. Now let us assume for sake of argument that the action of is ergodic, which (ignoring issues about sets of measure zero) basically asserts that the only -invariant functions are constant. So now we get a weaker version of (4), namely

for some constants .

Now we need to eliminate the constants. This can be done by the following group-theoretic projection. Let denote the space of concrete measurable maps from to , up to almost everywhere equivalence; this is an abelian group where the various terms in (5) naturally live. Inside this group we have the subgroup of constant functions (up to almost everywhere equivalence); this is where the right-hand side of (5) lives. Because is a divisible group, there is an application of Zorn’s lemma (a good exercise for those who are not acquainted with these things) to show that there exists a retraction , that is to say a group homomorphism that is the identity on the subgroup . We can use this retraction, or more precisely the complement , to eliminate the constant in (5). Indeed, if we set

then from (5) we see that

while from (2) one has

and now the previous strategy works with replaced by . This concludes the sketch of proof of Theorem 1.

In making the above argument rigorous, the hypotheses (i)-(iii) are used in several places. For instance, to reduce to the ergodic case one relies on the ergodic decomposition, which requires the hypothesis (ii). Also, most of the above equations only hold outside of a set of measure zero, and the hypothesis (i) and the hypothesis (iii) (which is equivalent to being at most countable) to avoid the problem that an uncountable union of sets of measure zero could have positive measure (or fail to be measurable at all).

My co-author Asgar Jamneshan and I are working on a long-term project to extend many results in ergodic theory (such as the aforementioned Host-Kra structure theorem) to “uncountable” settings in which hypotheses analogous to (i)-(iii) are omitted; thus we wish to consider actions on uncountable groups, on spaces that are not standard Borel, and cocycles taking values in groups that are not metrisable. Such uncountable contexts naturally arise when trying to apply ergodic theory techniques to combinatorial problems (such as the inverse conjecture for the Gowers norms), as one often relies on the ultraproduct construction (or something similar) to generate an ergodic theory translation of these problems, and these constructions usually give “uncountable” objects rather than “countable” ones. (For instance, the ultraproduct of finite groups is a hyperfinite group, which is usually uncountable.). This paper marks the first step in this project by extending the Moore-Schmidt theorem to the uncountable setting.

If one simply drops the hypotheses (i)-(iii) and tries to prove the Moore-Schmidt theorem, several serious difficulties arise. We have already mentioned the loss of the ergodic decomposition and the possibility that one has to control an uncountable union of null sets. But there is in fact a more basic problem when one deletes (iii): the addition operation , while still continuous, can fail to be measurable as a map from to ! Thus for instance the sum of two measurable functions need not remain measurable, which makes even the very definition of a measurable cocycle or measurable coboundary problematic (or at least unnatural). This phenomenon is known as the *Nedoma pathology*. A standard example arises when is the uncountable torus , endowed with the product topology. Crucially, the Borel -algebra generated by this uncountable product is *not* the product of the factor Borel -algebras (the discrepancy ultimately arises from the fact that topologies permit uncountable unions, but -algebras do not); relating to this, the product -algebra is *not* the same as the Borel -algebra , but is instead a strict sub-algebra. If the group operations on were measurable, then the diagonal set

would be measurable in . But it is an easy exercise in manipulation of -algebras to show that if are any two measurable spaces and is measurable in , then the fibres of are contained in some countably generated subalgebra of . Thus if were -measurable, then all the points of would lie in a single countably generated -algebra. But the cardinality of such an algebra is at most while the cardinality of is , and Cantor’s theorem then gives a contradiction.

To resolve this problem, we give a coarser -algebra than the Borel -algebra, namely the *Baire -algebra* , thus coarsening the measurable space structure on to a new measurable space . In the case of compact Hausdorff abelian groups, can be defined as the -algebra generated by the characters ; for more general compact abelian groups, one can define as the -algebra generated by all continuous maps into metric spaces. This -algebra is equal to when is metrisable but can be smaller for other . With this measurable structure, becomes a measurable group; it seems that once one leaves the metrisable world that is a superior (or at least equally good) space to work with than for analysis, as it avoids the Nedoma pathology. (For instance, from Plancherel’s theorem, we see that if is the Haar probability measure on , then (thus, every -measurable set is equivalent modulo -null sets to a -measurable set), so there is no damage to Plancherel caused by passing to the Baire -algebra.

Passing to the Baire -algebra fixes the most severe problems with an uncountable Moore-Schmidt theorem, but one is still faced with an issue of having to potentially take an uncountable union of null sets. To avoid this sort of problem, we pass to the framework of *abstract measure theory*, in which we remove explicit mention of “points” and can easily delete all null sets at a very early stage of the formalism. In this setup, the category of concrete measurable spaces is replaced with the larger category of *abstract measurable spaces*, which we formally define as the opposite category of the category of -algebras (with Boolean algebra homomorphisms). Thus, we define an *abstract measurable space* to be an object of the form , where is an (abstract) -algebra and is a formal placeholder symbol that signifies use of the opposite category, and an *abstract measurable map* is an object of the form , where is a Boolean algebra homomorphism and is again used as a formal placeholder; we call the *pullback map* associated to . [UPDATE: It turns out that this definition of a measurable map led to technical issues. In a forthcoming revision of the paper we also impose the requirement that the abstract measurable map be -complete (i.e., it respects countable joins).] The composition of two abstract measurable maps , is defined by the formula , or equivalently .

Every concrete measurable space can be identified with an abstract counterpart , and similarly every concrete measurable map can be identified with an abstract counterpart , where is the pullback map . Thus the category of concrete measurable spaces can be viewed as a subcategory of the category of abstract measurable spaces. The advantage of working in the abstract setting is that it gives us access to more spaces that could not be directly defined in the concrete setting. Most importantly for us, we have a new abstract space, the *opposite measure algebra* of , defined as where is the ideal of null sets in . Informally, is the space with all the null sets removed; there is a canonical abstract embedding map , which allows one to convert any concrete measurable map into an abstract one . One can then define the notion of an abstract action, abstract cocycle, and abstract coboundary by replacing every occurrence of the category of concrete measurable spaces with their abstract counterparts, and replacing with the opposite measure algebra ; see the paper for details. Our main theorem is then

Theorem 2 (Uncountable Moore-Schmidt theorem)Let be a discrete group acting abstractly on a -finite measure space . Let be a compact Hausdorff abelian group. Then a -valued abstract measurable cocycle is an abstract coboundary if and only if for each character , the -valued cocycles are abstract coboundaries.

With the abstract formalism, the proof of the uncountable Moore-Schmidt theorem is almost identical to the countable one (in fact we were able to make some simplifications, such as avoiding the use of the ergodic decomposition). A key tool is what we call a “conditional Pontryagin duality” theorem, which asserts that if one has an abstract measurable map for each obeying the identity for all , then there is an abstract measurable map such that for all . This is derived from the usual Pontryagin duality and some other tools, most notably the completeness of the -algebra of , and the Sikorski extension theorem.

We feel that it is natural to stay within the abstract measure theory formalism whenever dealing with uncountable situations. However, it is still an interesting question as to when one can guarantee that the abstract objects constructed in this formalism are representable by concrete analogues. The basic questions in this regard are:

- (i) Suppose one has an abstract measurable map into a concrete measurable space. Does there exist a representation of by a concrete measurable map ? Is it unique up to almost everywhere equivalence?
- (ii) Suppose one has a concrete cocycle that is an abstract coboundary. When can it be represented by a concrete coboundary?

For (i) the answer is somewhat interesting (as I learned after posing this MathOverflow question):

- If does not separate points, or is not compact metrisable or Polish, there can be counterexamples to uniqueness. If is not compact or Polish, there can be counterexamples to existence.
- If is a compact metric space or a Polish space, then one always has existence and uniqueness.
- If is a compact Hausdorff abelian group, one always has existence.
- If is a complete measure space, then one always has existence (from a theorem of Maharam).
- If is the unit interval with the Borel -algebra and Lebesgue measure, then one has existence for all compact Hausdorff assuming the continuum hypothesis (from a theorem of von Neumann) but existence can fail under other extensions of ZFC (from a theorem of Shelah, using the method of forcing).
- For more general , existence for all compact Hausdorff is equivalent to the existence of a lifting from the -algebra to (or, in the language of abstract measurable spaces, the existence of an abstract retraction from to ).
- It is a long-standing open question (posed for instance by Fremlin) whether it is relatively consistent with ZFC that existence holds whenever is compact Hausdorff.

Our understanding of (ii) is much less complete:

- If is metrisable, the answer is “always” (which among other things establishes the countable Moore-Schmidt theorem as a corollary of the uncountable one).
- If is at most countable and is a complete measure space, then the answer is again “always”.

In view of the answers to (i), I would not be surprised if the full answer to (ii) was also sensitive to axioms of set theory. However, such set theoretic issues seem to be almost completely avoided if one sticks with the abstract formalism throughout; they only arise when trying to pass back and forth between the abstract and concrete categories.

]]>The basic objects of study in analytic number theory are deterministic; there is nothing inherently random about the set of prime numbers, for instance. Despite this, one can still interpret many of the averages encountered in analytic number theory in probabilistic terms, by introducing random variables into the subject. Consider for instance the form

of the prime number theorem (where we take the limit ). One can interpret this estimate probabilistically as

where is a random variable drawn uniformly from the natural numbers up to , and denotes the expectation. (In this set of notes we will use boldface symbols to denote random variables, and non-boldface symbols for deterministic objects.) By itself, such an interpretation is little more than a change of notation. However, the power of this interpretation becomes more apparent when one then imports concepts from probability theory (together with all their attendant intuitions and tools), such as independence, conditioning, stationarity, total variation distance, and entropy. For instance, suppose we want to use the prime number theorem (1) to make a prediction for the sum

After dividing by , this is essentially

With probabilistic intuition, one may expect the random variables to be approximately independent (there is no obvious relationship between the number of prime factors of , and of ), and so the above average would be expected to be approximately equal to

which by (2) is equal to . Thus we are led to the prediction

The asymptotic (3) is widely believed (it is a special case of the *Chowla conjecture*, which we will discuss in later notes; while there has been recent progress towards establishing it rigorously, it remains open for now.

How would one try to make these probabilistic intuitions more rigorous? The first thing one needs to do is find a more quantitative measurement of what it means for two random variables to be “approximately” independent. There are several candidates for such measurements, but we will focus in these notes on two particularly convenient measures of approximate independence: the “” measure of independence known as covariance, and the “” measure of independence known as mutual information (actually we will usually need the more general notion of conditional mutual information that measures conditional independence). The use of type methods in analytic number theory is well established, though it is usually not described in probabilistic terms, being referred to instead by such names as the “second moment method”, the “large sieve” or the “method of bilinear sums”. The use of methods (or “entropy methods”) is much more recent, and has been able to control certain types of averages in analytic number theory that were out of reach of previous methods such as methods. For instance, in later notes we will use entropy methods to establish the logarithmically averaged version

of (3), which is implied by (3) but strictly weaker (much as the prime number theorem (1) implies the bound , but the latter bound is much easier to establish than the former).

As with many other situations in analytic number theory, we can exploit the fact that certain assertions (such as approximate independence) can become significantly easier to prove if one only seeks to establish them *on average*, rather than uniformly. For instance, given two random variables and of number-theoretic origin (such as the random variables and mentioned previously), it can often be extremely difficult to determine the extent to which behave “independently” (or “conditionally independently”). However, thanks to second moment tools or entropy based tools, it is often possible to assert results of the following flavour: if are a large collection of “independent” random variables, and is a further random variable that is “not too large” in some sense, then must necessarily be nearly independent (or conditionally independent) to many of the , even if one cannot pinpoint precisely which of the the variable is independent with. In the case of the second moment method, this allows us to compute correlations such as for “most” . The entropy method gives bounds that are significantly weaker quantitatively than the second moment method (and in particular, in its current incarnation at least it is only able to say non-trivial assertions involving interactions with residue classes at small primes), but can control significantly more general quantities for “most” thanks to tools such as the Pinsker inequality.

** — 1. Second moment methods — **

In this section we discuss probabilistic techniques of an “” nature. We fix a probability space to model all of random variables; thus for instance we shall model a complex random variable in these notes by a measurable function . (Strictly speaking, there is a subtle distinction one can maintain between a random variable and its various measure-theoretic models, which becomes relevant if one later decides to modify the probability space , but this distinction will not be so important in these notes and so we shall ignore it. See this previous set of notes for more discussion.)

We will focus here on the space of complex random variables (that is to say, measurable maps ) whose *second moment*

of is finite. In many number-theoretic applications the finiteness of the second moment will be automatic because will only take finitely many values. As is well known, the space has the structure of a complex Hilbert space, with inner product

and norm

for . By slight abuse of notation, the complex numbers can be viewed as a subset of , by viewing any given complex number as a constant (deterministic) random variable. Then is a one-dimensional subspace of , spanned by the unit vector . Given a random variable to , the projection of to is then the *mean*

and we obtain an orthogonal splitting of any into its mean and its mean zero part . By Pythagoras’ theorem, we then have

The first quantity on the right-hand side is the square of the distance from to , and this non-negative quantity is known as the variance

The square root of the variance is known as the standard deviation. The variance controls the distribution of the random variable through Chebyshev’s inequality

for any , which is immediate from observing the inequality and then taking expectations of both sides. Roughly speaking, this inequality asserts that typically deviates from its mean by no more than a bounded multiple of the standard deviation .

A slight generalisation of Chebyshev’s inequality that can be convenient to use is

for any and any complex number (which typically will be a simplified approximation to the mean ), which is proven similarly to (6) but noting (from (5)) that .

Informally, (6) is an assertion that a square-integrable random variable will concentrate around its mean if its variance is not too large. See these previous notes for more discussion of the concentration of measure phenomenon. One can often obtain stronger concentration of measure than what is provided by Chebyshev’s inequality if one is able to calculate higher moments than the second moment, such as the fourth moment or exponential moments , but we will not pursue this direction in this set of notes.

Clearly the variance is homogeneous of order two, thus

for any and . In particular, the variance is not always additive: the claim fails in particular when is not almost surely zero. However, there is an important substitute for this formula. Given two random variables , the inner product of the corresponding mean zero parts is a complex number known as the covariance:

As are orthogonal to , it is not difficult to obtain the alternate formula

The covariance is then a positive semi-definite inner product on (it basically arises from the Hilbert space structure of the space of mean zero variables), and . From the Cauchy-Schwarz inequality we have

If have non-zero variance (that is, they are not almost surely constant), then the ratio

is then known as the correlation between and , and is a complex number of magnitude at most ; for real-valued that are not almost surely constant, the correlation is instead a real number between and . At one extreme, a correlation of magnitude occurs if and only if is a scalar multiple of . At the other extreme, a correlation of zero is an indication (though not a guarantee) of independence. Recall that two random variables are *independent* if one has

for all (Borel) measurable . In particular, setting , for and integrating using Fubini’s theorem, we conclude that

similarly with replaced by , and similarly for . In particular we have

and thus from (8) we thus see that independent random variables have zero covariance (and zero correlation, when they are not almost surely constant). On the other hand, the converse fails:

Exercise 1Provide an example of two random variables which are not independent, but which have zero correlation or covariance with each other. (There are many ways to produce some examples. One comes from exploiting various systems of orthogonal functions, such as sines and cosines. Another comes from working with random variables taking only a small number of values, such as .

for any finite collection of random variables . These identities combine well with Chebyshev-type inequalities such as (6), (7), and this leads to a very common instance of the second moment method in action. For instance, we can use it to understand the distribution of the number of prime factors of a random number that fall within a given set . Given any set of natural numbers, define the *logarithmic size* to be the quantity

Thus for instance Euler’s theorem asserts that the primes have infinite logarithmic size.

Lemma 2 (Turan-Kubilius inequality, special case)Let be an interval of length at least , and let be an integer drawn uniformly at random from this interval, thusfor all . Let be a finite collection of primes, all of which have size at most . Then the random variable has mean

and variance

In particular,

and from (7) we have

for any .

*Proof:* For any natural number , we have

We now write . From (11) we see that each indicator random variable , has mean and variance ; similarly, for any two distinct , we see from (11), (8) the indicators , have covariance

and the claim now follows from (10).

The exponents of in the error terms here are not optimal; but in practice, we apply this inequality when is much larger than any given power of , so factors such as will be negligible. Informally speaking, the above lemma asserts that a typical number in a large interval will have roughly prime factors in a given finite set of primes, as long as the logarithmic size is large.

If we apply the above lemma to for some large , and equal to the primes up to (say) , we have , and hence

Since , we recover the main result

of Section 5 of Notes 1 (indeed this is essentially the same argument as in that section, dressed up in probabilistic language). In particular, we recover the Hardy-Ramanujan law that a proportion of the natural numbers in have prime factors.

Exercise 3 (Turan-Kubilius inequality, general case)Let be an additive function (which means that whenever are coprime. Show thatwhere

(Hint: one may first want to work with the special case when vanishes whenever so that the second moment method can be profitably applied, and then figure out how to address the contributions of prime powers larger than .)

Exercise 4 (Turan-Kubilius inequality, logarithmic version)Let with , and let be a collection of primes of size less than with . Show that

Exercise 5 (Paley-Zygmund inequality)Let be non-negative with positive mean. Show thatThis inequality can sometimes give slightly sharper results than the Chebyshev inequality when using the second moment method.

Now we give a useful lemma that quantifies a heuristic mentioned in the introduction, namely that if several random variables do not correlate with each other, then it is not possible for any further random variable to correlate with many of them simultaneously. We first state an abstract Hilbert space version.

Lemma 6 (Bessel type inequality, Hilbert space version)If are elements of a Hilbert space , and are positive reals, then

*Proof:* We use the duality method. Namely, we can write the left-hand side of (13) as

for some complex numbers with (just take to be normalised by the left-hand side of (14), or zero if that left-hand side vanishes. By Cauchy-Schwarz, it then suffices to establish the dual inequality

The left-hand side can be written as

Using the arithmetic mean-geometric mean inequality and symmetry, this may be bounded by

Since , the claim follows.

Corollary 7 (Bessel type inequality, probabilistic version)If , and are positive reals, then

*Proof:* By subtracting the mean from each of we may assume that these random variables have mean zero. The claim now follows from Lemma 6.

To get a feel for this inequality, suppose for sake of discussion that and all have unit variance and , but that the are pairwise uncorrelated. Then the right-hand side is equal to , and the left-hand side is the sum of squares of the correlations between and each of the . Any individual correlation is then still permitted to be as large as , but it is not possible for multiple correlations to be this large simultaneously. This is geometrically intuitive if one views the random variables as vectors in a Hilbert space (and correlation as a rough proxy for the angle between such vectors). This lemma also shares many commonalities with the large sieve inequality, discussed in this set of notes.

One basic number-theoretic application of this inequality is the following sampling inequality of Elliott, that lets one approximate a sum of an arithmetic function by its values on multiples of primes :

Exercise 8 (Elliott’s inequality)Let be an interval of length at least . Show that for any function , one has(

Hint:Apply Corollary 7 with , , and , where is the uniform variable from Lemma 2.) Conclude in particular that for every , one hasfor all primes outside of a set of exceptional primes of logarithmic size .

Informally, the point of this inequality is that an arbitrary arithmetic function may exhibit correlation with the indicator function of the multiples of for some primes , but cannot exhibit significant correlation with all of these indicators simultaneously, because these indicators are not very correlated to each other. We note however that this inequality only gains a tiny bit over trivial bounds, because the set of primes up to only has logarithmic size by Mertens’ theorems; thus, any asymptotics that are obtained using this inequality will typically have error terms that only improve upon the trivial bound by factors such as .

Exercise 9 (Elliott’s inequality, logarithmic form)Let with . Show that for any function , one hasand thus for every , one has

for all primes outside of an exceptional set of primes of logarithmic size .

Exercise 10Use Exercise (9) and a duality argument to provide an alternate proof of Exercise 4. (Hint:express the left-hand side of (12) as a correlation between and some suitably -normalised arithmetic function .)

As a quick application of Elliott’s inequality, let us establish a weak version of the prime number theorem:

Proposition 11 (Weak prime number theorem)For any we havewhenever are sufficiently large depending on .

This estimate is weaker than what one can obtain by existing methods, such as Exercise 56 of Notes 1. However in the next section we will refine this argument to recover the full prime number theorem.

*Proof:* Fix , and suppose that are sufficiently large. From Exercise 9 one has

for all primes outside of an exceptional set of logarithmic size . If we restrict attention to primes then one sees from the integral test that one can replace the sum by and only incur an additional error of . If we furthermore restrict to primes larger than , then the contribution of those that are divisible by is also . For not divisible by , one has . Putting all this together, we conclude that

for all primes outside of an exceptional set of logarithmic size . In particular, for large enough this statement is true for at least one such . The claim then follows.

As another application of Elliott’s inequality, we present a criterion for orthogonality between multiplicative functions and other sequences, first discovered by Katai (with related results also introduced earlier by Daboussi and Delange), and rediscovered by Bourgain, Sarnak, and Ziegler:

Proposition 12 (Daboussi-Delange-Katai-Bourgain-Sarnak-Ziegler criterion)Let be a multiplicative function with for all , and let be another bounded function. Suppose that one hasas for any two distinct primes . Then one has

as .

*Proof:* Suppose the claim fails, then there exists (which we can assume to be small) and arbitrarily large such that

By Exercise 8, this implies that

for all primes outside of an exceptional set of logarithmic size . Call such primes “good primes”. In particular, by the pigeonhole principle, and assuming large enough, there exists a dyadic range with which contains good primes.

Fix a good prime in . From (15) we have

We can replace the range by with negligible error. We also have except when is a multiple of , but this latter case only contributes which is also negligible compared to the right-hand side. We conclude that

for every good prime. On the other hand, from Lemma 6 we have

where range over the good primes in . The left-hand side is then , and by hypothesis the right-hand side is for large enough. As and is small, this gives the desired contradiction

Exercise 13 (Daboussi-Delange theorem)Let be irrational, and let be a multiplicative function with for all . Show thatas . If instead is rational, show that there exists be a multiplicative function with for which the statement (16) fails. (Hint: use Dirichlet characters and Plancherel’s theorem for finite abelian groups.)

** — 2. An elementary proof of the prime number theorem — **

Define the Mertens function

As shown in Theorem 58 of Notes 1, the prime number theorem is equivalent to the bound

as . We now give a recent proof of this theorem, due to Redmond McNamara (personal communication), that relies primarily on Elliott’s inequality and the Selberg symmetry formula; it is a relative of the standard elementary proof of this theorem due to Erdös and Selberg. In order to keep the exposition simple, we will not arrange the argument in a fashion that optimises the decay rate (in any event, there are other proofs of the prime number theorem that give significantly stronger bounds).

Firstly we see that Elliott’s inequality gives the following weaker version of (17):

Lemma 14 (Oscillation for Mertens’ function)If and , then we havefor all primes outside of an exceptional set of primes of logarithmic size .

*Proof:* We may assume as the claim is trivial otherwise. From Exercise 8 applied to and , we have

for all outside of an exceptional set of primes of logarithmic size . Since for not divisible by , the right-hand side can be written as

Since outside of an exceptional set of logarithmic size , the claim follows.

Informally, this lemma asserts that for most primes , which morally implies that for most primes . If we can then locate suitable primes with , thus should then lead to , which should then yield the prime number theorem . The manipulations below are intended to make this argument rigorous.

It will be convenient to work with a logarithmically averaged version of this claim.

Corollary 15 (Logarithmically averaged oscillation)If and is sufficiently large depending on , then

*Proof:* For each , we have from the previous lemma that

for all outside of an exceptional set of logarithmic size . We then have

so it suffices by Markov’s inequality to show that

But by Fubini’s theorem, the left-hand side may be bounded by

and the claim follows.

Let be sufficiently small, and let be sufficiently large depending on . Call a prime *good* if the bound (18) holds and *bad* otherwise, thus all primes outside of an exceptional set of bad primes of logarithmic size are good. Now we observe that we can make small as long as we can make two good primes multiply to be close to a third:

*Proof:* By definition of good prime, we have the bounds

We rescale (20) by to conclude that

We can replace the integration range here from to with an error of if is large enough. Also, since , we have . Thus we have

Combining this with (19), (21) and the triangle inequality (writing as a linear combination of , , and ) we conclude that

This is an averaged version of the claim we need. To remove the averaging, we use the identity (see equation (63) of Notes 1) to conclude that

From the triangle inequality one has

and hence by Mertens’ theorem

From the Brun-Titchmarsh inequality (Corollary 61 of Notes 1) we have

and so from the previous estimate and Fubini’s theorem one has

and hence by (22) (using trivial bounds to handle the region outside of )

Since

we conclude (for large enough) that

and the claim follows.

To finish the proof of the prime number theorem, it thus suffices to locate, for sufficiently large, three good primes with . If we already had the prime number theorem, or even the weaker form that every interval of the form contained primes for large enough, then this would be quite easy: pick a large natural number (depending on , but independent of ), so that the primes up to has logarithmic size (so that only of them are bad, as measured by logarithmic size), and let be random numbers and drawn uniformly from (say) . From the prime number theorem, for each , the interval contains primes. In particular, contains primes, but the expected number of bad primes in this interval is . Thus by Markov’s inequality there would be at least a chance (say) of having at least one good prime in ; similarly there is a chance of having a good prime in , and a chance of having a good prime in . Thus (as an application of the probabilistic method), there exist (deterministic) good primes with the required properties.

Of course, using the prime number theorem here to prove the prime number theorem would be circular. However, we can still locate a good triple of primes using the Selberg symmetry formula

as , where is the second von Mangoldt function

see Proposition 60 of Notes 1. We can strip away the contribution of the primes:

Exercise 17Show thatas .

In particular, on evaluating this at and subtracting, we have

whenever is sufficiently large depending on . In particular, for any such , one either has

(or both). Informally, the Selberg symmetry formula shows that the interval contains either a lot of primes, or a lot of semiprimes. The factor of is slightly annoying, so we now remove it. Consider the contribution of those primes to (25) with . This is bounded by

which we can bound crudely using the Chebyshev bound by

which by Mertens theorem is . Thus the contribution of this case can be safely removed from (25). Similarly for those cases when . For the remaining cases we bound . We conclude that for any sufficiently large , either (24) or

In order to find primes with close to , it would be very convenient if we could find a for which (24) and (26) *both* hold. We can’t quite do this directly, but due to the “connected” nature of the set of scales , but we can do the next best thing:

Proposition 18Suppose is sufficiently large depending on . Then there exists with such that

*Proof:* We know that every in obeys at least one of (27), (28). Our task is to produce an adjacent pair of , one of which obeys (27) and the other obeys (28). Suppose for contradiction that no such pair exists, then whenever fails to obey (27), then any adjacent must also fail to do so, and similarly for (28). Thus either (27) will fail to hold for all , or (28) will fail to hold for all such . If (27) fails for all , then on summing we have

which contradicts Mertens’ theorem if is large enough because the left-hand side is . Similarly, if (28) fails for all , then

and again Mertens’ theorem can be used to lower bound the left-hand side by (in fact one can even gain an additional factor of if one works things through carefully) and obtain a contradiction.

The above proposition does indeed provide a triple of primes with . If is sufficiently large depending on and less than (say) , so that , this would give us what we need as long as one of the triples consisted only of good primes. The only way this can fail is if either

for some , or if

for some . In the first case, we can sum to conclude that

and in the second case we have

Since the total set of bad primes up to has logarithmic size , we conclude from the pigeonhole principle (and the divergence of the harmonic series ) that for any depending only on , and any large enough, there exists such that neither of (29) and (30) hold. Indeed the set of obeying (29) has logarithmic size , and similarly for (30). Choosing a that avoids both of these scenarios, we then find a good and good with , so that , and then by Proposition 16 we conclude that for all sufficiently large . Sending to zero, we obtain the prime number theorem.

** — 3. Entropy methods — **

In the previous section we explored the consequences of the second moment method, which applies to square-integrable random variables taking values in the real or complex numbers. Now we explore entropy methods, which now apply to random variables which take a finite number of values (equipped with the discrete sigma-algebra), but whose range need not be numerical in nature. (One could extend entropy methods to slightly larger classes of random variables, such as ones that attain a countable number of values, but for our applications finitely-valued random variables will suffice.)

The fundamental notion here is that of the Shannon entropy of a random variable. If takes values in a finite set , its Shannon entropy (or *entropy* for short) is defined by the formula

where ranges over all the possible values of , and we adopt the convention , so that values that are almost surely not attained by do not influence the entropy. We choose here to use the natural logarithm to normalise our entropy (in which case a unit of entropy is known as a “nat“); in the information theory literature it is also common to use the base two logarithm to measure entropy (in which case a unit of entropy is known as a “bit“, which is equal to nats). However, the precise choice of normalisation will not be important in our discussion.

It is clear that if two random variables have the same probability distribution, then they have the same entropy. Also, the precise choice of range set is not terribly important: if takes values in , and is an injection, then it is clear that and have the same entropy:

This is in sharp contrast to moment-based statistics such as the mean or variance, which can be radically changed by applying some injective transformation to the range values.

Informally, the entropy informally measures how “spread out” or “disordered” the distribution of is, behaving like a logarithm of the size of the “essential support” of such a variable; from an information-theoretic viewpoint, it measures the amount of “information” one learns when one is told the value of . Here are some basic properties of Shannon entropy that help support this intuition:

Exercise 19 (Basic properties of Shannon entropy)Let be a random variable taking values in a finite set .

- (i) Show that , with equality if and only if is almost surely deterministic (that is to say, it is almost surely equal to a constant ).
- (ii) Show that
with equality if and only if is uniformly distributed on . (Hint: use Jensen’s inequality and the convexity of the map on .)

- (iii) (Shannon-McMillan-Breiman theorem) Let be a natural number, and let be independent copies of . As , show that there is a subset of cardinality with the properties that
and

uniformly for all . (The proof of this theorem will require Stirling’s formula, which you may assume here as a black box; see also this previous blog post.) Informally, we thus see a large tuple of independent samples of approximately behaves like a uniform distribution on values.

One can view Shannon entropy as a generalisation of the notion of cardinality of a finite set (or equivalently, cardinality of finite sets can be viewed as a special case of Shannon entropy); see this previous blog post for an elaboration of this point.

The concept of Shannon entropy becomes significantly more powerful when combined with that of conditioning. Recall that a random variable taking values in a range set can be modeled by a measurable map from a probability space to the range . If is an event in of positive probability, we can then *condition* to the event to form a new random variable on the conditioned probability space , where

is the restriction of the -algebra to ,

is the conditional probability measure on , and is the restriction of to . This random variable lives on a different probability space than itself, so it does not make sense to directly combine these variables (thus for instance one cannot form the sum even when both random variables are real or complex valued); however, one can still form the Shannon entropy of the conditioned random variable , which is given by the same formula

Given another random variable taking values in another finite set , we can then define the conditional Shannon entropy to be the expected entropy of the level sets , thus

with the convention that the summand here vanishes when . From the law of total probability we have

for any , and hence by Jensen’s inequality

for any ; summing we obtain the Shannon entropy inequality

Informally, this inequality asserts that the new information content of can be decreased, but not increased, if one is first told some additional information .

This inequality (33) can be rewritten in several ways:

Exercise 20Let , be random variables taking values in finite sets respectively.

- (i) Establish the chain rule
where is the joint random variable . In particular, (33) can be expressed as a subadditivity formula

- (ii) If is a function of , in the sense that for some (deterministic) function , show that .
- (iii) Define the mutual information by the formula
Establish the inequalities

with the first inequality holding with equality if and only if are independent, and the latter inequalities holding if and only if is a function of (or vice versa).

From the above exercise we see that the mutual information is a measure of dependence between and , much as correlation or covariance was in the previous sections. There is however one key difference: whereas a zero correlation or covariance is a consequence but not a guarantee of independence, zero mutual information is *logically equivalent* to independence, and is thus a stronger property. To put it another way, zero correlation or covariance allows one to calculate the average in terms of individual averages of , but zero mutual information is stronger because it allows one to calculate the more general averages in terms of individual averages of , for arbitrary functions taking values into the complex numbers. This increased power of the mutual information statistic will allow us to estimate various averages of interest in analytic number theory in ways that do not seem amenable to second moment methods.

The subadditivity property formula can be conditioned to any event occuring with positive probability (replacing the random variables by their conditioned counterparts ), yielding the inequality

Applying this inequality to the level events of some auxiliary random variable taking values in another finite set , multiplying by , and summing, we conclude the inequality

In other words, the conditional mutual information

between and conditioning on is always non-negative:

One has conditional analogues of the above exercise:

Exercise 21Let , , be random variables taking values in finite sets respectively.

- (i) Establish the conditional chain rule
In particular, (36) is equivalent to the inequality

- (ii) Show that equality holds in (36) if and only if are conditionally independent relative to , which means that
for any , , .

- (iii) Show that , with equality if and only if is almost surely a deterministic function of .
- (iv) Show the data processing inequality
for any functions , , and more generally that

- (v) If is an injective function, show that
However, if is not assumed to be injective, show by means of examples that there is no order relation between the left and right-hand side of (40) (in other words, show that either side may be greater than the other). Thus, increasing or decreasing the amount of information that is known may influence the mutual information between two remaining random variables in either direction.

- (vi) If is a function of , and also a function of (thus for some and ), and a further random variable is a function jointly of (thus for some ), establish the submodularity inequality

We now give a key motivating application of the Shannon entropy inequalities. Suppose one has a sequence of random variables, all taking values in a finite set , which are stationary in the sense that the tuples and have the same distribution for every . In particular we will have

and hence by (39)

If we write , we conclude from (34) that we have the concavity property

In particular we have for any , which on summing and telescoping series (noting that ) gives

and hence we have the entropy monotonicity

In particular, the limit exists. This quantity is known as the Kolmogorov-Sinai entropy of the stationary process ; it is an important statistic in the theory of dynamical systems, and roughly speaking measures the amount of entropy produced by this process as a function of a discrete time vairable . We will not directly need the Kolmogorov-Sinai entropy in our notes, but a variant of the entropy monotonicity formula (41) will be important shortly.

In our application we will be dealing with processes that are only asymptotically stationary rather than stationary. To control this we recall the notion of the total variation distance between two random variables taking values in the same finite space , defined by

There is an essentially equivalent notion of this distance which is also often in use:

Exercise 22If two random variables take values in the same finite space , establish the inequalities

Shannon entropy is continuous in total variation distance as long as we keep the range finite. More quantitatively, we have

Lemma 23If two random variables take values in the same finite space , thenwith the convention that the error term vanishes when .

*Proof:* Set . The claim is trivial when (since then have the same distribution) and when (from (32)), so let us assume , and our task is to show that

If we write , , and , then

By dividing into the cases and we see that

since , it thus suffices to show that

But from Jensen’s inequality (32) one has

since , the claim follows.

In the converse direction, if a random variable has entropy close to the maximum , then one can control the total variation:

Lemma 24 (Special case of Pinsker inequality)If takes values in a finite set , and is a uniformly distributed random variable on , then

Of course, we have , so we may also write the above inequality as

The optimal value of the implied constant here is known to equal , but we will not use this sharp version of the inequality here.

*Proof:* If we write and , and , then we can rewrite the claimed inequality as

Observe that the function is concave, and in fact for all . From this and Taylor expansion with remainder we may write

for some between and . Since is independent of , and , we thus have on summing in

By Cauchy-Schwarz we then have

Since and , the claim follows.

The above lemma does not hold when the comparison variable is not assumed to be uniform; in particular, two non-uniform random variables can have precisely the same entropy but yet have different distributions, so that their total variation distance is positive. There is a more general variant, known as the Pinsker inequality, which we will not use in these notes:

Exercise 25 (Pinsker inequality)If take values in a finite set , define the Kullback-Leibler divergence of relative to by the formula(with the convention that the summand vanishes when vanishes).

- (i) Establish the Gibbs inequality .
- (ii) Establish the Pinsker inequality
In particular, vanishes if and only if have identical distribution. Show that this implies Lemma 24 as a special case.

- (iii) Give an example to show that the Kullback-Liebler divergence need not be symmetric, thus there exist such that .
- (iv) If are random variables taking values in finite sets , and are
independentrandom variables taking values in respectively with each having the same distribution of , show that

In our applications we will need a relative version of Lemma 24:

Corollary 26 (Relative Pinsker inequality)If takes values in a finite set , takes values in a finite set , and is a uniformly distributed random variable on that is independent of , then

*Proof:* From direct calculation we have the identity

As is independent of , is uniformly distributed on . From Lemma 24 we conclude

Inserting this bound and using the Cauchy-Schwarz inequality, we obtain the claim.

Now we are ready to apply the above machinery to give a key inequality that is analogous to Elliott’s inequality. Inequalities of this type first appeared in one of my papers, introducing what I called the “entropy decrement argument”; the following arrangement of the inequality and proof is due to Redmond McNamara (personal communication).

Theorem 27 (Entropy decrement inequality)Let be a random variable taking values in a finite set of integers, which obeys the approximate stationarityfor some . Let be a collection of distinct primes less than some threshold , and let be natural numbers that are also bounded by . Let be a function taking values in a finite set . For , let denote the -valued random variable

and let denote the -valued random variable

Also, let be a random variable drawn uniformly from , independently of . Then

The factor (arising from an invocation of the Chinese remainder theorem in the proof) unfortunately restricts the usefulness of this theorem to the regime in which all the primes involved are of “sub-logarithmic size”, but once one is in that regime, the second term on the right-hand side of (45) tends to be negligible in practice. Informally, this theorem asserts that for most small primes , the random variables and behave as if they are independent of each other.

*Proof:* We can assume , as the claim is trivial for (the all have zero entropy). For , we introduce the -valued random variable

The idea is to exploit some monotonicity properties of the quantity , in analogy with (41). By telescoping series we have

where we extend (44) to the case. From (38) we have

Now we lower bound the summand on the right-hand side. From multiple applications of the conditional chain rule (37) we have

We now use the approximate stationarity of to derive an approximate monotonicity property for . If , then from (39) we have

Write and

Note that is a deterministic function of and vice versa. Thus we can replace by in the above formula, and conclude that

The tuple takes values in a set of cardinality thanks to the Chebyshev bounds. Hence by two applications of Lemma 23, (43) we have

The first term on the right-hand side is . Worsening the error term slightly, we conclude that

and hence

for any . In particular

which by (47), (48) rearranges to

From (46) we conclude that

Meanwhile, from Corollary 26, (39), (38) we have

The probability distribution of is a function on , which by the Chinese remainder theorem we can identify with a cyclic group where . From (43) we see that the value of this distribution at adjacent values of this cyclic group varies by , hence the total variation distance between this random variable and the uniform distribution on is by Chebyshev bounds. By Lemma 23 we then have

and thus

The claim follows.

We now compare this result to Elliott’s inequality. If one tries to address precisely the same question that Elliott’s inequality does – namely, to try to compare a sum with sampled subsums – then the results are quantitatively much weaker:

Corollary 28 (Weak Elliott inequality)Let be an interval of length at least . Let be a function with for all , and let . Then one hasfor all primes outside of an exceptional set of primes of logarithmic size .

Comparing this with Exercise 8 we see that we cover a much smaller range of primes ; also the size of the exceptional set is slightly worse. This version of Elliot’s inequality is however still strong enough to recover a proof of the prime number theorem as in the previous section.

*Proof:* We can assume that is small, as the claim is trivial for comparable to . We can also assume that

since the claim is also trivial otherwise (just make all primes up to exceptional, then use Mertens’ theorem). As a consequence of this, any quantity involving in the denominator will end up being completely negligible in practice. We can also restrict attention to primes less than (say) , since the remaining primes between and have logarithmic size .

By rounding the real and imaginary parts of to the nearest multiple of , we may assume that takes values in some finite set of complex numbers of size with cardinality . Let be drawn uniformly at random from . Then (43) holds with , and from Theorem 27 with and (which makes the second term of the right-hand side of (45) negligible) we have

where are the primes up to , arranged in increasing order. By Markov’s inequality, we thus have

for outside of a set of primes of logarithmic size .

Let be as above. Now let be the function

that is to say picks out the unique component of the tuple in which is divisible by . This function is bounded by , and then by (42) we have

The left-hand side is equal to

which on switching the summations and using the large nature of can be rewritten as

Meanwhile, the left-hand side is equal to

which again by switching the summations becomes

The claim follows.

In the above argument we applied (42) with a very specific choice of function . The power of Theorem 27 lies in the ability to select many other such functions , leading to estimates that do not seem to be obtainable purely from the second moment method. In particular we have the following generalisation of the previous estimate:

Proposition 29 (Weak Elliott inequality for multiple correlations)Let be an interval of length at least . Let be a function with for all , and let . Let be integers. Then one hasfor all primes outside of an exceptional set of primes of logarithmic size .

*Proof:* We allow all implied constants to depend on . As before we can assume that is sufficiently small (depending on ), that takes values in a set of bounded complex numbers of cardinality , and that is large in the sense of (49), and restrict attention to primes up to . By shifting the and using the large nature of we can assume that the are all non-negative, taking values in for some . We now apply Theorem 27 with and conclude as before that

for outside of a set of primes of logarithmic size .

Let be as above. Let be the function

This function is still bounded by , so by (42) as before we have

The left-hand side is equal to

which on switching the summations and using the large nature of can be rewritten as

Meanwhile, the left-hand side is equal to

which again by switching the summations becomes

The claim follows.

There is a logarithmically averaged version of the above proposition:

Exercise 30 (Weak Elliott inequality for logarithmically averaged multiple correlations)Let with , let be a function bounded in magnitude by , let , and let be integers. Show thatfor all primes outside of an exceptional set of primes of logarithmic size .

When one specialises to multiplicative functions, this lets us dilate shifts in multiple correlations by primes:

Exercise 31Let with , let be a multiplicative function bounded in magnitude by , let , and let be nonnegative integers. Show thatfor all primes outside of an exceptional set of primes of logarithmic size .

For instance, setting to be the Möbius function, , , and (say), we see that

for all primes outside of an exceptional set of primes of logarithmic size . In particular, for large enough, one can obtain bounds of the form

for various moderately large sets of primes . It turns out that these double sums on the right-hand side can be estimated by methods which we will cover in later series of notes. Among other things, this allows us to establish estimates such as

as , which to date have only been established using these entropy methods (in conjunction with the methods discussed in later notes). This is progress towards an open problem in analytic number theory known as *Chowla’s conjecture*, which we will also discuss in later notes.

Conjecture 1 (Collatz conjecture)One has for all .

Establishing the conjecture for all remains out of reach of current techniques (for instance, as discussed in the previous blog post, it is basically at least as difficult as Baker’s theorem, all known proofs of which are quite difficult). However, the situation is more promising if one is willing to settle for results which only hold for “most” in some sense. For instance, it is a result of Krasikov and Lagarias that

for all sufficiently large . In another direction, it was shown by Terras that for almost all (in the sense of natural density), one has . This was then improved by Allouche to , and extended later by Korec to cover all . In this paper we obtain the following further improvement (at the cost of weakening natural density to logarithmic density):

Theorem 2Let be any function with . Then we have for almost all (in the sense of logarithmic density).

Thus for instance one has for almost all (in the sense of logarithmic density).

The difficulty here is one usually only expects to establish “local-in-time” results that control the evolution for times that only get as large as a small multiple of ; the aforementioned results of Terras, Allouche, and Korec, for instance, are of this type. However, to get all the way down to one needs something more like an “(almost) global-in-time” result, where the evolution remains under control for so long that the orbit has nearly reached the bounded state .

However, as observed by Bourgain in the context of nonlinear Schrödinger equations, one can iterate “almost sure local wellposedness” type results (which give local control for almost all initial data from a given distribution) into “almost sure (almost) global wellposedness” type results if one is fortunate enough to draw one’s data from an *invariant measure* for the dynamics. To illustrate the idea, let us take Korec’s aforementioned result that if one picks at random an integer from a large interval , then in most cases, the orbit of will eventually move into the interval . Similarly, if one picks an integer at random from , then in most cases, the orbit of will eventually move into . It is then tempting to concatenate the two statements and conclude that for most in , the orbit will eventually move . Unfortunately, this argument does not quite work, because by the time the orbit from a randomly drawn reaches , the distribution of the final value is unlikely to be close to being uniformly distributed on , and in particular could potentially concentrate almost entirely in the exceptional set of that do not make it into . The point here is the uniform measure on is not transported by Collatz dynamics to anything resembling the uniform measure on .

So, one now needs to locate a measure which has better invariance properties under the Collatz dynamics. It turns out to be technically convenient to work with a standard acceleration of the Collatz map known as the *Syracuse map* , defined on the odd numbers by setting , where is the largest power of that divides . (The advantage of using the Syracuse map over the Collatz map is that it performs precisely one multiplication of at each iteration step, which makes the map better behaved when performing “-adic” analysis.)

When viewed -adically, we soon see that iterations of the Syracuse map become somewhat irregular. Most obviously, is never divisible by . A little less obviously, is twice as likely to equal mod as it is to equal mod . This is because for a randomly chosen odd , the number of times that divides can be seen to have a geometric distribution of mean – it equals any given value with probability . Such a geometric random variable is twice as likely to be odd as to be even, which is what gives the above irregularity. There are similar irregularities modulo higher powers of . For instance, one can compute that for large random odd , will take the residue classes with probabilities

respectively. More generally, for any , will be distributed according to the law of a random variable on that we call a *Syracuse random variable*, and can be described explicitly as

where are iid copies of a geometric random variable of mean .

In view of this, any proposed “invariant” (or approximately invariant) measure (or family of measures) for the Syracuse dynamics should take this -adic irregularity of distribution into account. It turns out that one can use the Syracuse random variables to construct such a measure, but only if these random variables stabilise in the limit in a certain total variation sense. More precisely, in the paper we establish the estimate

for any and any . This type of stabilisation is plausible from entropy heuristics – the tuple of geometric random variables that generates has Shannon entropy , which is significantly larger than the total entropy of the uniform distribution on , so we expect a lot of “mixing” and “collision” to occur when converting the tuple to ; these heuristics can be supported by numerics (which I was able to work out up to about before running into memory and CPU issues), but it turns out to be surprisingly delicate to make this precise.

A first hint of how to proceed comes from the elementary number theory observation (easily proven by induction) that the rational numbers

are all distinct as vary over tuples in . Unfortunately, the process of reducing mod creates a lot of collisions (as must happen from the pigeonhole principle); however, by a simple “Lefschetz principle” type argument one can at least show that the reductions

are mostly distinct for “typical” (as drawn using the geometric distribution) as long as is a bit smaller than (basically because the rational number appearing in (3) then typically takes a form like with an integer between and ). This analysis of the component (3) of (1) is already enough to get quite a bit of spreading on (roughly speaking, when the argument is optimised, it shows that this random variable cannot concentrate in any subset of of density less than for some large absolute constant ). To get from this to a stabilisation property (2) we have to exploit the mixing effects of the remaining portion of (1) that does not come from (3). After some standard Fourier-analytic manipulations, matters then boil down to obtaining non-trivial decay of the characteristic function of , and more precisely in showing that

for any and any that is not divisible by .

If the random variable (1) was the sum of independent terms, one could express this characteristic function as something like a Riesz product, which would be straightforward to estimate well. Unfortunately, the terms in (1) are loosely coupled together, and so the characteristic factor does not immediately factor into a Riesz product. However, if one groups adjacent terms in (1) together, one can rewrite it (assuming is even for sake of discussion) as

where . The point here is that after conditioning on the to be fixed, the random variables remain independent (though the distribution of each depends on the value that we conditioned to), and so the above expression is a *conditional* sum of independent random variables. This lets one express the characeteristic function of (1) as an *averaged* Riesz product. One can use this to establish the bound (4) as long as one can show that the expression

is not close to an integer for a moderately large number (, to be precise) of indices . (Actually, for technical reasons we have to also restrict to those for which , but let us ignore this detail here.) To put it another way, if we let denote the set of pairs for which

we have to show that (with overwhelming probability) the random walk

(which we view as a two-dimensional renewal process) contains at least a few points lying outside of .

A little bit of elementary number theory and combinatorics allows one to describe the set as the union of “triangles” with a certain non-zero separation between them. If the triangles were all fairly small, then one expects the renewal process to visit at least one point outside of after passing through any given such triangle, and it then becomes relatively easy to then show that the renewal process usually has the required number of points outside of . The most difficult case is when the renewal process passes through a particularly large triangle in . However, it turns out that large triangles enjoy particularly good separation properties, and in particular afer passing through a large triangle one is likely to only encounter nothing but small triangles for a while. After making these heuristics more precise, one is finally able to get enough points on the renewal process outside of that one can finish the proof of (4), and thus Theorem 2.

]]>- Elementary multiplicative number theory
- Complex-analytic multiplicative number theory
- The entropy decrement argument
- Bounds for exponential sums
- Zero density theorems
- Halasz’s theorem and the Matomaki-Radziwill theorem
- The circle method
- (If time permits) Chowla’s conjecture and the Erdos discrepancy problem

Lecture notes for topics 3, 6, and 8 will be forthcoming.

]]>

Conjecture 1 (Cramér conjecture)If is a large number, then the largest prime gap in is of size . (Granville refines this conjecture to , where . Here we use the asymptotic notation for , for , for , and for .)

Conjecture 2 (Hardy-Littlewood conjecture)If are fixed distinct integers, then the number of numbers with all prime is as , where the singular series is defined by the formula

(One can view these conjectures as modern versions of two of the classical Landau problems, namely Legendre’s conjecture and the twin prime conjecture respectively.)

A well known connection between the Hardy-Littlewood conjecture and prime gaps was made by Gallagher. Among other things, Gallagher showed that if the Hardy-Littlewood conjecture was true, then the prime gaps with were asymptotically distributed according to an exponential distribution of mean , in the sense that

as for any fixed . Roughly speaking, the way this is established is by using the Hardy-Littlewood conjecture to control the mean values of for fixed , where ranges over the primes in . The relevance of these quantities arises from the Bonferroni inequalities (or “Brun pure sieve“), which can be formulated as the assertion that

when is even and

when is odd, for any natural number ; setting and taking means, one then gets upper and lower bounds for the probability that the interval is free of primes. The most difficult step is to control the mean values of the singular series as ranges over -tuples in a fixed interval such as .

Heuristically, if one extrapolates the asymptotic (1) to the regime , one is then led to Cramér’s conjecture, since the right-hand side of (1) falls below when is significantly larger than . However, this is not a rigorous derivation of Cramér’s conjecture from the Hardy-Littlewood conjecture, since Gallagher’s computations only establish (1) for *fixed* choices of , which is only enough to establish the far weaker bound , which was already known (see this previous paper for a discussion of the best known unconditional lower bounds on ). An inspection of the argument shows that if one wished to extend (1) to parameter choices that were allowed to grow with , then one would need as input a stronger version of the Hardy-Littlewood conjecture in which the length of the tuple , as well as the magnitudes of the shifts , were also allowed to grow with . Our initial objective in this project was then to quantify exactly what strengthening of the Hardy-Littlewood conjecture would be needed to rigorously imply Cramer’s conjecture. The precise results are technical, but roughly we show results of the following form:

Theorem 3 (Large gaps from Hardy-Littlewood, rough statement)

- If the Hardy-Littlewood conjecture is uniformly true for -tuples of length , and with shifts of size , with a power savings in the error term, then .
- If the Hardy-Littlewood conjecture is “true on average” for -tuples of length and shifts of size for all , with a power savings in the error term, then .

In particular, we can recover Cramer’s conjecture given a sufficiently powerful version of the Hardy-Littlewood conjecture “on the average”.

Our proof of this theorem proceeds more or less along the same lines as Gallagher’s calculation, but now with allowed to grow slowly with . Again, the main difficulty is to accurately estimate average values of the singular series . Here we found it useful to switch to a probabilistic interpretation of this series. For technical reasons it is convenient to work with a truncated, unnormalised version

of the singular series, for a suitable cutoff ; it turns out that when studying prime tuples of size , the most convenient cutoff is the “Pólya magic cutoff“, defined as the largest prime for which

(this is well defined for ); by Mertens’ theorem, we have . One can interpret probabilistically as

where is the randomly sifted set of integers formed by removing one residue class uniformly at random for each prime . The Hardy-Littlewood conjecture can be viewed as an assertion that the primes behave in some approximate statistical sense like the random sifted set , and one can prove the above theorem by using the Bonferroni inequalities both for the primes and for the random sifted set, and comparing the two (using an even for the sifted set and an odd for the primes in order to be able to combine the two together to get a useful bound).

The proof of Theorem 3 ended up not using any properties of the set of primes other than that this set obeyed some form of the Hardy-Littlewood conjectures; the theorem remains true (with suitable notational changes) if this set were replaced by any other set. In order to convince ourselves that our theorem was not vacuous due to our version of the Hardy-Littlewood conjecture being too strong to be true, we then started exploring the question of coming up with random models of which obeyed various versions of the Hardy-Littlewood and Cramér conjectures.

This line of inquiry was started by Cramér, who introduced what we now call the *Cramér random model* of the primes, in which each natural number is selected for membership in with an independent probability of . This model matches the primes well in some respects; for instance, it almost surely obeys the “Riemann hypothesis”

and Cramér also showed that the largest gap was almost surely . On the other hand, it does not obey the Hardy-Littlewood conjecture; more precisely, it obeys a simplified variant of that conjecture in which the singular series is absent.

Granville proposed a refinement to Cramér’s random model in which one first sieves out (in each dyadic interval ) all residue classes for for a certain threshold , and then places each surviving natural number in with an independent probability . One can verify that this model obeys the Hardy-Littlewood conjectures, and Granville showed that the largest gap in this model was almost surely , leading to his conjecture that this bound also was true for the primes. (Interestingly, this conjecture is not yet borne out by numerics; calculations of prime gaps up to , for instance, have shown that never exceeds in this range. This is not necessarily a conflict, however; Granville’s analysis relies on inspecting gaps in an extremely sparse region of natural numbers that are more devoid of primes than average, and this region is not well explored by existing numerics. See this previous blog post for more discussion of Granville’s argument.)

However, Granville’s model does not produce a power savings in the error term of the Hardy-Littlewood conjectures, mostly due to the need to truncate the singular series at the logarithmic cutoff . After some experimentation, we were able to produce a tractable random model for the primes which obeyed the Hardy-Littlewood conjectures with power savings, and which reproduced Granville’s gap prediction of (we also get an upper bound of for both models, though we expect the lower bound to be closer to the truth); to us, this strengthens the case for Granville’s version of Cramér’s conjecture. The model can be described as follows. We select one residue class uniformly at random for each prime , and as before we let be the sifted set of integers formed by deleting the residue classes with . We then set

with Pólya’s magic cutoff (this is the cutoff that gives a density consistent with the prime number theorem or the Riemann hypothesis). As stated above, we are able to show that almost surely one has

and that the Hardy-Littlewood conjectures hold with power savings for up to for any fixed and for shifts of size . This is unfortunately a tiny bit weaker than what Theorem 3 requires (which more or less corresponds to the endpoint ), although there is a variant of Theorem 3 that can use this input to produce a lower bound on gaps in the model (but it is weaker than the one in (3)). In fact we prove a more precise almost sure asymptotic formula for that involves the optimal bounds for the *linear sieve* (or *interval sieve*), in which one deletes one residue class modulo from an interval for all primes up to a given threshold. The lower bound in (3) relates to the case of deleting the residue classes from ; the upper bound comes from the delicate analysis of the linear sieve by Iwaniec. Improving on either of the two bounds looks to be quite a difficult problem.

The probabilistic analysis of is somewhat more complicated than of or as there is now non-trivial coupling between the events as varies, although moment methods such as the second moment method are still viable and allow one to verify the Hardy-Littlewood conjectures by a lengthy but fairly straightforward calculation. To analyse large gaps, one has to understand the statistical behaviour of a random linear sieve in which one starts with an interval and randomly deletes a residue class for each prime up to a given threshold. For very small this is handled by the deterministic theory of the linear sieve as discussed above. For medium sized , it turns out that there is good concentration of measure thanks to tools such as Bennett’s inequality or Azuma’s inequality, as one can view the sieving process as a martingale or (approximately) as a sum of independent random variables. For larger primes , in which only a small number of survivors are expected to be sieved out by each residue class, a direct combinatorial calculation of all possible outcomes (involving the random graph that connects interval elements to primes if falls in the random residue class ) turns out to give the best results.

]]>