You are currently browsing the category archive for the ‘math.NT’ category.

In this post we assume the Riemann hypothesis and the simplicity of zeroes, thus the zeroes of {\zeta} in the critical strip take the form {\frac{1}{2} \pm i \gamma_j} for some real number ordinates {0 < \gamma_1 < \gamma_2 < \dots}. From the Riemann-von Mangoldt formula, one has the asymptotic

\displaystyle  \gamma_n = (1+o(1)) \frac{2\pi}{\log n} n

as {n \rightarrow \infty}; in particular, the spacing {\gamma_{n+1} - \gamma_n} should behave like {\frac{2\pi}{\log n}} on the average. However, it can happen that some gaps are unusually small compared to other nearby gaps. For the sake of concreteness, let us define a Lehmer pair to be a pair of adjacent ordinates {\gamma_n, \gamma_{n+1}} such that

\displaystyle  \frac{1}{(\gamma_{n+1} - \gamma_n)^2} \geq 1.3 \sum_{m \neq n,n+1} \frac{1}{(\gamma_m - \gamma_n)^2} + \frac{1}{(\gamma_m - \gamma_{n+1})^2}. \ \ \ \ \ (1)

The specific value of constant {1.3} is not particularly important here; anything larger than {\frac{5}{4}} would suffice. An example of such a pair would be the classical pair

\displaystyle  \gamma_{6709} = 7005.062866\dots

\displaystyle  \gamma_{6710} = 7005.100564\dots

discovered by Lehmer. It follows easily from the main results of Csordas, Smith, and Varga that if an infinite number of Lehmer pairs (in the above sense) existed, then the de Bruijn-Newman constant {\Lambda} is non-negative. This implication is now redundant in view of the unconditional results of this recent paper of Rodgers and myself; however, the question of whether an infinite number of Lehmer pairs exist remain open.

In this post, I sketch an argument that Brad and I came up with (as initially suggested by Odlyzko) the GUE hypothesis implies the existence of infinitely many Lehmer pairs. We argue probabilistically: pick a sufficiently large number {T}, pick {n} at random from {T \log T} to {2 T \log T} (so that the average gap size is close to {\frac{2\pi}{\log T}}), and prove that the Lehmer pair condition (1) occurs with positive probability.

Introduce the renormalised ordinates {x_n := \frac{\log T}{2\pi} \gamma_n} for {T \log T \leq n \leq 2 T \log T}, and let {\varepsilon > 0} be a small absolute constant (independent of {T}). It will then suffice to show that

\displaystyle  \frac{1}{(x_{n+1} - x_n)^2} \geq

\displaystyle 1.3 \sum_{m \in [T \log T, 2T \log T]: m \neq n,n+1} \frac{1}{(x_m - x_n)^2} + \frac{1}{(x_m - x_{n+1})^2}

\displaystyle  + \frac{1}{6\varepsilon^2}

(say) with probability {\gg \varepsilon^4 - o(1)}, since the contribution of those {m} outside of {[T \log T, 2T \log T]} can be absorbed by the {\frac{1}{\varepsilon^2}} factor with probability {o(1)}.

As one consequence of the GUE hypothesis, we have {x_{n+1} - x_n \leq \varepsilon^2} with probability {O(\varepsilon^6)}. Thus, if {E := \{ m \in [T \log T, 2T \log T]: x_{m+1} - x_m \leq \varepsilon^2 \}}, then {E} has density {O( \varepsilon^6 )}. Applying the Hardy-Littlewood maximal inequality, we see that with probability {O(\varepsilon^6)}, we have

\displaystyle  \sup_{h \geq 1} | \# E \cap [n+h, n-h] | \leq \frac{1}{10}

which implies in particular that

\displaystyle  |x_m - x_n|, |x_{m} - x_{n+1}| \gg \varepsilon^2 |m-n|

for all {m \in [T \log T, 2 T \log T] \backslash \{ n, n+1\}}. This implies in particular that

\displaystyle  \sum_{m \in [T \log T, 2T \log T]: |m-n| \geq \varepsilon^{-3}} \frac{1}{(x_m - x_n)^2} + \frac{1}{(x_m - x_{n+1})^2} \ll \varepsilon^{-1}

and so it will suffice to show that

\displaystyle  \frac{1}{(x_{n+1} - x_n)^2}

\displaystyle \geq 1.3 \sum_{m \in [T \log T, 2T \log T]: m \neq n,n+1; |m-n| < \varepsilon^{-3}} \frac{1}{(x_m - x_n)^2} + \frac{1}{(x_m - x_{n+1})^2} + \frac{1}{5\varepsilon^2}

(say) with probability {\gg \varepsilon^4 - o(1)}.

By the GUE hypothesis (and the fact that {\varepsilon} is independent of {T}), it suffices to show that a Dyson sine process {(x_n)_{n \in {\bf Z}}}, normalised so that {x_0} is the first positive point in the process, obeys the inequality

\displaystyle  \frac{1}{(x_{1} - x_0)^2} \geq 1.3 \sum_{|m| < \varepsilon^{-3}: m \neq 0,1} \frac{1}{(x_m - x_0)^2} + \frac{1}{(x_m - x_1)^2} \ \ \ \ \ (2)

with probability {\gg \varepsilon^4}. However, if we let {A > 0} be a moderately large constant (and assume {\varepsilon} small depending on {A}), one can show using {k}-point correlation functions for the Dyson sine process (and the fact that the Dyson kernel {K(x,y) = \sin(\pi(x-y))/\pi(x-y)} equals {1} to second order at the origin) that

\displaystyle  {\bf E} N_{[-\varepsilon,0]} N_{[0,\varepsilon]} \gg \varepsilon^4

\displaystyle  {\bf E} N_{[-\varepsilon,0]} \binom{N_{[0,\varepsilon]}}{2} \ll \varepsilon^7

\displaystyle  {\bf E} \binom{N_{[-\varepsilon,0]}}{2} N_{[0,\varepsilon]} \ll \varepsilon^7

\displaystyle  {\bf E} N_{[-\varepsilon,0]} N_{[0,\varepsilon]} N_{[\varepsilon,A^{-1}]} \ll A^{-3} \varepsilon^4

\displaystyle  {\bf E} N_{[-\varepsilon,0]} N_{[0,\varepsilon]} N_{[-A^{-1}, -\varepsilon]} \ll A^{-3} \varepsilon^4

\displaystyle  {\bf E} N_{[-\varepsilon,0]} N_{[0,\varepsilon]} N_{[-k, k]}^2 \ll k^2 \varepsilon^4 \ \ \ \ \ (3)

for any natural number {k}, where {N_{I}} denotes the number of elements of the process in {I}. For instance, the expression {{\bf E} N_{[-\varepsilon,0]} \binom{N_{[0,\varepsilon]}}{2} } can be written in terms of the three-point correlation function {\rho_3(x_1,x_2,x_3) = \mathrm{det}(K(x_i,x_j))_{1 \leq i,j \leq 3}} as

\displaystyle  \int_{-\varepsilon \leq x_1 \leq 0 \leq x_2 \leq x_3 \leq \varepsilon} \rho_3( x_1, x_2, x_3 )\ dx_1 dx_2 dx_3

which can easily be estimated to be {O(\varepsilon^7)} (since {\rho_3 = O(\varepsilon^4)} in this region), and similarly for the other estimates claimed above.

Since for natural numbers {a,b}, the quantity {ab - 2 a \binom{b}{2} - 2 b \binom{a}{2} = ab (5-2a-2b)} is only positive when {a=b=1}, we see from the first three estimates that the event {E} that {N_{[-\varepsilon,0]} = N_{[0,\varepsilon]} = 1} occurs with probability {\gg \varepsilon^4}. In particular, by Markov’s inequality we have the conditional probabilities

\displaystyle  {\bf P} ( N_{[\varepsilon,A^{-1}]} \geq 1 | E ) \ll A^{-3}

\displaystyle  {\bf P} ( N_{[-A^{-1}, -\varepsilon]} \geq 1 | E ) \ll A^{-3}

\displaystyle  {\bf P} ( N_{[-k, k]} \geq A k^{5/3} | E ) \ll A^{-4} k^{-4/3}

and thus, if {A} is large enough, and {\varepsilon} small enough, it will be true with probability {\gg \varepsilon^4} that

\displaystyle  N_{[-\varepsilon,0]}, N_{[0,\varepsilon]} = 1

and

\displaystyle  N_{[A^{-1}, \varepsilon]} = N_{[\varepsilon, A^{-1}]} = 0

and simultaneously that

\displaystyle  N_{[-k,k]} \leq A k^{5/3}

for all natural numbers {k}. This implies in particular that

\displaystyle  x_1 - x_0 \leq 2\varepsilon

and

\displaystyle  |x_m - x_0|, |x_m - x_1| \gg_A |m|^{3/5}

for all {m \neq 0,1}, which gives (2) for {\varepsilon} small enough.

Remark 1 The above argument needed the GUE hypothesis for correlations up to fourth order (in order to establish (3)). It might be possible to reduce the number of correlations needed, but I do not see how to obtain the claim just using pair correlations only.

Brad Rodgers and I have uploaded to the arXiv our paper “The De Bruijn-Newman constant is non-negative“. This paper affirms a conjecture of Newman regarding to the extent to which the Riemann hypothesis, if true, is only “barely so”. To describe the conjecture, let us begin with the Riemann xi function

\displaystyle \xi(s) := \frac{s(s-1)}{2} \pi^{-s/2} \Gamma(\frac{s}{2}) \zeta(s)

where {\Gamma(s) := \int_0^\infty e^{-t} t^{s-1}\ dt} is the Gamma function and {\zeta(s) := \sum_{n=1}^\infty \frac{1}{n^s}} is the Riemann zeta function. Initially, this function is only defined for {\mathrm{Re} s > 1}, but, as was already known to Riemann, we can manipulate it into a form that extends to the entire complex plane as follows. Firstly, in view of the standard identity {s \Gamma(s) = \Gamma(s+1)}, we can write

\displaystyle \frac{s(s-1)}{2} \Gamma(\frac{s}{2}) = 2 \Gamma(\frac{s+4}{2}) - 3 \Gamma( \frac{s+2}{2} )

and hence

\displaystyle \xi(s) = \sum_{n=1}^\infty 2 \pi^{-s/2} n^{-s} \int_0^\infty e^{-t} t^{\frac{s+4}{2}-1}\ dt - 3 \pi^{-s/2} n^{-s} \int_0^\infty e^{-t} t^{\frac{s+2}{2}-1}\ dt.

By a rescaling, one may write

\displaystyle \int_0^\infty e^{-t} t^{\frac{s+4}{2}-1}\ dt = (\pi n^2)^{\frac{s+4}{2}} \int_0^\infty e^{-\pi n^2 t} t^{\frac{s+4}{2}-1}\ dt

and similarly

\displaystyle \int_0^\infty e^{-t} t^{\frac{s+2}{2}-1}\ dt = (\pi n^2)^{\frac{s+2}{2}} \int_0^\infty e^{-\pi n^2 t} t^{\frac{s+2}{2}-1}\ dt

and thus (after applying Fubini’s theorem)

\displaystyle \xi(s) = \int_0^\infty \sum_{n=1}^\infty 2 \pi^2 n^4 e^{-\pi n^2 t} t^{\frac{s+4}{2}-1} - 3 \pi n^2 e^{-\pi n^2 t} t^{\frac{s+2}{2}-1}\ dt.

We’ll make the change of variables {t = e^{4u}} to obtain

\displaystyle \xi(s) = 4 \int_{\bf R} \sum_{n=1}^\infty (2 \pi^2 n^4 e^{8u} - 3 \pi n^2 e^{4u}) \exp( 2su - \pi n^2 e^{4u} )\ du.

If we introduce the mild renormalisation

\displaystyle H_0(z) := \frac{1}{8} \xi( \frac{1}{2} + \frac{iz}{2} )

of {\xi}, we then conclude (at least for {\mathrm{Im} z > 1}) that

\displaystyle H_0(z) = \frac{1}{2} \int_{\bf R} \Phi(u)\exp(izu)\ du \ \ \ \ \ (1)

 

where {\Phi: {\bf R} \rightarrow {\bf C}} is the function

\displaystyle \Phi(u) := \sum_{n=1}^\infty (2 \pi^2 n^4 e^{9u} - 3 \pi n^2 e^{5u}) \exp( - \pi n^2 e^{4u} ), \ \ \ \ \ (2)

 

which one can verify to be rapidly decreasing both as {u \rightarrow +\infty} and as {u \rightarrow -\infty}, with the decrease as {u \rightarrow +\infty} faster than any exponential. In particular {H_0} extends holomorphically to the upper half plane.

If we normalize the Fourier transform {{\mathcal F} f(\xi)} of a (Schwartz) function {f(x)} as {{\mathcal F} f(\xi) := \int_{\bf R} f(x) e^{-2\pi i \xi x}\ dx}, it is well known that the Gaussian {x \mapsto e^{-\pi x^2}} is its own Fourier transform. The creation operator {2\pi x - \frac{d}{dx}} interacts with the Fourier transform by the identity

\displaystyle {\mathcal F} (( 2\pi x - \frac{d}{dx} ) f) (\xi) = -i (2 \pi \xi - \frac{d}{d\xi} ) {\mathcal F} f(\xi).

Since {(-i)^4 = 1}, this implies that the function

\displaystyle x \mapsto (2\pi x - \frac{d}{dx})^4 e^{-\pi x^2} = 128 \pi^2 (2 \pi^2 x^4 - 3 \pi x^2) e^{-\pi x^2} + 48 \pi^2 e^{-\pi x^2}

is its own Fourier transform. (One can view the polynomial {128 \pi^2 (2\pi^2 x^4 - 3 \pi x^2) + 48 \pi^2} as a renormalised version of the fourth Hermite polynomial.) Taking a suitable linear combination of this with {x \mapsto e^{-\pi x^2}}, we conclude that

\displaystyle x \mapsto (2 \pi^2 x^4 - 3 \pi x^2) e^{-\pi x^2}

is also its own Fourier transform. Rescaling {x} by {e^{2u}} and then multiplying by {e^u}, we conclude that the Fourier transform of

\displaystyle x \mapsto (2 \pi^2 x^4 e^{9u} - 3 \pi x^2 e^{5u}) \exp( - \pi x^2 e^{4u} )

is

\displaystyle x \mapsto (2 \pi^2 x^4 e^{-9u} - 3 \pi x^2 e^{-5u}) \exp( - \pi x^2 e^{-4u} ),

and hence by the Poisson summation formula (using symmetry and vanishing at {n=0} to unfold the {n} summation in (2) to the integers rather than the natural numbers) we obtain the functional equation

\displaystyle \Phi(-u) = \Phi(u),

which implies that {\Phi} and {H_0} are even functions (in particular, {H_0} now extends to an entire function). From this symmetry we can also rewrite (1) as

\displaystyle H_0(z) = \int_0^\infty \Phi(u) \cos(zu)\ du,

which now gives a convergent expression for the entire function {H_0(z)} for all complex {z}. As {\Phi} is even and real-valued on {{\bf R}}, {H_0(z)} is even and also obeys the functional equation {H_0(\overline{z}) = \overline{H_0(z)}}, which is equivalent to the usual functional equation for the Riemann zeta function. The Riemann hypothesis is equivalent to the claim that all the zeroes of {H_0} are real.

De Bruijn introduced the family {H_t: {\bf C} \rightarrow {\bf C}} of deformations of {H_0: {\bf C} \rightarrow {\bf C}}, defined for all {t \in {\bf R}} and {z \in {\bf C}} by the formula

\displaystyle H_t(z) := \int_0^\infty e^{tu^2} \Phi(u) \cos(zu)\ du.

From a PDE perspective, one can view {H_t} as the evolution of {H_0} under the backwards heat equation {\partial_t H_t(z) = - \partial_{zz} H_t(z)}. As with {H_0}, the {H_t} are all even entire functions that obey the functional equation {H_t(\overline{z}) = \overline{H_t(z)}}, and one can ask an analogue of the Riemann hypothesis for each such {H_t}, namely whether all the zeroes of {H_t} are real. De Bruijn showed that these hypotheses were monotone in {t}: if {H_t} had all real zeroes for some {t}, then {H_{t'}} would also have all zeroes real for any {t' \geq t}. Newman later sharpened this claim by showing the existence of a finite number {\Lambda \leq 1/2}, now known as the de Bruijn-Newman constant, with the property that {H_t} had all zeroes real if and only if {t \geq \Lambda}. Thus, the Riemann hypothesis is equivalent to the inequality {\Lambda \leq 0}. Newman then conjectured the complementary bound {\Lambda \geq 0}; in his words, this conjecture asserted that if the Riemann hypothesis is true, then it is only “barely so”, in that the reality of all the zeroes is destroyed by applying heat flow for even an arbitrarily small amount of time. Over time, a significant amount of evidence was established in favour of this conjecture; most recently, in 2011, Saouter, Gourdon, and Demichel showed that {\Lambda \geq -1.15 \times 10^{-11}}.

In this paper we finish off the proof of Newman’s conjecture, that is we show that {\Lambda \geq 0}. The proof is by contradiction, assuming that {\Lambda < 0} (which among other things, implies the truth of the Riemann hypothesis), and using the properties of backwards heat evolution to reach a contradiction.

Very roughly, the argument proceeds as follows. As observed by Csordas, Smith, and Varga (and also discussed in this previous blog post, the backwards heat evolution of the {H_t} introduces a nice ODE dynamics on the zeroes {x_j(t)} of {H_t}, namely that they solve the ODE

\displaystyle \frac{d}{dt} x_j(t) = -2 \sum_{j \neq k} \frac{1}{x_k(t) - x_j(t)} \ \ \ \ \ (3)

 

for all {j} (one has to interpret the sum in a principal value sense as it is not absolutely convergent, but let us ignore this technicality for the current discussion). Intuitively, this ODE is asserting that the zeroes {x_j(t)} repel each other, somewhat like positively charged particles (but note that the dynamics is first-order, as opposed to the second-order laws of Newtonian mechanics). Formally, a steady state (or equilibrium) of this dynamics is reached when the {x_k(t)} are arranged in an arithmetic progression. (Note for instance that for any positive {u}, the functions {z \mapsto e^{tu^2} \cos(uz)} obey the same backwards heat equation as {H_t}, and their zeroes are on a fixed arithmetic progression {\{ \frac{2\pi (k+\tfrac{1}{2})}{u}: k \in {\bf Z} \}}.) The strategy is to then show that the dynamics from time {-\Lambda} to time {0} creates a convergence to local equilibrium, in which the zeroes {x_k(t)} locally resemble an arithmetic progression at time {t=0}. This will be in contradiction with known results on pair correlation of zeroes (or on related statistics, such as the fluctuations on gaps between zeroes), such as the results of Montgomery (actually for technical reasons it is slightly more convenient for us to use related results of Conrey, Ghosh, Goldston, Gonek, and Heath-Brown). Another way of thinking about this is that even very slight deviations from local equilibrium (such as a small number of gaps that are slightly smaller than the average spacing) will almost immediately lead to zeroes colliding with each other and leaving the real line as one evolves backwards in time (i.e., under the forward heat flow). This is a refinement of the strategy used in previous lower bounds on {\Lambda}, in which “Lehmer pairs” (pairs of zeroes of the zeta function that were unusually close to each other) were used to limit the extent to which the evolution continued backwards in time while keeping all zeroes real.

How does one obtain this convergence to local equilibrium? We proceed by broad analogy with the “local relaxation flow” method of Erdos, Schlein, and Yau in random matrix theory, in which one combines some initial control on zeroes (which, in the case of the Erdos-Schlein-Yau method, is referred to with terms such as “local semicircular law”) with convexity properties of a relevant Hamiltonian that can be used to force the zeroes towards equilibrium.

We first discuss the initial control on zeroes. For {H_0}, we have the classical Riemann-von Mangoldt formula, which asserts that the number of zeroes in the interval {[0,T]} is {\frac{T}{4\pi} \log \frac{T}{4\pi} - \frac{T}{4\pi} + O(\log T)} as {T \rightarrow \infty}. (We have a factor of {4\pi} here instead of the more familiar {2\pi} due to the way {H_0} is normalised.) This implies for instance that for a fixed {\alpha}, the number of zeroes in the interval {[T, T+\alpha]} is {\frac{\alpha}{4\pi} \log T + O(\log T)}. Actually, because we get to assume the Riemann hypothesis, we can sharpen this to {\frac{\alpha}{4\pi} \log T + o(\log T)}, a result of Littlewood (see this previous blog post for a proof). Ideally, we would like to obtain similar control for the other {H_t}, {\Lambda \leq t < 0}, as well. Unfortunately we were only able to obtain the weaker claims that the number of zeroes of {H_t} in {[0,T]} is {\frac{T}{4\pi} \log \frac{T}{4\pi} - \frac{T}{4\pi} + O(\log^2 T)}, and that the number of zeroes in {[T, T+\alpha \log T]} is {\frac{\alpha}{4 \pi} \log^2 T + o(\log^2 T)}, that is to say we only get good control on the distribution of zeroes at scales {\gg \log T} rather than at scales {\gg 1}. Ultimately this is because we were only able to get control (and in particular, lower bounds) on {|H_t(x-iy)|} with high precision when {y \gg \log x} (whereas {|H_0(x-iy)|} has good estimates as soon as {y} is larger than (say) {2}). This control is obtained by the expressing {H_t(x-iy)} in terms of some contour integrals and using the method of steepest descent (actually it is slightly simpler to rely instead on the Stirling approximation for the Gamma function, which can be proven in turn by steepest descent methods). Fortunately, it turns out that this weaker control is still (barely) enough for the rest of our argument to go through.

Once one has the initial control on zeroes, we now need to force convergence to local equilibrium by exploiting convexity of a Hamiltonian. Here, the relevant Hamiltonian is

\displaystyle H(t) := \sum_{j,k: j \neq k} \log \frac{1}{|x_j(t) - x_k(t)|},

ignoring for now the rather important technical issue that this sum is not actually absolutely convergent. (Because of this, we will need to truncate and renormalise the Hamiltonian in a number of ways which we will not detail here.) The ODE (3) is formally the gradient flow for this Hamiltonian. Furthermore, this Hamiltonian is a convex function of the {x_j} (because {t \mapsto \log \frac{1}{t}} is a convex function on {(0,+\infty)}). We therefore expect the Hamiltonian to be a decreasing function of time, and that the derivative should be an increasing function of time. As time passes, the derivative of the Hamiltonian would then be expected to converge to zero, which should imply convergence to local equilibrium.

Formally, the derivative of the above Hamiltonian is

\displaystyle \partial_t H(t) = -4 E(t), \ \ \ \ \ (4)

 

where {E(t)} is the “energy”

\displaystyle E(t) := \sum_{j,k: j \neq k} \frac{1}{|x_j(t) - x_k(t)|^2}.

Again, there is the important technical issue that this quantity is infinite; but it turns out that if we renormalise the Hamiltonian appropriately, then the energy will also become suitably renormalised, and in particular will vanish when the {x_j} are arranged in an arithmetic progression, and be positive otherwise. One can also formally calculate the derivative of {E(t)} to be a somewhat complicated but manifestly non-negative quantity (a sum of squares); see this previous blog post for analogous computations in the case of heat flow on polynomials. After flowing from time {\Lambda} to time {0}, and using some crude initial bounds on {H(t)} and {E(t)} in this region (coming from the Riemann-von Mangoldt type formulae mentioned above and some further manipulations), we can eventually show that the (renormalisation of the) energy {E(0)} at time zero is small, which forces the {x_j} to locally resemble an arithmetic progression, which gives the required convergence to local equilibrium.

There are a number of technicalities involved in making the above sketch of argument rigorous (for instance, justifying interchanges of derivatives and infinite sums turns out to be a little bit delicate). I will highlight here one particular technical point. One of the ways in which we make expressions such as the energy {E(t)} finite is to truncate the indices {j,k} to an interval {I} to create a truncated energy {E_I(t)}. In typical situations, we would then expect {E_I(t)} to be decreasing, which will greatly help in bounding {E_I(0)} (in particular it would allow one to control {E_I(0)} by time-averaged quantities such as {\int_{\Lambda/2}^0 E_I(t)\ dt}, which can in turn be controlled using variants of (4)). However, there are boundary effects at both ends of {I} that could in principle add a large amount of energy into {E_I}, which is bad news as it could conceivably make {E_I(0)} undesirably large even if integrated energies such as {\int_{\Lambda/2}^0 E_I(t)\ dt} remain adequately controlled. As it turns out, such boundary effects are negligible as long as there is a large gap between adjacent zeroes at boundary of {I} – it is only narrow gaps that can rapidly transmit energy across the boundary of {I}. Now, narrow gaps can certainly exist (indeed, the GUE hypothesis predicts these happen a positive fraction of the time); but the pigeonhole principle (together with the Riemann-von Mangoldt formula) can allow us to pick the endpoints of the interval {I} so that no narrow gaps appear at the boundary of {I} for any given time {t}. However, there was a technical problem: this argument did not allow one to find a single interval {I} that avoided gaps for all times {\Lambda/2 \leq t \leq 0} simultaneously – the pigeonhole principle could produce a different interval {I} for each time {t}! Since the number of times was uncountable, this was a serious issue. (In physical terms, the problem was that there might be very fast “longitudinal waves” in the dynamics that, at each time, cause some gaps between zeroes to be highly compressed, but the specific gap that was narrow changed very rapidly with time. Such waves could, in principle, import a huge amount of energy into {E_I} by time {0}.) To resolve this, we borrowed a PDE trick of Bourgain’s, in which the pigeonhole principle was coupled with local conservation laws. More specifically, we use the phenomenon that very narrow gaps {g_i = x_{i+1}-x_i} take a nontrivial amount of time to expand back to a reasonable size (this can be seen by comparing the evolution of this gap with solutions of the scalar ODE {\partial_t g = \frac{4}{g^2}}, which represents the fastest at which a gap such as {g_i} can expand). Thus, if a gap {g_i} is reasonably large at some time {t_0}, it will also stay reasonably large at slightly earlier times {t \in [t_0-\delta, t_0]} for some moderately small {\delta>0}. This lets one locate an interval {I} that has manageable boundary effects during the times in {[t_0-\delta, t_0]}, so in particular {E_I} is basically non-increasing in this time interval. Unfortunately, this interval is a little bit too short to cover all of {[\Lambda/2,0]}; however it turns out that one can iterate the above construction and find a nested sequence of intervals {I_k}, with each {E_{I_k}} non-increasing in a different time interval {[t_k - \delta, t_k]}, and with all of the time intervals covering {[\Lambda/2,0]}. This turns out to be enough (together with the obvious fact that {E_I} is monotone in {I}) to still control {E_I(0)} for some reasonably sized interval {I}, as required for the rest of the arguments.

ADDED LATER: the following analogy (involving functions with just two zeroes, rather than an infinite number of zeroes) may help clarify the relation between this result and the Riemann hypothesis (and in particular why this result does not make the Riemann hypothesis any easier to prove, in fact it confirms the delicate nature of that hypothesis). Suppose one had a quadratic polynomial {P} of the form {P(z) = z^2 + \Lambda}, where {\Lambda} was an unknown real constant. Suppose that one was for some reason interested in the analogue of the “Riemann hypothesis” for {P}, namely that all the zeroes of {P} are real. A priori, there are three scenarios:

  • (Riemann hypothesis false) {\Lambda > 0}, and {P} has zeroes {\pm i |\Lambda|^{1/2}} off the real axis.
  • (Riemann hypothesis true, but barely so) {\Lambda = 0}, and both zeroes of {P} are on the real axis; however, any slight perturbation of {\Lambda} in the positive direction would move zeroes off the real axis.
  • (Riemann hypothesis true, with room to spare) {\Lambda < 0}, and both zeroes of {P} are on the real axis. Furthermore, any slight perturbation of {P} will also have both zeroes on the real axis.

The analogue of our result in this case is that {\Lambda \geq 0}, thus ruling out the third of the three scenarios here. In this simple example in which only two zeroes are involved, one can think of the inequality {\Lambda \geq 0} as asserting that if the zeroes of {P} are real, then they must be repeated. In our result (in which there are an infinity of zeroes, that become increasingly dense near infinity), and in view of the convergence to local equilibrium properties of (3), the analogous assertion is that if the zeroes of {H_0} are real, then they do not behave locally as if they were in arithmetic progression.

Kaisa Matomaki, Maksym Radziwill, and I have uploaded to the arXiv our paper “Correlations of the von Mangoldt and higher divisor functions II. Divisor correlations in short ranges“. This is a sequel of sorts to our previous paper on divisor correlations, though the proof techniques in this paper are rather different. As with the previous paper, our interest is in correlations such as

\displaystyle  \sum_{n \leq X} d_k(n) d_l(n+h) \ \ \ \ \ (1)

for medium-sized {h} and large {X}, where {k \geq l \geq 1} are natural numbers and {d_k(n) = \sum_{n = m_1 \dots m_k} 1} is the {k^{th}} divisor function (actually our methods can also treat a generalisation in which {k} is non-integer, but for simplicity let us stick with the integer case for this discussion). Our methods also allow for one of the divisor function factors to be replaced with a von Mangoldt function, but (in contrast to the previous paper) we cannot treat the case when both factors are von Mangoldt.

As discussed in this previous post, one heuristically expects an asymptotic of the form

\displaystyle  \sum_{n \leq X} d_k(n) d_l(n+h) = P_{k,l,h}( \log X ) X + O( X^{1/2+\varepsilon})

for any fixed {\varepsilon>0}, where {P_{k,l,h}} is a certain explicit (but rather complicated) polynomial of degree {k+l-1}. Such asymptotics are known when {l \leq 2}, but remain open for {k \geq l \geq 3}. In the previous paper, we were able to obtain a weaker bound of the form

\displaystyle  \sum_{n \leq X} d_k(n) d_l(n+h) = P_{k,l,h}( \log X ) X + O_A( X \log^{-A} X)

for {1-O_A(\log^{-A} X)} of the shifts {-H \leq h \leq H}, whenever the shift range {H} lies between {X^{8/33+\varepsilon}} and {X^{1-\varepsilon}}. But the methods become increasingly hard to use as {H} gets smaller. In this paper, we use a rather different method to obtain the even weaker bound

\displaystyle  \sum_{n \leq X} d_k(n) d_l(n+h) = (1+o(1)) P_{k,l,h}( \log X ) X

for {1-o(1)} of the shifts {-H \leq h \leq H}, where {H} can now be as short as {H = \log^{10^4 k \log k} X}. The constant {10^4} can be improved, but there are serious obstacles to using our method to go below {\log^{k \log k} X} (as the exceptionally large values of {d_k} then begin to dominate). This can be viewed as an analogue to our previous paper on correlations of bounded multiplicative functions on average, in which the functions {d_k,d_l} are now unbounded, and indeed our proof strategy is based in large part on that paper (but with many significant new technical complications).

We now discuss some of the ingredients of the proof. Unsurprisingly, the first step is the circle method, expressing (1) in terms of exponential sums such as

\displaystyle  S(\alpha) := \sum_{n \leq X} d_k(n) e(\alpha).

Actually, it is convenient to first prune {d_k} slightly by zeroing out this function on “atypical” numbers {n} that have an unusually small or large number of factors in a certain sense, but let us ignore this technicality for this discussion. The contribution of {S(\alpha)} for “major arc” {\alpha} can be treated by standard techniques (and is the source of the main term {P_{k,l,h}(\log X) X}; the main difficulty comes from treating the contribution of “minor arc” {\alpha}.

In our previous paper on bounded multiplicative functions, we used Plancherel’s theorem to estimate the global {L^2} norm {\int_{{\bf R}/{\bf Z}} |S(\alpha)|^2\ d\alpha}, and then also used the Katai-Bourgain-Sarnak-Ziegler orthogonality criterion to control local {L^2} norms {\int_I |S(\alpha)|^2\ d\alpha}, where {I} was a minor arc interval of length about {1/H}, and these two estimates together were sufficient to get a good bound on correlations by an application of Hölder’s inequality. For {d_k}, it is more convenient to use Dirichlet series methods (and Ramaré-type factorisations of such Dirichlet series) to control local {L^2} norms on minor arcs, in the spirit of the proof of the Matomaki-Radziwill theorem; a key point is to develop “log-free” mean value theorems for Dirichlet series associated to functions such as {d_k}, so as not to wipe out the (rather small) savings one will get over the trivial bound from this method. On the other hand, the global {L^2} bound will definitely be unusable, because the {\ell^2} sum {\sum_{n \leq X} d_k(n)^2} has too many unwanted factors of {\log X}. Fortunately, we can substitute this global {L^2} bound with a “large values” bound that controls expressions such as

\displaystyle  \sum_{i=1}^J \int_{I_i} |S(\alpha)|^2\ d\alpha

for a moderate number of disjoint intervals {I_1,\dots,I_J}, with a bound that is slightly better (for {J} a medium-sized power of {\log X}) than what one would have obtained by bounding each integral {\int_{I_i} |S(\alpha)|^2\ d\alpha} separately. (One needs to save more than {J^{1/2}} for the argument to work; we end up saving a factor of about {J^{3/4}}.) This large values estimate is probably the most novel contribution of the paper. After taking the Fourier transform, matters basically reduce to getting a good estimate for

\displaystyle  \sum_{i=1}^J (\int_X^{2X} |\sum_{x \leq n \leq x+H} d_k(n) e(\alpha_i n)|^2\ dx)^{1/2},

where {\alpha_i} is the midpoint of {I_i}; thus we need some upper bound on the large local Fourier coefficients of {d_k}. These coefficients are difficult to calculate directly, but, in the spirit of a paper of Ben Green and myself, we can try to replace {d_k} by a more tractable and “pseudorandom” majorant {\tilde d_k} for which the local Fourier coefficients are computable (on average). After a standard duality argument, one ends up having to control expressions such as

\displaystyle  |\sum_{x \leq n \leq x+H} \tilde d_k(n) e((\alpha_i -\alpha_{i'}) n)|

after various averaging in the {x, i,i'} parameters. These local Fourier coefficients of {\tilde d_k} turn out to be small on average unless {\alpha_i -\alpha_{i'}} is “major arc”. One then is left with a mostly combinatorial problem of trying to bound how often this major arc scenario occurs. This is very close to a computation in the previously mentioned paper of Ben and myself; there is a technical wrinkle in that the {\alpha_i} are not as well separated as they were in my paper with Ben, but it turns out that one can modify the arguments in that paper to still obtain a satisfactory estimate in this case (after first grouping nearby frequencies {\alpha_i} together, and modifying the duality argument accordingly).

A basic object of study in multiplicative number theory are the arithmetic functions: functions {f: {\bf N} \rightarrow {\bf C}} from the natural numbers to the complex numbers. Some fundamental examples of such functions include

Given an arithmetic function {f}, we are often interested in statistics such as the summatory function

\displaystyle \sum_{n \leq x} f(n), \ \ \ \ \ (1)

 

the logarithmically (or harmonically) weighted summatory function

\displaystyle \sum_{n \leq x} \frac{f(n)}{n}, \ \ \ \ \ (2)

 

or the Dirichlet series

\displaystyle {\mathcal D}[f](s) := \sum_n \frac{f(n)}{n^s}.

In the latter case, one typically has to first restrict {s} to those complex numbers whose real part is large enough in order to ensure the series on the right converges; but in many important cases, one can then extend the Dirichlet series to almost all of the complex plane by analytic continuation. One is also interested in correlations involving additive shifts, such as {\sum_{n \leq x} f(n) f(n+h)}, but these are significantly more difficult to study and cannot be easily estimated by the methods of classical multiplicative number theory.

A key operation on arithmetic functions is that of Dirichlet convolution, which when given two arithmetic functions {f,g: {\bf N} \rightarrow {\bf C}}, forms a new arithmetic function {f*g: {\bf N} \rightarrow {\bf C}}, defined by the formula

\displaystyle f*g(n) := \sum_{d|n} f(d) g(\frac{n}{d}).

Thus for instance {1*1 = d_2}, {1 * \Lambda = L}, {1 * \mu = \delta}, and {\delta * f = f} for any arithmetic function {f}. Dirichlet convolution and Dirichlet series are related by the fundamental formula

\displaystyle {\mathcal D}[f * g](s) = {\mathcal D}[f](s) {\mathcal D}[g](s), \ \ \ \ \ (3)

 

at least when the real part of {s} is large enough that all sums involved become absolutely convergent (but in practice one can use analytic continuation to extend this identity to most of the complex plane). There is also the identity

\displaystyle {\mathcal D}[Lf](s) = - \frac{d}{ds} {\mathcal D}[f](s), \ \ \ \ \ (4)

 

at least when the real part of {s} is large enough to justify interchange of differentiation and summation. As a consequence, many Dirichlet series can be expressed in terms of the Riemann zeta function {\zeta = {\mathcal D}[1]}, thus for instance

\displaystyle {\mathcal D}[d_2](s) = \zeta^2(s)

\displaystyle {\mathcal D}[L](s) = - \zeta'(s)

\displaystyle {\mathcal D}[\delta](s) = 1

\displaystyle {\mathcal D}[\mu](s) = \frac{1}{\zeta(s)}

\displaystyle {\mathcal D}[\Lambda](s) = -\frac{\zeta'(s)}{\zeta(s)}.

Much of the difficulty of multiplicative number theory can be traced back to the discrete nature of the natural numbers {{\bf N}}, which form a rather complicated abelian semigroup with respect to multiplication (in particular the set of generators is the set of prime numbers). One can obtain a simpler analogue of the subject by working instead with the half-infinite interval {{\bf N}_\infty := [1,+\infty)}, which is a much simpler abelian semigroup under multiplication (being a one-dimensional Lie semigroup). (I will think of this as a sort of “completion” of {{\bf N}} at the infinite place {\infty}, hence the terminology.) Accordingly, let us define a continuous arithmetic function to be a locally integrable function {f: {\bf N}_\infty \rightarrow {\bf C}}. The analogue of the summatory function (1) is then an integral

\displaystyle \int_1^x f(t)\ dt,

and similarly the analogue of (2) is

\displaystyle \int_1^x \frac{f(t)}{t}\ dt.

The analogue of the Dirichlet series is the Mellin-type transform

\displaystyle {\mathcal D}_\infty[f](s) := \int_1^\infty \frac{f(t)}{t^s}\ dt,

which will be well-defined at least if the real part of {s} is large enough and if the continuous arithmetic function {f: {\bf N}_\infty \rightarrow {\bf C}} does not grow too quickly, and hopefully will also be defined elsewhere in the complex plane by analytic continuation.

For instance, the continuous analogue of the discrete constant function {1: {\bf N} \rightarrow {\bf C}} would be the constant function {1_\infty: {\bf N}_\infty \rightarrow {\bf C}}, which maps any {t \in [1,+\infty)} to {1}, and which we will denote by {1_\infty} in order to keep it distinct from {1}. The two functions {1_\infty} and {1} have approximately similar statistics; for instance one has

\displaystyle \sum_{n \leq x} 1 = \lfloor x \rfloor \approx x-1 = \int_1^x 1\ dt

and

\displaystyle \sum_{n \leq x} \frac{1}{n} = H_{\lfloor x \rfloor} \approx \log x = \int_1^x \frac{1}{t}\ dt

where {H_n} is the {n^{th}} harmonic number, and we are deliberately vague as to what the symbol {\approx} means. Continuing this analogy, we would expect

\displaystyle {\mathcal D}[1](s) = \zeta(s) \approx \frac{1}{s-1} = {\mathcal D}_\infty[1_\infty](s)

which reflects the fact that {\zeta} has a simple pole at {s=1} with residue {1}, and no other poles. Note that the identity {{\mathcal D}_\infty[1_\infty](s) = \frac{1}{s-1}} is initially only valid in the region {\mathrm{Re} s > 1}, but clearly the right-hand side can be continued analytically to the entire complex plane except for the pole at {1}, and so one can define {{\mathcal D}_\infty[1_\infty]} in this region also.

In a similar vein, the logarithm function {L: {\bf N} \rightarrow {\bf C}} is approximately similar to the logarithm function {L_\infty: {\bf N}_\infty \rightarrow {\bf C}}, giving for instance the crude form

\displaystyle \sum_{n \leq x} L(n) = \log \lfloor x \rfloor! \approx x \log x - x = \int_1^\infty L_\infty(t)\ dt

of Stirling’s formula, or the Dirichlet series approximation

\displaystyle {\mathcal D}[L](s) = -\zeta'(s) \approx \frac{1}{(s-1)^2} = {\mathcal D}_\infty[L_\infty](s).

The continuous analogue of Dirichlet convolution is multiplicative convolution using the multiplicative Haar measure {\frac{dt}{t}}: given two continuous arithmetic functions {f_\infty, g_\infty: {\bf N}_\infty \rightarrow {\bf C}}, one can define their convolution {f_\infty *_\infty g_\infty: {\bf N}_\infty \rightarrow {\bf C}} by the formula

\displaystyle f_\infty *_\infty g_\infty(t) := \int_1^t f_\infty(s) g_\infty(\frac{t}{s}) \frac{ds}{s}.

Thus for instance {1_\infty * 1_\infty = L_\infty}. A short computation using Fubini’s theorem shows the analogue

\displaystyle D_\infty[f_\infty *_\infty g_\infty](s) = D_\infty[f_\infty](s) D_\infty[g_\infty](s)

of (3) whenever the real part of {s} is large enough that Fubini’s theorem can be justified; similarly, differentiation under the integral sign shows that

\displaystyle D_\infty[L_\infty f_\infty](s) = -\frac{d}{ds} D_\infty[f_\infty](s) \ \ \ \ \ (5)

 

again assuming that the real part of {s} is large enough that differentiation under the integral sign (or some other tool like this, such as the Cauchy integral formula for derivatives) can be justified.

Direct calculation shows that for any complex number {\rho}, one has

\displaystyle \frac{1}{s-\rho} = D_\infty[ t \mapsto t^{\rho-1} ](s)

(at least for the real part of {s} large enough), and hence by several applications of (5)

\displaystyle \frac{1}{(s-\rho)^k} = D_\infty[ t \mapsto \frac{1}{(k-1)!} t^{\rho-1} \log^{k-1} t ](s)

for any natural number {k}. This can lead to the following heuristic: if a Dirichlet series {D[f](s)} behaves like a linear combination of poles {\frac{1}{(s-\rho)^k}}, in that

\displaystyle D[f](s) \approx \sum_\rho \frac{c_\rho}{(s-\rho)^{k_\rho}}

for some set {\rho} of poles and some coefficients {c_\rho} and natural numbers {k_\rho} (where we again are vague as to what {\approx} means, and how to interpret the sum {\sum_\rho} if the set of poles is infinite), then one should expect the arithmetic function {f} to behave like the continuous arithmetic function

\displaystyle t \mapsto \sum_\rho \frac{c_\rho}{(k_\rho-1)!} t^{\rho-1} \log^{k_\rho-1} t.

In particular, if we only have simple poles,

\displaystyle D[f](s) \approx \sum_\rho \frac{c_\rho}{s-\rho}

then we expect to have {f} behave like continuous arithmetic function

\displaystyle t \mapsto \sum_\rho c_\rho t^{\rho-1}.

Integrating this from {1} to {x}, this heuristically suggests an approximation

\displaystyle \sum_{n \leq x} f(n) \approx \sum_\rho c_\rho \frac{x^\rho-1}{\rho}

for the summatory function, and similarly

\displaystyle \sum_{n \leq x} \frac{f(n)}{n} \approx \sum_\rho c_\rho \frac{x^{\rho-1}-1}{\rho-1},

with the convention that {\frac{x^\rho-1}{\rho}} is {\log x} when {\rho=0}, and similarly {\frac{x^{\rho-1}-1}{\rho-1}} is {\log x} when {\rho=1}. One can make these sorts of approximations more rigorous by means of Perron’s formula (or one of its variants) combined with the residue theorem, provided that one has good enough control on the relevant Dirichlet series, but we will not pursue these rigorous calculations here. (But see for instance this previous blog post for some examples.)

For instance, using the more refined approximation

\displaystyle \zeta(s) \approx \frac{1}{s-1} + \gamma

to the zeta function near {s=1}, we have

\displaystyle {\mathcal D}[d_2](s) = \zeta^2(s) \approx \frac{1}{(s-1)^2} + \frac{2 \gamma}{s-1}

we would expect that

\displaystyle d_2 \approx L_\infty + 2 \gamma

and thus for instance

\displaystyle \sum_{n \leq x} d_2(n) \approx x \log x - x + 2 \gamma x

which matches what one actually gets from the Dirichlet hyperbola method (see e.g. equation (44) of this previous post).

Or, noting that {\zeta(s)} has a simple pole at {s=1} and assuming simple zeroes elsewhere, the log derivative {-\zeta'(s)/\zeta(s)} will have simple poles of residue {+1} at {s=1} and {-1} at all the zeroes, leading to the heuristic

\displaystyle {\mathcal D}[\Lambda](s) = -\frac{\zeta'(s)}{\zeta(s)} \approx \frac{1}{s-1} - \sum_\rho \frac{1}{s-\rho}

suggesting that {\Lambda} should behave like the continuous arithmetic function

\displaystyle t \mapsto 1 - \sum_\rho t^{\rho-1}

leading for instance to the summatory approximation

\displaystyle \sum_{n \leq x} \Lambda(n) \approx x - \sum_\rho \frac{x^\rho-1}{\rho}

which is a heuristic form of the Riemann-von Mangoldt explicit formula (see Exercise 45 of these notes for a rigorous version of this formula).

Exercise 1 Go through some of the other explicit formulae listed at this Wikipedia page and give heuristic justifications for them (up to some lower order terms) by similar calculations to those given above.

Given the “adelic” perspective on number theory, I wonder if there are also {p}-adic analogues of arithmetic functions to which a similar set of heuristics can be applied, perhaps to study sums such as {\sum_{n \leq x: n = a \hbox{ mod } p^j} f(n)}. A key problem here is that there does not seem to be any good interpretation of the expression {\frac{1}{t^s}} when {s} is complex and {t} is a {p}-adic number, so it is not clear that one can analyse a Dirichlet series {p}-adically. For similar reasons, we don’t have a canonical way to define {\chi(t)} for a Dirichlet character {\chi} (unless its conductor happens to be a power of {p}), so there doesn’t seem to be much to say in the {q}-aspect either.

Let {\lambda: {\bf N} \rightarrow \{-1,1\}} be the Liouville function, thus {\lambda(n)} is defined to equal {+1} when {n} is the product of an even number of primes, and {-1} when {n} is the product of an odd number of primes. The Chowla conjecture asserts that {\lambda} has the statistics of a random sign pattern, in the sense that

\displaystyle  \lim_{N \rightarrow \infty} \mathbb{E}_{n \leq N} \lambda(n+h_1) \dots \lambda(n+h_k) = 0 \ \ \ \ \ (1)

for all {k \geq 1} and all distinct natural numbers {h_1,\dots,h_k}, where we use the averaging notation

\displaystyle  \mathbb{E}_{n \leq N} f(n) := \frac{1}{N} \sum_{n \leq N} f(n).

For {k=1}, this conjecture is equivalent to the prime number theorem (as discussed in this previous blog post), but the conjecture remains open for any {k \geq 2}.

In recent years, it has been realised that one can make more progress on this conjecture if one works instead with the logarithmically averaged version

\displaystyle  \lim_{N \rightarrow \infty} \mathbb{E}_{n \leq N}^{\log} \lambda(n+h_1) \dots \lambda(n+h_k) = 0 \ \ \ \ \ (2)

of the conjecture, where we use the logarithmic averaging notation

\displaystyle  \mathbb{E}_{n \leq N}^{\log} f(n) := \frac{\sum_{n \leq N} \frac{f(n)}{n}}{\sum_{n \leq N} \frac{1}{n}}.

Using the summation by parts (or telescoping series) identity

\displaystyle  \sum_{n \leq N} \frac{f(n)}{n} = \sum_{M < N} \frac{1}{M(M+1)} (\sum_{n \leq M} f(n)) + \frac{1}{N} \sum_{n \leq N} f(n) \ \ \ \ \ (3)

it is not difficult to show that the Chowla conjecture (1) for a given {k,h_1,\dots,h_k} implies the logarithmically averaged conjecture (2). However, the converse implication is not at all clear. For instance, for {k=1}, we have already mentioned that the Chowla conjecture

\displaystyle  \lim_{N \rightarrow \infty} \mathbb{E}_{n \leq N} \lambda(n) = 0

is equivalent to the prime number theorem; but the logarithmically averaged analogue

\displaystyle  \lim_{N \rightarrow \infty} \mathbb{E}^{\log}_{n \leq N} \lambda(n) = 0

is significantly easier to show (a proof with the Liouville function {\lambda} replaced by the closely related Möbius function {\mu} is given in this previous blog post). And indeed, significantly more is now known for the logarithmically averaged Chowla conjecture; in this paper of mine I had proven (2) for {k=2}, and in this recent paper with Joni Teravainen, we proved the conjecture for all odd {k} (with a different proof also given here).

In view of this emerging consensus that the logarithmically averaged Chowla conjecture was easier than the ordinary Chowla conjecture, it was thus somewhat of a surprise for me to read a recent paper of Gomilko, Kwietniak, and Lemanczyk who (among other things) established the following statement:

Theorem 1 Assume that the logarithmically averaged Chowla conjecture (2) is true for all {k}. Then there exists a sequence {N_i} going to infinity such that the Chowla conjecture (1) is true for all {k} along that sequence, that is to say

\displaystyle  \lim_{N_i \rightarrow \infty} \mathbb{E}_{n \leq N_i} \lambda(n+h_1) \dots \lambda(n+h_k) = 0

for all {k} and all distinct {h_1,\dots,h_k}.

This implication does not use any special properties of the Liouville function (other than that they are bounded), and in fact proceeds by ergodic theoretic methods, focusing in particular on the ergodic decomposition of invariant measures of a shift into ergodic measures. Ergodic methods have proven remarkably fruitful in understanding these sorts of number theoretic and combinatorial problems, as could already be seen by the ergodic theoretic proof of Szemerédi’s theorem by Furstenberg, and more recently by the work of Frantzikinakis and Host on Sarnak’s conjecture. (My first paper with Teravainen also uses ergodic theory tools.) Indeed, many other results in the subject were first discovered using ergodic theory methods.

On the other hand, many results in this subject that were first proven ergodic theoretically have since been reproven by more combinatorial means; my second paper with Teravainen is an instance of this. As it turns out, one can also prove Theorem 1 by a standard combinatorial (or probabilistic) technique known as the second moment method. In fact, one can prove slightly more:

Theorem 2 Let {k} be a natural number. Assume that the logarithmically averaged Chowla conjecture (2) is true for {2k}. Then there exists a set {{\mathcal N}} of natural numbers of logarithmic density {1} (that is, {\lim_{N \rightarrow \infty} \mathbb{E}_{n \leq N}^{\log} 1_{n \in {\mathcal N}} = 1}) such that

\displaystyle  \lim_{N \rightarrow \infty: N \in {\mathcal N}} \mathbb{E}_{n \leq N} \lambda(n+h_1) \dots \lambda(n+h_k) = 0

for any distinct {h_1,\dots,h_k}.

It is not difficult to deduce Theorem 1 from Theorem 2 using a diagonalisation argument. Unfortunately, the known cases of the logarithmically averaged Chowla conjecture ({k=2} and odd {k}) are currently insufficient to use Theorem 2 for any purpose other than to reprove what is already known to be true from the prime number theorem. (Indeed, the even cases of Chowla, in either logarithmically averaged or non-logarithmically averaged forms, seem to be far more powerful than the odd cases; see Remark 1.7 of this paper of myself and Teravainen for a related observation in this direction.)

We now sketch the proof of Theorem 2. For any distinct {h_1,\dots,h_k}, we take a large number {H} and consider the limiting the second moment

\displaystyle  \limsup_{N \rightarrow \infty} \mathop{\bf E}_{n \leq N}^{\log} |\mathop{\bf E}_{m \leq H} \lambda(n+m+h_1) \dots \lambda(n+m+h_k)|^2.

We can expand this as

\displaystyle  \limsup_{N \rightarrow \infty} \mathop{\bf E}_{m,m' \leq H} \mathop{\bf E}_{n \leq N}^{\log} \lambda(n+m+h_1) \dots \lambda(n+m+h_k)

\displaystyle \lambda(n+m'+h_1) \dots \lambda(n+m'+h_k).

If all the {m+h_1,\dots,m+h_k,m'+h_1,\dots,m'+h_k} are distinct, the hypothesis (2) tells us that the inner averages goes to zero as {N \rightarrow \infty}. The remaining averages are {O(1)}, and there are {O( k^2 )} of these averages. We conclude that

\displaystyle  \limsup_{N \rightarrow \infty} \mathop{\bf E}_{n \leq N}^{\log} |\mathop{\bf E}_{m \leq H} \lambda(n+m+h_1) \dots \lambda(n+m+h_k)|^2 \ll k^2 / H.

By Markov’s inequality (and (3)), we conclude that for any fixed {h_1,\dots,h_k, H}, there exists a set {{\mathcal N}_{h_1,\dots,h_k,H}} of upper logarithmic density at least {1-k/H^{1/2}}, thus

\displaystyle  \limsup_{N \rightarrow \infty} \mathbb{E}_{n \leq N}^{\log} 1_{n \in {\mathcal N}_{h_1,\dots,h_k,H}} \geq 1 - k/H^{1/2}

such that

\displaystyle  \mathop{\bf E}_{n \leq N} |\mathop{\bf E}_{m \leq H} \lambda(n+m+h_1) \dots \lambda(n+m+h_k)|^2 \ll k / H^{1/2}.

By deleting at most finitely many elements, we may assume that {{\mathcal N}_{h_1,\dots,h_k,H}} consists only of elements of size at least {H^2} (say).

For any {H_0}, if we let {{\mathcal N}_{h_1,\dots,h_k, \geq H_0}} be the union of {{\mathcal N}_{h_1,\dots,h_k, H}} for {H \geq H_0}, then {{\mathcal N}_{h_1,\dots,h_k, \geq H_0}} has logarithmic density {1}. By a diagonalisation argument (using the fact that the set of tuples {(h_1,\dots,h_k)} is countable), we can then find a set {{\mathcal N}} of natural numbers of logarithmic density {1}, such that for every {h_1,\dots,h_k,H_0}, every sufficiently large element of {{\mathcal N}} lies in {{\mathcal N}_{h_1,\dots,h_k,\geq H_0}}. Thus for every sufficiently large {N} in {{\mathcal N}}, one has

\displaystyle  \mathop{\bf E}_{n \leq N} |\mathop{\bf E}_{m \leq H} \lambda(n+m+h_1) \dots \lambda(n+m+h_k)|^2 \ll k / H^{1/2}.

for some {H \geq H_0} with {N \geq H^2}. By Cauchy-Schwarz, this implies that

\displaystyle  \mathop{\bf E}_{n \leq N} \mathop{\bf E}_{m \leq H} \lambda(n+m+h_1) \dots \lambda(n+m+h_k) \ll k^{1/2} / H^{1/4};

interchanging the sums and using {N \geq H^2} and {H \geq H_0}, this implies that

\displaystyle  \mathop{\bf E}_{n \leq N} \lambda(n+h_1) \dots \lambda(n+h_k) \ll k^{1/2} / H^{1/4} \leq k^{1/2} / H_0^{1/4}.

We conclude on taking {H_0} to infinity that

\displaystyle  \lim_{N \rightarrow \infty; N \in {\mathcal N}} \mathop{\bf E}_{n \leq N} \lambda(n+h_1) \dots \lambda(n+h_k) = 0

as required.

Joni Teräväinen and I have just uploaded to the arXiv our paper “Odd order cases of the logarithmically averaged Chowla conjecture“, submitted to J. Numb. Thy. Bordeaux. This paper gives an alternate route to one of the main results of our previous paper, and more specifically reproves the asymptotic

\displaystyle \sum_{n \leq x} \frac{\lambda(n+h_1) \dots \lambda(n+h_k)}{n} = o(\log x) \ \ \ \ \ (1)

 

for all odd {k} and all integers {h_1,\dots,h_k} (that is to say, all the odd order cases of the logarithmically averaged Chowla conjecture). Our previous argument relies heavily on some deep ergodic theory results of Bergelson-Host-Kra, Leibman, and Le (and was applicable to more general multiplicative functions than the Liouville function {\lambda}); here we give a shorter proof that avoids ergodic theory (but instead requires the Gowers uniformity of the (W-tricked) von Mangoldt function, established in several papers of Ben Green, Tamar Ziegler, and myself). The proof follows the lines sketched in the previous blog post. In principle, due to the avoidance of ergodic theory, the arguments here have a greater chance to be made quantitative; however, at present the known bounds on the Gowers uniformity of the von Mangoldt function are qualitative, except at the {U^2} level, which is unfortunate since the first non-trivial odd case {k=3} requires quantitative control on the {U^3} level. (But it may be possible to make the Gowers uniformity bounds for {U^3} quantitative if one assumes GRH, although when one puts everything together, the actual decay rate obtained in (1) is likely to be poor.)

Joni Teräväinen and I have just uploaded to the arXiv our paper “The structure of logarithmically averaged correlations of multiplicative functions, with applications to the Chowla and Elliott conjectures“, submitted to Duke Mathematical Journal. This paper builds upon my previous paper in which I introduced an “entropy decrement method” to prove the two-point (logarithmically averaged) cases of the Chowla and Elliott conjectures. A bit more specifically, I showed that

\displaystyle  \lim_{m \rightarrow \infty} \frac{1}{\log \omega_m} \sum_{x_m/\omega_m \leq n \leq x_m} \frac{g_0(n+h_0) g_1(n+h_1)}{n} = 0

whenever {1 \leq \omega_m \leq x_m} were sequences going to infinity, {h_0,h_1} were distinct integers, and {g_0,g_1: {\bf N} \rightarrow {\bf C}} were {1}-bounded multiplicative functions which were non-pretentious in the sense that

\displaystyle  \liminf_{X \rightarrow \infty} \inf_{|t_j| \leq X} \sum_{p \leq X} \frac{1-\mathrm{Re}( g_j(p) \overline{\chi_j}(p) p^{it_j})}{p} = \infty \ \ \ \ \ (1)

for all Dirichlet characters {\chi_j} and for {j=0,1}. Thus, for instance, one had the logarithmically averaged two-point Chowla conjecture

\displaystyle  \sum_{n \leq x} \frac{\lambda(n) \lambda(n+h)}{n} = o(\log x)

for fixed any non-zero {h}, where {\lambda} was the Liouville function.

One would certainly like to extend these results to higher order correlations than the two-point correlations. This looks to be difficult (though perhaps not completely impossible if one allows for logarithmic averaging): in a previous paper I showed that achieving this in the context of the Liouville function would be equivalent to resolving the logarithmically averaged Sarnak conjecture, as well as establishing logarithmically averaged local Gowers uniformity of the Liouville function. However, in this paper we are able to avoid having to resolve these difficult conjectures to obtain partial results towards the (logarithmically averaged) Chowla and Elliott conjecture. For the Chowla conjecture, we can obtain all odd order correlations, in that

\displaystyle  \sum_{n \leq x} \frac{\lambda(n+h_1) \dots \lambda(n+h_k)}{n} = o(\log x) \ \ \ \ \ (2)

for all odd {k} and all integers {h_1,\dots,h_k} (which, in the odd order case, are no longer required to be distinct). (Superficially, this looks like we have resolved “half of the cases” of the logarithmically averaged Chowla conjecture; but it seems the odd order correlations are significantly easier than the even order ones. For instance, because of the Katai-Bourgain-Sarnak-Ziegler criterion, one can basically deduce the odd order cases of (2) from the even order cases (after allowing for some dilations in the argument {n}).

For the more general Elliott conjecture, we can show that

\displaystyle  \lim_{m \rightarrow \infty} \frac{1}{\log \omega_m} \sum_{x_m/\omega_m \leq n \leq x_m} \frac{g_1(n+h_1) \dots g_k(n+h_k)}{n} = 0

for any {k}, any integers {h_1,\dots,h_k} and any bounded multiplicative functions {g_1,\dots,g_k}, unless the product {g_1 \dots g_k} weakly pretends to be a Dirichlet character {\chi} in the sense that

\displaystyle  \sum_{p \leq X} \frac{1 - \hbox{Re}( g_1 \dots g_k(p) \overline{\chi}(p)}{p} = o(\log\log X).

This can be seen to imply (2) as a special case. Even when {g_1,\dots,g_k} does pretend to be a Dirichlet character {\chi}, we can still say something: if the limits

\displaystyle  f(a) := \lim_{m \rightarrow \infty} \frac{1}{\log \omega_m} \sum_{x_m/\omega_m \leq n \leq x_m} \frac{g_1(n+ah_1) \dots g_k(n+ah_k)}{n}

exist for each {a \in {\bf Z}} (which can be guaranteed if we pass to a suitable subsequence), then {f} is the uniform limit of periodic functions {f_i}, each of which is {\chi}isotypic in the sense that {f_i(ab) = f_i(a) \chi(b)} whenever {a,b} are integers with {b} coprime to the periods of {\chi} and {f_i}. This does not pin down the value of any single correlation {f(a)}, but does put significant constraints on how these correlations may vary with {a}.

Among other things, this allows us to show that all {16} possible length four sign patterns {(\lambda(n+1),\dots,\lambda(n+4)) \in \{-1,+1\}^4} of the Liouville function occur with positive density, and all {65} possible length four sign patterns {(\mu(n+1),\dots,\mu(n+4)) \in \{-1,0,+1\}^4 \backslash \{-1,+1\}^4} occur with the conjectured logarithmic density. (In a previous paper with Matomaki and Radziwill, we obtained comparable results for length three patterns of Liouville and length two patterns of Möbius.)

To describe the argument, let us focus for simplicity on the case of the Liouville correlations

\displaystyle  f(a) := \lim_{X \rightarrow \infty} \frac{1}{\log X} \sum_{n \leq X} \frac{\lambda(n) \lambda(n+a) \dots \lambda(n+(k-1)a)}{n}, \ \ \ \ \ (3)

assuming for sake of discussion that all limits exist. (In the paper, we instead use the device of generalised limits, as discussed in this previous post.) The idea is to combine together two rather different ways to control this function {f}. The first proceeds by the entropy decrement method mentioned earlier, which roughly speaking works as follows. Firstly, we pick a prime {p} and observe that {\lambda(pn)=-\lambda(n)} for any {n}, which allows us to rewrite (3) as

\displaystyle  (-1)^k f(a) = \lim_{X \rightarrow \infty} \frac{1}{\log X}

\displaystyle  \sum_{n \leq X} \frac{\lambda(pn) \lambda(pn+ap) \dots \lambda(pn+(k-1)ap)}{n}.

Making the change of variables {n' = pn}, we obtain

\displaystyle  (-1)^k f(a) = \lim_{X \rightarrow \infty} \frac{1}{\log X}

\displaystyle \sum_{n' \leq pX} \frac{\lambda(n') \lambda(n'+ap) \dots \lambda(n'+(k-1)ap)}{n'} p 1_{p|n'}.

The difference between {n' \leq pX} and {n' \leq X} is negligible in the limit (here is where we crucially rely on the log-averaging), hence

\displaystyle  (-1)^k f(a) = \lim_{X \rightarrow \infty} \frac{1}{\log X} \sum_{n \leq X} \frac{\lambda(n) \lambda(n+ap) \dots \lambda(n+(k-1)ap)}{n} p 1_{p|n}

and thus by (3) we have

\displaystyle  (-1)^k f(a) = f(ap) + \lim_{X \rightarrow \infty} \frac{1}{\log X}

\displaystyle \sum_{n \leq X} \frac{\lambda(n) \lambda(n+ap) \dots \lambda(n+(k-1)ap)}{n} (p 1_{p|n}-1).

The entropy decrement argument can be used to show that the latter limit is small for most {p} (roughly speaking, this is because the factors {p 1_{p|n}-1} behave like independent random variables as {p} varies, so that concentration of measure results such as Hoeffding’s inequality can apply, after using entropy inequalities to decouple somewhat these random variables from the {\lambda} factors). We thus obtain the approximate isotopy property

\displaystyle  (-1)^k f(a) \approx f(ap) \ \ \ \ \ (4)

for most {a} and {p}.

On the other hand, by the Furstenberg correspondence principle (as discussed in these previous posts), it is possible to express {f(a)} as a multiple correlation

\displaystyle  f(a) = \int_X g(x) g(T^a x) \dots g(T^{(k-1)a} x)\ d\mu(x)

for some probability space {(X,\mu)} equipped with a measure-preserving invertible map {T: X \rightarrow X}. Using results of Bergelson-Host-Kra, Leibman, and Le, this allows us to obtain a decomposition of the form

\displaystyle  f(a) = f_1(a) + f_2(a) \ \ \ \ \ (5)

where {f_1} is a nilsequence, and {f_2} goes to zero in density (even along the primes, or constant multiples of the primes). The original work of Bergelson-Host-Kra required ergodicity on {X}, which is very definitely a hypothesis that is not available here; however, the later work of Leibman removed this hypothesis, and the work of Le refined the control on {f_1} so that one still has good control when restricting to primes, or constant multiples of primes.

Ignoring the small error {f_2(a)}, we can now combine (5) to conclude that

\displaystyle  f(a) \approx (-1)^k f_1(ap).

Using the equidistribution theory of nilsequences (as developed in this previous paper of Ben Green and myself), one can break up {f_1} further into a periodic piece {f_0} and an “irrational” or “minor arc” piece {f_3}. The contribution of the minor arc piece {f_3} can be shown to mostly cancel itself out after dilating by primes {p} and averaging, thanks to Vinogradov-type bilinear sum estimates (transferred to the primes). So we end up with

\displaystyle  f(a) \approx (-1)^k f_0(ap),

which already shows (heuristically, at least) the claim that {f} can be approximated by periodic functions {f_0} which are isotopic in the sense that

\displaystyle  f_0(a) \approx (-1)^k f_0(ap).

But if {k} is odd, one can use Dirichlet’s theorem on primes in arithmetic progressions to restrict to primes {p} that are {1} modulo the period of {f_0}, and conclude now that {f_0} vanishes identically, which (heuristically, at least) gives (2).

The same sort of argument works to give the more general bounds on correlations of bounded multiplicative functions. But for the specific task of proving (2), we initially used a slightly different argument that avoids using the ergodic theory machinery of Bergelson-Host-Kra, Leibman, and Le, but replaces it instead with the Gowers uniformity norm theory used to count linear equations in primes. Basically, by averaging (4) in {p} using the “{W}-trick”, as well as known facts about the Gowers uniformity of the von Mangoldt function, one can obtain an approximation of the form

\displaystyle  (-1)^k f(a) \approx {\bf E}_{b: (b,W)=1} f(ab)

where {b} ranges over a large range of integers coprime to some primorial {W = \prod_{p \leq w} p}. On the other hand, by iterating (4) we have

\displaystyle  f(a) \approx f(apq)

for most semiprimes {pq}, and by again averaging over semiprimes one can obtain an approximation of the form

\displaystyle  f(a) \approx {\bf E}_{b: (b,W)=1} f(ab).

For {k} odd, one can combine the two approximations to conclude that {f(a)=0}. (This argument is not given in the current paper, but we plan to detail it in a subsequent one.)

Kaisa Matomaki, Maksym Radziwill, and I have uploaded to the arXiv our paper “Correlations of the von Mangoldt and higher divisor functions I. Long shift ranges“, submitted to Proceedings of the London Mathematical Society. This paper is concerned with the estimation of correlations such as

\displaystyle \sum_{n \leq X} \Lambda(n) \Lambda(n+h) \ \ \ \ \ (1)

for medium-sized {h} and large {X}, where {\Lambda} is the von Mangoldt function; we also consider variants of this sum in which one of the von Mangoldt functions is replaced with a (higher order) divisor function, but for sake of discussion let us focus just on the sum (1). Understanding this sum is very closely related to the problem of finding pairs of primes that differ by {h}; for instance, if one could establish a lower bound

\displaystyle \sum_{n \leq X} \Lambda(n) \Lambda(n+2) \gg X

then this would easily imply the twin prime conjecture.

The (first) Hardy-Littlewood conjecture asserts an asymptotic

\displaystyle \sum_{n \leq X} \Lambda(n) \Lambda(n+h) = {\mathfrak S}(h) X + o(X) \ \ \ \ \ (2)

as {X \rightarrow \infty} for any fixed positive {h}, where the singular series {{\mathfrak S}(h)} is an arithmetic factor arising from the irregularity of distribution of {\Lambda} at small moduli, defined explicitly by

\displaystyle {\mathfrak S}(h) := 2 \Pi_2 \prod_{p|h; p>2} \frac{p-2}{p-1}

when {h} is even, and {{\mathfrak S}(h)=0} when {h} is odd, where

\displaystyle \Pi_2 := \prod_{p>2} (1-\frac{1}{(p-1)^2}) = 0.66016\dots

is (half of) the twin prime constant. See for instance this previous blog post for a a heuristic explanation of this conjecture. From the previous discussion we see that (2) for {h=2} would imply the twin prime conjecture. Sieve theoretic methods are only able to provide an upper bound of the form { \sum_{n \leq X} \Lambda(n) \Lambda(n+h) \ll {\mathfrak S}(h) X}.

Needless to say, apart from the trivial case of odd {h}, there are no values of {h} for which the Hardy-Littlewood conjecture is known. However there are some results that say that this conjecture holds “on the average”: in particular, if {H} is a quantity depending on {X} that is somewhat large, there are results that show that (2) holds for most (i.e. for {1-o(1)}) of the {h} betwen {0} and {H}. Ideally one would like to get {H} as small as possible, in particular one can view the full Hardy-Littlewood conjecture as the endpoint case when {H} is bounded.

The first results in this direction were by van der Corput and by Lavrik, who established such a result with {H = X} (with a subsequent refinement by Balog); Wolke lowered {H} to {X^{5/8+\varepsilon}}, and Mikawa lowered {H} further to {X^{1/3+\varepsilon}}. The main result of this paper is a further lowering of {H} to {X^{8/33+\varepsilon}}. In fact (as in the preceding works) we get a better error term than {o(X)}, namely an error of the shape {O_A( X \log^{-A} X)} for any {A}.

Our arguments initially proceed along standard lines. One can use the Hardy-Littlewood circle method to express the correlation in (2) as an integral involving exponential sums {S(\alpha) := \sum_{n \leq X} \Lambda(n) e(\alpha n)}. The contribution of “major arc” {\alpha} is known by a standard computation to recover the main term {{\mathfrak S}(h) X} plus acceptable errors, so it is a matter of controlling the “minor arcs”. After averaging in {h} and using the Plancherel identity, one is basically faced with establishing a bound of the form

\displaystyle \int_{\beta-1/H}^{\beta+1/H} |S(\alpha)|^2\ d\alpha \ll_A X \log^{-A} X

for any “minor arc” {\beta}. If {\beta} is somewhat close to a low height rational {a/q} (specifically, if it is within {X^{-1/6-\varepsilon}} of such a rational with {q = O(\log^{O(1)} X)}), then this type of estimate is roughly of comparable strength (by another application of Plancherel) to the best available prime number theorem in short intervals on the average, namely that the prime number theorem holds for most intervals of the form {[x, x + x^{1/6+\varepsilon}]}, and we can handle this case using standard mean value theorems for Dirichlet series. So we can restrict attention to the “strongly minor arc” case where {\beta} is far from such rationals.

The next step (following some ideas we found in a paper of Zhan) is to rewrite this estimate not in terms of the exponential sums {S(\alpha) := \sum_{n \leq X} \Lambda(n) e(\alpha n)}, but rather in terms of the Dirichlet polynomial {F(s) := \sum_{n \sim X} \frac{\Lambda(n)}{n^s}}. After a certain amount of computation (including some oscillatory integral estimates arising from stationary phase), one is eventually reduced to the task of establishing an estimate of the form

\displaystyle \int_{t \sim \lambda X} (\sum_{t-\lambda H}^{t+\lambda H} |F(\frac{1}{2}+it')|\ dt')^2\ dt \ll_A \lambda^2 H^2 X \log^{-A} X

for any {X^{-1/6-\varepsilon} \ll \lambda \ll \log^{-B} X} (with {B} sufficiently large depending on {A}).

The next step, which is again standard, is the use of the Heath-Brown identity (as discussed for instance in this previous blog post) to split up {\Lambda} into a number of components that have a Dirichlet convolution structure. Because the exponent {8/33} we are shooting for is less than {1/4}, we end up with five types of components that arise, which we call “Type {d_1}“, “Type {d_2}“, “Type {d_3}“, “Type {d_4}“, and “Type II”. The “Type II” sums are Dirichlet convolutions involving a factor supported on a range {[X^\varepsilon, X^{-\varepsilon} H]} and is quite easy to deal with; the “Type {d_j}” terms are Dirichlet convolutions that resemble (non-degenerate portions of) the {j^{th}} divisor function, formed from convolving together {j} portions of {1}. The “Type {d_1}” and “Type {d_2}” terms can be estimated satisfactorily by standard moment estimates for Dirichlet polynomials; this already recovers the result of Mikawa (and our argument is in fact slightly more elementary in that no Kloosterman sum estimates are required). It is the treatment of the “Type {d_3}” and “Type {d_4}” sums that require some new analysis, with the Type {d_3} terms turning to be the most delicate. After using an existing moment estimate of Jutila for Dirichlet L-functions, matters reduce to obtaining a family of estimates, a typical one of which (relating to the more difficult Type {d_3} sums) is of the form

\displaystyle \int_{t - H}^{t+H} |M( \frac{1}{2} + it')|^2\ dt' \ll X^{\varepsilon^2} H \ \ \ \ \ (3)

for “typical” ordinates {t} of size {X}, where {M} is the Dirichlet polynomial {M(s) := \sum_{n \sim X^{1/3}} \frac{1}{n^s}} (a fragment of the Riemann zeta function). The precise definition of “typical” is a little technical (because of the complicated nature of Jutila’s estimate) and will not be detailed here. Such a claim would follow easily from the Lindelof hypothesis (which would imply that {M(1/2 + it) \ll X^{o(1)}}) but of course we would like to have an unconditional result.

At this point, having exhausted all the Dirichlet polynomial estimates that are usefully available, we return to “physical space”. Using some further Fourier-analytic and oscillatory integral computations, we can estimate the left-hand side of (3) by an expression that is roughly of the shape

\displaystyle \frac{H}{X^{1/3}} \sum_{\ell \sim X^{1/3}/H} |\sum_{m \sim X^{1/3}} e( \frac{t}{2\pi} \log \frac{m+\ell}{m-\ell} )|.

The phase {\frac{t}{2\pi} \log \frac{m+\ell}{m-\ell}} can be Taylor expanded as the sum of {\frac{t_j \ell}{\pi m}} and a lower order term {\frac{t_j \ell^3}{3\pi m^3}}, plus negligible errors. If we could discard the lower order term then we would get quite a good bound using the exponential sum estimates of Robert and Sargos, which control averages of exponential sums with purely monomial phases, with the averaging allowing us to exploit the hypothesis that {t} is “typical”. Figuring out how to get rid of this lower order term caused some inefficiency in our arguments; the best we could do (after much experimentation) was to use Fourier analysis to shorten the sums, estimate a one-parameter average exponential sum with a binomial phase by a two-parameter average with a monomial phase, and then use the van der Corput {B} process followed by the estimates of Robert and Sargos. This rather complicated procedure works up to {H = X^{8/33+\varepsilon}} it may be possible that some alternate way to proceed here could improve the exponent somewhat.

In a sequel to this paper, we will use a somewhat different method to reduce {H} to a much smaller value of {\log^{O(1)} X}, but only if we replace the correlations {\sum_{n \leq X} \Lambda(n) \Lambda(n+h)} by either {\sum_{n \leq X} \Lambda(n) d_k(n+h)} or {\sum_{n \leq X} d_k(n) d_l(n+h)}, and also we now only save a {o(1)} in the error term rather than {O_A(\log^{-A} X)}.

Given a function {f: {\bf N} \rightarrow \{-1,+1\}} on the natural numbers taking values in {+1, -1}, one can invoke the Furstenberg correspondence principle to locate a measure preserving system {T \circlearrowright (X, \mu)} – a probability space {(X,\mu)} together with a measure-preserving shift {T: X \rightarrow X} (or equivalently, a measure-preserving {{\bf Z}}-action on {(X,\mu)}) – together with a measurable function (or “observable”) {F: X \rightarrow \{-1,+1\}} that has essentially the same statistics as {f} in the sense that

\displaystyle \lim \inf_{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^N f(n+h_1) \dots f(n+h_k)

\displaystyle \leq \int_X F(T^{h_1} x) \dots F(T^{h_k} x)\ d\mu(x)

\displaystyle \leq \lim \sup_{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^N f(n+h_1) \dots f(n+h_k)

for any integers {h_1,\dots,h_k}. In particular, one has

\displaystyle \int_X F(T^{h_1} x) \dots F(T^{h_k} x)\ d\mu(x) = \lim_{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^N f(n+h_1) \dots f(n+h_k) \ \ \ \ \ (1)

 

whenever the limit on the right-hand side exists. We will refer to the system {T \circlearrowright (X,\mu)} together with the designated function {F} as a Furstenberg limit ot the sequence {f}. These Furstenberg limits capture some, but not all, of the asymptotic behaviour of {f}; roughly speaking, they control the typical “local” behaviour of {f}, involving correlations such as {\frac{1}{N} \sum_{n=1}^N f(n+h_1) \dots f(n+h_k)} in the regime where {h_1,\dots,h_k} are much smaller than {N}. However, the control on error terms here is usually only qualitative at best, and one usually does not obtain non-trivial control on correlations in which the {h_1,\dots,h_k} are allowed to grow at some significant rate with {N} (e.g. like some power {N^\theta} of {N}).

The correspondence principle is discussed in these previous blog posts. One way to establish the principle is by introducing a Banach limit {p\!-\!\lim: \ell^\infty({\bf N}) \rightarrow {\bf R}} that extends the usual limit functional on the subspace of {\ell^\infty({\bf N})} consisting of convergent sequences while still having operator norm one. Such functionals cannot be constructed explicitly, but can be proven to exist (non-constructively and non-uniquely) using the Hahn-Banach theorem; one can also use a non-principal ultrafilter here if desired. One can then seek to construct a system {T \circlearrowright (X,\mu)} and a measurable function {F: X \rightarrow \{-1,+1\}} for which one has the statistics

\displaystyle \int_X F(T^{h_1} x) \dots F(T^{h_k} x)\ d\mu(x) = p\!-\!\lim_{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^N f(n+h_1) \dots f(n+h_k) \ \ \ \ \ (2)

 

for all {h_1,\dots,h_k}. One can explicitly construct such a system as follows. One can take {X} to be the Cantor space {\{-1,+1\}^{\bf Z}} with the product {\sigma}-algebra and the shift

\displaystyle T ( (x_n)_{n \in {\bf Z}} ) := (x_{n+1})_{n \in {\bf Z}}

with the function {F: X \rightarrow \{-1,+1\}} being the coordinate function at zero:

\displaystyle F( (x_n)_{n \in {\bf Z}} ) := x_0

(so in particular {F( T^h (x_n)_{n \in {\bf Z}} ) = x_h} for any {h \in {\bf Z}}). The only thing remaining is to construct the invariant measure {\mu}. In order to be consistent with (2), one must have

\displaystyle \mu( \{ (x_n)_{n \in {\bf Z}}: x_{h_j} = \epsilon_j \forall 1 \leq j \leq k \} )

\displaystyle = p\!-\!\lim_{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^N 1_{f(n+h_1)=\epsilon_1} \dots 1_{f(n+h_k)=\epsilon_k}

for any distinct integers {h_1,\dots,h_k} and signs {\epsilon_1,\dots,\epsilon_k}. One can check that this defines a premeasure on the Boolean algebra of {\{-1,+1\}^{\bf Z}} defined by cylinder sets, and the existence of {\mu} then follows from the Hahn-Kolmogorov extension theorem (or the closely related Kolmogorov extension theorem). One can then check that the correspondence (2) holds, and that {\mu} is translation-invariant; the latter comes from the translation invariance of the (Banach-)Césaro averaging operation {f \mapsto p\!-\!\lim_{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^N f(n)}. A variant of this construction shows that the Furstenberg limit is unique up to equivalence if and only if all the limits appearing in (1) actually exist.

One can obtain a slightly tighter correspondence by using a smoother average than the Césaro average. For instance, one can use the logarithmic Césaro averages {\lim_{N \rightarrow \infty} \frac{1}{\log N}\sum_{n=1}^N \frac{f(n)}{n}} in place of the Césaro average {\sum_{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^N f(n)}, thus one replaces (2) by

\displaystyle \int_X F(T^{h_1} x) \dots F(T^{h_k} x)\ d\mu(x)

\displaystyle = p\!-\!\lim_{N \rightarrow \infty} \frac{1}{\log N} \sum_{n=1}^N \frac{f(n+h_1) \dots f(n+h_k)}{n}.

Whenever the Césaro average of a bounded sequence {f: {\bf N} \rightarrow {\bf R}} exists, then the logarithmic Césaro average exists and is equal to the Césaro average. Thus, a Furstenberg limit constructed using logarithmic Banach-Césaro averaging still obeys (1) for all {h_1,\dots,h_k} when the right-hand side limit exists, but also obeys the more general assertion

\displaystyle \int_X F(T^{h_1} x) \dots F(T^{h_k} x)\ d\mu(x)

\displaystyle = \lim_{N \rightarrow \infty} \frac{1}{\log N} \sum_{n=1}^N \frac{f(n+h_1) \dots f(n+h_k)}{n}

whenever the limit of the right-hand side exists.

In a recent paper of Frantizinakis, the Furstenberg limits of the Liouville function {\lambda} (with logarithmic averaging) were studied. Some (but not all) of the known facts and conjectures about the Liouville function can be interpreted in the Furstenberg limit. For instance, in a recent breakthrough result of Matomaki and Radziwill (discussed previously here), it was shown that the Liouville function exhibited cancellation on short intervals in the sense that

\displaystyle \lim_{H \rightarrow \infty} \limsup_{X \rightarrow \infty} \frac{1}{X} \int_X^{2X} \frac{1}{H} |\sum_{x \leq n \leq x+H} \lambda(n)|\ dx = 0.

In terms of Furstenberg limits of the Liouville function, this assertion is equivalent to the assertion that

\displaystyle \lim_{H \rightarrow \infty} \int_X |\frac{1}{H} \sum_{h=1}^H F(T^h x)|\ d\mu(x) = 0

for all Furstenberg limits {T \circlearrowright (X,\mu), F} of Liouville (including those without logarithmic averaging). Invoking the mean ergodic theorem (discussed in this previous post), this assertion is in turn equivalent to the observable {F} that corresponds to the Liouville function being orthogonal to the invariant factor {L^\infty(X,\mu)^{\bf Z} = \{ g \in L^\infty(X,\mu): g \circ T = g \}} of {X}; equivalently, the first Gowers-Host-Kra seminorm {\|F\|_{U^1(X)}} of {F} (as defined for instance in this previous post) vanishes. The Chowla conjecture, which asserts that

\displaystyle \lim_{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^N \lambda(n+h_1) \dots \lambda(n+h_k) = 0

for all distinct integers {h_1,\dots,h_k}, is equivalent to the assertion that all the Furstenberg limits of Liouville are equivalent to the Bernoulli system ({\{-1,+1\}^{\bf Z}} with the product measure arising from the uniform distribution on {\{-1,+1\}}, with the shift {T} and observable {F} as before). Similarly, the logarithmically averaged Chowla conjecture

\displaystyle \lim_{N \rightarrow \infty} \frac{1}{\log N} \sum_{n=1}^N \frac{\lambda(n+h_1) \dots \lambda(n+h_k)}{n} = 0

is equivalent to the assertion that all the Furstenberg limits of Liouville with logarithmic averaging are equivalent to the Bernoulli system. Recently, I was able to prove the two-point version

\displaystyle \lim_{N \rightarrow \infty} \frac{1}{\log N} \sum_{n=1}^N \frac{\lambda(n) \lambda(n+h)}{n} = 0 \ \ \ \ \ (3)

 

of the logarithmically averaged Chowla conjecture, for any non-zero integer {h}; this is equivalent to the perfect strong mixing property

\displaystyle \int_X F(x) F(T^h x)\ d\mu(x) = 0

for any Furstenberg limit of Liouville with logarithmic averaging, and any {h \neq 0}.

The situation is more delicate with regards to the Sarnak conjecture, which is equivalent to the assertion that

\displaystyle \lim_{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^N \lambda(n) f(n) = 0

for any zero-entropy sequence {f: {\bf N} \rightarrow {\bf R}} (see this previous blog post for more discussion). Morally speaking, this conjecture should be equivalent to the assertion that any Furstenberg limit of Liouville is disjoint from any zero entropy system, but I was not able to formally establish an implication in either direction due to some technical issues regarding the fact that the Furstenberg limit does not directly control long-range correlations, only short-range ones. (There are however ergodic theoretic interpretations of the Sarnak conjecture that involve the notion of generic points; see this paper of El Abdalaoui, Lemancyk, and de la Rue.) But the situation is currently better with the logarithmically averaged Sarnak conjecture

\displaystyle \lim_{N \rightarrow \infty} \frac{1}{\log N} \sum_{n=1}^N \frac{\lambda(n) f(n)}{n} = 0,

as I was able to show that this conjecture was equivalent to the logarithmically averaged Chowla conjecture, and hence to all Furstenberg limits of Liouville with logarithmic averaging being Bernoulli; I also showed the conjecture was equivalent to local Gowers uniformity of the Liouville function, which is in turn equivalent to the function {F} having all Gowers-Host-Kra seminorms vanishing in every Furstenberg limit with logarithmic averaging. In this recent paper of Frantzikinakis, this analysis was taken further, showing that the logarithmically averaged Chowla and Sarnak conjectures were in fact equivalent to the much milder seeming assertion that all Furstenberg limits with logarithmic averaging were ergodic.

Actually, the logarithmically averaged Furstenberg limits have more structure than just a {{\bf Z}}-action on a measure preserving system {(X,\mu)} with a single observable {F}. Let {Aff_+({\bf Z})} denote the semigroup of affine maps {n \mapsto an+b} on the integers with {a,b \in {\bf Z}} and {a} positive. Also, let {\hat {\bf Z}} denote the profinite integers (the inverse limit of the cyclic groups {{\bf Z}/q{\bf Z}}). Observe that {Aff_+({\bf Z})} acts on {\hat {\bf Z}} by taking the inverse limit of the obvious actions of {Aff_+({\bf Z})} on {{\bf Z}/q{\bf Z}}.

Proposition 1 (Enriched logarithmically averaged Furstenberg limit of Liouville) Let {p\!-\!\lim} be a Banach limit. Then there exists a probability space {(X,\mu)} with an action {\phi \mapsto T^\phi} of the affine semigroup {Aff_+({\bf Z})}, as well as measurable functions {F: X \rightarrow \{-1,+1\}} and {M: X \rightarrow \hat {\bf Z}}, with the following properties:

  • (i) (Affine Furstenberg limit) For any {\phi_1,\dots,\phi_k \in Aff_+({\bf Z})}, and any congruence class {a\ (q)}, one has

    \displaystyle p\!-\!\lim_{N \rightarrow \infty} \frac{1}{\log N} \sum_{n=1}^N \frac{\lambda(\phi_1(n)) \dots \lambda(\phi_k(n)) 1_{n = a\ (q)}}{n}

    \displaystyle = \int_X F( T^{\phi_1}(x) ) \dots F( T^{\phi_k}(x) ) 1_{M(x) = a\ (q)}\ d\mu(x).

  • (ii) (Equivariance of {M}) For any {\phi \in Aff_+({\bf Z})}, one has

    \displaystyle M( T^\phi(x) ) = \phi( M(x) )

    for {\mu}-almost every {x \in X}.

  • (iii) (Multiplicativity at fixed primes) For any prime {p}, one has

    \displaystyle F( T^{p\cdot} x ) = - F(x)

    for {\mu}-almost every {x \in X}, where {p \cdot \in Aff_+({\bf Z})} is the dilation map {n \mapsto pn}.

  • (iv) (Measure pushforward) If {\phi \in Aff_+({\bf Z})} is of the form {\phi(n) = an+b} and {S_\phi \subset X} is the set {S_\phi = \{ x \in X: M(x) \in \phi(\hat {\bf Z}) \}}, then the pushforward {T^\phi_* \mu} of {\mu} by {\phi} is equal to {a \mu\downharpoonright_{S_\phi}}, that is to say one has

    \displaystyle \mu( (T^\phi)^{-1}(E) ) = a \mu( E \cap S_\phi )

    for every measurable {E \subset X}.

Note that {{\bf Z}} can be viewed as the subgroup of {Aff_+({\bf Z})} consisting of the translations {n \mapsto n + b}. If one only keeps the {{\bf Z}}-portion of the {Aff_+({\bf Z})} action and forgets the rest (as well as the function {M}) then the action becomes measure-preserving, and we recover an ordinary Furstenberg limit with logarithmic averaging. However, the additional structure here can be quite useful; for instance, one can transfer the proof of (3) to this setting, which we sketch below the fold, after proving the proposition.

The observable {M}, roughly speaking, means that points {x} in the Furstenberg limit {X} constructed by this proposition are still “virtual integers” in the sense that one can meaningfully compute the residue class of {x} modulo any natural number modulus {q}, by first applying {M} and then reducing mod {q}. The action of {Aff_+({\bf Z})} means that one can also meaningfully multiply {x} by any natural number, and translate it by any integer. As with other applications of the correspondence principle, the main advantage of moving to this more “virtual” setting is that one now acquires a probability measure {\mu}, so that the tools of ergodic theory can be readily applied.

Read the rest of this entry »

Given a random variable {X} that takes on only finitely many values, we can define its Shannon entropy by the formula

\displaystyle  H(X) := \sum_x \mathbf{P}(X=x) \log \frac{1}{\mathbf{P}(X=x)}

with the convention that {0 \log \frac{1}{0} = 0}. (In some texts, one uses the logarithm to base {2} rather than the natural logarithm, but the choice of base will not be relevant for this discussion.) This is clearly a nonnegative quantity. Given two random variables {X,Y} taking on finitely many values, the joint variable {(X,Y)} is also a random variable taking on finitely many values, and also has an entropy {H(X,Y)}. It obeys the Shannon inequalities

\displaystyle  H(X), H(Y) \leq H(X,Y) \leq H(X) + H(Y)

so we can define some further nonnegative quantities, the mutual information

\displaystyle  I(X:Y) := H(X) + H(Y) - H(X,Y)

and the conditional entropies

\displaystyle  H(X|Y) := H(X,Y) - H(Y); \quad H(Y|X) := H(X,Y) - H(X).

More generally, given three random variables {X,Y,Z}, one can define the conditional mutual information

\displaystyle  I(X:Y|Z) := H(X|Z) + H(Y|Z) - H(X,Y|Z)

and the final of the Shannon entropy inequalities asserts that this quantity is also non-negative.

The mutual information {I(X:Y)} is a measure of the extent to which {X} and {Y} fail to be independent; indeed, it is not difficult to show that {I(X:Y)} vanishes if and only if {X} and {Y} are independent. Similarly, {I(X:Y|Z)} vanishes if and only if {X} and {Y} are conditionally independent relative to {Z}. At the other extreme, {H(X|Y)} is a measure of the extent to which {X} fails to depend on {Y}; indeed, it is not difficult to show that {H(X|Y)=0} if and only if {X} is determined by {Y} in the sense that there is a deterministic function {f} such that {X = f(Y)}. In a related vein, if {X} and {X'} are equivalent in the sense that there are deterministic functional relationships {X = f(X')}, {X' = g(X)} between the two variables, then {X} is interchangeable with {X'} for the purposes of computing the above quantities, thus for instance {H(X) = H(X')}, {H(X,Y) = H(X',Y)}, {I(X:Y) = I(X':Y)}, {I(X:Y|Z) = I(X':Y|Z)}, etc..

One can get some initial intuition for these information-theoretic quantities by specialising to a simple situation in which all the random variables {X} being considered come from restricting a single random (and uniformly distributed) boolean function {F: \Omega \rightarrow \{0,1\}} on a given finite domain {\Omega} to some subset {A} of {\Omega}:

\displaystyle  X = F \downharpoonright_A.

In this case, {X} has the law of a random uniformly distributed boolean function from {A} to {\{0,1\}}, and the entropy here can be easily computed to be {|A| \log 2}, where {|A|} denotes the cardinality of {A}. If {X} is the restriction of {F} to {A}, and {Y} is the restriction of {F} to {B}, then the joint variable {(X,Y)} is equivalent to the restriction of {F} to {A \cup B}. If one discards the normalisation factor {\log 2}, one then obtains the following dictionary between entropy and the combinatorics of finite sets:

Random variables {X,Y,Z} Finite sets {A,B,C}
Entropy {H(X)} Cardinality {|A|}
Joint variable {(X,Y)} Union {A \cup B}
Mutual information {I(X:Y)} Intersection cardinality {|A \cap B|}
Conditional entropy {H(X|Y)} Set difference cardinality {|A \backslash B|}
Conditional mutual information {I(X:Y|Z)} {|(A \cap B) \backslash C|}
{X, Y} independent {A, B} disjoint
{X} determined by {Y} {A} a subset of {B}
{X,Y} conditionally independent relative to {Z} {A \cap B \subset C}

Every (linear) inequality or identity about entropy (and related quantities, such as mutual information) then specialises to a combinatorial inequality or identity about finite sets that is easily verified. For instance, the Shannon inequality {H(X,Y) \leq H(X)+H(Y)} becomes the union bound {|A \cup B| \leq |A| + |B|}, and the definition of mutual information becomes the inclusion-exclusion formula

\displaystyle  |A \cap B| = |A| + |B| - |A \cup B|.

For a more advanced example, consider the data processing inequality that asserts that if {X, Z} are conditionally independent relative to {Y}, then {I(X:Z) \leq I(X:Y)}. Specialising to sets, this now says that if {A, C} are disjoint outside of {B}, then {|A \cap C| \leq |A \cap B|}; this can be made apparent by considering the corresponding Venn diagram. This dictionary also suggests how to prove the data processing inequality using the existing Shannon inequalities. Firstly, if {A} and {C} are not necessarily disjoint outside of {B}, then a consideration of Venn diagrams gives the more general inequality

\displaystyle  |A \cap C| \leq |A \cap B| + |(A \cap C) \backslash B|

and a further inspection of the diagram then reveals the more precise identity

\displaystyle  |A \cap C| + |(A \cap B) \backslash C| = |A \cap B| + |(A \cap C) \backslash B|.

Using the dictionary in the reverse direction, one is then led to conjecture the identity

\displaystyle  I( X : Z ) + I( X : Y | Z ) = I( X : Y ) + I( X : Z | Y )

which (together with non-negativity of conditional mutual information) implies the data processing inequality, and this identity is in turn easily established from the definition of mutual information.

On the other hand, not every assertion about cardinalities of sets generalises to entropies of random variables that are not arising from restricting random boolean functions to sets. For instance, a basic property of sets is that disjointness from a given set {C} is preserved by unions:

\displaystyle  A \cap C = B \cap C = \emptyset \implies (A \cup B) \cap C = \emptyset.

Indeed, one has the union bound

\displaystyle  |(A \cup B) \cap C| \leq |A \cap C| + |B \cap C|. \ \ \ \ \ (1)

Applying the dictionary in the reverse direction, one might now conjecture that if {X} was independent of {Z} and {Y} was independent of {Z}, then {(X,Y)} should also be independent of {Z}, and furthermore that

\displaystyle  I(X,Y:Z) \leq I(X:Z) + I(Y:Z)

but these statements are well known to be false (for reasons related to pairwise independence of random variables being strictly weaker than joint independence). For a concrete counterexample, one can take {X, Y \in {\bf F}_2} to be independent, uniformly distributed random elements of the finite field {{\bf F}_2} of two elements, and take {Z := X+Y} to be the sum of these two field elements. One can easily check that each of {X} and {Y} is separately independent of {Z}, but the joint variable {(X,Y)} determines {Z} and thus is not independent of {Z}.

From the inclusion-exclusion identities

\displaystyle  |A \cap C| = |A| + |C| - |A \cup C|

\displaystyle  |B \cap C| = |B| + |C| - |B \cup C|

\displaystyle  |(A \cup B) \cap C| = |A \cup B| + |C| - |A \cup B \cup C|

\displaystyle  |A \cap B \cap C| = |A| + |B| + |C| - |A \cup B| - |B \cup C| - |A \cup C|

\displaystyle + |A \cup B \cup C|

one can check that (1) is equivalent to the trivial lower bound {|A \cap B \cap C| \geq 0}. The basic issue here is that in the dictionary between entropy and combinatorics, there is no satisfactory entropy analogue of the notion of a triple intersection {A \cap B \cap C}. (Even the double intersection {A \cap B} only exists information theoretically in a “virtual” sense; the mutual information {I(X:Y)} allows one to “compute the entropy” of this “intersection”, but does not actually describe this intersection itself as a random variable.)

However, this issue only arises with three or more variables; it is not too difficult to show that the only linear equalities and inequalities that are necessarily obeyed by the information-theoretic quantities {H(X), H(Y), H(X,Y), I(X:Y), H(X|Y), H(Y|X)} associated to just two variables {X,Y} are those that are also necessarily obeyed by their combinatorial analogues {|A|, |B|, |A \cup B|, |A \cap B|, |A \backslash B|, |B \backslash A|}. (See for instance the Venn diagram at the Wikipedia page for mutual information for a pictorial summation of this statement.)

One can work with a larger class of special cases of Shannon entropy by working with random linear functions rather than random boolean functions. Namely, let {S} be some finite-dimensional vector space over a finite field {{\mathbf F}}, and let {f: S \rightarrow {\mathbf F}} be a random linear functional on {S}, selected uniformly among all such functions. Every subspace {U} of {S} then gives rise to a random variable {X = X_U: U \rightarrow {\mathbf F}} formed by restricting {f} to {U}. This random variable is also distributed uniformly amongst all linear functions on {U}, and its entropy can be easily computed to be {\mathrm{dim}(U) \log |\mathbf{F}|}. Given two random variables {X, Y} formed by restricting {f} to {U, V} respectively, the joint random variable {(X,Y)} determines the random linear function {f} on the union {U \cup V} on the two spaces, and thus by linearity on the Minkowski sum {U+V} as well; thus {(X,Y)} is equivalent to the restriction of {f} to {U+V}. In particular, {H(X,Y) = \mathrm{dim}(U+V) \log |\mathbf{F}|}. This implies that {I(X:Y) = \mathrm{dim}(U \cap V) \log |\mathbf{F}|} and also {H(X|Y) = \mathrm{dim}(\pi_V(U)) \log |\mathbf{F}|}, where {\pi_V: S \rightarrow S/V} is the quotient map. After discarding the normalising constant {\log |\mathbf{F}|}, this leads to the following dictionary between information theoretic quantities and linear algebra quantities, analogous to the previous dictionary:

Random variables {X,Y,Z} Subspaces {U,V,W}
Entropy {H(X)} Dimension {\mathrm{dim}(U)}
Joint variable {(X,Y)} Sum {U+V}
Mutual information {I(X:Y)} Dimension of intersection {\mathrm{dim}(U \cap V)}
Conditional entropy {H(X|Y)} Dimension of projection {\mathrm{dim}(\pi_V(U))}
Conditional mutual information {I(X:Y|Z)} {\mathrm{dim}(\pi_W(U) \cap \pi_W(V))}
{X, Y} independent {U, V} transverse ({U \cap V = \{0\}})
{X} determined by {Y} {U} a subspace of {V}
{X,Y} conditionally independent relative to {Z} {\pi_W(U)}, {\pi_W(V)} transverse.

The combinatorial dictionary can be regarded as a specialisation of the linear algebra dictionary, by taking {S} to be the vector space {\mathbf{F}_2^\Omega} over the finite field {\mathbf{F}_2} of two elements, and only considering those subspaces {U} that are coordinate subspaces {U = {\bf F}_2^A} associated to various subsets {A} of {\Omega}.

As before, every linear inequality or equality that is valid for the information-theoretic quantities discussed above, is automatically valid for the linear algebra counterparts for subspaces of a vector space over a finite field by applying the above specialisation (and dividing out by the normalising factor of {\log |\mathbf{F}|}). In fact, the requirement that the field be finite can be removed by applying the compactness theorem from logic (or one of its relatives, such as Los’s theorem on ultraproducts, as done in this previous blog post).

The linear algebra model captures more of the features of Shannon entropy than the combinatorial model. For instance, in contrast to the combinatorial case, it is possible in the linear algebra setting to have subspaces {U,V,W} such that {U} and {V} are separately transverse to {W}, but their sum {U+V} is not; for instance, in a two-dimensional vector space {{\bf F}^2}, one can take {U,V,W} to be the one-dimensional subspaces spanned by {(0,1)}, {(1,0)}, and {(1,1)} respectively. Note that this is essentially the same counterexample from before (which took {{\bf F}} to be the field of two elements). Indeed, one can show that any necessarily true linear inequality or equality involving the dimensions of three subspaces {U,V,W} (as well as the various other quantities on the above table) will also be necessarily true when applied to the entropies of three discrete random variables {X,Y,Z} (as well as the corresponding quantities on the above table).

However, the linear algebra model does not completely capture the subtleties of Shannon entropy once one works with four or more variables (or subspaces). This was first observed by Ingleton, who established the dimensional inequality

\displaystyle  \mathrm{dim}(U \cap V) \leq \mathrm{dim}(\pi_W(U) \cap \pi_W(V)) + \mathrm{dim}(\pi_X(U) \cap \pi_X(V)) + \mathrm{dim}(W \cap X) \ \ \ \ \ (2)

for any subspaces {U,V,W,X}. This is easiest to see when the three terms on the right-hand side vanish; then {\pi_W(U), \pi_W(V)} are transverse, which implies that {U\cap V \subset W}; similarly {U \cap V \subset X}. But {W} and {X} are transverse, and this clearly implies that {U} and {V} are themselves transverse. To prove the general case of Ingleton’s inequality, one can define {Y := U \cap V} and use {\mathrm{dim}(\pi_W(Y)) \leq \mathrm{dim}(\pi_W(U) \cap \pi_W(V))} (and similarly for {X} instead of {W}) to reduce to establishing the inequality

\displaystyle  \mathrm{dim}(Y) \leq \mathrm{dim}(\pi_W(Y)) + \mathrm{dim}(\pi_X(Y)) + \mathrm{dim}(W \cap X) \ \ \ \ \ (3)

which can be rearranged using {\mathrm{dim}(\pi_W(Y)) = \mathrm{dim}(Y) - \mathrm{dim}(W) + \mathrm{dim}(\pi_Y(W))} (and similarly for {X} instead of {W}) and {\mathrm{dim}(W \cap X) = \mathrm{dim}(W) + \mathrm{dim}(X) - \mathrm{dim}(W + X)} as

\displaystyle  \mathrm{dim}(W + X ) \leq \mathrm{dim}(\pi_Y(W)) + \mathrm{dim}(\pi_Y(X)) + \mathrm{dim}(Y)

but this is clear since {\mathrm{dim}(W + X ) \leq \mathrm{dim}(\pi_Y(W) + \pi_Y(X)) + \mathrm{dim}(Y)}.

Returning to the entropy setting, the analogue

\displaystyle  H( V ) \leq H( V | Z ) + H(V | W ) + I(Z:W)

of (3) is true (exercise!), but the analogue

\displaystyle  I(X:Y) \leq I(X:Y|Z) + I(X:Y|W) + I(Z:W) \ \ \ \ \ (4)

of Ingleton’s inequality is false in general. Again, this is easiest to see when all the terms on the right-hand side vanish; then {X,Y} are conditionally independent relative to {Z}, and relative to {W}, and {Z} and {W} are independent, and the claim (4) would then be asserting that {X} and {Y} are independent. While there is no linear counterexample to this statement, there are simple non-linear ones: for instance, one can take {Z,W} to be independent uniform variables from {\mathbf{F}_2}, and take {X} and {Y} to be (say) {ZW} and {(1-Z)(1-W)} respectively (thus {X, Y} are the indicators of the events {Z=W=1} and {Z=W=0} respectively). Once one conditions on either {Z} or {W}, one of {X,Y} has positive conditional entropy and the other has zero entropy, and so {X, Y} are conditionally independent relative to either {Z} or {W}; also, {Z} or {W} are independent of each other. But {X} and {Y} are not independent of each other (they cannot be simultaneously equal to {1}). Somehow, the feature of the linear algebra model that is not present in general is that in the linear algebra setting, every pair of subspaces {U, V} has a well-defined intersection {U \cap V} that is also a subspace, whereas for arbitrary random variables {X, Y}, there does not necessarily exist the analogue of an intersection, namely a “common information” random variable {V} that has the entropy of {I(X:Y)} and is determined either by {X} or by {Y}.

I do not know if there is any simpler model of Shannon entropy that captures all the inequalities available for four variables. One significant complication is that there exist some information inequalities in this setting that are not of Shannon type, such as the Zhang-Yeung inequality

\displaystyle  I(X:Y) \leq 2 I(X:Y|Z) + I(X:Z|Y) + I(Y:Z|X)

\displaystyle + I(X:Y|W) + I(Z:W).

One can however still use these simpler models of Shannon entropy to be able to guess arguments that would work for general random variables. An example of this comes from my paper on the logarithmically averaged Chowla conjecture, in which I showed among other things that

\displaystyle  |\sum_{n \leq x} \frac{\lambda(n) \lambda(n+1)}{n}| \leq \varepsilon x \ \ \ \ \ (5)

whenever {x} was sufficiently large depending on {\varepsilon>0}, where {\lambda} is the Liouville function. The information-theoretic part of the proof was as follows. Given some intermediate scale {H} between {1} and {x}, one can form certain random variables {X_H, Y_H}. The random variable {X_H} is a sign pattern of the form {(\lambda(n+1),\dots,\lambda(n+H))} where {n} is a random number chosen from {1} to {x} (with logarithmic weighting). The random variable {Y_H} was tuple {(n \hbox{ mod } p)_{p \sim \varepsilon^2 H}} of reductions of {n} to primes {p} comparable to {\varepsilon^2 H}. Roughly speaking, what was implicitly shown in the paper (after using the multiplicativity of {\lambda}, the circle method, and the Matomaki-Radziwill theorem on short averages of multiplicative functions) is that if the inequality (5) fails, then there was a lower bound

\displaystyle  I( X_H : Y_H ) \gg \varepsilon^7 \frac{H}{\log H}

on the mutual information between {X_H} and {Y_H}. From translation invariance, this also gives the more general lower bound

\displaystyle  I( X_{H_0,H} : Y_H ) \gg \varepsilon^7 \frac{H}{\log H} \ \ \ \ \ (6)

for any {H_0}, where {X_{H_0,H}} denotes the shifted sign pattern {(\lambda(n+H_0+1),\dots,\lambda(n+H_0+H))}. On the other hand, one had the entropy bounds

\displaystyle  H( X_{H_0,H} ), H(Y_H) \ll H

and from concatenating sign patterns one could see that {X_{H_0,H+H'}} is equivalent to the joint random variable {(X_{H_0,H}, X_{H_0+H,H'})} for any {H_0,H,H'}. Applying these facts and using an “entropy decrement” argument, I was able to obtain a contradiction once {H} was allowed to become sufficiently large compared to {\varepsilon}, but the bound was quite weak (coming ultimately from the unboundedness of {\sum_{\log H_- \leq j \leq \log H_+} \frac{1}{j \log j}} as the interval {[H_-,H_+]} of values of {H} under consideration becomes large), something of the order of {H \sim \exp\exp\exp(\varepsilon^{-7})}; the quantity {H} needs at various junctures to be less than a small power of {\log x}, so the relationship between {x} and {\varepsilon} becomes essentially quadruple exponential in nature, {x \sim \exp\exp\exp\exp(\varepsilon^{-7})}. The basic strategy was to observe that the lower bound (6) causes some slowdown in the growth rate {H(X_{kH})/kH} of the mean entropy, in that this quantity decreased by {\gg \frac{\varepsilon^7}{\log H}} as {k} increased from {1} to {\log H}, basically by dividing {X_{kH}} into {k} components {X_{jH, H}}, {j=0,\dots,k-1} and observing from (6) each of these shares a bit of common information with the same variable {Y_H}. This is relatively clear when one works in a set model, in which {Y_H} is modeled by a set {B_H} of size {O(H)}, and {X_{H_0,H}} is modeled by a set of the form

\displaystyle  X_{H_0,H} = \bigcup_{H_0 < h \leq H_0+H} A_h

for various sets {A_h} of size {O(1)} (also there is some translation symmetry that maps {A_h} to a shift {A_{h+1}} while preserving all of the {B_H}).

However, on considering the set model recently, I realised that one can be a little more efficient by exploiting the fact (basically the Chinese remainder theorem) that the random variables {Y_H} are basically jointly independent as {H} ranges over dyadic values that are much smaller than {\log x}, which in the set model corresponds to the {B_H} all being disjoint. One can then establish a variant

\displaystyle  I( X_{H_0,H} : Y_H | (Y_{H'})_{H' < H}) \gg \varepsilon^7 \frac{H}{\log H} \ \ \ \ \ (7)

of (6), which in the set model roughly speaking asserts that each {B_H} claims a portion of the {\bigcup_{H_0 < h \leq H_0+H} A_h} of cardinality {\gg \varepsilon^7 \frac{H}{\log H}} that is not claimed by previous choices of {B_H}. This leads to a more efficient contradiction (relying on the unboundedness of {\sum_{\log H_- \leq j \leq \log H_+} \frac{1}{j}} rather than {\sum_{\log H_- \leq j \leq \log H_+} \frac{1}{j \log j}}) that looks like it removes one order of exponential growth, thus the relationship between {x} and {\varepsilon} is now {x \sim \exp\exp\exp(\varepsilon^{-7})}. Returning to the entropy model, one can use (7) and Shannon inequalities to establish an inequality of the form

\displaystyle  \frac{1}{2H} H(X_{2H} | (Y_{H'})_{H' \leq 2H}) \leq \frac{1}{H} H(X_{H} | (Y_{H'})_{H' \leq H}) - \frac{c \varepsilon^7}{\log H}

for a small constant {c>0}, which on iterating and using the boundedness of {\frac{1}{H} H(X_{H} | (Y_{H'})_{H' \leq H})} gives the claim. (A modification of this analysis, at least on the level of the back of the envelope calculation, suggests that the Matomaki-Radziwill theorem is needed only for ranges {H} greater than {\exp( (\log\log x)^{\varepsilon^{7}} )} or so, although at this range the theorem is not significantly simpler than the general case).

Archives