The Riemann zeta function {\zeta(s)}, defined for {\hbox{Re}(s)>1} by

\displaystyle  \zeta(s) := \sum_{n=1}^\infty \frac{1}{n^s} \ \ \ \ \ (1)

and then continued meromorphically to other values of {s} by analytic continuation, is a fundamentally important function in analytic number theory, as it is connected to the primes {p=2,3,5,\ldots} via the Euler product formula

\displaystyle  \zeta(s) = \prod_p (1 - \frac{1}{p^s})^{-1} \ \ \ \ \ (2)

(for {\hbox{Re}(s) > 1}, at least), where {p} ranges over primes. (The equivalence between (1) and (2) is essentially the generating function version of the fundamental theorem of arithmetic.) The function {\zeta} has a pole at {1} and a number of zeroes {\rho}. A formal application of the factor theorem gives

\displaystyle  \zeta(s) = \frac{1}{s-1} \prod_\rho (s-\rho) \times \ldots \ \ \ \ \ (3)

where {\rho} ranges over zeroes of {\zeta}, and we will be vague about what the {\ldots} factor is, how to make sense of the infinite product, and exactly which zeroes of {\zeta} are involved in the product. Equating (2) and (3) and taking logarithms gives the formal identity

\displaystyle  - \log \zeta(s) = \sum_p \log(1 - \frac{1}{p^s}) = \log(s-1) - \sum_\rho \log(s-\rho) + \ldots; \ \ \ \ \ (4)

using the Taylor expansion

\displaystyle  \log(1 - \frac{1}{p^s}) = - \frac{1}{p^s} - \frac{1}{2 p^{2s}} - \frac{1}{3p^{3s}} - \ldots \ \ \ \ \ (5)

and differentiating the above identity in {s} yields the formal identity

\displaystyle  - \frac{\zeta'(s)}{\zeta(s)} = \sum_n \frac{\Lambda(n)}{n^s} = \frac{1}{s-1} - \sum_\rho \frac{1}{s-\rho} + \ldots \ \ \ \ \ (6)

where {\Lambda(n)} is the von Mangoldt function, defined to be {\log p} when {n} is a power of a prime {p}, and zero otherwise. Thus we see that the behaviour of the primes (as encoded by the von Mangoldt function) is intimately tied to the distribution of the zeroes {\rho}. For instance, if we knew that the zeroes were far away from the axis {\hbox{Re}(s)=1}, then we would heuristically have

\displaystyle  \sum_n \frac{\Lambda(n)}{n^{1+it}} \approx \frac{1}{it}

for real {t}. On the other hand, the integral test suggests that

\displaystyle  \sum_n \frac{1}{n^{1+it}} \approx \frac{1}{it}

and thus we see that {\frac{\Lambda(n)}{n}} and {\frac{1}{n}} have essentially the same (multiplicative) Fourier transform:

\displaystyle  \sum_n \frac{\Lambda(n)}{n^{1+it}} \approx \sum_n \frac{1}{n^{1+it}}.

Inverting the Fourier transform (or performing a contour integral closely related to the inverse Fourier transform), one is led to the prime number theorem

\displaystyle  \sum_{n \leq x} \Lambda(n) \approx \sum_{n \leq x} 1.

In fact, the standard proof of the prime number theorem basically proceeds by making all of the above formal arguments precise and rigorous.

Unfortunately, we don’t know as much about the zeroes {\rho} of the zeta function (and hence, about the {\zeta} function itself) as we would like. The Riemann hypothesis (RH) asserts that all the zeroes (except for the “trivial” zeroes at the negative even numbers) lie on the critical line {\hbox{Re}(s)=1/2}; this hypothesis would make the error terms in the above proof of the prime number theorem significantly more accurate. Furthermore, the stronger GUE hypothesis asserts in addition to RH that the local distribution of these zeroes on the critical line should behave like the local distribution of the eigenvalues of a random matrix drawn from the gaussian unitary ensemble (GUE). I will not give a precise formulation of this hypothesis here, except to say that the adjective “local” in the context of distribution of zeroes {\rho} means something like “at scale {O(1/\log T)} when {\hbox{Im}(s) = O(T)}“.

Nevertheless, we do know some reasonably non-trivial facts about the zeroes {\rho} and the zeta function {\zeta}, either unconditionally, or assuming RH (or GUE). Firstly, there are no zeroes for {\hbox{Re}(s)>1} (as one can already see from the convergence of the Euler product (2) in this case) or for {\hbox{Re}(s)=1} (this is trickier, relying on (6) and the elementary observation that

\displaystyle  \hbox{Re}( 3\frac{\Lambda(n)}{n^{\sigma}} + 4\frac{\Lambda(n)}{n^{\sigma+it}} + \frac{\Lambda(n)}{n^{\sigma+2it}} ) = 2\frac{\Lambda(n)}{n^\sigma} (1+\cos(t \log n))^2

is non-negative for {\sigma > 1} and {t \in {\mathbb R}}); from the functional equation

\displaystyle  \pi^{-s/2} \Gamma(s/2) \zeta(s) = \pi^{-(1-s)/2} \Gamma((1-s)/2) \zeta(1-s)

(which can be viewed as a consequence of the Poisson summation formula, see e.g. my blog post on this topic) we know that there are no zeroes for {\hbox{Re}(s) \leq 0} either (except for the trivial zeroes at negative even integers, corresponding to the poles of the Gamma function). Thus all the non-trivial zeroes lie in the critical strip {0 < \hbox{Re}(s) < 1}.

We also know that there are infinitely many non-trivial zeroes, and can approximately count how many zeroes there are in any large bounded region of the critical strip. For instance, for large {T}, the number of zeroes {\rho} in this strip with {\hbox{Im}(\rho) = T+O(1)} is {O(\log T)}. This can be seen by applying (6) to {s = 2+iT} (say); the trivial zeroes at the negative integers end up giving a contribution of {O(\log T)} to this sum (this is a heavily disguised variant of Stirling’s formula, as one can view the trivial zeroes as essentially being poles of the Gamma function), while the {\frac{1}{s-1}} and {\ldots} terms end up being negligible (of size {O(1)}), while each non-trivial zero {\rho} contributes a term which has a non-negative real part, and furthermore has size comparable to {1} if {\hbox{Im}(\rho) = T+O(1)}. (Here I am glossing over a technical renormalisation needed to make the infinite series in (6) converge properly.) Meanwhile, the left-hand side of (6) is absolutely convergent for {s=2+iT} and of size {O(1)}, and the claim follows. A more refined version of this argument shows that the number of non-trivial zeroes with {0 \leq \hbox{Im}(\rho) \leq T} is {\frac{T}{2\pi} \log \frac{T}{2\pi} - \frac{T}{2\pi} + O(\log T)}, but we will not need this more precise formula here. (A fair fraction – at least 40%, in fact – of these zeroes are known to lie on the critical line; see this earlier blog post of mine for more discussion.)

Another thing that we happen to know is how the magnitude {|\zeta(1/2+it)|} of the zeta function is distributed as {t \rightarrow \infty}; it turns out to be log-normally distributed with log-variance about {\frac{1}{2} \log \log t}. More precisely, we have the following result of Selberg:

Theorem 1 Let {T} be a large number, and let {t} be chosen uniformly at random from between {T} and {2T} (say). Then the distribution of {\frac{1}{\sqrt{\frac{1}{2} \log \log T}} \log |\zeta(1/2+it)|} converges (in distribution) to the normal distribution {N(0,1)}.

To put it more informally, {\log |\zeta(1/2+it)|} behaves like {\sqrt{\frac{1}{2} \log \log t} \times N(0,1)} plus lower order terms for “typical” large values of {t}. (Zeroes {\rho} of {\zeta} are, of course, certainly not typical, but one can show that one can usually stay away from these zeroes.) In fact, Selberg showed a slightly more precise result, namely that for any fixed {k \geq 1}, the {k^{th}} moment of {\frac{1}{\sqrt{\frac{1}{2} \log \log T}} \log |\zeta(1/2+it)|} converges to the {k^{th}} moment of {N(0,1)}.

Remarkably, Selberg’s result does not need RH or GUE, though it is certainly consistent with such hypotheses. (For instance, the determinant of a GUE matrix asymptotically obeys a remarkably similar log-normal law to that given by Selberg’s theorem.) Indeed, the net effect of these hypotheses only affects some error terms in {\log |\zeta(1/2+it)|} of magnitude {O(1)}, and are thus asymptotically negligible compared to the main term, which has magnitude about {O(\sqrt{\log \log T})}. So Selberg’s result, while very pretty, manages to finesse the question of what the zeroes {\rho} of {\zeta} are actually doing – he makes the primes do most of the work, rather than the zeroes.

Selberg never actually published the above result, but it is reproduced in a number of places (e.g. in this book by Joyner, or this book by Laurincikas). As with many other results in analytic number theory, the actual details of the proof can get somewhat technical; but I would like to record here (partly for my own benefit) an informal sketch of some of the main ideas in the argument.

— 1. Informal overview of argument —

The first step is to get a usable (approximate) formula for {\log|\zeta(s)|}. On the one hand, from the second part of (4) one has

\displaystyle  -\log|\zeta(s)| = \log|s-1| - \sum_\rho \log|s-\rho| + \ldots. \ \ \ \ \ (7)

This formula turns out not to be directly useful because it requires one to know much more about the distribution of the zeroes {\rho} than we currently possess. On the other hand, from the first part of (4) and (5) one also has the formula

\displaystyle  \log|\zeta(s)| = \sum_p \hbox{Re} \frac{1}{p^s} + \ldots. \ \ \ \ \ (8)

This formula also turns out not to be directly useful, because it requires one to know much more about the distribution of the primes {p} than we currently possess.

However, it turns out that we can “split the difference” between (7), (8), and get a formula for {\log |\zeta(s)|} which involves some zeroes {\rho} and some primes {p}, in a manner that one can control them both. Roughly speaking, the formula looks like this:

\displaystyle  \log|\zeta(s)| = \sum_{p \leq T^\epsilon} \hbox{Re} \frac{1}{p^s} + O( \sum_{\rho = s + O(1/\log T)} 1 + |\log\frac{|s-\rho|}{1/\log T}| ) + \ldots \ \ \ \ \ (9)

for {s = 1/2 + it} and {t = O(T)}, where {\epsilon} is a small parameter that we can choose (e.g. {\epsilon = 0.01}); thus we have localised the prime sum to the primes {p} of size {O(T^{O(1)})}, and the zero sum to those zeroes at a distance {O(1/\log T)} from {s}. (This is an oversimplification; there is a “tail” coming from those zeroes that are more distant from {s} than {O(1/\log T)}, and also one has to smooth out the sum in {p} a little bit, and allow the implied constants in the {O()} notation to depend on {\epsilon}, but let us ignore these technical issues here, as well as the issue of what exactly is hiding in the {\ldots} error.)

It turns out that all of these expressions can be controlled. The error term coming from the zeroes (as well as the {\ldots} error term) turn out to be of size {O(1)} for most values of {t}, so are a lower order term. (As mentioned before, it is this error term that would be better controlled if one had RH or GUE, but this is not necessary to establish Selberg’s result.) The main term is the one coming from the primes.

We can heuristically argue as follows. The expression {X_p := \hbox{Re} \frac{1}{p^s} = \frac{1}{\sqrt{p}} \cos( t \log p )}, for {t} ranging between {T} and {2T}, is a random variable of mean zero and variance approximately {\frac{1}{2p}} (if {p \leq T^\epsilon} and {\epsilon} is small). Making the heuristic assumption that the {X_p} behave as if they were independent, the central limit theorem then suggests that the sum {\sum_{p \leq T^\epsilon} X_p} should behave like a normal distribution of mean zero and variance {\sum_{p \leq T^\epsilon} \frac{1}{2p}}. But the claim now follows from the classical estimate

\displaystyle  \sum_{p \leq x}\frac{1}{p} = \log \log x + O(1)

(which follows from the prime number theorem, but can also be deduced from the formula (8) for {s = 1 + O(1/\log x)}, using the fact that {\zeta} has a simple pole at {1}).

To summarise, there are three main tasks to establish Selberg’s theorem:

  1. Establish a formula along the lines of (9);
  2. Show that the error terms arising from zeroes are {O(1)} on the average;
  3. Justify the central limit calculation for {\sum_p X_p}.

I’ll now briefly talk (informally) about each of the three steps in turn.

— 2. The explicit formula —

To get a formula such as (9), the basic strategy is to take a suitable average of the formula (7) and the formula (8). Traditionally, this is done by contour integration; however, I prefer (perhaps idiosyncratically) to take a more Fourier-analytic perspective, using convolutions rather than contour integrals. (The two approaches are largely equivalent, though.) The basic point is that the imaginary part {\hbox{Im}(\rho)} of the zeroes inhabits the same space as the imaginary part {t = \hbox{Im}(s)} of the {s} variable, which in turn is the Fourier analytic dual of the variable that the logarithm {\log p} of the primes {p} live in; this can be seen by writing (7), (8) in a more Fourier-like manner as

\displaystyle  \sum_\rho \log|1/2+it-\rho| + \ldots = \hbox{Re} \sum_p \frac{1}{\sqrt{p}} e^{-i t \log p} + \ldots.

(These sorts of Fourier-analytic connections are often summarised by the slogan “the zeroes of the zeta function are the music of the primes”.) The uncertainty principle then predicts that localising {\log p} to the scale {O(\log T^\epsilon)} should result in blurring out the zeroes {\rho} at scale {O(1/\log T^\epsilon)}, which is where (9) is going to come from.

Let’s see how this idea works in practice. We consider a convolution of the form

\displaystyle  \int_{\mathbb R} \log|\zeta(s + \frac{iy}{\log T^\epsilon})| \psi(y)\ dy \ \ \ \ \ (10)

where {\psi} is some bump function with total mass {1}; informally, this is {\log|\zeta(s)|} averaged out in the vertical direction at scale {O(1/\log T^\epsilon) = O(1/\log T)} (we allow implied constants to depend on {\epsilon}).

We can express (10) in two different ways, one using (7), and one using (8). Let’s look at (7) first. If one modifies {s} by {O(1/\log T)}, then the quantity {\log|s-\rho|} doesn’t fluctuate very much, unless {\rho} is within {O(1/\log T)} of {s}, in which case it can move by about {O( 1 + \log \frac{|s-\rho|}{1/\log T} )}. As a consequence, we see that

\displaystyle  \int_{\mathbb R} \log|s + \frac{iy}{\log T^\epsilon}-\rho| \psi(y)\ dy \approx \log |s-\rho|

when {|\rho-s| \gg 1/\log T}, and

\displaystyle  \int_{\mathbb R} \log|s + \frac{iy}{\log T^\epsilon}-\rho| \psi(y)\ dy = \log |s-\rho| + O( 1 + \log \frac{|s-\rho|}{1/\log T} ).

The quantity {\log|s-1|} also doesn’t move very much by this shift (we are assuming the imaginary part of {s} to be large). Inserting these facts into (7), we thus see that (10) is (heuristically) equal to

\displaystyle  \log|\zeta(s)| + \sum_{\rho = s + O(1/\log T)} O( 1 + \log \frac{|s-\rho|}{1/\log T} ) + \ldots. \ \ \ \ \ (11)

Now let’s compute (10) using (8) instead. Writing {s = 1/2+it}, we express (10) as

\displaystyle  \sum_p \hbox{Re} \frac{1}{p^s} \int_{\mathbb R} e^{-i y \log p / \log T^\epsilon} \psi(y)\ dy + \ldots.

Introducing the Fourier transform {\hat \psi(\xi) := \int_{\mathbb R} e^{-iy\xi} \psi(y)\ dy} of {\psi}, one can write this as

\displaystyle  \sum_p \hbox{Re} \frac{1}{p^s} \hat \psi(\log p / \log T^\epsilon) + \ldots.

Now we took {\psi} to be a bump function, so its Fourier transform should also be like a bump function (or perhaps a Schwartz function). As a first approximation, one can thus think of {\hat \psi} as a smoothed truncation to the region {\{ \xi: \xi = O(1)\}}, thus the {\hat \psi(\log p / \log T^\epsilon)} weight is morally restricting {p} to the region {p \leq T^\epsilon}. Thus we (morally) can express (10) as

\displaystyle  \sum_{p \leq T^\epsilon} \hbox{Re} \frac{1}{p^s} + \ldots.

Comparing this with the other formula (11) we have for (10), we obtain (9) as required (formally, at least).

— 3. Controlling the zeroes —

Next, we want to show that the quantity

\displaystyle  \sum_{\rho = s + O(1/\log T)} 1 + |\log\frac{|s-\rho|}{1/\log T}|

is {O(1)} on the average, when {s=1/2+it} and {t} is chosen uniformly at random from {T} to {2T}.

For this, we can use the first moment method. For each zero {\rho}, let {I_\rho} be the random variable which equals {1 + |\log\frac{|s-\rho|}{1/\log T}|} when {\rho = s + O(1/\log T)} and zero otherwise, thus we are trying to control the expectation of {\sum_\rho I_\rho}. The only zeroes which are relevant are those which are of size {O(T)}, and we know that there are {O(T \log T)} of these (indeed, we have an even more precise formula, as remarked earlier). On the other hand, a randomly chosen {s} has a probability of {O( 1 / T \log T )} of falling within {O(1/\log T)} of {\rho}, and so we expect each {I_\rho} to have an expected value of {O(1/T \log T)}. (The logarithmic factor in the definition of {I_\rho} turns out not to be an issue, basically because {\log x} is locally integrable.) By linearity of expectation, we conclude that {\sum_\rho I_\rho} has expectation {O(T \log T) \times O( 1/T \log T) = O(1)}, and the claim follows.

Remark 1 One can actually do a bit better than this, showing that higher order moments of {\sum_\rho I_\rho} are also {O(1)}, by using a variant of (9) together with the moment bounds in the next section; but we will not need that refinement here.

— 4. The central limit theorem —

Finally, we have to show that {\sum_{p \leq T^\epsilon} X_p} behaves like a normal distribution, as predicted by the central limit theorem heuristic. The key is to show that the {X_p} behave “as if” they were jointly independent. In particular, as the {X_p} all have mean zero, one would like to show that products such as

\displaystyle  X_{p_1} \ldots X_{p_k} \ \ \ \ \ (12)

have a negligible expectation as long as at least one of the primes in {p_1,\ldots,p_k} occurs at most once. Once one has this (as well as a similar formula for the case when all primes appear at least twice), one can then do a standard moment computation of the {k^{th}} moment {(\sum_{p \leq T^\epsilon} X_p)^k} and verify that this moment then matches the answer predicted by the central limit theorem, which by standard arguments (involving the Weierstrass approximation theorem) is enough to establish the distributional law. Note that to get close to the normal distribution by a fixed amount of accuracy, it suffices to control a bounded number of moments, which ultimately means that we can treat {k} as being bounded, {k=O(1)}.

If we expand out the product (12), we get

\displaystyle  \frac{1}{\sqrt{p_1} \ldots \sqrt{p_k}} \cos( t \log p_1 ) \ldots \cos( t \log p_k ).

Using the product formula for cosines (or Euler’s formula), the product of cosines here can be expressed as a linear combination of cosines {\cos(t\xi)}, where the frequency {\xi} takes the form

\displaystyle  \xi = \pm \log p_1 \pm \log p_2 \ldots \pm \log p_k.

Thus, {\xi} is the logarithm of a rational number, whose numerator and denominator are the product of some of the {p_1,\ldots,p_k}. Since all the {p_j} are at most {T^\epsilon}, we see that the numerator and denominator here are at most {T^{k\epsilon}}.

Now for the punchline. If there is a prime in {p_1,\ldots,p_k} that appears only once, then the numerator and denominator cannot fully cancel, by the fundamental theorem of arithmetic. Thus {\xi} cannot be {0}. Furthermore, since the denominator is at most {T^{k\epsilon}}, we see that {\xi} must stay away from {0} by a distance of about {1/T^{k\epsilon}} or more, and so {\cos(t\xi)} has a wavelength of at most {O(T^{k\epsilon})}. On the other hand, {t} ranges between {T} and {2T}. If {k} is fixed and {\epsilon} is small enough (much smaller than {1/k}), we thus see that the average value of {\cos(t\xi)} between {T} and {2T} is close to zero, and so (12) does indeed have negligible expectation as claimed. (A similar argument lets one compute the expectation of (12) when all primes appear at least twice.)

Remark 2 A famous theorem of Erdös and Kac gives a normal distribution for the number of prime factors of a large number {n}, with mean {\log \log n} and variance {\log \log n}. One can view Selberg’s theorem as a sort of Fourier-analytic variant of the Erdös-Kac theorem.

Remark 3 The Fourier-like correspondence between zeroes of the zeta function and primes can be used to convert statements about zeroes, such as the Riemann hypothesis and the GUE hypothesis, into equivalent statements about primes. For instance, the Riemann hypothesis is equivalent to having the square root error term

\displaystyle  \sum_{x \leq n \leq x+y} \Lambda(n) = y + O_\varepsilon( y^{1/2+\varepsilon} )

in the prime number theorem holding asymptotically as {x \rightarrow \infty} for all {\epsilon > 0} and all intervals {[x,x+y]} which are large in the sense that {y} is comparable to {x}. Meanwhile, the pair correlation conjecture (the simplest component of the GUE hypothesis) is equivalent (on RH) to the square root error term holding (with the expected variance) for all {\epsilon > 0} and almost all intervals {[x,x+y]} which are short in the sense that {y=x^\theta} for some small (fixed) {\theta > 0}. (This is a rough statement; a more precise formulation can be found in this paper of Goldston and Montgomery.) It seems to me that reformulation of the full GUE hypothesis in terms of primes should be similar, but would assert that the error term in the prime number theorem (as well as variants of this theorem for almost primes) in short intervals enjoys the expected normal distribution; I don’t know of a precise formulation of this assertion, but calculations in this direction lie in this paper of Bogomolny and Keating.)