A fundamental problem in analytic number theory is to understand the distribution of the prime numbers {\{2,3,5,\ldots\}}. For technical reasons, it is convenient not to study the primes directly, but a proxy for the primes known as the von Mangoldt function {\Lambda: {\mathbb N} \rightarrow {\mathbb R}}, defined by setting {\Lambda(n)} to equal {\log p} when {n} is a prime {p} (or a power of that prime) and zero otherwise. The basic reason why the von Mangoldt function is useful is that it encodes the fundamental theorem of arithmetic (which in turn can be viewed as the defining property of the primes) very neatly via the identity

\displaystyle  \log n = \sum_{d|n} \Lambda(d) \ \ \ \ \ (1)

for every natural number {n}.

The most important result in this subject is the prime number theorem, which asserts that the number of prime numbers less than a large number {x} is equal to {(1+o(1)) \frac{x}{\log x}}:

\displaystyle  \sum_{p \leq x} 1 = (1+o(1)) \frac{x}{\log x}.

Here, of course, {o(1)} denotes a quantity that goes to zero as {x \rightarrow \infty}.

It is not hard to see (e.g. by summation by parts) that this is equivalent to the asymptotic

\displaystyle  \sum_{n \leq x} \Lambda(n) = (1+o(1)) x \ \ \ \ \ (2)

for the von Mangoldt function (the key point being that the squares, cubes, etc. of primes give a negligible contribution, so {\sum_{n \leq x} \Lambda(n)} is essentially the same quantity as {\sum_{p \leq x} \log p}). Understanding the nature of the {o(1)} term is a very important problem, with the conjectured optimal decay rate of {O(\sqrt{x} \log x)} being equivalent to the Riemann hypothesis, but this will not be our concern here.

The prime number theorem has several important generalisations (for instance, there are analogues for other number fields such as the Chebotarev density theorem). One of the more elementary such generalisations is the prime number theorem in arithmetic progressions, which asserts that for fixed {a} and {q} with {a} coprime to {q} (thus {(a,q)=1}), the number of primes less than {x} equal to {a} mod {q} is equal to {(1+o_q(1)) \frac{1}{\phi(q)} \frac{x}{\log x}}, where {\phi(q) := \# \{ 1 \leq a \leq q: (a,q)=1 \}} is the Euler totient function:

\displaystyle  \sum_{p \leq x: p = a \hbox{ mod } q} 1 = (1+o_q(1)) \frac{1}{\phi(q)} \frac{x}{\log x}.

(Of course, if {a} is not coprime to {q}, the number of primes less than {x} equal to {a} mod {q} is {O(1)}. The subscript {q} in the {o()} and {O()} notation denotes that the implied constants in that notation is allowed to depend on {q}.) This is a more quantitative version of Dirichlet’s theorem, which asserts the weaker statement that the number of primes equal to {a} mod {q} is infinite. This theorem is important in many applications in analytic number theory, for instance in Vinogradov’s theorem that every sufficiently large odd number is the sum of three odd primes. (Imagine for instance if almost all of the primes were clustered in the residue class {2} mod {3}, rather than {1} mod {3}. Then almost all sums of three odd primes would be divisible by {3}, leaving dangerously few sums left to cover the remaining two residue classes. Similarly for other moduli than {3}. This does not fully rule out the possibility that Vinogradov’s theorem could still be true, but it does indicate why the prime number theorem in arithmetic progressions is a relevant tool in the proof of that theorem.)

As before, one can rewrite the prime number theorem in arithmetic progressions in terms of the von Mangoldt function as the equivalent form

\displaystyle  \sum_{n \leq x: n = a \hbox{ mod } q} \Lambda(n) = (1+o_q(1)) \frac{1}{\phi(q)} x.

Philosophically, one of the main reasons why it is so hard to control the distribution of the primes is that we do not currently have too many tools with which one can rule out “conspiracies” between the primes, in which the primes (or the von Mangoldt function) decide to correlate with some structured object (and in particular, with a totally multiplicative function) which then visibly distorts the distribution of the primes. For instance, one could imagine a scenario in which the probability that a randomly chosen large integer {n} is prime is not asymptotic to {\frac{1}{\log n}} (as is given by the prime number theorem), but instead to fluctuate depending on the phase of the complex number {n^{it}} for some fixed real number {t}, thus for instance the probability might be significantly less than {1/\log n} when {t \log n} is close to an integer, and significantly more than {1/\log n} when {t \log n} is close to a half-integer. This would contradict the prime number theorem, and so this scenario would have to be somehow eradicated in the course of proving that theorem. In the language of Dirichlet series, this conspiracy is more commonly known as a zero of the Riemann zeta function at {1+it}.

In the above scenario, the primality of a large integer {n} was somehow sensitive to asymptotic or “Archimedean” information about {n}, namely the approximate value of its logarithm. In modern terminology, this information reflects the local behaviour of {n} at the infinite place {\infty}. There are also potential consipracies in which the primality of {n} is sensitive to the local behaviour of {n} at finite places, and in particular to the residue class of {n} mod {q} for some fixed modulus {q}. For instance, given a Dirichlet character {\chi: {\mathbb Z} \rightarrow {\mathbb C}} of modulus {q}, i.e. a completely multiplicative function on the integers which is periodic of period {q} (and vanishes on those integers not coprime to {q}), one could imagine a scenario in which the probability that a randomly chosen large integer {n} is prime is large when {\chi(n)} is close to {+1}, and small when {\chi(n)} is close to {-1}, which would contradict the prime number theorem in arithmetic progressions. (Note the similarity between this scenario at {q} and the previous scenario at {\infty}; in particular, observe that the functions {n \rightarrow \chi(n)} and {n \rightarrow n^{it}} are both totally multiplicative.) In the language of Dirichlet series, this conspiracy is more commonly known as a zero of the {L}-function of {\chi} at {1}.

An especially difficult scenario to eliminate is that of real characters, such as the Kronecker symbol {\chi(n) = \left( \frac{n}{q} \right)}, in which numbers {n} which are quadratic nonresidues mod {q} are very likely to be prime, and quadratic residues mod {q} are unlikely to be prime. Indeed, there is a scenario of this form – the Siegel zero scenario – which we are still not able to eradicate (without assuming powerful conjectures such as GRH), though fortunately Siegel zeroes are not quite strong enough to destroy the prime number theorem in arithmetic progressions.

It is difficult to prove that no conspiracy between the primes exist. However, it is not entirely impossible, because we have been able to exploit two important phenomena. The first is that there is often a “all or nothing dichotomy” (somewhat resembling the zero-one laws in probability) regarding conspiracies: in the asymptotic limit, the primes can either conspire totally (or more precisely, anti-conspire totally) with a multiplicative function, or fail to conspire at all, but there is no middle ground. (In the language of Dirichlet series, this is reflected in the fact that zeroes of a meromorphic function can have order {1}, or order {0} (i.e. are not zeroes after all), but cannot have an intermediate order between {0} and {1}.) As a corollary of this fact, the prime numbers cannot conspire with two distinct multiplicative functions at once (by having a partial correlation with one and another partial correlation with another); thus one can use the existence of one conspiracy to exclude all the others. In other words, there is at most one conspiracy that can significantly distort the distribution of the primes. Unfortunately, this argument is ineffective, because it doesn’t give any control at all on what that conspiracy is, or even if it exists in the first place!

But now one can use the second important phenomenon, which is that because of symmetries, one type of conspiracy can lead to another. For instance, because the von Mangoldt function is real-valued rather than complex-valued, we have conjugation symmetry; if the primes correlate with, say, {n^{it}}, then they must also correlate with {n^{-it}}. (In the language of Dirichlet series, this reflects the fact that the zeta function and {L}-functions enjoy symmetries with respect to reflection across the real axis (i.e. complex conjugation).) Combining this observation with the all-or-nothing dichotomy, we conclude that the primes cannot correlate with {n^{it}} for any non-zero {t}, which in fact leads directly to the prime number theorem (2), as we shall discuss below. Similarly, if the primes correlated with a Dirichlet character {\chi(n)}, then they would also correlate with the conjugate {\overline{\chi}(n)}, which also is inconsistent with the all-or-nothing dichotomy, except in the exceptional case when {\chi} is real – which essentially means that {\chi} is a quadratic character. In this one case (which is the only scenario which comes close to threatening the truth of the prime number theorem in arithmetic progressions), the above tricks fail and one has to instead exploit the algebraic number theory properties of these characters instead, which has so far led to weaker results than in the non-real case.

As mentioned previously in passing, these phenomena are usually presented using the language of Dirichlet series and complex analysis. This is a very slick and powerful way to do things, but I would like here to present the elementary approach to the same topics, which is slightly weaker but which I find to also be very instructive. (However, I will not be too dogmatic about keeping things elementary, if this comes at the expense of obscuring the key ideas; in particular, I will rely on multiplicative Fourier analysis (both at {\infty} and at finite places) as a substitute for complex analysis in order to expedite various parts of the argument. Also, the emphasis here will be more on heuristics and intuition than on rigour.)

The material here is closely related to the theory of pretentious characters developed by Granville and Soundararajan, as well as an earlier paper of Granville on elementary proofs of the prime number theorem in arithmetic progressions.

— 1. A heuristic elementary proof of the prime number theorem —

To motivate some of the later discussion, let us first give a highly non-rigorous heuristic elementary “proof” of the prime number theorem (2). Since we clearly have

\displaystyle  \sum_{n \leq x} 1 = x + O(1)

one can view the prime number theorem as an assertion that the von Mangoldt function {\Lambda} “behaves like {1} on the average”,

\displaystyle  \Lambda(n) \approx 1, \ \ \ \ \ (3)

where we will be deliberately vague as to what the “{\approx}” symbol means. (One can think of this symbol as denoting some sort of proximity in the weak topology or vague topology, after suitable normalisation.)

To see why one would expect (3) to be true, we take divisor sums of (3) to heuristically obtain

\displaystyle  \sum_{d|n} \Lambda(d) \approx \sum_{d|n} 1. \ \ \ \ \ (4)

By (1), the left-hand side is {\log n}; meanwhile, the right-hand side is the divisor function {\tau(n)} of {n}, by definition. So we have a heuristic relationship between (3) and the informal approximation

\displaystyle  \tau(n) \approx \log n. \ \ \ \ \ (5)

In particular, we expect

\displaystyle  \sum_{n \leq x} \tau(n) \approx \sum_{n \leq x} \log n. \ \ \ \ \ (6)

The right-hand side of (6) can be approximated using the integral test as

\displaystyle  \sum_{n \leq x} \log n = \int_1^x \log t\ dt + O(\log x ) = x \log x - x + O(\log x) \ \ \ \ \ (7)

(one can also use Stirling’s formula to obtain a similar asymptotic). As for the left-hand side, we write {\tau(n) = \sum_{d|n} 1} and then make the substitution {n=dm} to obtain

\displaystyle \sum_{n \leq x} \tau(n) = \sum_{d,m: dm \leq x} 1.

The right-hand side is the number of lattice points underneath the hyperbola {dm=x}, and can be counted using the Dirichlet hyperbola method:

\displaystyle  \sum_{d,m: dm \leq x} 1 = \sum_{d \leq \sqrt{x}} \sum_{m \leq x/d} 1 + \sum_{m \leq \sqrt{x}} \sum_{d \leq x/m} 1 - \sum_{d \leq \sqrt{x}} \sum_{m \leq \sqrt{x}} 1.

The third sum is equal to {(\sqrt{x}+O(1))^2 = x + O(\sqrt{x})}. The second sum is equal to the first. The first sum can be computed as

\displaystyle  \sum_{d \leq \sqrt{x}} \sum_{m \leq x/d} 1 = \sum_{d \leq \sqrt{x}} (\frac{x}{d} + O(1)) = x \sum_{d \leq \sqrt{x}} \frac{1}{d} + O(\sqrt{x});

meanwhile, from the integral test and the definition of Euler’s constant {\gamma = 0.577\ldots} one has

\displaystyle  \sum_{d \leq y} \frac{1}{d} = \log y + \gamma + O(1/y) \ \ \ \ \ (8)

for any {y \geq 1}; combining all these estimates one obtains

\displaystyle  \sum_{n \leq x} \tau(n) = x \log x + (2\gamma-1) x + O(\sqrt{x}). \ \ \ \ \ (9)

Comparing this with (7) we do see that {\tau(n)} and {\log n} are roughly equal “to top order” on average, thus giving some form of (5) and hence (4); if one could somehow invert the divisor sum operation, one could hope to get (3) and thus the prime number theorem.

(Looking at the next highest order terms in (7), (9), we see that we expect {\tau(n)} to in fact be slightly larger than {\log n} on the average, and so {\Lambda(n)} should be slightly less than {1} on the average. There is indeed a slight effect of this form; for instance, it is possible (using the prime number theorem) to prove

\displaystyle  \sum_{d \leq y} \frac{\Lambda(d)}{d} = \log y - \gamma + o(1),

which should be compared with (8).)

One can partially translate the above discussion into the language of Dirichlet series, by transforming various arithmetical functions {f(n)} to their associated Dirichlet series

\displaystyle  F(s) := \sum_{n=1}^\infty \frac{f(n)}{n^s},

ignoring for now the issue of convergence of this series. By definition, the constant function {1} transforms to the Riemann zeta function {\zeta(s)}. Taking derivatives in {s}, we see (formally, at least) that if {f(n)} has Dirichlet series {F(s)}, then {f(n) \log n} has Dirichlet series {-F'(s)}; thus, for instance, {\log n} has Dirichlet series {-\zeta'(s)}.

Most importantly, though, if {f(n), g(n)} have Dirichlet series {F(s), G(s)} respectively, then their Dirichlet convolution {f * g(n) := \sum_{d|n} f(d) g(\frac{n}{d})} has Dirichlet series {F(s) G(s)}; this is closely related to the well-known ability of the Fourier transform to convert convolutions to pointwise multiplication. Thus, for instance, {\tau(n)} has Dirichlet series {\zeta(s)^2}. Also, from (1) and the preceding discussion, we see that {\Lambda(n)} has Dirichlet series {-\zeta'(s)/\zeta(s)} (formally, at least). This already suggests that the von Mangoldt function will be sensitive to the zeroes of the zeta function.

An integral test computation closely related to (8) gives the asymptotic

\displaystyle  \zeta(s) = \frac{1}{s-1} + \gamma + O(s-1)

for {s} close to one (and {\hbox{Re}(s) > 1}, to avoid issues of convergence). This implies that the Dirichlet series {-\zeta'(s)/\zeta(s)} for {\Lambda(n)} has asymptotic

\displaystyle  \frac{-\zeta'(s)}{\zeta(s)} = \frac{1}{s-1} - \gamma + O(s-1)

thus giving support to (3); similarly, the Dirichlet series for {\log n} and {\tau(n)} have asymptotic

\displaystyle  -\zeta'(s) = \frac{1}{(s-1)^2} + O(1)

and

\displaystyle  \zeta(s)^2 = \frac{1}{(s-1)^2} + \frac{2\gamma}{s-1} + O(1)

which gives support to (5) (and is also consistent with (7), (9)).

Remark 1 One can connect the properties of Dirichlet series {F(s)} more rigorously to asymptotics of partial sums {\sum_{n \leq x} f(n)} by means of various transforms in Fourier analysis and complex analysis, in particular contour integration or the Hilbert transform, but this becomes somewhat technical and we will not do so here. I will remark, though, that asymptotics of {F(s)} for {s} close to {1} are not enough, by themselves, to get really precise asymptotics for the sharply truncated partial sums {\sum_{n \leq x} f(n)}, for reasons related to the uncertainty principle; in order to control such sums one also needs to understand the behaviour of {F} far away from {s=1}, and in particular for {s=1+it} for large real {t}. On the other hand, the asymptotics for {F(s)} for {s} near {1} are just about all one needs to control smoothly truncated partial sums such as {\sum_n f(n) \eta(n/x)} for suitable cutoff functions {\eta}. Also, while Dirichlet series are very powerful tools, particularly with regards to understanding Dirichlet convolution identities, and controlling everything in terms of the zeroes and poles of such series, they do have the drawback that they do not easily encode such fundamental “physical space” facts as the pointwise inequalities {|\mu(n)| \leq 1} and {\Lambda(n) \geq 0}, which are also an important aspect of the theory.

— 2. Almost primes —

One can hope to make the above heuristics precise by applying the Möbius inversion formula

\displaystyle  1_{n=1} = \sum_{d|n} \mu(d)

where {\mu(d)} is the Möbius function, defined as {(-1)^k} when {d} is the product of {k} distinct primes for some {k \geq 0}, and zero otherwise. In terms of Dirichlet series, we thus see that {\mu} has the Dirichlet series of {1/\zeta(s)}, and so can invert the divisor sum operation {f(n) \mapsto \sum_{d|n} f(d)} (which corresponds to multiplication by {\zeta(s)}):

\displaystyle  f(n) = \sum_{m|n} \mu(m) ( \sum_{d|n/m} f(d) ).

From (1) we then conclude

\displaystyle  \Lambda(n) = \sum_{d|n} \mu(d) \log \frac{n}{d} \ \ \ \ \ (10)

while from {\tau(n) = \sum_{d|n} 1} we have

\displaystyle  1 = \sum_{d|n} \mu(d) \tau(\frac{n}{d}). \ \ \ \ \ (11)

One can now hope to derive the prime number theorem (2) from the formulae (7), (9). Unfortunately, this doesn’t quite work: the prime number theorem is equivalent to the assertion

\displaystyle  \sum_{n \leq x} (\Lambda(n)-1) = o(x), \ \ \ \ \ (12)

but if one inserts (10), (11) into the left-hand side of (12), one obtains

\displaystyle  \sum_{d \leq x} \mu(d) \sum_{m \leq x/d} (\log m - \tau(m)),

which if one then inserts (7), (9) and the trivial bound {\mu(d)=O(1)}, leads to

\displaystyle  2C x \sum_{d \leq x} \frac{\mu(d)}{d} + O(x).

As noted in this previous post, we have

\displaystyle  |\sum_{d \leq x} \frac{\mu(d)}{d}| \leq 1, \ \ \ \ \ (13)

so we only obtain a bound of {O(x)} for (12) instead of {o(x)}. (A refinement of this argument, though, shows that the prime number theorem would follow if one had the asymptotic {\sum_{n \leq x} \mu(n) = o(x)}, which is in fact equivalent to the prime number theorem; see the previous post.)

We remark that if one computed {\sum_{n \leq x} \tau(n)} or {\sum_{n \leq x} \Lambda(n)} by the above methods, one would eventually be led to a variant of (13), namely

\displaystyle  \sum_{d \leq x} \frac{\mu(d)}{d} \log \frac{x}{d} = O(1), \ \ \ \ \ (14)

which is an estimate which will be useful later.

So we see that when trying to sum the von Mangoldt function {\Lambda} by elementary means, the error term {O(x)} overwhelms the main term {x}. But there is a slight tweaking of the von Mangoldt function, the second von Mangoldt function {\Lambda_2}, that increases the size of the main term to {2x \log x} while keeping the error term at {O(x)}, thus leading to a useful estimate; the price one pays for this is that this function is now a proxy for the almost primes rather than the primes. This function is defined by a variant of (10), namely

\displaystyle  \Lambda_2(n) = \sum_{d|n} \mu(d) \log^2 \frac{n}{d}. \ \ \ \ \ (15)

It is not hard to see that {\Lambda_2(n)} vanishes once {n} has at least three distinct prime factors (basically because the quadratic function {x \mapsto x^2} vanishes after being differentiated three or more times). Indeed, one can easily verify the identity

\displaystyle  \Lambda_2(n) = \Lambda(n) \log n + \Lambda*\Lambda(n) \ \ \ \ \ (16)

(which corresponds to the Dirichlet series identity {\zeta''(s)/\zeta(s) = -(-\zeta'(s)/\zeta(s))' + (-\zeta'(s)/\zeta(s))^2}); the first term {\Lambda(n) \log n} is mostly concentrated on primes, while the second term {\Lambda*\Lambda(n)} is mostly concentrated on semiprimes (products of two distinct primes).

Now let us sum {\Lambda_2(n)}. In analogy with the previous discussion, we will do so by comparing the function {\log^2 n} with something involving the divisor function. In view of (5), it is reasonable to try the approximation

\displaystyle  \log^2 n \approx \tau(n) \log n;

from the identity

\displaystyle  2 \log n = \sum_{d|n} \mu(d) \tau(\frac{n}{d}) \log \frac{n}{d} \ \ \ \ \ (17)

(which corresponds to the Dirichlet series identity {- 2 \zeta'(s) = \frac{1}{\zeta(s)} -(\zeta^2(s))'}) we thus expect

\displaystyle  \Lambda_2(n) \approx 2\log n. \ \ \ \ \ (18)

Now we make these heuristics more precise. From the integral test we have

\displaystyle  \sum_{n \leq x} \log^2 n = x \log^2 x + C_1 x \log x + C_2 x + O(\log^2 x)

while from (9) and summation by parts one has

\displaystyle  \sum_{n \leq x} \tau(n) \log n = x \log^2 x + C_3 x \log x + C_4 x + O(\sqrt{x} \log x)

where {C_1,C_2,C_3,C_4} are explicit absolute constants whose exact value is not important here. Thus

\displaystyle  \sum_{n \leq x} (\log^2 n - \tau(n) \log n) = C_5 x \log x + C_6 x + O(\sqrt{x} \log x) \ \ \ \ \ (19)

for some other constants {C_5, C_6}.

Meanwhile, from (15), (17) one has

\displaystyle  \sum_{n \leq x} (\Lambda_2(n) - 2\log(n)) = \sum_{d \leq x} \mu(d) \sum_{m \leq x/d} \log^2 n - \tau(n) \log n;

applying (19), (13), (14) we see that the right-hand side is {O(x)}. Computing {\sum_{n \leq x} \log n} by the integral test, we deduce the Selberg symmetry formula

\displaystyle  \sum_{n \leq x} \Lambda_2(n) = 2 x \log x + O(x). \ \ \ \ \ (20)

One can view (20) as the “almost prime number theorem” – the analogue of the prime number theorem for almost primes.

The fact that the almost primes have a relatively easy asymptotic, while the genuine primes do not, is a reflection of the parity problem in sieve theory; see this post for further discussion. The symmetry formula is however enough to get “within a factor of two” of the prime number theorem: if we discard the semiprimes {\Lambda*\Lambda} from (16), we see that {\Lambda(n) \log n \leq \Lambda_2(n)}, and thus

\displaystyle  \sum_{n \leq x} \Lambda(n) \log n \leq 2 x \log x + O(x)

which by a summation by parts argument leads to

\displaystyle  0 \leq \sum_{n \leq x} \Lambda(n) \leq 2 x + O(\frac{x}{\log x}),

which is within a factor of {2} of (2) in some sense.

One can “twist” all of the above arguments by a Dirichlet character {\chi}. For instance, (15) twists to

\displaystyle  \Lambda_2(n) \chi(n) = \sum_{d|n} \mu(d)\chi(d) \log^2 \frac{n}{d} \chi(\frac{n}{d}).

On the other hand, if {\chi} is a non-principal character of modulus {q}, then it has mean zero on any interval with length {q}, and it is then not hard to establish the asymptotic

\displaystyle  \sum_{n \leq y} \log^2 n \chi(n) = O_q(\log^2 y).

This soon leads to the twisted version of (20):

\displaystyle  \sum_{n \leq x} \Lambda_2(n) \chi(n) = O_q(x), \ \ \ \ \ (21)

thus almost primes are asymptotically unbiased with respect to non-principal characters.

From the multiplicative Fourier analysis of Dirichlet characters modulo {q} (and the observation that {\Lambda_2} is quite small on residue classes not coprime to {q}) one then has an “almost prime number theorem in arithmetic progressions”:

\displaystyle  \sum_{n \leq x: n = a \mod q} \Lambda_2(n) = \frac{2}{\phi(q)} x \log x + O_q(x).

As before, this lets us come within a factor of two of the actual prime number theorem in arithmetic progressions:

\displaystyle  \sum_{n \leq x: n = a \mod q} \Lambda(n) \leq \frac{2}{\phi(q)} x + O_q(\frac{x}{\log x}).

One can also twist things by the completely multiplicative function {n \mapsto n^{it}}, but with the caveat that the approximation {2 \log n} to {\Lambda_2(n)} can locally correlate with {n^{it}}. Thus for instance one has

\displaystyle  \sum_{n \leq x} (\Lambda_2(n)-2 \log n) \chi(n) n^{it} = O_q(x)

for any fixed {t} and {\chi}; in particular, if {\chi} is non-principal, one has

\displaystyle  \sum_{n \leq x} \Lambda_2(n) \chi(n) n^{it} = O_q(x).

— 3. The all-or-nothing dichotomy —

To summarise so far, the almost primes (as represented by {\Lambda_2}) are quite uniformly distributed. These almost primes can be split up into the primes (as represented by {\Lambda(n) \log n}) and the semiprimes (as represented by {\Lambda*\Lambda(n)}), thanks to (16).

One can rewrite (16) as a recursive formula for {\Lambda}:

\displaystyle  \Lambda(n) = \frac{1}{\log n} \Lambda_2(n) - \frac{1}{\log n} \Lambda * \Lambda(n). \ \ \ \ \ (22)

One can also twist this formula by a character {\chi} and/or a completely multiplicative function {n \mapsto n^{it}}, thus for instance

\displaystyle  \Lambda \chi(n) = \frac{1}{\log n} \Lambda_2 \chi(n) - \frac{1}{\log n} \Lambda \chi * \Lambda \chi(n). \ \ \ \ \ (23)

This recursion, combined with the uniform distribution properties on {\Lambda_2}, lead to various all-or-nothing dichotomies for {\Lambda}. Suppose, for instance, that {\Lambda \chi} behaves like a constant {c} on the average for some non-principal character {\chi}:

\displaystyle  \Lambda \chi(n) \approx c.

Then (from (5)) we expect {\Lambda \chi * \Lambda \chi} to behave like {c^2 \log n}, thus

\displaystyle  \frac{1}{\log n} \Lambda \chi * \Lambda \chi(n) \approx c^2.

On the other hand, from (21), {\frac{1}{\log n} \Lambda_2(n)} is asymptotically uncorrelated with {\chi}:

\displaystyle  \frac{1}{\log n} \Lambda_2 \chi \approx 0.

Putting all this together, one obtains

\displaystyle  c \approx -c^2

which suggests that {c} must be either close to {0}, or close to {-1}.

Basically, the point is that there are only two equilibria for the recursion (23). One equilibrium occurs when {\Lambda} is asymptotically uncorrelated with {\chi}; the other is when it is completely anti-correlated with {\chi}, so that {\Lambda(n)} is supported primarily on those {n} for which {\chi(n)} is close to {-1}. Note in the latter case {\chi(n) \approx -1} for most primes {n}, and thus {\chi(n) \approx +1} for most semiprimes {n}, thus leading to an equidistribution of {\chi(n)} for almost primes (weighted by {\Lambda_2}). Any intermediate distribution of {\Lambda \chi} would be inconsistent with the distribution of {\Lambda_2 \chi}. (In terms of Dirichlet series, this assertion corresponds to the fact that the {L}-function of {\chi} either has a zero of order {1}, or a zero of order {0} (i.e. not a zero at all) at {s=1}.)

A similar phenomenon occurs when twisting {\Lambda} by {n^{it}}; basically, the average value of {(\Lambda(n)-1) n^{it}} must asymptotically either be close to {0}, or close to {-1}; no other asymptotic ends up being compatible with the distribution of {(\Lambda_2(n)-2\log n) n^{it}}. (Again, this corresponds to the fact that the Riemann zeta function has a zero of order {1} or {0} at {1+it}.) More generally, the average value of {(\Lambda(n)-1) \chi(n) n^{it}} must asymptotically approach either {0} or {-1}.

Remark 2 One can make the above heuristics precise either by using Dirichlet series (and analytic continuation, and the theory of zeroes of meromorphic functions), or by smoothing out arithmetic functions such as {\Lambda \chi} by a suitable multiplicative convolution with a mollifier (as is basically done in elementary proofs of the prime number theorem); see also the Granville-Soundararajan theory of pretentious characters for a closely related theory. We will not pursue these details here, however.

— 4. Dueling conspiracies —

In the previous section we have seen (heuristically, at least), that the von Mangoldt function {\Lambda(n)} (or more precisely, {\Lambda(n)-1}) will either have no correlation, or a maximal amount of anti-correlation, with a completely multiplicative function such as {\chi(n)}, {n^{it}}, or {\chi(n) n^{it}}. On the other hand, it is not possible for this function to maximally anti-correlate (or to conspire) with two such functions; thus the presence of one conspiracy excludes the presence of all others.

Suppose for instance that we had two distinct non-principal characters {\chi, \chi'} for which one had maximal anti-correlation:

\displaystyle  \Lambda(n) \chi(n), \Lambda(n) \chi'(n) \approx -1.

One could then combine the two statements to obtain

\displaystyle  \Lambda(n) (\chi(n) + \chi'(n)) \approx -2.

Meanwhile, {\frac{1}{\log n} \Lambda_2(n)} doesn’t correlate with either {\chi} or {\chi'}. It will be convenient to exploit this to normalise {\Lambda}, obtaining

\displaystyle  (\Lambda(n) - \frac{1}{2\log n} \Lambda_2(n)) (\chi(n) + \chi'(n)) \approx -2.

(Note from (3), (18) that we expect {\Lambda(n) - \frac{1}{2\log n} \Lambda_2(n)} to have mean zero.)

On the other hand, since {0 \leq \Lambda(n) \log n \leq \Lambda_2(n)}, one has

\displaystyle  |\Lambda(n) - \frac{1}{2\log n} \Lambda_2(n)| \leq \frac{1}{2\log n} \Lambda_2(n)

and hence by the triangle inequality

\displaystyle  \Lambda_2(n) |\chi(n) + \chi'(n)| \gtrapprox 4 \log n

in the sense that averages of the left-hand side should be at least as large as averages of the right-hand side. From this, (18), and Cauchy-Schwarz, one thus expects

\displaystyle  \Lambda_2(n) |\chi(n) + \chi'(n)|^2 \gtrapprox 8 \log n.

But if one expands out the left-hand side using (18), (21), one only ends up with {4 \log n + O_q(1)} on the average, a contradiction for {n} sufficiently large.

Remark 3 The above argument belongs to a family of {L^2}-based arguments which go by various names (almost orthogonality, {TT^*}, large sieve, etc.). The {L^2} argument can more generally be used to establish square-summability estimates on averages such as {\frac{1}{x} \sum_{n \leq x} \Lambda(n) \chi(n)} as {\chi} varies, but we will not make this precise here.

As one consequence of the above arguments, one can show that {\Lambda(n)} cannot maximally anti-correlate with any non-real character {\chi}, since (by the reality of {\Lambda}) it would then also maximally anti-correlate with the complex conjugate {\overline{\chi}}, which is distinct from {\chi}. A similar argument shows that {\Lambda(n)} cannot maximally anti-correlate with {n^{it}} for any non-zero {t}, a fact which can soon lead to the prime number theorem, either by Dirichlet series methods, by Fourier-analytic means, or by elementary means. (Sketch of Fourier-analytic proof: {L^2} methods provide {L^2}-type bounds on the averages of {\Lambda(n) n^{it}} in {t}, while the above arguments show that these averages are also small in {L^\infty}. Applying (22) a few times to take advantage of the smoothing effects of convolution, one eventually concludes that these averages can be made arbitrarily small in {L^1} asymptotically, at which point the prime number theorem follows from Fourier inversion.)

Remark 4 There is a slightly different argument of an {L^1} nature rather than an {L^2} nature (i.e. using tools such as the triangle inequality, union bound, etc.) that can also achieve similar results. For instance, suppose that {\Lambda(n)} maximally anti-correlates with {\chi} and {\chi'}. Then {\chi(n), \chi'(n) \approx -1} for most primes {n}, which implies that {\chi \chi'(n) \approx +1} for most primes {n}, which is incompatible with the all-or-nothing dichotomy unless {\chi \chi'} is principal. This is an alternate way to exclude correlation with non-real characters. Similarly, if {\Lambda(n) n^{it} \approx -1}, then {\Lambda(n) n^{2it} \approx +1}, which is also incompatible with the zero-one law; this is essentially the method underlying the standard proof of the prime number theorem (which relates {\zeta(1+it)} with {\zeta(1+2it)}).

— 5. Quadratic characters —

The one difficult scenario to eliminate is that of maximal anti-correlation with a real non-principal (i.e. quadratic) character {\chi}, thus

\displaystyle  \Lambda(n) \chi(n) \approx -1.

This scenario implies that the quantity

\displaystyle  L(1,\chi) := \sum_{n=1}^\infty \frac{\chi(n)}{n}

vanishes. Indeed, if one starts with the identity

\displaystyle  \log n \chi(n) = \sum_{d|n} \Lambda \chi(d) \chi(\frac{n}{d})

and sums in {n}, one sees that

\displaystyle  \sum_{n \leq x} \log n \chi(n) = \sum_{d,m: dm \leq x} \Lambda \chi(d) \chi(m).

The left-hand side is {O_q( \log x)} by the mean zero and periodicity properties of {\chi}. To estimate the right-hand side, we use the hyperbola method and rewrite it as

\displaystyle  \sum_{m \leq M} \chi(m) \sum_{d \leq x/m} \Lambda \chi(d) + \sum_{d \leq x/M} \Lambda \chi(d) \sum_{M < m \leq x/d} \chi(m)

for some parameter {M} (sufficiently slowly growing in {x}) to be optimised later. Writing {\sum_{d \leq x/m} \Lambda \chi(d) = (-1 + o_q(1)) x/m} and {\sum_{M < m \leq x/d} \chi(m) = O_q(1)}, we can express this as

\displaystyle  x( \sum_{m \leq M} \frac{\chi(m)}{m} + o_q(1) ) + O_q( x / M );

sending {x \rightarrow \infty} (and {M \rightarrow \infty} at a slower rate) we conclude {L(1,\chi)=0} as required.

It is remarkably difficult to show that {L(1,\chi)} does not, in fact, vanish. One way to do this is to use the class number formula, that relates this quantity to the class number of the quadratic number field {{\Bbb Q}(\sqrt{-d})} associated to the conductor {d} of {\chi}, together with some related number-theoretic quantities. A more elementary (but significantly weaker) method proceeds by using the easily verified fact that the convolution {1*\chi} is non-negative, and is at least {1} on the squares; this should be interpreted as a fact from algebraic number theory, and basically corresponds to the fact that the number of representations of an integer {n} as the norm {x^2 + d y^2} of an integer in {{\Bbb Z}(\sqrt{d})} (or more generally, as the norm of an ideal in that ring) is non-negative, and is at least {1} on the squares. In particular we have

\displaystyle  \sum_{n \leq x} \frac{1*\chi(n)}{\sqrt{n}} \geq \frac{1}{2} \log x + O(1).

On the other hand, from the hyperbola method we can express the left-hand side as

\displaystyle  \sum_{d \leq \sqrt{x}} \frac{\chi(d)}{\sqrt{d}} \sum_{m \leq x/d} \frac{1}{\sqrt{m}} + \sum_{m < \sqrt{x}} \frac{1}{\sqrt{m}} \sum_{\sqrt{x} < d \leq x/m} \frac{\chi(d)}{\sqrt{d}}. \ \ \ \ \ (24)

From the mean zero and periodicity properties of {\chi} we have {\sum_{\sqrt{x} < d \leq x/m} \frac{\chi(d)}{\sqrt{d}} = O_q( x^{-1/4} )}, so the second term in (24) is {O_q(1)}. Meanwhile, from the midpoint rule, {\sum_{m \leq y} \frac{1}{\sqrt{m}} = 2 \sqrt{y} - \frac{3}{2} + O( 1 / \sqrt{y} )}, and so the first term in (24) is

\displaystyle  2 \sqrt{x} \sum_{d \leq \sqrt{x}} \frac{\chi(d)}{d} + O( |\sum_{d \leq \sqrt{x}} \frac{\chi(d)}{\sqrt{d}}| ) + O( 1 ) = 2 \sqrt{x} L(1,\chi) + O(1).

Putting all this together we have

\displaystyle  \frac{1}{2} \log x + O(1) \leq 2 \sqrt{x} L(1,\chi) + O_q(1),

which leads to a contradiction as {x \rightarrow \infty} if {L(1,\chi)} vanishes.

Note in fact that the above argument shows that {L(1,\chi)} is positive. If one carefully computes the dependence of the above argument on the modulus {q}, one obtains a lower bound of the form {L(1,\chi) \geq \exp( -q^{1/2+o(1)} )}, which is quite poor. Using a non-trivial improvement on the error term in counting lattice points under the hyperbola (or better still, by smoothing the sum {\sum_{n \leq x}}), one can improve this a bit, to {L(1,\chi) \geq q^{-O(1)}}. In contrast, the class number method gives a bound {L(1,\chi) \geq q^{-1/2+o(1)}}.

We can improve this even further for all but at most one real primitive character {\chi}:

Theorem 1 (Siegel’s theorem) For every {\epsilon > 0}, one has {L(1,\chi) \gg_\epsilon q^{-\epsilon}} for all but at most one real primitive character {\chi}, where the implied constant is effective, and {q} is the modulus of {\chi}.

Throwing in this (hypothetical) one exceptional character, we conclude that {L(1,\chi) \gg_\epsilon q^{-\epsilon}} for all real primitive characters {\chi}, but now the implied constant is ineffective, which is the usual way in which Siegel’s theorem is formulated (but the above nearly effective refinement can be obtained by the same methods). It is a major open problem in the subject to eliminate this exceptional character and recover an effective estimate for some {\epsilon < 1/2}.

Proof: Let {\epsilon > 0} (which we can assume to be small), and let {c > 0} be a small number depending (effectively) on {\epsilon} to be chosen later. Our task is to show that {L(1,\chi) \geq c q^{-\epsilon}} for all but at most one primitive real character {\chi}. Note we may assume {q} is large (effectively) depending on {\epsilon}, as the claim follows from the previous bounds on {L(1,\chi)} otherwise.

Suppose then for contradiction that {L(1,\chi) < c q^{-\epsilon}} and {L(1,\chi') < c (q')^{-\epsilon}} for two distinct primitive real characters {\chi, \chi'} of (large) modulus {q,q'} respectively.

We begin by modifying the proof that {L(1,\chi)} was positive, which relied (among other things) on the observation that {1 * \chi}, and equals {1} at {1}. In particular, one has

\displaystyle  \sum_{n \leq x} \frac{1*\chi(n)}{n^s} \geq 1 \ \ \ \ \ (25)

for any {x \geq 1} and any real {s}. (One can get slightly better bounds by exploiting that {1*\chi} is also at least {1} on square numbers, as before, but this is really only useful for {s \leq 1/2}, and we are now going to take {s} much closer to {1}.)

On the other hand, one has the asymptotics

\displaystyle  \sum_{n \leq x} \frac{1}{n^s} = \zeta(s) + \frac{x^{1-s}}{1-s} + O( x^{-s} )

for any real {s} close (but not equal) to {1}, and similarly

\displaystyle  \sum_{n \leq x} \frac{\chi(n)}{n^s} = L(s,\chi) + O( q^{O(1)} x^{-s} )

for any real {s} close to {1}; similarly for {\chi', \chi \chi'}. From the hyperbola method, we can then conclude

\displaystyle  \sum_{n \leq x} \frac{1*\chi(n)}{n^s} = \zeta(s) L(s,\chi) + \frac{x^{1-s}}{1-s} L(1,\chi) + O( q^{O(1)} x^{0.5-s} ) \ \ \ \ \ (26)

for all real {s} sufficiently close to {1}. Indeed, one can expand the left-hand side of (26) as

\displaystyle  \sum_{d \leq \sqrt{x}} \frac{\chi(d)}{d^s} \sum_{m \leq x/d} \frac{1}{m^s} + \sum_{m < \sqrt{x}} \frac{1}{m^s} \sum_{\sqrt{x} < d \leq x/m} \frac{\chi(d)}{d^s}

and the claim then follows from the previous asymptotics. (One can improve the error term by smoothing the summation, but we will not need to do so here.)

Now set {x = C q^{C}} for a large absolute constant {C}. If {0.99 \leq s < 1}, then the error term in {O( q^{O(1)} x^{0.5-s} )} is then at most {1/2} (say) if {C} is large enough. We conclude from (25) that

\displaystyle  \zeta(s) L(s,\chi) \geq \frac{1}{2} - O( \frac{q^{O(1-s)}}{1-s} L(1,\chi) )

for {0.99 \leq s < 1}. Since {L(1,\chi) \leq c q^{-\epsilon}} and {c} is assumed small (depending on {\epsilon}), this implies that {\zeta(s) L(s,\chi)} is positive in the range

\displaystyle  L(1,\chi) \ll 1-s \ll \epsilon

(this can be seen by checking the cases {1-s \leq 1/\log q} and {1-s > 1/\log q} separately). On the other hand, {\zeta(s) L(s,\chi)} has a simple pole at {s=1} with positive residue, and is thus negative for {s<1} extremely close to {1}. By the intermediate value theorem, we conclude that {L(s,\chi)} has a zero for some {s = 1 - O(L(1,\chi))}. Conversely, it is not difficult (using summation by parts) to show that {L'(s,\chi) = O(\log^2 q)} for {s = 1 - O( 1/\log q )}, and so by the mean value theorem we see that the zero of {L(s,\chi)} must also obey {1-s \gg L(1,\chi) / \log^2 q}. Thus {L(s,\chi)} has a zero for some {s < 1} with

\displaystyle  L(1,\chi)/\log^2 q \ll 1-s \ll L(1,\chi). \ \ \ \ \ (27)

Similarly, {L(s',\chi')} has a zero for some {s' < 1} with

\displaystyle  L(1,\chi')/\log^2 q' \ll 1-s' \ll L(1,\chi'). \ \ \ \ \ (28)

Now, we consider the function

\displaystyle  f := 1 * \chi * \chi' * \chi\chi'.

One can also show that {f} is non-negative and equals {1} at {1}, thus

\displaystyle  \sum_{n \leq x} \frac{f(n)}{n^s} \geq 1.

(The algebraic number theory interpretation of this positivity is that {f(n)} is the number of representations of {n} as the norm of an ideal in the biquadratic field generated by {\sqrt{q}} and {\sqrt{q'}}.)

Also, by (a more complicated version of) the derivation of (26), one has

\displaystyle  \sum_{n \leq x} \frac{f(n)}{n^s} = \zeta(s) L(s,\chi) L(s,\chi') L(s,\chi\chi') + \frac{x^{1-s}}{1-s} L(1,\chi) L(1,\chi') L(1,\chi\chi') + O( (qq')^{O(1)} x^{0.9-s} )

(say). Arguing as before, we conclude that

\displaystyle  \zeta(s) L(s,\chi) L(s,\chi') L(s,\chi\chi') \geq \frac{1}{2} - O( \frac{(qq')^{O(1-s)}}{1-s} L(1,\chi) L(1,\chi')L(1,\chi\chi') )

for {0.99 \leq s < 1}. Using the bound {L(1,\chi\chi') \ll \log(qq')} (which can be established by summation by parts), we conclude that {\zeta(s) L(s,\chi) L(s,\chi') L(s,\chi\chi')} is positive in the range

\displaystyle  L(1,\chi) L(1,\chi') \log(qq') \ll 1-s \ll \epsilon.

Since we already know {L(s,\chi)} and {L(s',\chi')} have zeroes for some {s, s'} obeying (27), (28)

\displaystyle  \frac{L(1,\chi)}{\log^2 q}, \frac{L(1,\chi')}{\log^2 q'} \ll L(1,\chi) L(1,\chi') \log(qq');

taking geometric means and rearranging we obtain

\displaystyle  L(1,\chi) L(1,\chi') \gg \log(qq')^{-O(1)}.

But this contradicts the hypotheses {L(1,\chi) \leq c q^{-\epsilon}}, {L(1,\chi') \leq c (q')^{-\epsilon}} if {c} is small enough. \Box

Remark 5 Siegel’s theorem leads to a version of the prime number theorem in arithmetic progressions known as the Siegel-Walfisz theorem. As with Siegel’s theorem, the bounds are ineffective unless one is allowed to exclude a single exceptional modulus {q} (and its multiples), in which case one has a modified prime number theorem which favours the quadratic nonresidues mod {q}; see this paper of Granville.

Remark 6 One can improve the effective bounds in Siegel’s theorem if one is allowed to exclude a larger set of bad moduli. For instance, the arguments in Section 4 allow one to establish a bound of the form {L(1,\chi) \gg \log^{-O(1)} q} after excluding at most one {q} in each hyper-dyadic range {2^{100^k} \leq q \leq 2^{100^{k+1}}} for each {k}; one can of course replace {100} by other exponents here, but at the cost of worsening the {O(1)} term. (This is essentially an observation of Landau.)