In these notes we presume familiarity with the basic concepts of probability theory, such as random variables (which could take values in the reals, vectors, or other measurable spaces), probability, and expectation. Much of this theory is in turn based on measure theory, which we will also presume familiarity with. See for instance this previous set of lecture notes for a brief review.

The basic objects of study in analytic number theory are deterministic; there is nothing inherently random about the set of prime numbers, for instance. Despite this, one can still interpret many of the averages encountered in analytic number theory in probabilistic terms, by introducing random variables into the subject. Consider for instance the form

$\displaystyle \sum_{n \leq x} \mu(n) = o(x) \ \ \ \ \ (1)$

of the prime number theorem (where we take the limit ${x \rightarrow \infty}$). One can interpret this estimate probabilistically as

$\displaystyle {\mathbb E} \mu(\mathbf{n}) = o(1) \ \ \ \ \ (2)$

where ${\mathbf{n} = \mathbf{n}_{\leq x}}$ is a random variable drawn uniformly from the natural numbers up to ${x}$, and ${{\mathbb E}}$ denotes the expectation. (In this set of notes we will use boldface symbols to denote random variables, and non-boldface symbols for deterministic objects.) By itself, such an interpretation is little more than a change of notation. However, the power of this interpretation becomes more apparent when one then imports concepts from probability theory (together with all their attendant intuitions and tools), such as independence, conditioning, stationarity, total variation distance, and entropy. For instance, suppose we want to use the prime number theorem (1) to make a prediction for the sum

$\displaystyle \sum_{n \leq x} \mu(n) \mu(n+1).$

After dividing by ${x}$, this is essentially

$\displaystyle {\mathbb E} \mu(\mathbf{n}) \mu(\mathbf{n}+1).$

With probabilistic intuition, one may expect the random variables ${\mu(\mathbf{n}), \mu(\mathbf{n}+1)}$ to be approximately independent (there is no obvious relationship between the number of prime factors of ${\mathbf{n}}$, and of ${\mathbf{n}+1}$), and so the above average would be expected to be approximately equal to

$\displaystyle ({\mathbb E} \mu(\mathbf{n})) ({\mathbb E} \mu(\mathbf{n}+1))$

which by (2) is equal to ${o(1)}$. Thus we are led to the prediction

$\displaystyle \sum_{n \leq x} \mu(n) \mu(n+1) = o(x). \ \ \ \ \ (3)$

The asymptotic (3) is widely believed (it is a special case of the Chowla conjecture, which we will discuss in later notes; while there has been recent progress towards establishing it rigorously, it remains open for now.
How would one try to make these probabilistic intuitions more rigorous? The first thing one needs to do is find a more quantitative measurement of what it means for two random variables to be “approximately” independent. There are several candidates for such measurements, but we will focus in these notes on two particularly convenient measures of approximate independence: the “${L^2}$” measure of independence known as covariance, and the “${L \log L}$” measure of independence known as mutual information (actually we will usually need the more general notion of conditional mutual information that measures conditional independence). The use of ${L^2}$ type methods in analytic number theory is well established, though it is usually not described in probabilistic terms, being referred to instead by such names as the “second moment method”, the “large sieve” or the “method of bilinear sums”. The use of ${L \log L}$ methods (or “entropy methods”) is much more recent, and has been able to control certain types of averages in analytic number theory that were out of reach of previous methods such as ${L^2}$ methods. For instance, in later notes we will use entropy methods to establish the logarithmically averaged version

$\displaystyle \sum_{n \leq x} \frac{\mu(n) \mu(n+1)}{n} = o(\log x) \ \ \ \ \ (4)$

of (3), which is implied by (3) but strictly weaker (much as the prime number theorem (1) implies the bound ${\sum_{n \leq x} \frac{\mu(n)}{n} = o(\log x)}$, but the latter bound is much easier to establish than the former).
As with many other situations in analytic number theory, we can exploit the fact that certain assertions (such as approximate independence) can become significantly easier to prove if one only seeks to establish them on average, rather than uniformly. For instance, given two random variables ${\mathbf{X}}$ and ${\mathbf{Y}}$ of number-theoretic origin (such as the random variables ${\mu(\mathbf{n})}$ and ${\mu(\mathbf{n}+1)}$ mentioned previously), it can often be extremely difficult to determine the extent to which ${\mathbf{X},\mathbf{Y}}$ behave “independently” (or “conditionally independently”). However, thanks to second moment tools or entropy based tools, it is often possible to assert results of the following flavour: if ${\mathbf{Y}_1,\dots,\mathbf{Y}_k}$ are a large collection of “independent” random variables, and ${\mathbf{X}}$ is a further random variable that is “not too large” in some sense, then ${\mathbf{X}}$ must necessarily be nearly independent (or conditionally independent) to many of the ${\mathbf{Y}_i}$, even if one cannot pinpoint precisely which of the ${\mathbf{Y}_i}$ the variable ${\mathbf{X}}$ is independent with. In the case of the second moment method, this allows us to compute correlations such as ${{\mathbb E} {\mathbf X} \mathbf{Y}_i}$ for “most” ${i}$. The entropy method gives bounds that are significantly weaker quantitatively than the second moment method (and in particular, in its current incarnation at least it is only able to say non-trivial assertions involving interactions with residue classes at small primes), but can control significantly more general quantities ${{\mathbb E} F( {\mathbf X}, \mathbf{Y}_i )}$ for “most” ${i}$ thanks to tools such as the Pinsker inequality.

— 1. Second moment methods —

In this section we discuss probabilistic techniques of an “${L^2}$” nature. We fix a probability space ${(\Omega, {\mathcal F}, \mathop{\bf P})}$ to model all of random variables; thus for instance we shall model a complex random variable ${\mathbf{X}}$ in these notes by a measurable function ${\mathbf{X}: \Omega \rightarrow {\bf C}}$. (Strictly speaking, there is a subtle distinction one can maintain between a random variable and its various measure-theoretic models, which becomes relevant if one later decides to modify the probability space ${\Omega}$, but this distinction will not be so important in these notes and so we shall ignore it. See this previous set of notes for more discussion.)

We will focus here on the space ${L^2 = L^2(\Omega, {\mathcal F}, \mathop{\bf P})}$ of complex random variables ${\mathbf{X}}$ (that is to say, measurable maps ${\mathbf{X}: \Omega \rightarrow {\bf C}}$) whose second moment

$\displaystyle {\mathbb E}(|\mathbf{X}|^2)$

of ${\mathbf{X}}$ is finite. In many number-theoretic applications the finiteness of the second moment will be automatic because ${\mathbf{X}}$ will only take finitely many values. As is well known, the space ${L^2}$ has the structure of a complex Hilbert space, with inner product

$\displaystyle \langle \mathbf{X}, \mathbf{Y} \rangle_{L^2} := {\mathbb E}(\mathbf{X} \overline{\mathbf{Y}})$

and norm

$\displaystyle \| \mathbf{X} \|_{L^2} := ({\mathbb E} (|\mathbf{X}|^2))^{1/2}$

for ${\mathbf{X},\mathbf{Y} \in L^2}$. By slight abuse of notation, the complex numbers ${{\bf C}}$ can be viewed as a subset of ${L^2}$, by viewing any given complex number ${z}$ as a constant (deterministic) random variable. Then ${{\bf C}}$ is a one-dimensional subspace of ${L^2}$, spanned by the unit vector ${1 \in L^2}$. Given a random variable ${\mathbf{X}}$ to ${L^2}$, the projection of ${\mathbf{X}}$ to ${{\bf C}}$ is then the mean

$\displaystyle {\mathbb E} \mathbf{X} =\langle \mathbf{X}, 1 \rangle,$

and we obtain an orthogonal splitting ${\mathbf{X} = ({\mathbb E} \mathbf{X}) + (\mathbf{X} - {\mathbb E} \mathbf{X})}$ of any ${\mathbf{X} \in L^2}$ into its mean ${{\mathbb E} \mathbf{X}}$ and its mean zero part ${\mathbf{X} - {\mathbb E} \mathbf{X}}$. By Pythagoras’ theorem, we then have

$\displaystyle {\mathbb E}(|\mathbf{X}|^2) = {\mathbb E}(|\mathbf{X} - {\mathbb E} \mathbf{X}|^2) + {\mathbb E} |{\mathbb E} \mathbf{X}|^2.$

The first quantity on the right-hand side is the square of the distance from ${\mathbf{X}}$ to ${{\bf C}}$, and this non-negative quantity is known as the variance

$\displaystyle \mathrm{Var} \mathbf{X} := {\mathbb E}(|\mathbf{X} - {\mathbb E} \mathbf{X}|^2) = {\mathbb E}(|\mathbf{X}|^2) - |{\mathbb E} \mathbf{X}|^2. \ \ \ \ \ (5)$

The square root ${(\mathrm{Var} \mathbf{X})^{1/2} = \| \mathbf{X} - {\mathbb E} \mathbf{X} \|_{L^2}}$ of the variance is known as the standard deviation. The variance controls the distribution of the random variable ${\mathbf{X}}$ through Chebyshev’s inequality

$\displaystyle {\mathbb P}( |\mathbf{X} - {\mathbb E} \mathbf{X}| \geq \lambda ) \leq \frac{\mathrm{Var}(\mathbf{X})}{\lambda^2} \ \ \ \ \ (6)$

for any ${\lambda > 0}$, which is immediate from observing the inequality ${1_{|\mathbf{X} - {\mathbb E} \mathbf{X}| \geq \lambda} \leq \frac{|\mathbf{X} - {\mathbb E} \mathbf{X}|^2}{\lambda^2}}$ and then taking expectations of both sides. Roughly speaking, this inequality asserts that ${\mathbf{X}}$ typically deviates from its mean ${{\mathbb E} \mathbf{X}}$ by no more than a bounded multiple of the standard deviation ${(\mathrm{Var} \mathbf{X})^{1/2}}$.
A slight generalisation of Chebyshev’s inequality that can be convenient to use is

$\displaystyle {\mathbb P}( |\mathbf{X} - \mu| \geq \lambda ) \leq \frac{\mathrm{Var}(\mathbf{X}) + |{\mathbb E} \mathbf{X} - \mu|^2}{\lambda^2} \ \ \ \ \ (7)$

for any ${\lambda > 0}$ and any complex number ${\mu}$ (which typically will be a simplified approximation to the mean ${{\mathbb E} X}$), which is proven similarly to (6) but noting (from (5)) that ${{\mathbb E} |\mathrm{X}-\mu|^2 = \mathrm{Var}(\mathbf{X}) + |{\mathbb E} \mathbf{X} - \mu|^2}$.
Informally, (6) is an assertion that a square-integrable random variable ${\mathbf{X}}$ will concentrate around its mean ${{\mathbb E} \mathbf{X}}$ if its variance is not too large. See these previous notes for more discussion of the concentration of measure phenomenon. One can often obtain stronger concentration of measure than what is provided by Chebyshev’s inequality if one is able to calculate higher moments than the second moment, such as the fourth moment ${{\mathbb E} |\mathbf{X}|^4}$ or exponential moments ${{\mathbb E} e^{c\mathbf{X}}}$, but we will not pursue this direction in this set of notes.

Clearly the variance is homogeneous of order two, thus

$\displaystyle \mathrm{Var}(c\mathbf{X}) = |c|^2 \mathrm{Var}(\mathbf{X})$

for any ${\mathbf{X} \in L^2}$ and ${c \in {\bf C}}$. In particular, the variance is not always additive: the claim ${\mathrm{Var}(\mathbf{X}+\mathbf{Y}) = \mathrm{Var}(\mathbf{X}) + \mathrm{Var}(\mathbf{Y})}$ fails in particular when ${\mathbf{X}=\mathbf{Y}}$ is not almost surely zero. However, there is an important substitute for this formula. Given two random variables ${\mathbf{X},\mathbf{Y} \in L^2}$, the inner product of the corresponding mean zero parts ${\mathbf{X} - {\mathbb E} \mathbf{X}, \mathbf{Y} - {\mathbb E} \mathbf{Y}}$ is a complex number known as the covariance:

$\displaystyle \mathrm{Cov}(\mathbf{X},\mathbf{Y}) := \langle \mathbf{X}-{\mathbb E} \mathbf{X}, \mathbf{Y} - {\mathbb E} \mathbf{Y} \rangle_{L^2} = {\mathbb E}( (\mathbf{X}-{\mathbb E} \mathbf{X}) \overline{(\mathbf{Y} - {\mathbb E} \mathbf{Y})} ).$

As ${{\mathbb E} \mathbf{X}, {\mathbb E} \mathbf{Y}}$ are orthogonal to ${\mathbf{X}-{\mathbb E} \mathbf{X}, \mathbf{Y} - {\mathbb E} \mathbf{Y}}$, it is not difficult to obtain the alternate formula

$\displaystyle \mathrm{Cov}(\mathbf{X},\mathbf{Y}) := \langle \mathbf{X}, \mathbf{Y} \rangle_{L^2} - \langle {\mathbb E} \mathbf{X}, {\mathbb E} \mathbf{Y} \rangle = {\mathbb E}(\mathbf{X} \overline{\mathbf{Y}}) - ({\mathbb E} \mathbf{X}) \overline{{\mathbb E} \mathbf{Y}} \ \ \ \ \ (8)$

for the covariance.
The covariance is then a positive semi-definite inner product on ${L^2}$ (it basically arises from the Hilbert space structure of the space ${{\bf C}^\perp}$ of mean zero variables), and ${\mathrm{Cov}(\mathbf{X},\mathbf{X}) = \mathrm{Var}(\mathbf{X})}$. From the Cauchy-Schwarz inequality we have

$\displaystyle |\mathrm{Cov}(\mathbf{X},\mathbf{Y})| \leq \mathrm{Var}(\mathbf{X})^{1/2} \mathrm{Var}(\mathbf{Y})^{1/2}.$

If ${\mathbf{X}, \mathbf{Y}}$ have non-zero variance (that is, they are not almost surely constant), then the ratio

$\displaystyle \frac{\mathrm{Cov}(\mathbf{X},\mathbf{Y})}{\mathrm{Var}(\mathbf{X})^{1/2} \mathrm{Var}(\mathbf{Y})^{1/2}}$

is then known as the correlation between ${\mathbf{X}}$ and ${\mathbf{Y}}$, and is a complex number of magnitude at most ${1}$; for real-valued ${\mathbf{X}, \mathbf{Y}}$ that are not almost surely constant, the correlation is instead a real number between ${-1}$ and ${1}$. At one extreme, a correlation of magnitude ${1}$ occurs if and only if ${\mathbf{X} - {\mathbb E} \mathbf{X}}$ is a scalar multiple of ${\mathbf{Y} - {\mathbb E} \mathbf{Y}}$. At the other extreme, a correlation of zero is an indication (though not a guarantee) of independence. Recall that two random variables ${\mathbf{X}, \mathbf{Y}}$ are independent if one has

$\displaystyle {\mathbb P}( \mathbf{X} \in E, \mathbf{Y} \in F ) = {\mathbb P}( \mathbf{X} \in E ) {\mathbb P}( \mathbf{Y} \in F )$

for all (Borel) measurable ${E,F}$. In particular, setting ${E = [0,x]}$, ${F = [0,y]}$ for ${x,y \geq 0}$ and integrating using Fubini’s theorem, we conclude that

$\displaystyle {\mathbb E} \max(\mathbf{X},0) \max(\mathbf{Y},0) = ({\mathbb E} \max(\mathbf{X},0)) ({\mathbb E} \max(\mathbf{Y},0));$

similarly with ${\max(\mathbf{X},0)}$ replaced by ${-\min(\mathbf{X},0)}$, and similarly for ${\mathbf{Y}}$. In particular we have

$\displaystyle {\mathbb E} \mathbf{X} \mathbf{Y} = ({\mathbb E} \mathbf{X}) ({\mathbb E} \mathbf{Y})$

and thus from (8) we thus see that independent random variables have zero covariance (and zero correlation, when they are not almost surely constant). On the other hand, the converse fails:

Exercise 1 Provide an example of two random variables ${\mathbf{X}, \mathbf{Y} \in L^2}$ which are not independent, but which have zero correlation or covariance with each other. (There are many ways to produce some examples. One comes from exploiting various systems of orthogonal functions, such as sines and cosines. Another comes from working with random variables taking only a small number of values, such as ${\{-1,0,1\}}$.

From the cosine rule we have

$\displaystyle \mathrm{Var}(\mathbf{X}+\mathbf{Y}) = \mathrm{Var}(\mathbf{X}) + \mathrm{Var}(\mathbf{Y}) + 2 \mathrm{Re} \mathrm{Cov}( \mathbf{X}, \mathbf{Y} ) \ \ \ \ \ (9)$

and more generally

$\displaystyle \mathrm{Var}(\mathbf{X}_1+ \dots + \mathbf{X}_n) = \sum_{i=1}^n \mathrm{Var}(\mathbf{X}_i) + 2 \sum_{1 \leq i < j \leq n} \mathrm{Re} \mathrm{Cov}( \mathbf{X}_i, \mathbf{X}_j ) \ \ \ \ \ (10)$

for any finite collection of random variables ${\mathbf{X}_1, \dots, \mathbf{X}_n \in L^2}$. These identities combine well with Chebyshev-type inequalities such as (6), (7), and this leads to a very common instance of the second moment method in action. For instance, we can use it to understand the distribution of the number of prime factors of a random number ${\mathbf{n}}$ that fall within a given set ${{\mathcal P}}$. Given any set ${A}$ of natural numbers, define the logarithmic size ${l(A)}$ to be the quantity

$\displaystyle l(A) := \sum_{n \in A} \frac{1}{n}.$

Thus for instance Euler’s theorem asserts that the primes have infinite logarithmic size.

Lemma 2 (Turan-Kubilius inequality, special case) Let ${I \subset {\bf R}}$ be an interval of length at least ${1}$, and let ${\mathbf{n}}$ be an integer drawn uniformly at random from this interval, thus

$\displaystyle {\mathbb P}( \mathbf{n} \in E ) = \frac{\#(E \cap I)}{\#({\bf Z} \cap I)}$

for all ${E \subset {\bf Z}}$. Let ${{\mathcal P}}$ be a finite collection of primes, all of which have size at most ${P}$. Then the random variable ${\mathbf{\omega} := \# \{ p \in {\mathcal P}: p | \mathbf{n} \}}$ has mean

$\displaystyle {\mathbb E} \mathbf{\omega} = l(\mathcal{P}) + O( \frac{P}{|I|} )$

and variance

$\displaystyle \mathrm{Var}( \mathbf{\omega} ) \ll l(\mathcal{P}) + \frac{P^2}{|I|}.$

In particular,

$\displaystyle {\mathbb E} |\mathbf{\omega} - l(\mathcal{P})|^2 \ll l(\mathcal{P}) + \frac{P^2}{|I|}$

and from (7) we have

$\displaystyle {\mathbb P}( |\mathbf{\omega} - l({\mathcal P}) | \geq \lambda ) \ll \frac{l({\mathcal P}) + \frac{P^2}{|I|} }{\lambda^2}$

for any ${\lambda > 0}$.

Proof: For any natural number ${d}$, we have

$\displaystyle \sum_{n \in I} 1_{d|n} = \sum_{n \in I/d} = \frac{1}{d} |I| + O(1)$

and hence

$\displaystyle {\mathbb P}( d | \mathbf{n}) = \frac{\frac{1}{d} |I| + O(1)}{|I| + O(1)} = \frac{1}{d} + O( \frac{1}{|I|} ). \ \ \ \ \ (11)$

We now write ${\mathbf{\omega} = \sum_{p \in {\mathcal P}} 1_{p|{\mathbf n}}}$. From (11) we see that each indicator random variable ${1_{p|{\mathbf n}}}$, ${p \in {\mathcal P}}$ has mean ${\frac{1}{p} + O( \frac{1}{|I|})}$ and variance ${O( \frac{1}{p} + \frac{1}{|I|})}$; similarly, for any two distinct ${p_1, p_2 \in {\mathcal P}}$, we see from (11), (8) the indicators ${1_{p_1|{\mathbf n}}}$, ${1_{p_2|{\mathbf n}}}$ have covariance

$\displaystyle \mathrm{Cov}(1_{p_1|{\mathbf n}}, 1_{p_2|{\mathbf n}}) = \frac{1}{p_1 p_2} + O( \frac{1}{|I|} ) - ( \frac{1}{p_1} + O( \frac{1}{|I|} ) ) ( \frac{1}{p_2} + O( \frac{1}{|I|} ) ) = O( \frac{1}{|I|} )$

and the claim now follows from (10). $\Box$
The exponents of ${P}$ in the error terms here are not optimal; but in practice, we apply this inequality when ${|I|}$ is much larger than any given power of ${P}$, so factors such as ${P^3/|I|}$ will be negligible. Informally speaking, the above lemma asserts that a typical number in a large interval ${I}$ will have roughly ${l({\mathcal P})}$ prime factors in a given finite set ${{\mathcal P}}$ of primes, as long as the logarithmic size ${l({\mathcal P})}$ is large.

If we apply the above lemma to ${I = [1,x]}$ for some large ${x}$, and ${{\mathcal P}}$ equal to the primes up to (say) ${P = x^{0.1}}$, we have ${l({\mathcal P}) = \log\log x + O(1)}$, and hence

$\displaystyle {\mathbb E} |\mathbf{\omega} - \log\log x |^2 \ll \log \log x.$

Since ${\omega(\mathbf{n}) = \mathbf{\omega} + O(1)}$, we recover the main result

$\displaystyle \frac{1}{x} \sum_{n \leq x} |\omega(n) - \log\log x|^2 \ll \log\log x$

of Section 5 of Notes 1 (indeed this is essentially the same argument as in that section, dressed up in probabilistic language). In particular, we recover the Hardy-Ramanujan law that a proportion ${1-o(1)}$ of the natural numbers in ${[1,x]}$ have ${(1+o(1)) \log\log x}$ prime factors.

Exercise 3 (Turan-Kubilius inequality, general case) Let ${f: {\bf N} \rightarrow {\bf C}}$ be an additive function (which means that ${f(n+m) = f(n) + f(m)}$ whenever ${n,m}$ are coprime. Show that

$\displaystyle \frac{1}{x} \sum_{n \leq x} |f(n) - A|^2 \ll \sum_{p \leq x} \sum_{j \leq \log x/\log p} |f(p^j)|^2 / p^j$

where

$\displaystyle A := \sum_{p \leq x} \sum_{j \leq \log x/\log p} |f(p^j)| p^{-j} (1-\frac{1}{p}).$

(Hint: one may first want to work with the special case when ${f(p^j)}$ vanishes whenever ${p^j \geq x^{0.1}}$ so that the second moment method can be profitably applied, and then figure out how to address the contributions of prime powers larger than ${x^{0.1}}$.)

Exercise 4 (Turan-Kubilius inequality, logarithmic version) Let ${2 \leq y \leq x}$ with ${x/y \geq 2}$, and let ${{\mathcal P}}$ be a collection of primes of size less than ${P}$ with ${P \leq y^{0.1}}$. Show that

$\displaystyle \frac{1}{\log(x/y)} \sum_{y \leq n \leq x} \frac{|\sum_{p \in {\mathcal P}} 1_{p|n} - l({\mathcal P})|^2}{n} \ll l({\mathcal P}) + 1 \ \ \ \ \ (12)$

Exercise 5 (Paley-Zygmund inequality) Let ${\mathbf{X} \in L^2}$ be non-negative with positive mean. Show that

$\displaystyle {\mathbb P}( \mathbf{X} > \lambda {\mathbb E} \mathbf{X}) \geq (1-\lambda)^2 \frac{ ({\mathbb E} \mathbf{X})^2 }{ {\mathbb E} (\mathbf{X}^2) }.$

This inequality can sometimes give slightly sharper results than the Chebyshev inequality when using the second moment method.

Now we give a useful lemma that quantifies a heuristic mentioned in the introduction, namely that if several random variables ${\mathbf{Y}_1,\dots,\mathbf{Y}_n}$ do not correlate with each other, then it is not possible for any further random variable ${\mathbf{X}}$ to correlate with many of them simultaneously. We first state an abstract Hilbert space version.

Lemma 6 (Bessel type inequality, Hilbert space version) If ${u, v_1,\dots,v_n}$ are elements of a Hilbert space ${H}$, and ${w_1,\dots,w_n}$ are positive reals, then

$\displaystyle \left( \sum_{i=1}^n w_i |\langle u, v_i \rangle_{H}|^2\right)^{1/2} \leq \|u \|_{H} \left( \sup_{1 \leq i \leq n} \sum_{j=1}^n w_j |\langle v_i, v_j\rangle_{H}| \right)^{1/2}. \ \ \ \ \ (13)$

Proof: We use the duality method. Namely, we can write the left-hand side of (13) as

$\displaystyle \sum_{i=1}^n w_i \langle u, v_i \rangle_{H} \overline{c_i} = \langle u, \sum_{i=1}^n w_i c_i v_i \rangle_{H}$

for some complex numbers ${c_i}$ with ${\sum_{i=1}^n w_i |c_i|^2=1}$ (just take ${c_i}$ to be ${\langle u, v_i \rangle_{H}}$ normalised by the left-hand side of (13), or zero if that left-hand side vanishes. By Cauchy-Schwarz, it then suffices to establish the dual inequality

$\displaystyle \|\sum_{i=1}^n w_i c_i v_i \|_{H}^2 \leq \sup_{1 \leq i \leq n} \sum_{j=1}^n w_j |\langle v_i, v_j\rangle_{H}|.$

The left-hand side can be written as

$\displaystyle \sum_{i=1}^n \sum_{j=1}^n w_i w_j c_i \overline{c_j} \langle v_i, v_j \rangle.$

Using the arithmetic mean-geometric mean inequality ${c_i \overline{c_j} \leq \frac{1}{2} |c_i|^2 + \frac{1}{2} |c_j|^2}$ and symmetry, this may be bounded by

$\displaystyle \sum_{i=1}^n w_i |c_i|^2 \sum_{j=1}^n w_j |\langle v_i, v_j \rangle|.$

Since ${\sum_{i=1}^n w_i |c_i|^2 =1}$, the claim follows. $\Box$

Corollary 7 (Bessel type inequality, probabilistic version) If ${\mathbf{X}, \mathbf{Y}_1,\dots,\mathbf{Y}_n \in L^2}$, and ${w_1,\dots,w_n}$ are positive reals, then

$\displaystyle \sum_{i=1}^n w_i |\mathrm{Cov}( \mathbf{X}, \mathbf{Y}_i )|^2 \ \ \ \ \ (14)$

$\displaystyle \leq \mathrm{Var}(\mathbf{X}) \left( \sup_{1 \leq i \leq n} \sum_{j=1}^n w_j |\mathrm{Cov}(\mathbf{Y}_i, \mathbf{Y}_j)| \right).$

Proof: By subtracting the mean from each of ${\mathbf{X}, \mathbf{Y}_i}$ we may assume that these random variables have mean zero. The claim now follows from Lemma 6. $\Box$

To get a feel for this inequality, suppose for sake of discussion that ${\mathbf{X}}$ and ${\mathbf{Y}_i}$ all have unit variance and ${w_i = 1}$, but that the ${\mathbf{Y}_i}$ are pairwise uncorrelated. Then the right-hand side is equal to ${1}$, and the left-hand side is the sum of squares of the correlations between ${\mathbf{X}}$ and each of the ${\mathbf{Y}_i}$. Any individual correlation is then still permitted to be as large as ${1}$, but it is not possible for multiple correlations to be this large simultaneously. This is geometrically intuitive if one views the random variables ${\mathbf{X}, \mathbf{Y}_i}$ as vectors in a Hilbert space (and correlation as a rough proxy for the angle between such vectors). This lemma also shares many commonalities with the large sieve inequality, discussed in this set of notes.

One basic number-theoretic application of this inequality is the following sampling inequality of Elliott, that lets one approximate a sum ${\sum_{n \in I} f(n)}$ of an arithmetic function by its values ${\sum_{n \in I} f(n) p 1_{p|n}}$ on multiples of primes ${p}$:

Exercise 8 (Elliott’s inequality) Let ${I \subset {\bf R}}$ be an interval of length at least ${1}$. Show that for any function ${f: {\bf Z} \rightarrow {\bf C}}$, one has

$\displaystyle \sum_{p \leq |I|^{0.1}} \frac{1}{p} \left| \frac{1}{|I|} \sum_{n \in I} f(n) - \frac{1}{|I|} \sum_{n \in I} f(n) p 1_{p|n} \right|^2 \ll \frac{1}{|I|} \sum_{n \in I} |f(n)|^2.$

(Hint: Apply Corollary 7 with ${X := f({\mathbf n})}$, ${Y_p := p 1_{p|\mathbf{n}}}$, and ${w_p := 1/p}$, where ${\mathbf{n}}$ is the uniform variable from Lemma 2.) Conclude in particular that for every ${\varepsilon > 0}$, one has

$\displaystyle \frac{1}{|I|} \sum_{n \in I} f(n) = \frac{1}{|I|} \sum_{n \in I} f(n) p 1_{p|n} + O\left( \varepsilon \left(\frac{1}{|I|} \sum_{n \in I} |f(n)|^2\right)^{1/2} \right)$

for all primes ${p \leq |I|}$ outside of a set of exceptional primes of logarithmic size ${O(\varepsilon^{-2})}$.

Informally, the point of this inequality is that an arbitrary arithmetic function ${f}$ may exhibit correlation with the indicator function ${1_{p|n}}$ of the multiples of ${p}$ for some primes ${p}$, but cannot exhibit significant correlation with all of these indicators simultaneously, because these indicators are not very correlated to each other. We note however that this inequality only gains a tiny bit over trivial bounds, because the set of primes up to ${x}$ only has logarithmic size ${O(\log\log x)}$ by Mertens’ theorems; thus, any asymptotics that are obtained using this inequality will typically have error terms that only improve upon the trivial bound by factors such as ${\log\log x}$.

Exercise 9 (Elliott’s inequality, logarithmic form) Let ${2 \leq y \leq x}$ with ${x/y \geq 2}$. Show that for any function ${f: {\bf Z} \rightarrow {\bf C}}$, one has

$\displaystyle \sum_{p \leq y^{0.1}} \frac{1}{p} \left| \frac{1}{\log(x/y)} \sum_{y \leq n \leq x} \frac{f(n)}{n} - \frac{1}{\log(x/y)} \sum_{y/p \leq n \leq x/p} \frac{f(pn)}{n} \right|^2$

$\displaystyle \ll \frac{1}{\log(x/y)} \sum_{y \leq n \leq x} \frac{|f(n)|^2}{n}$

and thus for every ${\varepsilon>0}$, one has

$\displaystyle \frac{1}{\log(x/y)} \sum_{y \leq n \leq x} \frac{f(n)}{n} = \frac{1}{\log(x/y)} \sum_{y/p \leq n \leq x/p} \frac{f(pn)}{n}$

$\displaystyle + O\left( \varepsilon \left( \frac{1}{\log(x/y)} \sum_{y \leq n \leq x} \frac{|f(n)|^2}{n} \right)^{1/2} \right)$

for all primes ${p \leq y}$ outside of an exceptional set of primes of logarithmic size ${O(\varepsilon^{-2})}$.

Exercise 10 Use Exercise (9) and a duality argument to provide an alternate proof of Exercise 4. (Hint: express the left-hand side of (12) as a correlation between ${\sum_{p \in {\mathcal P}} 1_{p|n} - l({\mathcal P})}$ and some suitably ${L^2}$-normalised arithmetic function ${f(n)}$.)

As a quick application of Elliott’s inequality, let us establish a weak version of the prime number theorem:

Proposition 11 (Weak prime number theorem) For any ${\varepsilon>0}$ we have

$\displaystyle \frac{1}{\log(x/y)} \sum_{y \leq n \leq x} \frac{\mu(n)}{n} = O(\varepsilon)$

whenever ${y, x/y}$ are sufficiently large depending on ${\varepsilon}$.

This estimate is weaker than what one can obtain by existing methods, such as Exercise 56 of Notes 1. However in the next section we will refine this argument to recover the full prime number theorem.

Proof: Fix ${\varepsilon}$, and suppose that ${y, x/y}$ are sufficiently large. From Exercise 9 one has

$\displaystyle \frac{1}{\log(x/y)} \sum_{y \leq n \leq x} \frac{\mu(n)}{n} = \frac{1}{\log(x/y)} \sum_{y/p \leq n \leq x/p} \frac{\mu(pn)}{n} + O(\varepsilon)$

for all primes ${p \leq y}$ outside of an exceptional set of logarithmic size ${O(\varepsilon^{-2})}$. If we restrict attention to primes ${p \leq (x/y)^\varepsilon}$ then one sees from the integral test that one can replace the sum ${\sum_{y/p \leq n \leq x/p}}$ by ${\sum_{y \leq n \leq x}}$ and only incur an additional error of ${O(\varepsilon)}$. If we furthermore restrict to primes ${p}$ larger than ${\varepsilon^{-1}}$, then the contribution of those ${n}$ that are divisible by ${p}$ is also ${O(\varepsilon)}$. For ${n}$ not divisible by ${p}$, one has ${\mu(pn) = -\mu(n)}$. Putting all this together, we conclude that

$\displaystyle \frac{1}{\log(x/y)} \sum_{y \leq n \leq x} \frac{\mu(n)}{n} = - \frac{1}{\log(x/y)} \sum_{y \leq n \leq x} \frac{\mu(n)}{n} + O(\varepsilon)$

for all primes ${1/\varepsilon \leq p \leq (x/y)^\varepsilon, y}$ outside of an exceptional set of logarithmic size ${O(\varepsilon^{-2})}$. In particular, for ${y, x/y}$ large enough this statement is true for at least one such ${p}$. The claim then follows. $\Box$
As another application of Elliott’s inequality, we present a criterion for orthogonality between multiplicative functions and other sequences, first discovered by Katai (with related results also introduced earlier by Daboussi and Delange), and rediscovered by Bourgain, Sarnak, and Ziegler:

Proposition 12 (Daboussi-Delange-Katai-Bourgain-Sarnak-Ziegler criterion) Let ${f: {\bf N} \rightarrow {\bf C}}$ be a multiplicative function with ${|f(n)| \leq 1}$ for all ${n}$, and let ${g: {\bf N} \rightarrow {\bf C}}$ be another bounded function. Suppose that one has

$\displaystyle \frac{1}{x} \sum_{n \leq x} g(pn) \overline{g(qn)} = o(1)$

as ${x \rightarrow \infty}$ for any two distinct primes ${p,q}$. Then one has

$\displaystyle \frac{1}{x} \sum_{n \leq x} f(n) \overline{g(n)} = o(1)$

as ${x \rightarrow \infty}$.

Proof: Suppose the claim fails, then there exists ${\varepsilon>0}$ (which we can assume to be small) and arbitrarily large ${x}$ such that

$\displaystyle |\frac{1}{x} \sum_{n \leq x} f(n) \overline{g(n)}| \geq \varepsilon.$

By Exercise 8, this implies that

$\displaystyle |\frac{1}{x} \sum_{n \leq x} f(n) \overline{g(n)} p 1_{p|n}| \gg \varepsilon \ \ \ \ \ (15)$

for all primes ${p \leq x}$ outside of an exceptional set of logarithmic size ${O(\varepsilon^{-2})}$. Call such primes “good primes”. In particular, by the pigeonhole principle, and assuming ${x}$ large enough, there exists a dyadic range ${[P, (1+\varepsilon^2) P]}$ with ${\varepsilon^{-10} \ll P \ll_\varepsilon 1}$ which contains ${\gg \varepsilon^2 \frac{P}{\log P}}$ good primes.
Fix a good prime ${p}$ in ${[P, (1+\varepsilon^2)P]}$. From (15) we have

$\displaystyle |\frac{1}{x/P} \sum_{n \leq x/p} f(pn) \overline{g(pn)}| \gg \varepsilon.$

We can replace the range ${n \leq x/p}$ by ${n \leq x/P}$ with negligible error. We also have ${f(pn) = f(p) f(n)}$ except when ${n}$ is a multiple of ${p}$, but this latter case only contributes ${O(1/p)}$ which is also negligible compared to the right-hand side. We conclude that

$\displaystyle |\frac{1}{x/P} \sum_{n \leq x/P} f(n) \overline{g(pn)}| \gg \varepsilon$

for every good prime. On the other hand, from Lemma 6 we have

$\displaystyle \sum_p |\frac{1}{x/P} \sum_{n \leq x/P} f(n) \overline{g(pn)}|^2 \ll \sup_p \sum_q |\frac{1}{x/P} \sum_{n \leq x/P} g(pn) \overline{g(qn)}|$

where ${p,q}$ range over the good primes in ${[P,(1+\varepsilon) P]}$. The left-hand side is then ${\gg \varepsilon^3 \frac{P}{\log P}}$, and by hypothesis the right-hand side is ${O(1)}$ for ${x}$ large enough. As ${P \geq \varepsilon^{-10}}$ and ${\varepsilon}$ is small, this gives the desired contradiction $\Box$

Exercise 13 (Daboussi-Delange theorem) Let ${\alpha}$ be irrational, and let ${f: {\bf N} \rightarrow {\bf C}}$ be a multiplicative function with ${|f(n)| \leq 1}$ for all ${n}$. Show that

$\displaystyle \frac{1}{x} \sum_{n \leq x} f(n) e(-\alpha n) = o(1) \ \ \ \ \ (16)$

as ${x \rightarrow \infty}$. If instead ${\alpha}$ is rational, show that there exists ${f: {\bf N} \rightarrow {\bf C}}$ be a multiplicative function with ${|f(n)| \leq 1}$ for which the statement (16) fails. (Hint: use Dirichlet characters and Plancherel’s theorem for finite abelian groups.)

— 2. An elementary proof of the prime number theorem —

Define the Mertens function

$\displaystyle M(x) := \sum_{n \leq x} \mu(n).$

As shown in Theorem 58 of Notes 1, the prime number theorem is equivalent to the bound

$\displaystyle \frac{M(x)}{x}=o(1) \ \ \ \ \ (17)$

as ${x \rightarrow \infty}$. We now give a recent proof of this theorem, due to Redmond McNamara (personal communication), that relies primarily on Elliott’s inequality and the Selberg symmetry formula; it is a relative of the standard elementary proof of this theorem due to Erdös and Selberg. In order to keep the exposition simple, we will not arrange the argument in a fashion that optimises the decay rate ${o(1)}$ (in any event, there are other proofs of the prime number theorem that give significantly stronger bounds).
Firstly we see that Elliott’s inequality gives the following weaker version of (17):

Lemma 14 (Oscillation for Mertens’ function) If ${x \geq 2}$ and ${\varepsilon > 0}$, then we have

$\displaystyle \frac{M(x)}{x} = -\frac{M(x/p)}{x/p} + O(\varepsilon)$

for all primes ${p \leq x}$ outside of an exceptional set of primes of logarithmic size ${O(\varepsilon^{-2})}$.

Proof: We may assume ${\varepsilon \leq 1}$ as the claim is trivial otherwise. From Exercise 8 applied to ${I=[1,x]}$ and ${f=\mu}$, we have

$\displaystyle \frac{M(x)}{x} = \frac{1}{x/p} \sum_{n \leq x/p} \mu(pn) + O(\varepsilon)$

for all ${p \leq x}$ outside of an exceptional set of primes of logarithmic size ${O(\varepsilon^{-2})}$. Since ${\mu(pn)=-\mu(n)}$ for ${n}$ not divisible by ${p}$, the right-hand side can be written as

$\displaystyle -\frac{M(x/p)}{x/p} + O(\varepsilon) + O(1/p).$

Since ${1/p \leq \varepsilon}$ outside of an exceptional set of logarithmic size ${O(\varepsilon^{-2})}$, the claim follows. $\Box$
Informally, this lemma asserts that ${\frac{M(x)}{x} \approx -\frac{M(x/p)}{x/p}}$ for most primes ${p}$, which morally implies that ${\frac{M(x)}{x} \approx \frac{M(x/p_1p_2)}{x/p_1 p_2}}$ for most primes ${p_1,p_2}$. If we can then locate suitable primes ${p,p_1,p_2}$ with ${p \approx p_1 p_2}$, thus should then lead to ${\frac{M(x)}{x} \approx -\frac{M(x)}{x}}$, which should then yield the prime number theorem ${\frac{M(x)}{x} = o(1)}$. The manipulations below are intended to make this argument rigorous.

It will be convenient to work with a logarithmically averaged version of this claim.

Corollary 15 (Logarithmically averaged oscillation) If ${\varepsilon>0}$ and ${x}$ is sufficiently large depending on ${\varepsilon}$, then

$\displaystyle \int_{\log x}^x |\frac{M(t)}{t} + \frac{M(t/p)}{t/p}| \frac{dt}{t} \ll \varepsilon \log x \ \ \ \ \ (18)$

for all ${p \leq \log x}$ outside of an exceptional set of logarithmic size ${O(\varepsilon^{-3})}$.

Proof: For each ${\log x \leq t \leq x}$, we have from the previous lemma that

$\displaystyle |\frac{M(t)}{t} + \frac{M(t/p)}{t/p}| \leq \varepsilon$

for all ${p \leq \log x}$ outside of an exceptional set ${{\mathcal P}_t}$ of logarithmic size ${O(\varepsilon^{-2})}$. We then have

$\displaystyle \int_{\log x}^x |\frac{M(t)}{t} + \frac{M(t/p)}{t/p}| \frac{dt}{t} \ll \varepsilon \log x + \int_{\log x}^x 1_{p \in {\mathcal P}_t} \frac{dt}{t}$

so it suffices by Markov’s inequality to show that

$\displaystyle \sum_{p \leq \log x} \frac{1}{p} \int_{\log x}^x 1_{p \in {\mathcal P}_t} \frac{dt}{t} \ll \varepsilon^{-2} \log x.$

But by Fubini’s theorem, the left-hand side may be bounded by

$\displaystyle \int_{\log x}^x \varepsilon^{-2} \frac{dt}{t}$

and the claim follows. $\Box$
Let ${\varepsilon>0}$ be sufficiently small, and let ${x}$ be sufficiently large depending on ${\varepsilon}$. Call a prime ${p \leq \log x}$ good if the bound (18) holds and bad otherwise, thus all primes ${p \leq \log x}$ outside of an exceptional set of bad primes of logarithmic size ${O(\varepsilon^{-3})}$ are good. Now we observe that we can make ${M(x)/x}$ small as long as we can make two good primes multiply to be close to a third:

Proposition 16 Suppose there are good primes ${p_1, p_2, p \leq (\log x)^{0.1}}$ with ${p = (1+O(\varepsilon)) p_1 p_2}$. Then ${\frac{M(x)}{x} = O(\varepsilon)}$.

Proof: By definition of good prime, we have the bounds

$\displaystyle \int_{\log x}^x |\frac{M(t)}{t} + \frac{M(t/p_1)}{t/p_1}| \frac{dt}{t} \ll \varepsilon \log x \ \ \ \ \ (19)$

$\displaystyle \int_{\log x}^x |\frac{M(t)}{t} + \frac{M(t/p_2)}{t/p_2}| \frac{dt}{t} \ll \varepsilon \log x \ \ \ \ \ (20)$

$\displaystyle \int_{\log x}^x |\frac{M(t)}{t} + \frac{M(t/p)}{t/p}| \frac{dt}{t} \ll \varepsilon \log x. \ \ \ \ \ (21)$

We rescale (20) by ${p_1}$ to conclude that

$\displaystyle \int_{p_1\log x}^{p_1x} |\frac{M(t/p_1)}{t/p_1} + \frac{M(t/p_1p_2)}{t/p_1p_2}| \frac{dt}{t} \ll \varepsilon \log x.$

We can replace the integration range here from ${[p_1\log x, x]}$ to ${[\log x, x]}$ with an error of ${O(\varepsilon \log x)}$ if ${x}$ is large enough. Also, since ${p = (1+O(\varepsilon)) p_1 p_2}$, we have ${\frac{M(t/p_1p_2)}{t/p_1p_2} = \frac{M(t/p)}{t/p} + O(\varepsilon)}$. Thus we have

$\displaystyle \int_{\log x}^{x} |\frac{M(t/p_1)}{t/p_1} + \frac{M(t/p)}{t/p}| \frac{dt}{t} \ll \varepsilon \log x.$

Combining this with (19), (21) and the triangle inequality (writing ${\frac{M(t)}{t}}$ as a linear combination of ${\frac{M(t)}{t} + \frac{M(t/p_1)}{t/p_1}}$, ${\frac{M(t)}{t} + \frac{M(t/p)}{t/p}}$, and ${\frac{M(t/p_1)}{t/p_1} + \frac{M(t/p)}{t/p}}$) we conclude that

$\displaystyle \int_{\log x}^x |\frac{M(t)}{t}| \frac{dt}{t} \ll \varepsilon \log x. \ \ \ \ \ (22)$

This is an averaged version of the claim we need. To remove the averaging, we use the identity ${\mu L = - \mu * \Lambda}$ (see equation (63) of Notes 1) to conclude that

$\displaystyle \sum_{n \leq x} \mu(n) \log n = - \sum_{m \leq x} \Lambda(m) M(x/m).$

From the triangle inequality one has

$\displaystyle |M(x/m)| \ll \varepsilon^{-1} \int_{x/m}^{(1+\varepsilon)x/m} \frac{M(t)}{t}\ dt + \varepsilon \frac{x}{m}$

and hence by Mertens’ theorem

$\displaystyle \sum_{n \leq x} \mu(n) \log n \ll \varepsilon^{-1} \sum_{m \leq x} \Lambda(m) \int_{x/m}^{(1+\varepsilon)x/m} \frac{|M(t)|}{t}\ dt + \varepsilon x \log x.$

From the Brun-Titchmarsh inequality (Corollary 61 of Notes 1) we have

$\displaystyle \sum_m \Lambda(m) 1_{x/m \leq t \leq (1+\varepsilon)x/m} \ll \varepsilon \frac{x}{t}$

and so from the previous estimate and Fubini’s theorem one has

$\displaystyle \sum_{n \leq x} \mu(n) \log n \ll \int_{1}^{(1+\varepsilon)x} \frac{|M(t)|}{t} \frac{x}{t}\ dt + \varepsilon x \log x$

and hence by (22) (using trivial bounds to handle the region outside of ${[\log x, x]}$)

$\displaystyle \sum_{n \leq x} \mu(n) \log n \ll \varepsilon x \log x$

Since

$\displaystyle \sum_{n \leq x} \mu(n) \log(x/n) \ll \sum_{n \leq x} \log(x/n) \ll x,$

we conclude (for ${x}$ large enough) that

$\displaystyle \sum_{n \leq x} \mu(n) \log x \ll \varepsilon x \log x$

and the claim follows. $\Box$
To finish the proof of the prime number theorem, it thus suffices to locate, for ${x}$ sufficiently large, three good primes ${p_1,p_2,p}$ with ${p = (1+O(\varepsilon)) p_1 p_2}$. If we already had the prime number theorem, or even the weaker form that every interval of the form ${[x,e^\varepsilon x]}$ contained ${\gg \frac{\varepsilon x}{\log x}}$ primes for ${x}$ large enough, then this would be quite easy: pick a large natural number ${K}$ (depending on ${\varepsilon}$, but independent of ${x}$), so that the primes up to ${e^{\varepsilon K}}$ has logarithmic size ${\asymp \log K}$ (so that only ${O(\frac{\varepsilon^{-3}}{\log K})}$ of them are bad, as measured by logarithmic size), and let ${\mathbf{k}_1, \mathbf{k}_2}$ be random numbers and drawn uniformly from (say) ${\{K+1,\dots,2K\}}$. From the prime number theorem, for each ${K \leq k \leq 4K}$, the interval ${I_k := [e^{\varepsilon k},e^{\varepsilon(k+1)}]}$ contains ${\gg \frac{e^{\varepsilon k}}{k} }$ primes. In particular, ${I_{\mathrm{k}_1}}$ contains ${\gg \frac{e^{\varepsilon \mathbf{k}_1}}{\mathbf{k}_1} }$ primes, but the expected number of bad primes in this interval is ${O( \frac{\varepsilon^{-3}}{\log K} \frac{e^{\varepsilon k}}{\mathbf{k}_1} )}$. Thus by Markov’s inequality there would be at least a ${90\%}$ chance (say) of having at least one good prime ${\mathbf{p_1}}$ in ${I_{\mathbf{k}_1}}$; similarly there is a ${90\%}$ chance of having a good prime ${\mathbf{p_2}}$ in ${I_{\mathbf{k}_1}}$, and a ${90\%}$ chance of having a good prime ${\mathbf{p}}$ in ${I_{\mathbf{k}_1+\mathbf{k}_2}}$. Thus (as an application of the probabilistic method), there exist (deterministic) good primes ${p_1,p_2,p}$ with the required properties.

Of course, using the prime number theorem here to prove the prime number theorem would be circular. However, we can still locate a good triple ${p_1,p_2,p}$ of primes using the Selberg symmetry formula

$\displaystyle \sum_{n \leq x} \Lambda_2(n) = (2 + o(1)) x \log x \ \ \ \ \ (23)$

as ${x \rightarrow \infty}$, where ${\Lambda_2}$ is the second von Mangoldt function

$\displaystyle \Lambda_2(n) = \Lambda(n) \log n + \sum_{d|n} \Lambda(d) \Lambda(\frac{n}{d});$

see Proposition 60 of Notes 1. We can strip away the contribution of the primes:

Exercise 17 Show that

$\displaystyle \sum_{p \leq x} \log^2 p + \sum_{p_1,p_2: p_1 p_2 \leq x} \log p_1 \log p_2 = (2+o(1)) x \log x$

as ${x \rightarrow \infty}$.

In particular, on evaluating this at ${x = e^{\varepsilon k}, e^{\varepsilon(k+1)}}$ and subtracting, we have

$\displaystyle \sum_{p \in I_k} \varepsilon^2 k^2 + \sum_{p_1,p_2: p_1 p_2 \in I_k} \log p_1 \log p_2 \asymp \varepsilon^2 k e^{\varepsilon k}$

whenever ${k}$ is sufficiently large depending on ${\varepsilon}$. In particular, for any such ${k}$, one either has

$\displaystyle \sum_{p \in I_k} \frac{1}{p} \gg \frac{1}{k} \ \ \ \ \ (24)$

or

$\displaystyle \sum_{p_1 p_2 \in I_k} \frac{\log p_1 \log p_2}{p_1 p_2} \gg \varepsilon^2 k \ \ \ \ \ (25)$

(or both). Informally, the Selberg symmetry formula shows that the interval ${I_k}$ contains either a lot of primes, or a lot of semiprimes. The factor of ${\log p_1 \log p_2}$ is slightly annoying, so we now remove it. Consider the contribution of those primes ${p_1}$ to (25) with ${p_1 \leq e^{\varepsilon^3 k}}$. This is bounded by

$\displaystyle \sum_{p_1 \leq e^{\varepsilon^3 k}} \frac{\log p_1}{p_1} \sum_{p_2 = (1+O(\varepsilon)) e^{\varepsilon k} / p_1} \frac{\varepsilon k}{p_2}$

which we can bound crudely using the Chebyshev bound by

$\displaystyle \sum_{p_1 \leq e^{\varepsilon^3 k}} \frac{\log p_1}{p_1}$

which by Mertens theorem is ${O(\varepsilon^3 k)}$. Thus the contribution of this case can be safely removed from (25). Similarly for those cases when ${p_2 \leq e^{\varepsilon^3 k}}$. For the remaining cases we bound ${\log p_1 \log p_2 \leq \varepsilon^2 k^2}$. We conclude that for any sufficiently large ${k}$, either (24) or

$\displaystyle \sum_{p_1 p_2 \in I_k: p_1,p_2 \geq e^{\varepsilon^3 k}} \frac{1}{p_1 p_2} \gg \frac{1}{k}. \ \ \ \ \ (26)$

holds (or both).
In order to find primes ${p_1,p_2,p}$ with ${p}$ close to ${p_1 p_2}$, it would be very convenient if we could find a ${k}$ for which (24) and (26) both hold. We can’t quite do this directly, but due to the “connected” nature of the set of scales ${k}$, but we can do the next best thing:

Proposition 18 Suppose ${k_0}$ is sufficiently large depending on ${\varepsilon}$. Then there exists ${k,k' \in [k_0,k_0+\varepsilon^{-2}]}$ with ${|k-k'| \leq 1}$ such that

$\displaystyle \sum_{p \in I_k} \frac{1}{p} \geq \frac{\varepsilon}{k} \ \ \ \ \ (27)$

and

$\displaystyle \sum_{p_1 p_2 \in I_{k'}: p_1,p_2 \geq e^{\varepsilon^3 k'}} \frac{1}{p_1 p_2} \geq \varepsilon \frac{1}{k'}. \ \ \ \ \ (28)$

Proof: We know that every ${k}$ in ${[k_0,k_0+\varepsilon^{-2}]}$ obeys at least one of (27), (28). Our task is to produce an adjacent pair of ${k,k'}$, one of which obeys (27) and the other obeys (28). Suppose for contradiction that no such pair exists, then whenever ${k}$ fails to obey (27), then any adjacent ${k}$ must also fail to do so, and similarly for (28). Thus either (27) will fail to hold for all ${k \in [k_0,k_0+\varepsilon^{-2}]}$, or (28) will fail to hold for all such ${k}$. If (27) fails for all ${k \in [k_0,k_0+\varepsilon^{-2}]}$, then on summing we have

$\displaystyle \sum_{e^{\varepsilon k_0} \leq p \leq e^{\varepsilon^{-1} e^{\varepsilon k_0}}} \frac{1}{p} \ll \frac{1}{\varepsilon k_0}$

which contradicts Mertens’ theorem if ${\varepsilon}$ is small enough because the left-hand side is ${\gg \frac{1}{\varepsilon^2 k_0}}$. Similarly, if (28) fails for all ${k \in [k_0,k_0+\varepsilon^{-2}]}$, then

$\displaystyle \sum_{e^{\varepsilon k_0} \leq p_1 p_2 \leq e^{\varepsilon^{-1}} e^{\varepsilon k_0}: p_1,p_2 \geq e^{2\varepsilon^3 k_0}} \frac{1}{p_1 p_2} \leq \frac{1}{\varepsilon k_0}$

and again Mertens’ theorem can be used to lower bound the left-hand side by ${\gg \frac{1}{\varepsilon^2 k_0}}$ (in fact one can even gain an additional factor of ${\log \frac{1}{\varepsilon}}$ if one works things through carefully) and obtain a contradiction. $\Box$
The above proposition does indeed provide a triple of primes ${p_1,p_2,p}$ with ${p = (1+O(\varepsilon)) p_1 p_2}$. If ${k_0}$ is sufficiently large depending on ${\varepsilon}$ and less than (say) ${\log\log x}$, so that ${p_1,p_2,p \leq \log^{0.1} x}$, this would give us what we need as long as one of the triples consisted only of good primes. The only way this can fail is if either

$\displaystyle \sum_{p \in I_k: p \hbox{ bad}} \frac{1}{p} \gg \varepsilon \frac{1}{k}$

for some ${k \in [k_0,k_0+\varepsilon^{-2}]}$, or if

$\displaystyle \sum_{p_1 p_2 \in I_{k'}: p_1,p_2 \geq e^{\varepsilon^3 k'}; p_1 \hbox{ bad}} \frac{1}{p_1 p_2} \gg \frac{\varepsilon}{k'}$

for some ${k' \in [k_0,k_0+\varepsilon^{-2}]}$. In the first case, we can sum to conclude that

$\displaystyle \sum_{e^{\varepsilon k_0} \leq p \leq e^{\varepsilon k_0 + 2\varepsilon^{-2}}: p \hbox{ bad}} \frac{1}{p} \gg \frac{\varepsilon}{k_0}, \ \ \ \ \ (29)$

and in the second case we have

$\displaystyle \sum_{e^{\varepsilon^3 k_0} \leq p_1 \leq e^{\varepsilon k_0 + 2\varepsilon^{-2}}; p_1 \hbox{ bad}} \sum_{e^{\varepsilon k_0}/p_1 \leq p_2 \leq e^{\varepsilon(k_0+1)}/p_1} \frac{1}{p_1 p_2} \gg \frac{\varepsilon}{k_0}$

and hence by Chebyshev bounds

$\displaystyle \sum_{e^{\varepsilon^3 k_0} \leq p_1 \leq e^{\varepsilon k_0 + 2\varepsilon^{-2}}; p_1 \hbox{ bad}} \frac{1}{p_1} \gg \varepsilon^2. \ \ \ \ \ (30)$

Since the total set of bad primes up to ${\log x}$ has logarithmic size ${O(\varepsilon^{-2})}$, we conclude from the pigeonhole principle (and the divergence of the harmonic series ${\sum_{k_0=1}^\infty \frac{1}{k_0}}$) that for any ${C_\varepsilon}$ depending only on ${\varepsilon}$, and any ${x}$ large enough, there exists ${C_\varepsilon \leq k_0 \leq \log\log x}$ such that neither of (29) and (30) hold. Indeed the set of ${C_\varepsilon \leq k_0 \leq \log\log x}$ obeying (29) has logarithmic size ${O_\varepsilon(1)}$, and similarly for (30). Choosing a ${k_0}$ that avoids both of these scenarios, we then find a good ${p \in I_k}$ and good ${p_1,p_2}$ with ${p_1 p_2 \in I_{k'}}$, so that ${p_1 p_2 = (1+O(\varepsilon)) p}$, and then by Proposition 16 we conclude that ${M(x)/x = O(\varepsilon)}$ for all sufficiently large ${x}$. Sending ${\varepsilon}$ to zero, we obtain the prime number theorem.

— 3. Entropy methods —

In the previous section we explored the consequences of the second moment method, which applies to square-integrable random variables ${\mathbf{X} \in L^2}$ taking values in the real or complex numbers. Now we explore entropy methods, which now apply to random variables which take a finite number of values (equipped with the discrete sigma-algebra), but whose range need not be numerical in nature. (One could extend entropy methods to slightly larger classes of random variables, such as ones that attain a countable number of values, but for our applications finitely-valued random variables will suffice.)

The fundamental notion here is that of the Shannon entropy of a random variable. If ${\mathbf{X}}$ takes values in a finite set ${R}$, its Shannon entropy ${\mathbb{H}(\mathbf{X})}$ (or entropy for short) is defined by the formula

$\displaystyle \mathbb{H}(\mathbf{X}) := \sum_{x \in R} {\mathbb P}(\mathbf{X}=x) \log \frac{1}{{\mathbb P}(\mathbf{X}=x)}$

where ${x}$ ranges over all the possible values of ${\mathbf{X}}$, and we adopt the convention ${0 \log \frac{1}{0} = 0}$, so that values that are almost surely not attained by ${\mathbf{X}}$ do not influence the entropy. We choose here to use the natural logarithm ${\log}$ to normalise our entropy (in which case a unit of entropy is known as a “nat“); in the information theory literature it is also common to use the base two logarithm ${\log_2}$ to measure entropy (in which case a unit of entropy is known as a “bit“, which is equal to ${\log 2}$ nats). However, the precise choice of normalisation will not be important in our discussion.
It is clear that if two random variables ${\mathbf{X}, \mathbf{Y}}$ have the same probability distribution, then they have the same entropy. Also, the precise choice of range set ${R}$ is not terribly important: if ${\mathbf{X}}$ takes values in ${R}$, and ${f: R \rightarrow R'}$ is an injection, then it is clear that ${\mathbf{X}}$ and ${f(\mathbf{X})}$ have the same entropy:

$\displaystyle \mathbb{H}(f(\mathbf{X})) = \mathbb{H}(\mathbf{X}). \ \ \ \ \ (31)$

This is in sharp contrast to moment-based statistics such as the mean or variance, which can be radically changed by applying some injective transformation to the range values.
Informally, the entropy ${\mathbb{H}(\mathbf{X})}$ informally measures how “spread out” or “disordered” the distribution of ${\mathbf{X}}$ is, behaving like a logarithm of the size of the “essential support” of such a variable; from an information-theoretic viewpoint, it measures the amount of “information” one learns when one is told the value of ${\mathbf{X}}$. Here are some basic properties of Shannon entropy that help support this intuition:

Exercise 19 (Basic properties of Shannon entropy) Let ${\mathbf{X}}$ be a random variable taking values in a finite set ${R}$.

• (i) Show that ${\mathbb{H}(\mathbf{X}) \geq 0}$, with equality if and only if ${\mathbf{X}}$ is almost surely deterministic (that is to say, it is almost surely equal to a constant ${x}$).
• (ii) Show that

$\displaystyle \mathbb{H}(\mathbf{X}) \leq \log \# R, \ \ \ \ \ (32)$

with equality if and only if ${\mathbf{X}}$ is uniformly distributed on ${R}$. (Hint: use Jensen’s inequality and the concavity of the map ${t \mapsto t \log \frac{1}{t}}$ on ${[0,1]}$.)

• (iii) (Shannon-McMillan-Breiman theorem) Let ${n}$ be a natural number, and let ${\mathbf{X}_1, \mathbf{X}_2, \dots, \mathbf{X}_n}$ be independent copies of ${\mathbf{X}}$. As ${n \rightarrow \infty}$, show that there is a subset ${\Omega_n \subset R^n}$ of cardinality ${e^{(\mathbb{H}(\mathbf{X}) + o(1)) n}}$ with the properties that

$\displaystyle {\mathbb P}( (\mathbf{X}_1,\dots,\mathbf{X}_n) \in \Omega_n ) = 1-o(1)$

and

$\displaystyle {\mathbb P}( (\mathbf{X}_1,\dots,\mathbf{X}_n) = (x_1,\dots,x_n) ) = e^{(-\mathbb{H}(\mathbf{X}) + o(1)) n}$

uniformly for all ${(x_1,\dots,x_n) \in \Omega_n}$. (The proof of this theorem will require Stirling’s formula, which you may assume here as a black box; see also this previous blog post.) Informally, we thus see a large tuple ${(\mathbf{X}_1,\dots,\mathbf{X}_n)}$ of independent samples of ${\mathbf{X}}$ approximately behaves like a uniform distribution on ${e^{(\mathbb{H}(\mathbf{X}) + o(1)) n}}$ values.

One can view Shannon entropy as a generalisation of the notion of cardinality of a finite set (or equivalently, cardinality of finite sets can be viewed as a special case of Shannon entropy); see this previous blog post for an elaboration of this point.

The concept of Shannon entropy becomes significantly more powerful when combined with that of conditioning. Recall that a random variable ${\mathbf{X}}$ taking values in a range set ${R}$ can be modeled by a measurable map ${\mathbf{X}: \Omega \rightarrow R}$ from a probability space ${(\Omega,{\mathcal F}, \mathop{\bf P})}$ to the range ${R}$. If ${E \in {\mathcal F}}$ is an event in ${\Omega}$ of positive probability, we can then condition ${\mathbf{X}: \Omega \rightarrow R}$ to the event ${E}$ to form a new random variable ${(\mathbf{X}|E): E \rightarrow R}$ on the conditioned probability space ${(E, {\mathcal F}|_E, \mathop{\bf P}|E)}$, where

$\displaystyle {\mathcal F}|_E := \{ E \cap F: F \in {\mathcal F} \}$

is the restriction of the ${\sigma}$-algebra ${{\mathcal F}}$ to ${E}$,

$\displaystyle {\mathbb P}(F|E) := \frac{{\mathbb P}( E \cap F )}{{\mathbb P}(E)}$

is the conditional probability measure on ${E}$, and ${\mathbf{X}|E:E \rightarrow R}$ is the restriction of ${\mathbf{X}: \Omega \rightarrow R}$ to ${E}$. This random variable ${(\mathbf{X}|E)}$ lives on a different probability space than ${\mathbf{X}}$ itself, so it does not make sense to directly combine these variables (thus for instance one cannot form the sum ${\mathbf{X} + (\mathbf{X}|E)}$ even when both random variables are real or complex valued); however, one can still form the Shannon entropy ${\mathbb{H}( \mathbf{X}|E )}$ of the conditioned random variable ${(\mathbf{X}|E)}$, which is given by the same formula

$\displaystyle \mathbb{H}(\mathbf{X}|E) := \sum_{x \in R} {\mathbb P}(\mathbf{X}=x|E) \log \frac{1}{{\mathbb P}(\mathbf{X}=x|E)}.$

Given another random variable ${\mathbf{Y}: \Omega \rightarrow S}$ taking values in another finite set ${S}$, we can then define the conditional Shannon entropy ${\mathbb{H}(\mathbf{X}|\mathbf{Y})}$ to be the expected entropy of the level sets ${\mathbb{H}(\mathbf{X}|\mathbf{Y}=y)}$, thus

$\displaystyle \mathbb{H}(\mathbf{X}|\mathbf{Y}) := \sum_{y \in S} {\mathbb P}( \mathbf{Y} = y ) \mathbb{H}(\mathbf{X}|\mathbf{Y}=y)$

with the convention that the summand here vanishes when ${{\mathbb P}( \mathbf{Y} = y ) = 0}$. From the law of total probability we have

$\displaystyle {\mathbb P}( \mathbf{X}=x ) = \sum_{y \in S} {\mathbb P}( \mathbf{Y} = y ) {\mathbb P}(\mathbf{X}=x|\mathbf{Y}=y)$

for any ${x \in R}$, and hence by Jensen’s inequality

$\displaystyle {\mathbb P}( \mathbf{X}=x ) \log \frac{1}{{\mathbb P}( \mathbf{X}=x )} \leq \sum_{y \in S} {\mathbb P}( \mathbf{Y} = y ) {\mathbb P}(\mathbf{X}=x|\mathbf{Y}=y) \log \frac{1}{{\mathbb P}(\mathbf{X}=x|\mathbf{Y}=y)}$

for any ${x \in R}$; summing we obtain the Shannon entropy inequality

$\displaystyle \mathbb{H}(\mathbf{X}|\mathbf{Y}) \leq \mathbb{H}(\mathbf{X}). \ \ \ \ \ (33)$

Informally, this inequality asserts that the new information content of ${\mathbf{X}}$ can be decreased, but not increased, if one is first told some additional information ${\mathbf{Y}}$.
This inequality (33) can be rewritten in several ways:

Exercise 20 Let ${\mathbf{X}: \Omega \rightarrow R}$, ${\mathbf{Y}: \Omega \rightarrow S}$ be random variables taking values in finite sets ${R,S}$ respectively.

• (i) Establish the chain rule

$\displaystyle \mathbb{H}(\mathbf{X}, \mathbf{Y}) = \mathbb{H}(\mathbf{Y}) + \mathbb{H}(\mathbf{X}|\mathbf{Y}) \ \ \ \ \ (34)$

where ${(\mathbf{X}, \mathbf{Y}): \Omega \rightarrow R \times S}$ is the joint random variable ${(\mathbf{X}, \mathbf{Y})(\omega) := (\mathbf{X}(\omega), \mathbf{Y}(\omega))}$. In particular, (33) can be expressed as a subadditivity formula

$\displaystyle \mathbb{H}(\mathbf{X}, \mathbf{Y}) \leq \mathbb{H}(\mathbf{X}) + \mathbb{H}(\mathbf{Y}). \ \ \ \ \ (35)$

Show that equality occurs if and only if ${\mathbf{X}, \mathbf{Y}}$ are independent.

• (ii) If ${\mathbf{Y}}$ is a function of ${\mathbf{X}}$, in the sense that ${\mathbf{Y} = f(\mathbf{X})}$ for some (deterministic) function ${f: R \rightarrow S}$, show that ${\mathbb{H}( \mathbf{Y} ) \leq \mathbb{H}( \mathbf{X}, \mathbf{Y} ) = \mathbb{H}( \mathbf{X} )}$.
• (iii) Define the mutual information ${\mathbb{I}( \mathbf{X}: \mathbf{Y} )}$ by the formula

$\displaystyle \mathbb{I}( \mathbf{X}:\mathbf{Y} ) := \mathbb{H}(\mathbf{X}) + \mathbb{H}(\mathbf{Y}) - \mathbb{H}( \mathbf{X}, \mathbf{Y} )$

$\displaystyle = \mathbb{H}(\mathbf{X}) - \mathbb{H}(\mathbf{X}|\mathbf{Y})$

$\displaystyle = \mathbb{H}(\mathbf{Y}) - \mathbb{H}(\mathbf{Y}|\mathbf{X}).$

Establish the inequalities

$\displaystyle 0 \leq \mathbb{I}( \mathbf{X}:\mathbf{Y} ) \leq \mathbb{H}(\mathbf{X}), \mathbb{H}(\mathbf{Y}),$

with the first inequality holding with equality if and only if ${\mathbf{X}, \mathbf{Y}}$ are independent, and the latter inequalities holding if and only if ${\mathbf{X}}$ is a function of ${\mathbf{Y}}$ (or vice versa).

From the above exercise we see that the mutual information ${\mathbb{I}( \mathbf{X}: \mathbf{Y} )}$ is a measure of dependence between ${\mathbf{X}}$ and ${\mathbf{Y}}$, much as correlation or covariance was in the previous sections. There is however one key difference: whereas a zero correlation or covariance is a consequence but not a guarantee of independence, zero mutual information is logically equivalent to independence, and is thus a stronger property. To put it another way, zero correlation or covariance allows one to calculate the average ${\mathbb{E}( \mathbf{X} \mathbf{Y} )}$ in terms of individual averages of ${\mathbf{X}, \mathbf{Y}}$, but zero mutual information is stronger because it allows one to calculate the more general averages ${\mathbb{E}( f(\mathbf{X}) g(\mathbf{Y}) )}$ in terms of individual averages of ${\mathbf{X}, \mathbf{Y}}$, for arbitrary functions ${f,g}$ taking values into the complex numbers. This increased power of the mutual information statistic will allow us to estimate various averages of interest in analytic number theory in ways that do not seem amenable to second moment methods.

The subadditivity property formula can be conditioned to any event ${E}$ occuring with positive probability (replacing the random variables ${\mathbf{X}, \mathbf{Y}}$ by their conditioned counterparts ${({\mathbf X}|E), (\mathbf{Y}|E)}$), yielding the inequality

$\displaystyle \mathbb{H}(\mathbf{X}, \mathbf{Y}|E) \leq \mathbb{H}(\mathbf{X}|E) + \mathbb{H}(\mathbf{Y}|E).$

Applying this inequality to the level events ${\mathbf{Z}=z}$ of some auxiliary random variable ${\mathbf{Z}: \Omega \rightarrow T}$ taking values in another finite set ${T}$, multiplying by ${\mathbf{P}( \mathbf{Z}=z)}$, and summing, we conclude the inequality

$\displaystyle \mathbb{H}(\mathbf{X}, \mathbf{Y}|{\mathbf Z}) \leq \mathbb{H}(\mathbf{X}|\mathbf{Z}) + \mathbb{H}(\mathbf{Y}|\mathbf{Z}).$

In other words, the conditional mutual information

$\displaystyle \mathbb{I}( \mathbf{X}: \mathbf{Y}|{\mathbf Z} ) := \mathbb{H}(\mathbf{X}|\mathbf{Z}) + \mathbb{H}(\mathbf{Y}|\mathbf{Z})-\mathbb{H}(\mathbf{X}, \mathbf{Y}|{\mathbf Z})$

between ${\mathbf{X}}$ and ${\mathbf{Y}}$ conditioning on ${{\mathbf Z}}$ is always non-negative:

$\displaystyle \mathbb{I}( \mathbf{X}: \mathbf{Y}|{\mathbf Z} ) \geq 0. \ \ \ \ \ (36)$

One has conditional analogues of the above exercise:

Exercise 21 Let ${\mathbf{X}: \Omega \rightarrow R}$, ${\mathbf{Y}: \Omega \rightarrow S}$, ${\mathbf{Z}: \Omega \rightarrow T}$ be random variables taking values in finite sets ${R,S,T}$ respectively.

• (i) Establish the conditional chain rule

$\displaystyle \mathbb{H}(\mathbf{X}, \mathbf{Y}|{\mathbf Z}) = \mathbb{H}(\mathbf{X}|{\mathbf Z}) + \mathbb{H}(\mathbf{Y}|\mathbf{X}, {\mathbf Z}) \ \ \ \ \ (37)$

and show that

$\displaystyle \mathbb{I}( \mathbf{X}:\mathbf{Y} | \mathbf{Z}) = \mathbb{H}(\mathbf{X}|\mathbf{Z}) - \mathbb{H}(\mathbf{X}|\mathbf{Y}, \mathbf{Z}) \ \ \ \ \ (38)$

$\displaystyle = \mathbb{H}(\mathbf{Y}|\mathbf{Z}) - \mathbb{H}(\mathbf{Y}|\mathbf{X}, \mathbf{Z}).$

In particular, (36) is equivalent to the inequality

$\displaystyle \mathbb{H}(\mathbf{X}|\mathbf{Y}, \mathbf{Z}) \leq \mathbb{H}(\mathbf{X}|\mathbf{Z}). \ \ \ \ \ (39)$

• (ii) Show that equality holds in (36) if and only if ${\mathbf{X}, \mathbf{Y}}$ are conditionally independent relative to ${\mathbf{Z}}$, which means that

$\displaystyle {\mathbb P}( \mathbf{X} \in E, \mathbf{Y} \in F | \mathbf{Z} = z ) = {\mathbb P}( \mathbf{X} \in E|\mathbf{Z}=z) {\mathbb P}(\mathbf{Y} \in F | \mathbf{Z} = z )$

for any ${E \subset R}$, ${F \subset S}$, ${z \in T}$.

• (iii) Show that ${\mathbb{I}( \mathbf{X}:\mathbf{Y} | \mathbf{Z}) \leq \mathbb{H}(\mathbf{X}|\mathbf{Z})}$, with equality if and only if ${\mathbf{X}}$ is almost surely a deterministic function of ${\mathbf{Y}, \mathbf{Z}}$.
• (iv) Show the data processing inequality

$\displaystyle \mathbb{I}( f(\mathbf{X}) : g(\mathbf{Y}) ) \leq \mathbb{I}( \mathbf{X}: \mathbf{Y} )$

for any functions ${f: R \rightarrow R'}$, ${g:S \rightarrow S'}$, and more generally that

$\displaystyle \mathbb{I}( f(\mathbf{X}) : g(\mathbf{Y}) | \mathbf{Z} ) \leq \mathbb{I}( \mathbf{X} : \mathbf{Y} | \mathbf{Z} ).$

• (v) If ${h: T \rightarrow T'}$ is an injective function, show that

$\displaystyle \mathbb{I}( \mathbf{X} : \mathbf{Y} | h(\mathbf{Z}) ) = \mathbb{I}( \mathbf{X} : \mathbf{Y} | \mathbf{Z} ). \ \ \ \ \ (40)$

However, if ${h: T \rightarrow T'}$ is not assumed to be injective, show by means of examples that there is no order relation between the left and right-hand side of (40) (in other words, show that either side may be greater than the other). Thus, increasing or decreasing the amount of information that is known may influence the mutual information between two remaining random variables in either direction.

• (vi) If ${\mathbf{Z}}$ is a function of ${\mathbf{X}}$, and also a function of ${\mathbf{Y}}$ (thus ${\mathbf{Z} = f(\mathbf{X}) = g(\mathbf{Y})}$ for some ${f: R \rightarrow T}$ and ${g: S \rightarrow T}$), and a further random variable ${\mathbf{W}: \Omega \rightarrow U}$ is a function jointly of ${\mathbf{X}, \mathbf{Y}}$ (thus ${\mathbf{W} = h( \mathbf{X}, \mathbf{Y})}$ for some ${h: R \times S \rightarrow U}$), establish the submodularity inequality

$\displaystyle \mathbb{H}(\mathbf{W}) + \mathbb{H}(\mathbf{Z}) \leq \mathbb{H}(\mathbf{X}) + \mathbb{H}(\mathbf{Y}).$

We now give a key motivating application of the Shannon entropy inequalities. Suppose one has a sequence ${\mathbf{X}_1, \mathbf{X}_2, \dots}$ of random variables, all taking values in a finite set ${R}$, which are stationary in the sense that the tuples ${(\mathbf{X}_1,\dots,\mathbf{X}_n)}$ and ${(\mathbf{X}_2,\dots,\mathbf{X}_{n+1})}$ have the same distribution for every ${n}$. In particular we will have

$\displaystyle \mathbb{H}( \mathbf{X}_{n+1} | \mathbf{X}_2,\dots,\mathbf{X}_n ) = \mathbb{H}( \mathbf{X}_{n} | \mathbf{X}_1,\dots,\mathbf{X}_{n-1} )$

and hence by (39)

$\displaystyle \mathbb{H}( \mathbf{X}_{n+1} | \mathbf{X}_1,\dots,\mathbf{X}_n ) \leq \mathbb{H}( \mathbf{X}_{n} | \mathbf{X}_1,\dots,\mathbf{X}_{n-1} ).$

If we write ${h_n := \mathbb{H}(\mathbf{X}_1,\dots,\mathbf{X}_n)}$, we conclude from (34) that we have the concavity property

$\displaystyle h_{n+1} - h_n \leq h_n - h_{n-1}.$

In particular we have ${h_{n+1} - h_n \leq h_i - h_{i-1}}$ for any ${1 \leq i \leq n}$, which on summing and telescoping series (noting that ${h_0=0}$) gives

$\displaystyle n (h_{n+1} - h_n) \leq h_n$

and hence we have the entropy monotonicity

$\displaystyle \frac{h_{n+1}}{n+1} \leq \frac{h_n}{n}. \ \ \ \ \ (41)$

In particular, the limit ${\lim_{n \rightarrow\infty} \frac{h_n}{n}}$ exists. This quantity is known as the Kolmogorov-Sinai entropy of the stationary process ${\mathbf{X}_1, \mathbf{X}_2, \dots}$; it is an important statistic in the theory of dynamical systems, and roughly speaking measures the amount of entropy produced by this process as a function of a discrete time vairable ${n}$. We will not directly need the Kolmogorov-Sinai entropy in our notes, but a variant of the entropy monotonicity formula (41) will be important shortly.
In our application we will be dealing with processes that are only asymptotically stationary rather than stationary. To control this we recall the notion of the total variation distance ${d_{TV}(\mathbf{X}, \mathbf{Y})}$ between two random variables ${\mathbf{X}, \mathbf{Y}}$ taking values in the same finite space ${R}$, defined by

$\displaystyle d_{TV}(\mathbf{X}, \mathbf{Y}) := \sum_{r \in R} | {\mathbb P}( \mathbf{X} = r ) - {\mathbb P}( \mathbf{Y} = r ) |.$

There is an essentially equivalent notion of this distance which is also often in use:

Exercise 22 If two random variables ${\mathbf{X}, \mathbf{Y}}$ take values in the same finite space ${R}$, establish the inequalities

$\displaystyle \frac{1}{2} d_{TV}(\mathbf{X}, \mathbf{Y}) \leq \sup_{E \subset R} |{\mathbb P}( \mathbf{X} = E ) - {\mathbb P}( \mathbf{Y} = E )| \leq d_{TV}(\mathbf{X}, \mathbf{Y})$

and for any ${f: R \rightarrow {\bf C}}$, establish the inequality

$\displaystyle |{\mathbb E} f(\mathbf{X}) - {\mathbb E} f(\mathbf{Y})| \leq d_{TV}(\mathbf{X}, \mathbf{Y}) \sup_{r \in R} |f(r)|. \ \ \ \ \ (42)$

Shannon entropy is continuous in total variation distance as long as we keep the range finite. More quantitatively, we have

Lemma 23 If two random variables ${\mathbf{X}, \mathbf{Y}}$ take values in the same finite space ${R}$, then

$\displaystyle \mathbb{H}(\mathbf{X}) = \mathbb{H}(\mathbf{Y}) + O( d_{TV}(\mathbf{X}, \mathbf{Y}) \log(\# R (1 +\frac{1}{d_{TV}(\mathbf{X}, \mathbf{Y})})) )$

with the convention that the error term vanishes when ${d_{TV}(\mathbf{X}, \mathbf{Y}) = 0}$.

Proof: Set ${\varepsilon := d_{TV}(\mathbf{X}, \mathbf{Y})}$. The claim is trivial when ${\varepsilon=0}$ (since then ${\mathbf{X}, \mathbf{Y}}$ have the same distribution) and when ${\varepsilon \geq 1/2}$ (from (32)), so let us assume ${0 < \varepsilon < \frac{1}{2}}$, and our task is to show that

$\displaystyle \mathbb{H}(\mathbf{X}) = \mathbb{H}(\mathbf{Y}) + O( \varepsilon \log \# R + \varepsilon \log \frac{1}{\varepsilon} ).$

If we write ${h(p) := p \log \frac{1}{p}}$, ${p_r := {\mathbb P}( \mathbf{X} = r )}$, and ${q_r := {\mathbb P}( \mathbf{Y} = r )}$, then

$\displaystyle \mathbb{H}(\mathbf{X}) = \mathbb{H}(\mathbf{Y}) + \sum_{r \in R} h(q_r) - h(p_r).$

By dividing into the cases ${|p_r-q_r| \leq \frac{1}{2} q_r}$ and ${|p_r-q_r| > \frac{1}{2} q_r}$ we see that

$\displaystyle h(q_r) - h(p_r) \ll |p_r - q_r| + h( |p_r - q_r| )$

since ${\varepsilon = \sum_{r \in R} |p_r - q_r|}$, it thus suffices to show that

$\displaystyle \sum_{r \in R} h( |p_r - q_r| ) \ll \varepsilon \log \# R + \varepsilon \log \frac{1}{\varepsilon}.$

But from Jensen’s inequality (32) one has

$\displaystyle \sum_{r \in R} h( \frac{|p_r - q_r|}{\varepsilon} ) \leq \log \# R;$

since ${h( |p_r - q_r| ) = \varepsilon h( \frac{|p_r - q_r|}{\varepsilon} ) + |p_r-q_r| \log \frac{1}{\varepsilon}}$, the claim follows. $\Box$
In the converse direction, if a random variable has entropy close to the maximum ${\log \# R}$, then one can control the total variation:

Lemma 24 (Special case of Pinsker inequality) If ${\mathbf{X}}$ takes values in a finite set ${R}$, and ${\mathbf{U}}$ is a uniformly distributed random variable on ${R}$, then

$\displaystyle d_{TV}( \mathbf{X}, \mathbf{U} ) \ll \left( \mathbb{H}( \mathbf{U} ) - \mathbb{H}( \mathbf{X} ) \right)^{1/2}.$

Of course, we have ${\mathbb{H}(\mathbf{U}) = \log \# R}$, so we may also write the above inequality as

$\displaystyle d_{TV}( \mathbf{X}, \mathbf{U} ) \ll \left( \log \# R - \mathbb{H}( \mathbf{X} ) \right)^{1/2}.$

The optimal value of the implied constant here is known to equal ${\sqrt{2}}$, but we will not use this sharp version of the inequality here.
Proof: If we write ${p_r := {\mathbb P}( \mathbf{X} = r )}$ and ${q_r := {\mathbb P}( \mathbf{U} = r ) = 1/\# R}$, and ${h(p) := p \log \frac{1}{p}}$, then we can rewrite the claimed inequality as

$\displaystyle \sum_{r \in R} |p_r - q_r| \ll \left( \sum_{r \in R} h(q_r) - h(p_r) \right)^{1/2}.$

Observe that the function ${h}$ is concave, and in fact ${h''(p) = - \frac{1}{p}}$ for all ${0 < p \leq 1}$. From this and Taylor expansion with remainder we may write

$\displaystyle h(q_r) - h(p_r) = h'(q_r) (q_r - p_r) + \frac{1}{2} \frac{|p_r - q_r|^2}{s_r}$

for some ${s_r}$ between ${p_r}$ and ${q_r}$. Since ${h'(q_r)}$ is independent of ${r}$, and ${\sum_{r \in R} q_r = \sum_{r \in R} p_r = 1}$, we thus have on summing in ${r}$

$\displaystyle \sum_{r \in R} h(q_r) - h(p_r) = \frac{1}{2} \sum_{r \in R} \frac{|p_r - q_r|^2}{s_r}.$

By Cauchy-Schwarz we then have

$\displaystyle \sum_{r \in R} |p_r - q_r| \ll \left( \sum_{r \in R} h(q_r) - h(p_r) \right)^{1/2} \left( \sum_{r \in R} s_r \right)^{1/2}.$

Since ${s_r = O( p_r + q_r)}$ and ${\sum_{r \in R} q_r = \sum_{r \in R} p_r = 1}$, the claim follows. $\Box$
The above lemma does not hold when the comparison variable ${\mathbf{U}}$ is not assumed to be uniform; in particular, two non-uniform random variables can have precisely the same entropy but yet have different distributions, so that their total variation distance is positive. There is a more general variant, known as the Pinsker inequality, which we will not use in these notes:

Exercise 25 (Pinsker inequality) If ${\mathbf{X}, \mathbf{Y}}$ take values in a finite set ${R}$, define the Kullback-Leibler divergence of ${\mathbf{X}}$ relative to ${\mathbf{Y}}$ by the formula

$\displaystyle d_{KL}( \mathbf{X} || \mathbf{Y} ) := \sum_{r \in R} \mathbf{P}( \mathbf{X} = r ) \log \frac{\mathbf{P}( \mathbf{X} = r )}{\mathbf{P}( \mathbf{Y} = r )}$

(with the convention that the summand vanishes when ${\mathbf{P}( \mathbf{X} = r )}$ vanishes).

• (i) Establish the Gibbs inequality ${0 \leq d_{KL}(\mathbf{X} || \mathbf{Y}) \leq +\infty}$.
• (ii) Establish the Pinsker inequality

$\displaystyle d_{TV}( \mathbf{X}, \mathbf{Y} ) \ll d_{KL}( \mathbf{X} || \mathbf{Y} )^{1/2}.$

In particular, ${d_{KL}( \mathbf{X} || \mathbf{Y} )}$ vanishes if and only if ${\mathbf{X}, \mathbf{Y}}$ have identical distribution. Show that this implies Lemma 24 as a special case.

• (iii) Give an example to show that the Kullback-Liebler divergence need not be symmetric, thus there exist ${\mathbf{X},\mathbf{Y}}$ such that ${d_{KL}( \mathbf{X} || \mathbf{Y} ) \neq d_{KL}( \mathbf{Y} || \mathbf{X} )}$.
• (iv) If ${\mathbf{X}_1, \mathbf{X}_2}$ are random variables taking values in finite sets ${R_1,R_2}$, and ${\mathbf{X}'_1, \mathbf{X}'_2}$ are independent random variables taking values in ${R_1,R_2}$ respectively with each ${\mathbf{X}'_i}$ having the same distribution of ${\mathbf{X}_i}$, show that

$\displaystyle \mathbb{I}( X_1 : X_2 ) = d_{KL}( (X_1,X_2) || (X'_1,X'_2) ).$

In our applications we will need a relative version of Lemma 24:

Corollary 26 (Relative Pinsker inequality) If ${\mathbf{X}}$ takes values in a finite set ${R}$, ${\mathbf{Z}}$ takes values in a finite set ${T}$, and ${\mathbf{U}}$ is a uniformly distributed random variable on ${R}$ that is independent of ${\mathbf{Z}}$, then

$\displaystyle d_{TV}( (\mathbf{X},\mathbf{Z}), (\mathbf{U},\mathbf{Z}) ) \ll \left( \mathbb{H}( \mathbf{U} ) - \mathbb{H}( \mathbf{X} | \mathbf{Z} ) \right)^{1/2}.$

Proof: From direct calculation we have the identity

$\displaystyle d_{TV}( (\mathbf{X},\mathbf{Z}), (\mathbf{U},\mathbf{Z}) ) = \sum_{t \in T} d_{TV}( (\mathbf{X}|\mathbf{Z}=t), (\mathbf{U}|\mathbf{Z}=t) ) {\mathbb P}(\mathbf{Z} = t ).$

As ${\mathbf{U}}$ is independent of ${\mathbf{Z}}$, ${(\mathbf{U}|\mathbf{Z}=t)}$ is uniformly distributed on ${R}$. From Lemma 24 we conclude

$\displaystyle d_{TV}( (\mathbf{X}|\mathbf{Z}=t), (\mathbf{U}|\mathbf{Z}=t) ) \ll \left( \mathbb{H}( \mathbf{U} ) - \mathbb{H}( \mathbf{X} | \mathbf{Z}=t ) \right)^{1/2}.$

Inserting this bound and using the Cauchy-Schwarz inequality, we obtain the claim. $\Box$
Now we are ready to apply the above machinery to give a key inequality that is analogous to Elliott’s inequality. Inequalities of this type first appeared in one of my papers, introducing what I called the “entropy decrement argument”; the following arrangement of the inequality and proof is due to Redmond McNamara (personal communication).

Theorem 27 (Entropy decrement inequality) Let ${\mathbf{n}}$ be a random variable taking values in a finite set of integers, which obeys the approximate stationarity

$\displaystyle d_{TV}(\mathbf{n}, \mathbf{n}+1) \leq \delta \ \ \ \ \ (43)$

for some ${0 < \delta \leq \frac{1}{2}}$. Let ${p_1,\dots,p_m}$ be a collection of distinct primes less than some threshold ${P}$, and let ${1 = \ell_0 \leq \ell_1 \leq \dots \leq \ell_m \leq P}$ be natural numbers that are also bounded by ${P}$. Let ${f: {\bf Z} \rightarrow R}$ be a function taking values in a finite set ${R}$. For ${i=1,\dots,m}$, let ${\mathbf{X}_i}$ denote the ${R^{\ell_i}}$-valued random variable

$\displaystyle \mathbf{X}_i := (f(\mathbf{n}+1), \dots, f(\mathbf{n}+\ell_i)) \ \ \ \ \ (44)$

and let ${\mathbf{Y}_i}$ denote the ${{\bf Z}/p_i{\bf Z}}$-valued random variable

$\displaystyle \mathbf{Y}_i := \mathbf{n} \hbox{ mod } p_i.$

Also, let ${\mathbf{U}_i}$ be a random variable drawn uniformly from ${{\bf Z}/p_i{\bf Z}}$, independently of ${\mathbf{n}}$. Then

$\displaystyle \sum_{i=1}^m \frac{d_{TV}( (\mathbf{X}_i, \mathbf{Y}_i), (\mathbf{X}_i, \mathbf{U}_i) )^2 }{\ell_i} \ll \log \# R$

$\displaystyle + \delta \exp(O(P)) \log \frac{\# R}{\delta}. \ \ \ \ \ (45)$

The ${\exp(O(P))}$ factor (arising from an invocation of the Chinese remainder theorem in the proof) unfortunately restricts the usefulness of this theorem to the regime in which all the primes involved are of “sub-logarithmic size”, but once one is in that regime, the second term on the right-hand side of (45) tends to be negligible in practice. Informally, this theorem asserts that for most small primes ${p_i \leq P}$, the random variables ${\mathbf{X}_i}$ and ${\mathbf{Y}_i}$ behave as if they are independent of each other.

Proof: We can assume ${\# R \geq 2}$, as the claim is trivial for ${\# R = 1}$ (the ${\mathbf{X}_i}$ all have zero entropy). For ${i=0,\dots,n}$, we introduce the ${\prod_{j \leq i} {\bf Z}/p_i{\bf Z}}$-valued random variable

$\displaystyle \mathbf{Y}_{\leq i} := (\mathbf{Y}_j)_{j \leq i}.$

The idea is to exploit some monotonicity properties of the quantity ${\frac{\mathbb{H}( \mathbf{X}_i | \mathbf{Y}_{\leq i} )}{\ell_i}}$, in analogy with (41). By telescoping series we have

$\displaystyle \sum_{i=1}^n \frac{\mathbb{H}( \mathbf{X}_{i-1} | \mathbf{Y}_{\leq i-1} )}{\ell_{i-1}} - \frac{\mathbb{H}( \mathbf{X}_i | \mathbf{Y}_{\leq i} )}{\ell_i} \leq \mathbb{H}( \mathbf{X}_0 )$

$\displaystyle \leq \log \# R$

where we extend (44) to the ${i=0}$ case. From (38) we have

$\displaystyle \mathbb{H}( \mathbf{X}_i | \mathbf{Y}_{\leq i} ) = \mathbb{H}( \mathbf{X}_i | \mathbf{Y}_{\leq i-1} ) - \mathbb{I}( \mathbf{X}_{i} : \mathbf{Y}_i | \mathbf{Y}_{\leq i-1} )$

and thus

$\displaystyle \sum_{i=1}^n \frac{\mathbb{I}( \mathbf{X}_{i} : \mathbf{Y}_i | \mathbf{Y}_{\leq i-1} )}{\ell_i} \ \ \ \ \ (46)$

$\displaystyle \leq \log \# R - \sum_{i=1}^n \frac{\mathbb{H}( \mathbf{X}_{i-1} | \mathbf{Y}_{\leq i-1} )}{\ell_{i-1}} - \frac{\mathbb{H}( \mathbf{X}_i | \mathbf{Y}_{\leq i-1} )}{\ell_i}.$

Now we lower bound the summand on the right-hand side. From multiple applications of the conditional chain rule (37) we have

$\displaystyle \mathbb{H}( \mathbf{X}_i | \mathbf{Y}_{\leq i-1} ) = \sum_{k=1}^{\ell_i} c_{k,i} \ \ \ \ \ (47)$

and

$\displaystyle \mathbb{H}( \mathbf{X}_{i-1} | \mathbf{Y}_{\leq i-1} ) = \sum_{k=1}^{\ell_{i-1}} c_{k,i} \ \ \ \ \ (48)$

where

$\displaystyle c_{k,i} := \mathbb{H}( f(\mathbf{n}+k) | \mathbf{Y}_{\leq i-1}, f(\mathbf{n}+1),\dots,f(\mathbf{n}+k-1) ).$

We now use the approximate stationarity of ${\mathbf{n}}$ to derive an approximate monotonicity property for ${c_{k,i}}$. If ${k>1}$, then from (39) we have

$\displaystyle c_{k,i} \leq \mathbb{H}( f(\mathbf{n}+k) | \mathbf{Y}_{\leq i-1}, f(\mathbf{n}+2),\dots,f(\mathbf{n}+k-1) ).$

Write ${\mathbf{n}' := \mathbf{n} + 1}$ and

$\displaystyle \mathbf{Y}'_{\leq i-1} := (\mathbf{n}' \hbox{ mod } p_j)_{j \leq i-1}.$

Note that ${\mathbf{Y}'_{\leq i-1}}$ is a deterministic function of ${\mathbf{Y}_{\leq i-1}}$ and vice versa. Thus we can replace ${\mathbf{Y}_{\leq i-1}}$ by ${\mathbf{Y}'_{\leq i-1}}$ in the above formula, and conclude that

$\displaystyle c_{k,i} \leq \mathbb{H}( f(\mathbf{n}'+k-1) | \mathbf{Y}'_{\leq i-1}, f(\mathbf{n}'+1),\dots,f(\mathbf{n}'+k-2) ).$

The tuple ${(f(\mathbf{n}'_1),\dots,f(\mathbf{n}'+k-1), \mathbf{Y}'_{\leq i-1})}$ takes values in a set of cardinality ${(O(\# R))^P}$ thanks to the Chebyshev bounds. Hence by two applications of Lemma 23, (43) we have

$\displaystyle c_{k,i} \leq \mathbb{H}( f(\mathbf{n}+k-1) | \mathbf{Y}_{\leq i-1}, f(\mathbf{n}+1),\dots,f(\mathbf{n}+k-2) ) + O( \delta \log \frac{1}{\delta} + \delta P \log \# R ).$

The first term on the right-hand side is ${c_{k-1,i}}$. Worsening the error term slightly, we conclude that

$\displaystyle c_{k,i} \leq c_{k-1,i} + O( \delta P^{O(1)} \log \frac{\# R}{\delta} )$

and hence

$\displaystyle c_{k,i} \leq c_{j,i} + O( \delta P^{O(1)} \log \frac{\# R}{\delta} )$

for any ${1 \leq j \leq m}$. In particular

$\displaystyle \sum_{k=\ell_{i-1}+1}^{\ell_i} c_{k,i} \leq \frac{\ell_i - \ell_{i-1}}{\ell_{i-1}} \sum_{k=1}^{\ell_{i-1}} c_{k,i} + O( \delta P^{O(1)} \log \frac{\# R}{\delta} )$

which by (47), (48) rearranges to

$\displaystyle \frac{\mathbb{H}( \mathbf{X}_i | \mathbf{Y}_{\leq i-1} )}{\ell_i} \leq \frac{\mathbb{H}( \mathbf{X}_{i-1} | \mathbf{Y}_{\leq i-1} )}{\ell_{i-1}} + O( \delta P^{O(1)} \log \frac{\# R}{\delta} ).$

From (46) we conclude that

$\displaystyle \sum_{i=1}^m \frac{\mathbb{I}( \mathbf{X}_{i-1} : \mathbf{Y}_i | \mathbf{Y}_{\leq i-1} )}{\ell_i} \ll \log \# R + \delta P^{O(1)} \log \frac{\# R}{\delta}.$

Meanwhile, from Corollary 26, (39), (38) we have

$\displaystyle d_{TV}( (\mathbf{X}_i, \mathbf{Y}_i), (\mathbf{X}_i, \mathbf{U}_i) )^2 \ll \mathbb{H}( \mathbf{U}_i ) - \mathbb{H}( \mathbf{Y}_i | \mathbf{X}_i )$

$\displaystyle \ll \mathbb{H}( \mathbf{U}_i ) - \mathbb{H}( \mathbf{Y}_i | \mathbf{X}_i, \mathbf{Y}_{\leq i-1} )$

$\displaystyle = \mathbb{H}( \mathbf{U}_i ) - \mathbb{H}( \mathbf{Y}_i | \mathbf{Y}_{\leq i-1} ) + \mathbb{I}( \mathbf{X}_{i-1} : \mathbf{Y}_i | \mathbf{Y}_{\leq i-1} )$

$\displaystyle = \mathbb{H}( \mathbf{U}_i ) + \mathbb{H}(\mathbf{Y}_{\leq i-1}) - \mathbb{H}( \mathbf{Y}_{\leq i} ) + \mathbb{I}( \mathbf{X}_{i-1} : \mathbf{Y}_i | \mathbf{Y}_{\leq i-1} ).$

The probability distribution of ${\mathbf{Y}_{\leq i}}$ is a function on ${\prod_{j \leq i} {\bf Z}/p_j{\bf Z}}$, which by the Chinese remainder theorem we can identify with a cyclic group ${{\bf Z}/Q_i{\bf Z}}$ where ${Q_i := \prod_{j \leq i} p_j}$. From (43) we see that the value of this distribution at adjacent values ${n,n+1}$ of this cyclic group varies by ${O(\delta)}$, hence the total variation distance between this random variable and the uniform distribution on ${\prod_{j \leq i} {\bf Z}/p_i{\bf Z}}$ is ${O( \delta Q_i^{O(1)} ) = O( \delta \exp(O(P)) )}$ by Chebyshev bounds. By Lemma 23 we then have

$\displaystyle \mathbb{H}( \mathbf{Y}_{\leq i} ) = \log Q_i + O( \delta \exp(O(P)) \log \frac{1}{\delta} )$

and thus

$\displaystyle d_{TV}( (\mathbf{X}_i, \mathbf{Y}_i), (\mathbf{X}_i, \mathbf{U}_i) )^2 \leq \mathbb{I}( \mathbf{X}_{i-1} : \mathbf{Y}_i | \mathbf{Y}_{\leq i-1} ) + O(\delta \exp(O(P)) \log \frac{1}{\delta} ).$

The claim follows. $\Box$
We now compare this result to Elliott’s inequality. If one tries to address precisely the same question that Elliott’s inequality does – namely, to try to compare a sum ${\frac{1}{|I|} \sum_{n \in I} f(n)}$ with sampled subsums ${\frac{1}{|I|} \sum_{n \in I} f(n) p 1_{p|n}}$ – then the results are quantitatively much weaker:

Corollary 28 (Weak Elliott inequality) Let ${I \subset {\bf R}}$ be an interval of length at least ${1}$. Let ${f:{\bf Z} \rightarrow {\bf C}}$ be a function with ${|f(n)| \leq 1}$ for all ${n}$, and let ${0 < \varepsilon < \frac{1}{2}}$. Then one has

$\displaystyle \frac{1}{|I|} \sum_{n \in I} f(n) = \frac{1}{|I|} \sum_{n \in I} f(n) p 1_{p|n} + O(\varepsilon)$

for all primes ${p \leq \log |I|}$ outside of an exceptional set of primes of logarithmic size ${O( \varepsilon^{-2} \log \frac{1}{\varepsilon} )}$.

Comparing this with Exercise 8 we see that we cover a much smaller range of primes ${p}$; also the size of the exceptional set is slightly worse. This version of Elliot’s inequality is however still strong enough to recover a proof of the prime number theorem as in the previous section.

Proof: We can assume that ${\varepsilon}$ is small, as the claim is trivial for ${\varepsilon}$ comparable to ${1}$. We can also assume that

$\displaystyle \log\log\log |I| \geq \varepsilon^{-2} \log \frac{1}{\varepsilon} \ \ \ \ \ (49)$

since the claim is also trivial otherwise (just make all primes up to ${\log |I|}$ exceptional, then use Mertens’ theorem). As a consequence of this, any quantity involving ${|I|}$ in the denominator will end up being completely negligible in practice. We can also restrict attention to primes less than (say) ${\log^{1/2} |I|}$, since the remaining primes between ${\log |I|}$ and ${\log^{1/2} |I|}$ have logarithmic size ${O(1)}$.
By rounding the real and imaginary parts of ${f}$ to the nearest multiple of ${\varepsilon}$, we may assume that ${f}$ takes values in some finite set ${R}$ of complex numbers of size ${O(1)}$ with cardinality ${O(\varepsilon^{-2})}$. Let ${\mathbf{n}}$ be drawn uniformly at random from ${I}$. Then (43) holds with ${\delta = O(1/|I|)}$, and from Theorem 27 with ${\ell_i = p_i}$ and ${P = \log^{1/2} |I|}$ (which makes the second term of the right-hand side of (45) negligible) we have

$\displaystyle \sum_{i=1}^m \frac{d_{TV}( (\mathbf{X}_i, \mathbf{Y}_i), (\mathbf{X}_i, \mathbf{U}_i) )^2 }{p_i} \ll \log \frac{1}{\varepsilon}$

where ${p_1,\dots,p_m}$ are the primes up to ${\log^{1/2} |I|}$, arranged in increasing order. By Markov’s inequality, we thus have

$\displaystyle d_{TV}( (\mathbf{X}_i, \mathbf{Y}_i), (\mathbf{X}_i, \mathbf{U}_i) ) \ll \varepsilon$

for ${p_i}$ outside of a set of primes of logarithmic size ${O( \varepsilon^{-2} \log \frac{1}{\varepsilon} )}$.
Let ${i}$ be as above. Now let ${F_i: R^{p_i} \times {\bf Z}/p_i{\bf Z} \rightarrow {\bf C}}$ be the function

$\displaystyle F_i( (x_1,\dots,x_{p_i}), a ) := \sum_{j=1}^{p_i} x_j 1_{j = -a \hbox{ mod } p_i},$

that is to say ${F_i( (x_1,\dots,x_{p_i}), a )}$ picks out the unique component ${x_j}$ of the tuple ${(x_1,\dots,x_{p_i})}$ in which ${j+a}$ is divisible by ${p_i}$. This function is bounded by ${O(1)}$, and then by (42) we have

$\displaystyle {\mathbb E} F_i( \mathbf{X}_i, \mathbf{Y}_i ) = {\mathbb E} F_i( \mathbf{X}_i, \mathbf{U}_i ) + O(\varepsilon).$

The left-hand side is equal to

$\displaystyle \frac{1}{|I|} \sum_{n \in I} \sum_{j=1}^{p_i} f(n+j) 1_{j = -n \hbox{ mod } p_i} + O(\varepsilon)$

which on switching the summations and using the large nature of ${|I|}$ can be rewritten as

$\displaystyle \frac{1}{|I|} \sum_{n \in I} f(n) p_i 1_{p_i|n} + O(\varepsilon).$

Meanwhile, the left-hand side is equal to

$\displaystyle \frac{1}{|I|} \sum_{n \in I} \frac{1}{p_i} \sum_{j=1}^{p_i} f(n+j) + O(\varepsilon)$

which again by switching the summations becomes

$\displaystyle \frac{1}{|I|} \sum_{n \in I} f(n) + O(\varepsilon).$

The claim follows. $\Box$
In the above argument we applied (42) with a very specific choice of function ${F_i}$. The power of Theorem 27 lies in the ability to select many other such functions ${F_i}$, leading to estimates that do not seem to be obtainable purely from the second moment method. In particular we have the following generalisation of the previous estimate:

Proposition 29 (Weak Elliott inequality for multiple correlations) Let ${I \subset {\bf R}}$ be an interval of length at least ${1}$. Let ${f:{\bf Z} \rightarrow {\bf C}}$ be a function with ${|f(n)| \leq 1}$ for all ${n}$, and let ${0 < \varepsilon < \frac{1}{2}}$. Let ${h_1,\dots,h_k}$ be integers. Then one has

$\displaystyle \frac{1}{|I|} \sum_{n \in I} f(n+h_1 p) \dots f(n+h_k p)$

$\displaystyle = \frac{1}{|I|} \sum_{n \in I} f(n + h_1 p) \dots f(n+h_k p) p 1_{p|n} + O_{k,h_1,\dots,h_k}(\varepsilon)$

for all primes ${p \leq \log |I|}$ outside of an exceptional set of primes of logarithmic size ${O_{k,h_1,\dots,h_k}( \varepsilon^{-2} \log \frac{1}{\varepsilon} )}$.

Proof: We allow all implied constants to depend on ${k,h_1,\dots,h_k}$. As before we can assume that ${\varepsilon}$ is sufficiently small (depending on ${k,h_1,\dots,h_k}$), that ${f}$ takes values in a set ${R}$ of bounded complex numbers of cardinality ${O(\varepsilon^{-2})}$, and that ${|I|}$ is large in the sense of (49), and restrict attention to primes up to ${P := \log^{1/2} |I|}$. By shifting the ${h_j}$ and using the large nature of ${|I|}$ we can assume that the ${h_j}$ are all non-negative, taking values in ${\{0,\dots,H-1\}}$ for some ${H=O(1)}$. We now apply Theorem 27 with ${\ell_i = H p_i}$ and conclude as before that

$\displaystyle d_{TV}( (\mathbf{X}_i, \mathbf{Y}_i), (\mathbf{X}_i, \mathbf{U}_i) ) \ll \varepsilon$

for ${p_i}$ outside of a set of primes of logarithmic size ${O( \varepsilon^{-2} \log \frac{1}{\varepsilon} )}$.
Let ${i}$ be as above. Let ${F_i: R^{Hp_i} \times {\bf Z}/p_i{\bf Z} \rightarrow {\bf C}}$ be the function

$\displaystyle F_i( (x_1,\dots,x_{Hp_i}), a ) := \sum_{j=1}^{p_i} x_{j+h_1 p_i} \dots x_{j+h_k p_i} 1_{j = -a \hbox{ mod } p_i}.$

This function is still bounded by ${O(1)}$, so by (42) as before we have

$\displaystyle {\mathbb E} F_i( \mathbf{X}_i, \mathbf{Y}_i ) = {\mathbb E} F_i( \mathbf{X}_i, \mathbf{U}_i ) + O(\varepsilon).$

The left-hand side is equal to

$\displaystyle \frac{1}{|I|} \sum_{n \in I} \sum_{j=1}^{p_i} f(n+j+h_1 p_i) \dots f(n+j+h_k p_i) \dots 1_{j = -n \hbox{ mod } p_i} + O(\varepsilon)$

which on switching the summations and using the large nature of ${|I|}$ can be rewritten as

$\displaystyle \frac{1}{|I|} \sum_{n \in I} f(n+h_1 p_i) \dots f(n+h_k p_i) p_i 1_{p_i|n} + O(\varepsilon).$

Meanwhile, the right-hand side is equal to

$\displaystyle \frac{1}{|I|} \sum_{n \in I} \frac{1}{p_i} \sum_{j=1}^{p_i} f(n+j+h_1 p_i) \dots f(n+j+h_k p_i) + O(\varepsilon)$

which again by switching the summations becomes

$\displaystyle \frac{1}{|I|} \sum_{n \in I} f(n+h_1 p_i) \dots f(n+h_k p_i) + O(\varepsilon).$

The claim follows. $\Box$
There is a logarithmically averaged version of the above proposition:

Exercise 30 (Weak Elliott inequality for logarithmically averaged multiple correlations) Let ${2 \leq y \leq x}$ with ${x/y \geq 2}$, let ${f: {\bf Z} \rightarrow {\bf C}}$ be a function bounded in magnitude by ${1}$, let ${0 < \varepsilon < \frac{1}{2}}$, and let ${h_1,\dots,h_k}$ be integers. Show that

$\displaystyle \frac{1}{\log x/y} \sum_{y \leq n \leq x} \frac{f(n+h_1 p) \dots f(n+h_k p)}{n}$

$\displaystyle = \frac{1}{\log x/y} \sum_{y \leq n \leq x} \frac{f(n + h_1 p) \dots f(n+h_k p) p 1_{p|n}}{n} + O_{k,h_1,\dots,h_k}(\varepsilon)$

for all primes ${p \leq \log y}$ outside of an exceptional set of primes of logarithmic size ${O_{k,h_1,\dots,h_k}( \varepsilon^{-2} \log \frac{1}{\varepsilon} )}$.

When one specialises to multiplicative functions, this lets us dilate shifts in multiple correlations by primes:

Exercise 31 Let ${2 \leq N \leq y \leq x}$ with ${x/y \geq N}$, let ${f: {\bf N} \rightarrow {\bf C}}$ be a multiplicative function bounded in magnitude by ${1}$, let ${0 < \varepsilon < \frac{1}{2}}$, and let ${h_1,\dots,h_k}$ be nonnegative integers. Show that

$\displaystyle f(p)^k \frac{1}{\log x/y} \sum_{y \leq n \leq x} \frac{f(n+h_1) \dots f(n+h_k)}{n}$

$\displaystyle = \frac{1}{\log x/y} \sum_{y \leq n \leq x} \frac{f(n + h_1 p) \dots f(n+h_k p)}{n} + O_{k,h_1,\dots,h_k}(\varepsilon)$

for all primes ${p \leq \log N}$ outside of an exceptional set of primes of logarithmic size ${O_{k,h_1,\dots,h_k}( \varepsilon^{-2} \log \frac{1}{\varepsilon} )}$.

For instance, setting ${f}$ to be the Möbius function, ${h_1=0}$, ${h_2=0}$, and ${y = \log x}$ (say), we see that

$\displaystyle \frac{1}{\log x} \sum_{n \leq x} \frac{\mu(n) \mu(n+1)}{n} = \frac{1}{\log x} \sum_{n \leq x} \frac{\mu(n) \mu(n+p)}{n} + O(\varepsilon)$

for all primes ${p \leq \log\log x}$ outside of an exceptional set of primes of logarithmic size ${\varepsilon^{-2} \log \frac{1}{\varepsilon}}$. In particular, for ${x}$ large enough, one can obtain bounds of the form

$\displaystyle \frac{1}{\log x} \sum_{n \leq x} \frac{\mu(n) \mu(n+1)}{n} =\frac{1}{(\# {\mathcal P}) \log x} \sum_{n \leq x} \sum_{p \in {\mathcal P}} \frac{\mu(n) \mu(n+p)}{n} + O(\varepsilon)$

for various moderately large sets of primes ${p}$. It turns out that these double sums on the right-hand side can be estimated by methods which we will cover in later series of notes. Among other things, this allows us to establish estimates such as

$\displaystyle \frac{1}{\log x} \sum_{n \leq x} \frac{\mu(n) \mu(n+1)}{n} = o(1)$

as ${n \rightarrow \infty}$, which to date have only been established using these entropy methods (in conjunction with the methods discussed in later notes). This is progress towards an open problem in analytic number theory known as Chowla’s conjecture, which we will also discuss in later notes.