You are currently browsing the tag archive for the ‘Mobius function’ tag.

The (classical) Möbius function ${\mu: {\bf N} \rightarrow {\bf Z}}$ is the unique function that obeys the classical Möbius inversion formula:

Proposition 1 (Classical Möbius inversion) Let ${f,g: {\bf N} \rightarrow A}$ be functions from the natural numbers to an additive group ${A}$. Then the following two claims are equivalent:
• (i) ${f(n) = \sum_{d|n} g(d)}$ for all ${n \in {\bf N}}$.
• (ii) ${g(n) = \sum_{d|n} \mu(n/d) f(d)}$ for all ${n \in {\bf N}}$.

There is a generalisation of this formula to (finite) posets, due to Hall, in which one sums over chains ${n_0 > \dots > n_k}$ in the poset:

Proposition 2 (Poset Möbius inversion) Let ${{\mathcal N}}$ be a finite poset, and let ${f,g: {\mathcal N} \rightarrow A}$ be functions from that poset to an additive group ${A}$. Then the following two claims are equivalent:
• (i) ${f(n) = \sum_{d \leq n} g(d)}$ for all ${n \in {\mathcal N}}$, where ${d}$ is understood to range in ${{\mathcal N}}$.
• (ii) ${g(n) = \sum_{k=0}^\infty (-1)^k \sum_{n = n_0 > n_1 > \dots > n_k} f(n_k)}$ for all ${n \in {\mathcal N}}$, where in the inner sum ${n_0,\dots,n_k}$ are understood to range in ${{\mathcal N}}$ with the indicated ordering.
(Note from the finite nature of ${{\mathcal N}}$ that the inner sum in (ii) is vacuous for all but finitely many ${k}$.)

Comparing Proposition 2 with Proposition 1, it is natural to refer to the function ${\mu(d,n) := \sum_{k=0}^\infty (-1)^k \sum_{n = n_0 > n_1 > \dots > n_k = d} 1}$ as the Möbius function of the poset; the condition (ii) can then be written as

$\displaystyle g(n) = \sum_{d \leq n} \mu(d,n) f(d).$

Proof: If (i) holds, then we have

$\displaystyle g(n) = f(n) - \sum_{d

for any ${n \in {\mathcal N}}$. Iterating this we obtain (ii). Conversely, from (ii) and separating out the ${k=0}$ term, and grouping all the other terms based on the value of ${d:=n_1}$, we obtain (1), and hence (i). $\Box$

In fact it is not completely necessary that the poset ${{\mathcal N}}$ be finite; an inspection of the proof shows that it suffices that every element ${n}$ of the poset has only finitely many predecessors ${\{ d \in {\mathcal N}: d < n \}}$.

It is not difficult to see that Proposition 2 includes Proposition 1 as a special case, after verifying the combinatorial fact that the quantity

$\displaystyle \sum_{k=0}^\infty (-1)^k \sum_{d=n_k | n_{k-1} | \dots | n_1 | n_0 = n} 1$

is equal to ${\mu(n/d)}$ when ${d}$ divides ${n}$, and vanishes otherwise.

I recently discovered that Proposition 2 can also lead to a useful variant of the inclusion-exclusion principle. The classical version of this principle can be phrased in terms of indicator functions: if ${A_1,\dots,A_\ell}$ are subsets of some set ${X}$, then

$\displaystyle \prod_{j=1}^\ell (1-1_{A_j}) = \sum_{k=0}^\ell (-1)^k \sum_{1 \leq j_1 < \dots < j_k \leq \ell} 1_{A_{j_1} \cap \dots \cap A_{j_k}}.$

In particular, if there is a finite measure ${\nu}$ on ${X}$ for which ${A_1,\dots,A_\ell}$ are all measurable, we have

$\displaystyle \nu(X \backslash \bigcup_{j=1}^\ell A_j) = \sum_{k=0}^\ell (-1)^k \sum_{1 \leq j_1 < \dots < j_k \leq \ell} \nu( A_{j_1} \cap \dots \cap A_{j_k} ).$

One drawback of this formula is that there are exponentially many terms on the right-hand side: ${2^\ell}$ of them, in fact. However, in many cases of interest there are “collisions” between the intersections ${A_{j_1} \cap \dots \cap A_{j_k}}$ (for instance, perhaps many of the pairwise intersections ${A_i \cap A_j}$ agree), in which case there is an opportunity to collect terms and hopefully achieve some cancellation. It turns out that it is possible to use Proposition 2 to do this, in which one only needs to sum over chains in the resulting poset of intersections:

Proposition 3 (Hall-type inclusion-exclusion principle) Let ${A_1,\dots,A_\ell}$ be subsets of some set ${X}$, and let ${{\mathcal N}}$ be the finite poset formed by intersections of some of the ${A_i}$ (with the convention that ${X}$ is the empty intersection), ordered by set inclusion. Then for any ${E \in {\mathcal N}}$, one has

$\displaystyle 1_E \prod_{F \subsetneq E} (1 - 1_F) = \sum_{k=0}^\ell (-1)^k \sum_{E = E_0 \supsetneq E_1 \supsetneq \dots \supsetneq E_k} 1_{E_k} \ \ \ \ \ (2)$

where ${F, E_0,\dots,E_k}$ are understood to range in ${{\mathcal N}}$. In particular (setting ${E}$ to be the empty intersection) if the ${A_j}$ are all proper subsets of ${X}$ then we have

$\displaystyle \prod_{j=1}^\ell (1-1_{A_j}) = \sum_{k=0}^\ell (-1)^k \sum_{X = E_0 \supsetneq E_1 \supsetneq \dots \supsetneq E_k} 1_{E_k}. \ \ \ \ \ (3)$

In particular, if there is a finite measure ${\nu}$ on ${X}$ for which ${A_1,\dots,A_\ell}$ are all measurable, we have

$\displaystyle \mu(X \backslash \bigcup_{j=1}^\ell A_j) = \sum_{k=0}^\ell (-1)^k \sum_{X = E_0 \supsetneq E_1 \supsetneq \dots \supsetneq E_k} \mu(E_k).$

Using the Möbius function ${\mu}$ on the poset ${{\mathcal N}}$, one can write these formulae as

$\displaystyle 1_E \prod_{F \subsetneq E} (1 - 1_F) = \sum_{F \subseteq E} \mu(F,E) 1_F,$

$\displaystyle \prod_{j=1}^\ell (1-1_{A_j}) = \sum_F \mu(F,X) 1_F$

and

$\displaystyle \nu(X \backslash \bigcup_{j=1}^\ell A_j) = \sum_F \mu(F,X) \nu(F).$

Proof: It suffices to establish (2) (to derive (3) from (2) observe that all the ${F \subsetneq X}$ are contained in one of the ${A_j}$, so the effect of ${1-1_F}$ may be absorbed into ${1 - 1_{A_j}}$). Applying Proposition 2, this is equivalent to the assertion that

$\displaystyle 1_E = \sum_{F \subseteq E} 1_F \prod_{G \subsetneq F} (1 - 1_G)$

for all ${E \in {\mathcal N}}$. But this amounts to the assertion that for each ${x \in E}$, there is precisely one ${F \subseteq E}$ in ${{\mathcal n}}$ with the property that ${x \in F}$ and ${x \not \in G}$ for any ${G \subsetneq F}$ in ${{\mathcal N}}$, namely one can take ${F}$ to be the intersection of all ${G \subseteq E}$ in ${{\mathcal N}}$ such that ${G}$ contains ${x}$. $\Box$

Example 4 If ${A_1,A_2,A_3 \subsetneq X}$ with ${A_1 \cap A_2 = A_1 \cap A_3 = A_2 \cap A_3 = A_*}$, and ${A_1,A_2,A_3,A_*}$ are all distinct, then we have for any finite measure ${\nu}$ on ${X}$ that makes ${A_1,A_2,A_3}$ measurable that

$\displaystyle \nu(X \backslash (A_1 \cup A_2 \cup A_3)) = \nu(X) - \nu(A_1) - \nu(A_2) \ \ \ \ \ (4)$

$\displaystyle - \nu(A_3) - \nu(A_*) + 3 \nu(A_*)$

due to the four chains ${X \supsetneq A_1}$, ${X \supsetneq A_2}$, ${X \supsetneq A_3}$, ${X \supsetneq A_*}$ of length one, and the three chains ${X \supsetneq A_1 \supsetneq A_*}$, ${X \supsetneq A_2 \supsetneq A_*}$, ${X \supsetneq A_3 \supsetneq A_*}$ of length two. Note that this expansion just has six terms in it, as opposed to the ${2^3=8}$ given by the usual inclusion-exclusion formula, though of course one can reduce the number of terms by combining the ${\nu(A_*)}$ factors. This may not seem particularly impressive, especially if one views the term ${3 \mu(A_*)}$ as really being three terms instead of one, but if we add a fourth set ${A_4 \subsetneq X}$ with ${A_i \cap A_j = A_*}$ for all ${1 \leq i < j \leq 4}$, the formula now becomes

$\displaystyle \nu(X \backslash (A_1 \cup A_2 \cup A_3 \cap A_4)) = \nu(X) - \nu(A_1) - \nu(A_2) \ \ \ \ \ (5)$

$\displaystyle - \nu(A_3) - \nu(A_4) - \nu(A_*) + 4 \nu(A_*)$

and we begin to see more cancellation as we now have just seven terms (or ten if we count ${4 \nu(A_*)}$ as four terms) instead of ${2^4 = 16}$ terms.

Example 5 (Variant of Legendre sieve) If ${q_1,\dots,q_\ell > 1}$ are natural numbers, and ${a_1,a_2,\dots}$ is some sequence of complex numbers with only finitely many terms non-zero, then by applying the above proposition to the sets ${A_j := q_j {\bf N}}$ and with ${\nu}$ equal to counting measure weighted by the ${a_n}$ we obtain a variant of the Legendre sieve

$\displaystyle \sum_{n: (n,q_1 \dots q_\ell) = 1} a_n = \sum_{k=0}^\ell (-1)^k \sum_{1 |' d_1 |' \dots |' d_k} \sum_{n: d_k |n} a_n$

where ${d_1,\dots,d_k}$ range over the set ${{\mathcal N}}$ formed by taking least common multiples of the ${q_j}$ (with the understanding that the empty least common multiple is ${1}$), and ${d |' n}$ denotes the assertion that ${d}$ divides ${n}$ but is strictly less than ${n}$. I am curious to know of this version of the Legendre sieve already appears in the literature (and similarly for the other applications of Proposition 2 given here).

If the poset ${{\mathcal N}}$ has bounded depth then the number of terms in Proposition 3 can end up being just polynomially large in ${\ell}$ rather than exponentially large. Indeed, if all chains ${X \supsetneq E_1 \supsetneq \dots \supsetneq E_k}$ in ${{\mathcal N}}$ have length ${k}$ at most ${k_0}$ then the number of terms here is at most ${1 + \ell + \dots + \ell^{k_0}}$. (The examples (4), (5) are ones in which the depth is equal to two.) I hope to report in a later post on how this version of inclusion-exclusion with polynomially many terms can be useful in an application.

Actually in our application we need an abstraction of the above formula, in which the indicator functions are replaced by more abstract idempotents:

Proposition 6 (Hall-type inclusion-exclusion principle for idempotents) Let ${A_1,\dots,A_\ell}$ be pairwise commuting elements of some ring ${R}$ with identity, which are all idempotent (thus ${A_j A_j = A_j}$ for ${j=1,\dots,\ell}$). Let ${{\mathcal N}}$ be the finite poset formed by products of the ${A_i}$ (with the convention that ${1}$ is the empty product), ordered by declaring ${E \leq F}$ when ${EF = E}$ (note that all the elements of ${{\mathcal N}}$ are idempotent so this is a partial ordering). Then for any ${E \in {\mathcal N}}$, one has

$\displaystyle E \prod_{F < E} (1-F) = \sum_{k=0}^\ell (-1)^k \sum_{E = E_0 > E_1 > \dots > E_k} E_k. \ \ \ \ \ (6)$

where ${F, E_0,\dots,E_k}$ are understood to range in ${{\mathcal N}}$. In particular (setting ${E=1}$) if all the ${A_j}$ are not equal to ${1}$ then we have

$\displaystyle \prod_{j=1}^\ell (1-A_j) = \sum_{k=0}^\ell (-1)^k \sum_{1 = E_0 > E_1 > \dots > E_k} E_k.$

Morally speaking this proposition is equivalent to the previous one after applying a “spectral theorem” to simultaneously diagonalise all of the ${A_j}$, but it is quicker to just adapt the previous proof to establish this proposition directly. Using the Möbius function ${\mu}$ for ${{\mathcal N}}$, we can rewrite these formulae as

$\displaystyle E \prod_{F < E} (1-F) = \sum_{F \leq E} \mu(F,E) 1_F$

and

$\displaystyle \prod_{j=1}^\ell (1-A_j) = \sum_F \mu(F,1) 1_F.$

Proof: Again it suffices to verify (6). Using Proposition 2 as before, it suffices to show that

$\displaystyle E = \sum_{F \leq E} F \prod_{G < F} (1 - G) \ \ \ \ \ (7)$

for all ${E \in {\mathcal N}}$ (all sums and products are understood to range in ${{\mathcal N}}$). We can expand

$\displaystyle E = E \prod_{G < E} (G + (1-G)) = \sum_{{\mathcal A}} (\prod_{G \in {\mathcal A}} G) (\prod_{G < E: G \not \in {\mathcal A}} (1-G)) \ \ \ \ \ (8)$

where ${{\mathcal A}}$ ranges over all subsets of ${\{ G \in {\mathcal N}: G \leq E \}}$ that contain ${E}$. For such an ${{\mathcal A}}$, if we write ${F := \prod_{G \in {\mathcal A}} G}$, then ${F}$ is the greatest lower bound of ${{\mathcal A}}$, and we observe that ${F (\prod_{G < E: G \not \in {\mathcal A}} (1-G))}$ vanishes whenever ${{\mathcal A}}$ fails to contain some ${G \in {\mathcal N}}$ with ${F \leq G \leq E}$. Thus the only ${{\mathcal A}}$ that give non-zero contributions to (8) are the intervals of the form ${\{ G \in {\mathcal N}: F \leq G \leq E\}}$ for some ${F \leq E}$ (which then forms the greatest lower bound for that interval), and the claim (7) follows (after noting that ${F (1-G) = F (1-FG)}$ for any ${F,G \in {\mathcal N}}$). $\Box$

Kaisa Matomäki, Maksym Radziwiłł, and I have just uploaded to the arXiv our paper “Sign patterns of the Liouville and Möbius functions“. This paper is somewhat similar to our previous paper in that it is using the recent breakthrough of Matomäki and Radziwiłł on mean values of multiplicative functions to obtain partial results towards the Chowla conjecture. This conjecture can be phrased, roughly speaking, as follows: if ${k}$ is a fixed natural number and ${n}$ is selected at random from a large interval ${[1,x]}$, then the sign pattern ${(\lambda(n), \lambda(n+1),\dots,\lambda(n+k-1)) \in \{-1,+1\}^k}$ becomes asymptotically equidistributed in ${\{-1,+1\}^k}$ in the limit ${x \rightarrow \infty}$. This remains open for ${k \geq 2}$. In fact even the significantly weaker statement that each of the sign patterns in ${\{-1,+1\}^k}$ is attained infinitely often is open for ${k \geq 4}$. However, in 1986, Hildebrand showed that for ${k \leq 3}$ all sign patterns are indeed attained infinitely often. Our first result is a strengthening of Hildebrand’s, moving a little bit closer to Chowla’s conjecture:

Theorem 1 Let ${k \leq 3}$. Then each of the sign patterns in ${\{-1,+1\}^k}$ is attained by the Liouville function for a set of natural numbers ${n}$ of positive lower density.

Thus for instance one has ${\lambda(n)=\lambda(n+1)=\lambda(n+2)}$ for a set of ${n}$ of positive lower density. The ${k \leq 2}$ case of this theorem already appears in the original paper of Matomäki and Radziwiłł (and the significantly simpler case of the sign patterns ${++}$ and ${--}$ was treated previously by Harman, Pintz, and Wolke).

The basic strategy in all of these arguments is to assume for sake of contradiction that a certain sign pattern occurs extremely rarely, and then exploit the complete multiplicativity of ${\lambda}$ (which implies in particular that ${\lambda(2n) = -\lambda(n)}$, ${\lambda(3n) = -\lambda(n)}$, and ${\lambda(5n) = -\lambda(n)}$ for all ${n}$) together with some combinatorial arguments (vaguely analogous to solving a Sudoku puzzle!) to establish more complex sign patterns for the Liouville function, that are either inconsistent with each other, or with results such as the Matomäki-Radziwiłł result. To illustrate this, let us give some ${k=2}$ examples, arguing a little informally to emphasise the combinatorial aspects of the argument. First suppose that the sign pattern ${(\lambda(n),\lambda(n+1)) = (+1,+1)}$ almost never occurs. The prime number theorem tells us that ${\lambda(n)}$ and ${\lambda(n+1)}$ are each equal to ${+1}$ about half of the time, which by inclusion-exclusion implies that the sign pattern ${(\lambda(n),\lambda(n+1))=(-1,-1)}$ almost never occurs. In other words, we have ${\lambda(n+1) = -\lambda(n)}$ for almost all ${n}$. But from the multiplicativity property ${\lambda(2n)=-\lambda(n)}$ this implies that one should have

$\displaystyle \lambda(2n+2) = -\lambda(2n)$

$\displaystyle \lambda(2n+1) = -\lambda(2n)$

and

$\displaystyle \lambda(2n+2) = -\lambda(2n+1)$

for almost all ${n}$. But the above three statements are contradictory, and the claim follows.

Similarly, if we assume that the sign pattern ${(\lambda(n),\lambda(n+1)) = (+1,-1)}$ almost never occurs, then a similar argument to the above shows that for any fixed ${h}$, one has ${\lambda(n)=\lambda(n+1)=\dots=\lambda(n+h)}$ for almost all ${n}$. But this means that the mean ${\frac{1}{h} \sum_{j=1}^h \lambda(n+j)}$ is abnormally large for most ${n}$, which (for ${h}$ large enough) contradicts the results of Matomäki and Radziwiłł. Here we see that the “enemy” to defeat is the scenario in which ${\lambda}$ only changes sign very rarely, in which case one rarely sees the pattern ${(+1,-1)}$.

It turns out that similar (but more combinatorially intricate) arguments work for sign patterns of length three (but are unlikely to work for most sign patterns of length four or greater). We give here one fragment of such an argument (due to Hildebrand) which hopefully conveys the Sudoku-type flavour of the combinatorics. Suppose for instance that the sign pattern ${(\lambda(n),\lambda(n+1),\lambda(n+2)) = (+1,+1,+1)}$ almost never occurs. Now suppose ${n}$ is a typical number with ${\lambda(15n-1)=\lambda(15n+1)=+1}$. Since we almost never have the sign pattern ${(+1,+1,+1)}$, we must (almost always) then have ${\lambda(15n) = -1}$. By multiplicativity this implies that

$\displaystyle (\lambda(60n-4), \lambda(60n), \lambda(60n+4)) = (+1,-1,+1).$

We claim that this (almost always) forces ${\lambda(60n+5)=-1}$. For if ${\lambda(60n+5)=+1}$, then by the lack of the sign pattern ${(+1,+1,+1)}$, this (almost always) forces ${\lambda(60n+3)=\lambda(60n+6)=-1}$, which by multiplicativity forces ${\lambda(20n+1)=\lambda(20n+2)=+1}$, which by lack of ${(+1,+1,+1)}$ (almost always) forces ${\lambda(20n)=-1}$, which by multiplicativity contradicts ${\lambda(60n)=-1}$. Thus we have ${\lambda(60n+5)=-1}$; a similar argument gives ${\lambda(60n-5)=-1}$ almost always, which by multiplicativity gives ${\lambda(12n-1)=\lambda(12n)=\lambda(12n+1)=+1}$, a contradiction. Thus we almost never have ${\lambda(15n-1)=\lambda(15n+1)=+1}$, which by the inclusion-exclusion argument mentioned previously shows that ${\lambda(15n+1) = - \lambda(15n-1)}$ for almost all ${n}$.

One can continue these Sudoku-type arguments and conclude eventually that ${\lambda(3n-1)=-\lambda(3n+1)=\lambda(3n+2)}$ for almost all ${n}$. To put it another way, if ${\chi_3}$ denotes the non-principal Dirichlet character of modulus ${3}$, then ${\lambda \chi_3}$ is almost always constant away from the multiples of ${3}$. (Conversely, if ${\lambda \chi_3}$ changed sign very rarely outside of the multiples of three, then the sign pattern ${(+1,+1,+1)}$ would never occur.) Fortunately, the main result of Matomäki and Radziwiłł shows that this scenario cannot occur, which establishes that the sign pattern ${(+1,+1,+1)}$ must occur rather frequently. The other sign patterns are handled by variants of these arguments.

Excluding a sign pattern of length three leads to useful implications like “if ${\lambda(n-1)=\lambda(n)=+1}$, then ${\lambda(n+1)=-1}$” which turn out are just barely strong enough to quite rigidly constrain the Liouville function using Sudoku-like arguments. In contrast, excluding a sign pattern of length four only gives rise to implications like “`if ${\lambda(n-2)=\lambda(n-1)=\lambda(n)=+1}$, then ${\lambda(n+1)=-1}$“, and these seem to be much weaker for this purpose (the hypothesis in these implications just isn’t satisfied nearly often enough). So a different idea seems to be needed if one wishes to extend the above theorem to larger values of ${k}$.

Our second theorem gives an analogous result for the Möbius function ${\mu}$ (which takes values in ${\{-1,0,+1\}}$ rather than ${\{-1,1\}}$), but the analysis turns out to be remarkably difficult and we are only able to get up to ${k=2}$:

Theorem 2 Let ${k \leq 2}$. Then each of the sign patterns in ${\{-1,0,+1\}^k}$ is attained by the Möbius function for a set ${n}$ of positive lower density.

It turns out that the prime number theorem and elementary sieve theory can be used to handle the ${k=1}$ case and all the ${k=2}$ cases that involve at least one ${0}$, leaving only the four sign patterns ${(\pm 1, \pm 1)}$ to handle. It is here that the zeroes of the Möbius function cause a significant new obstacle. Suppose for instance that the sign pattern ${(+1, -1)}$ almost never occurs for the Möbius function. The same arguments that were used in the Liouville case then show that ${\mu(n)}$ will be almost always equal to ${\mu(n+1)}$, provided that ${n,n+1}$ are both square-free. One can try to chain this together as before to create a long string ${\mu(n)=\dots=\mu(n+h) \in \{-1,+1\}}$ where the Möbius function is constant, but this cannot work for any ${h}$ larger than three, because the Möbius function vanishes at every multiple of four.

The constraints we assume on the Möbius function can be depicted using a graph on the squarefree natural numbers, in which any two adjacent squarefree natural numbers are connected by an edge. The main difficulty is then that this graph is highly disconnected due to the multiples of four not being squarefree.

To get around this, we need to enlarge the graph. Note from multiplicativity that if ${\mu(n)}$ is almost always equal to ${\mu(n+1)}$ when ${n,n+1}$ are squarefree, then ${\mu(n)}$ is almost always equal to ${\mu(n+p)}$ when ${n,n+p}$ are squarefree and ${n}$ is divisible by ${p}$. We can then form a graph on the squarefree natural numbers by connecting ${n}$ to ${n+p}$ whenever ${n,n+p}$ are squarefree and ${n}$ is divisible by ${p}$. If this graph is “locally connected” in some sense, then ${\mu}$ will be constant on almost all of the squarefree numbers in a large interval, which turns out to be incompatible with the results of Matomäki and Radziwiłł. Because of this, matters are reduced to establishing the connectedness of a certain graph. More precisely, it turns out to be sufficient to establish the following claim:

Theorem 3 For each prime ${p}$, let ${a_p \hbox{ mod } p^2}$ be a residue class chosen uniformly at random. Let ${G}$ be the random graph whose vertices ${V}$ consist of those integers ${n}$ not equal to ${a_p \hbox{ mod } p^2}$ for any ${p}$, and whose edges consist of pairs ${n,n+p}$ in ${V}$ with ${n = a_p \hbox{ mod } p}$. Then with probability ${1}$, the graph ${G}$ is connected.

We were able to show the connectedness of this graph, though it turned out to be remarkably tricky to do so. Roughly speaking (and suppressing a number of technicalities), the main steps in the argument were as follows.

• (Early stage) Pick a large number ${X}$ (in our paper we take ${X}$ to be odd, but I’ll ignore this technicality here). Using a moment method to explore neighbourhoods of a single point in ${V}$, one can show that a vertex ${v}$ in ${V}$ is almost always connected to at least ${\log^{10} X}$ numbers in ${[v,v+X^{1/100}]}$, using relatively short paths of short diameter. (This is the most computationally intensive portion of the argument.)
• (Middle stage) Let ${X'}$ be a typical number in ${[X/40,X/20]}$, and let ${R}$ be a scale somewhere between ${X^{1/40}}$ and ${X'}$. By using paths ${n, n+p_1, n+p_1-p_2, n+p_1-p_2+p_3}$ involving three primes, and using a variant of Vinogradov’s theorem and some routine second moment computations, one can show that with quite high probability, any “good” vertex in ${[v+X'-R, v+X'-0.99R]}$ is connected to a “good” vertex in ${[v+X'-0.01R, v+X-0.0099 R]}$ by paths of length three, where the definition of “good” is somewhat technical but encompasses almost all of the vertices in ${V}$.
• (Late stage) Combining the two previous results together, we can show that most vertices ${v}$ will be connected to a vertex in ${[v+X'-X^{1/40}, v+X']}$ for any ${X'}$ in ${[X/40,X/20]}$. In particular, ${v}$ will be connected to a set of ${\gg X^{9/10}}$ vertices in ${[v,v+X/20]}$. By tracking everything carefully, one can control the length and diameter of the paths used to connect ${v}$ to this set, and one can also control the parity of the elements in this set.
• (Final stage) Now if we have two vertices ${v, w}$ at a distance ${X}$ apart. By the previous item, one can connect ${v}$ to a large set ${A}$ of vertices in ${[v,v+X/20]}$, and one can similarly connect ${w}$ to a large set ${B}$ of vertices in ${[w,w+X/20]}$. Now, by using a Vinogradov-type theorem and second moment calculations again (and ensuring that the elements of ${A}$ and ${B}$ have opposite parity), one can connect many of the vertices in ${A}$ to many of the vertices ${B}$ by paths of length three, which then connects ${v}$ to ${w}$, and gives the claim.

It seems of interest to understand random graphs like ${G}$ further. In particular, the graph ${G'}$ on the integers formed by connecting ${n}$ to ${n+p}$ for all ${n}$ in a randomly selected residue class mod ${p}$ for each prime ${p}$ is particularly interesting (it is to the Liouville function as ${G}$ is to the Möbius function); if one could show some “local expander” properties of this graph ${G'}$, then one would have a chance of modifying the above methods to attack the first unsolved case of the Chowla conjecture, namely that ${\lambda(n)\lambda(n+1)}$ has asymptotic density zero (perhaps working with logarithmic density instead of natural density to avoids some technicalities).

We now move away from the world of multiplicative prime number theory covered in Notes 1 and Notes 2, and enter the wider, and complementary, world of non-multiplicative prime number theory, in which one studies statistics related to non-multiplicative patterns, such as twins ${n,n+2}$. This creates a major jump in difficulty; for instance, even the most basic multiplicative result about the primes, namely Euclid’s theorem that there are infinitely many of them, remains unproven for twin primes. Of course, the situation is even worse for stronger results, such as Euler’s theorem, Dirichlet’s theorem, or the prime number theorem. Finally, even many multiplicative questions about the primes remain open. The most famous of these is the Riemann hypothesis, which gives the asymptotic ${\sum_{n \leq x} \Lambda(n) = x + O( \sqrt{x} \log^2 x )}$ (see Proposition 24 from Notes 2). But even if one assumes the Riemann hypothesis, the precise distribution of the error term ${O( \sqrt{x} \log^2 x )}$ in the above asymptotic (or in related asymptotics, such as for the sum ${\sum_{x \leq n < x+y} \Lambda(n)}$ that measures the distribution of primes in short intervals) is not entirely clear.

Despite this, we do have a number of extremely convincing and well supported models for the primes (and related objects) that let us predict what the answer to many prime number theory questions (both multiplicative and non-multiplicative) should be, particularly in asymptotic regimes where one can work with aggregate statistics about the primes, rather than with a small number of individual primes. These models are based on taking some statistical distribution related to the primes (e.g. the primality properties of a randomly selected ${k}$-tuple), and replacing that distribution by a model distribution that is easy to compute with (e.g. a distribution with strong joint independence properties). One can then predict the asymptotic value of various (normalised) statistics about the primes by replacing the relevant statistical distributions of the primes with their simplified models. In this non-rigorous setting, many difficult conjectures on the primes reduce to relatively simple calculations; for instance, all four of the (still unsolved) Landau problems may now be justified in the affirmative by one or more of these models. Indeed, the models are so effective at this task that analytic number theory is in the curious position of being able to confidently predict the answer to a large proportion of the open problems in the subject, whilst not possessing a clear way forward to rigorously confirm these answers!

As it turns out, the models for primes that have turned out to be the most accurate in practice are random models, which involve (either explicitly or implicitly) one or more random variables. This is despite the prime numbers being obviously deterministic in nature; no coins are flipped or dice rolled to create the set of primes. The point is that while the primes have a lot of obvious multiplicative structure (for instance, the product of two primes is never another prime), they do not appear to exhibit much discernible non-multiplicative structure asymptotically, in the sense that they rarely exhibit statistical anomalies in the asymptotic limit that cannot be easily explained in terms of the multiplicative properties of the primes. As such, when considering non-multiplicative statistics of the primes, the primes appear to behave pseudorandomly, and can thus be modeled with reasonable accuracy by a random model. And even for multiplicative problems, which are in principle controlled by the zeroes of the Riemann zeta function, one can obtain good predictions by positing various pseudorandomness properties of these zeroes, so that the distribution of these zeroes can be modeled by a random model.

Of course, one cannot expect perfect accuracy when replicating a deterministic set such as the primes by a probabilistic model of that set, and each of the heuristic models we discuss below have some limitations to the range of statistics about the primes that they can expect to track with reasonable accuracy. For instance, many of the models about the primes do not fully take into account the multiplicative structure of primes, such as the connection with a zeta function with a meromorphic continuation to the entire complex plane; at the opposite extreme, we have the GUE hypothesis which appears to accurately model the zeta function, but does not capture such basic properties of the primes as the fact that the primes are all natural numbers. Nevertheless, each of the models described below, when deployed within their sphere of reasonable application, has (possibly after some fine-tuning) given predictions that are in remarkable agreement with numerical computation and with known rigorous theoretical results, as well as with other models in overlapping spheres of application; they are also broadly compatible with the general heuristic (discussed in this previous post) that in the absence of any exploitable structure, asymptotic statistics should default to the most “uniform”, “pseudorandom”, or “independent” distribution allowable.

As hinted at above, we do not have a single unified model for the prime numbers (other than the primes themselves, of course), but instead have an overlapping family of useful models that each appear to accurately describe some, but not all, aspects of the prime numbers. In this set of notes, we will discuss four such models:

1. The Cramér random model and its refinements, which model the set ${{\mathcal P}}$ of prime numbers by a random set.
2. The Möbius pseudorandomness principle, which predicts that the Möbius function ${\mu}$ does not correlate with any genuinely different arithmetic sequence of reasonable “complexity”.
3. The equidistribution of residues principle, which predicts that the residue classes of a large number ${n}$ modulo a small or medium-sized prime ${p}$ behave as if they are independently and uniformly distributed as ${p}$ varies.
4. The GUE hypothesis, which asserts that the zeroes of the Riemann zeta function are distributed (at microscopic and mesoscopic scales) like the zeroes of a GUE random matrix, and which generalises the pair correlation conjecture regarding pairs of such zeroes.

This is not an exhaustive list of models for the primes and related objects; for instance, there is also the model in which the major arc contribution in the Hardy-Littlewood circle method is predicted to always dominate, and with regards to various finite groups of number-theoretic importance, such as the class groups discussed in Supplement 1, there are also heuristics of Cohen-Lenstra type. Historically, the first heuristic discussion of the primes along these lines was by Sylvester, who worked informally with a model somewhat related to the equidistribution of residues principle. However, we will not discuss any of these models here.

A word of warning: the discussion of the above four models will inevitably be largely informal, and “fuzzy” in nature. While one can certainly make precise formalisations of at least some aspects of these models, one should not be inflexibly wedded to a specific such formalisation as being “the” correct way to pin down the model rigorously. (To quote the statistician George Box: “all models are wrong, but some are useful”.) Indeed, we will see some examples below the fold in which some finer structure in the prime numbers leads to a correction term being added to a “naive” implementation of one of the above models to make it more accurate, and it is perfectly conceivable that some further such fine-tuning will be applied to one or more of these models in the future. These sorts of mathematical models are in some ways closer in nature to the scientific theories used to model the physical world, than they are to the axiomatic theories one is accustomed to in rigorous mathematics, and one should approach the discussion below accordingly. In particular, and in contrast to the other notes in this course, the material here is not directly used for proving further theorems, which is why we have marked it as “optional” material. Nevertheless, the heuristics and models here are still used indirectly for such purposes, for instance by

• giving a clearer indication of what results one expects to be true, thus guiding one to fruitful conjectures;
• providing a quick way to scan for possible errors in a mathematical claim (e.g. by finding that the main term is off from what a model predicts, or an error term is too small);
• gauging the relative strength of various assertions (e.g. classifying some results as “unsurprising”, others as “potential breakthroughs” or “powerful new estimates”, others as “unexpected new phenomena”, and yet others as “way too good to be true”); or
• setting up heuristic barriers (such as the parity barrier) that one has to resolve before resolving certain key problems (e.g. the twin prime conjecture).

See also my previous essay on the distinction between “rigorous” and “post-rigorous” mathematics, or Thurston’s essay discussing, among other things, the “definition-theorem-proof” model of mathematics and its limitations.

Remark 1 The material in this set of notes presumes some prior exposure to probability theory. See for instance this previous post for a quick review of the relevant concepts.

One of the basic general problems in analytic number theory is to understand as much as possible the fluctuations of the Möbius function ${\mu(n)}$, defined as ${(-1)^k}$ when ${n}$ is the product of ${k}$ distinct primes, and zero otherwise. For instance, as ${\mu}$ takes values in ${\{-1,0,1\}}$, we have the trivial bound

$\displaystyle |\sum_{n \leq x} \mu(n)| \leq x$

and the seemingly slight improvement

$\displaystyle \sum_{n \leq x} \mu(n) = o(x) \ \ \ \ \ (1)$

is already equivalent to the prime number theorem, as observed by Landau (see e.g. this previous blog post for a proof), while the much stronger (and still open) improvement

$\displaystyle \sum_{n \leq x} \mu(n) = O(x^{1/2+o(1)})$

is equivalent to the notorious Riemann hypothesis.

There is a general Möbius pseudorandomness heuristic that suggests that the sign pattern ${\mu}$ behaves so randomly (or pseudorandomly) that one should expect a substantial amount of cancellation in sums that involve the sign fluctuation of the Möbius function in a nontrivial fashion, with the amount of cancellation present comparable to the amount that an analogous random sum would provide; cf. the probabilistic heuristic discussed in this recent blog post. There are a number of ways to make this heuristic precise. As already mentioned, the Riemann hypothesis can be considered one such manifestation of the heuristic. Another manifestation is the following old conjecture of Chowla:

Conjecture 1 (Chowla’s conjecture) For any fixed integer ${m}$ and exponents ${a_1,a_2,\ldots,a_m \geq 0}$, with at least one of the ${a_i}$ odd (so as not to completely destroy the sign cancellation), we have

$\displaystyle \sum_{n \leq x} \mu(n+1)^{a_1} \ldots \mu(n+m)^{a_m} = o_{x \rightarrow \infty;m}(x).$

Note that as ${\mu^a = \mu^{a+2}}$ for any ${a \geq 1}$, we can reduce to the case when the ${a_i}$ take values in ${0,1,2}$ here. When only one of the ${a_i}$ are odd, this is essentially the prime number theorem in arithmetic progressions (after some elementary sieving), but with two or more of the ${a_i}$ are odd, the problem becomes completely open. For instance, the estimate

$\displaystyle \sum_{n \leq x} \mu(n) \mu(n+2) = o(x)$

is morally very close to the conjectured asymptotic

$\displaystyle \sum_{n \leq x} \Lambda(n) \Lambda(n+2) = 2\Pi_2 x + o(x)$

for the von Mangoldt function ${\Lambda}$, where ${\Pi_2 := \prod_{p > 2} (1 - \frac{1}{(p-1)^2}) = 0.66016\ldots}$ is the twin prime constant; this asymptotic in turn implies the twin prime conjecture. (To formally deduce estimates for von Mangoldt from estimates for Möbius, though, typically requires some better control on the error terms than ${o()}$, in particular gains of some power of ${\log x}$ are usually needed. See this previous blog post for more discussion.)

Remark 2 The Chowla conjecture resembles an assertion that, for ${n}$ chosen randomly and uniformly from ${1}$ to ${x}$, the random variables ${\mu(n+1),\ldots,\mu(n+k)}$ become asymptotically independent of each other (in the probabilistic sense) as ${x \rightarrow \infty}$. However, this is not quite accurate, because some moments (namely those with all exponents ${a_i}$ even) have the “wrong” asymptotic value, leading to some unwanted correlation between the two variables. For instance, the events ${\mu(n)=0}$ and ${\mu(n+4)=0}$ have a strong correlation with each other, basically because they are both strongly correlated with the event of ${n}$ being divisible by ${4}$. A more accurate interpretation of the Chowla conjecture is that the random variables ${\mu(n+1),\ldots,\mu(n+k)}$ are asymptotically conditionally independent of each other, after conditioning on the zero pattern ${\mu(n+1)^2,\ldots,\mu(n+k)^2}$; thus, it is the sign of the Möbius function that fluctuates like random noise, rather than the zero pattern. (The situation is a bit cleaner if one works instead with the Liouville function ${\lambda}$ instead of the Möbius function ${\mu}$, as this function never vanishes, but we will stick to the traditional Möbius function formalism here.)

A more recent formulation of the Möbius randomness heuristic is the following conjecture of Sarnak. Given a bounded sequence ${f: {\bf N} \rightarrow {\bf C}}$, define the topological entropy of the sequence to be the least exponent ${\sigma}$ with the property that for any fixed ${\varepsilon > 0}$, and for ${m}$ going to infinity the set ${\{ (f(n+1),\ldots,f(n+m)): n \in {\bf N} \} \subset {\bf C}^m}$ of ${f}$ can be covered by ${O( \exp( \sigma m + o(m) ) )}$ balls of radius ${\varepsilon}$ (in the ${\ell^\infty}$ metric). (If ${f}$ arises from a minimal topological dynamical system ${(X,T)}$ by ${f(n) := f(T^n x)}$, the above notion is equivalent to the usual notion of the topological entropy of a dynamical system.) For instance, if the sequence is a bit sequence (i.e. it takes values in ${\{0,1\}}$), then there are only ${\exp(\sigma m + o(m))}$ ${m}$-bit patterns that can appear as blocks of ${m}$ consecutive bits in this sequence. As a special case, a Turing machine with bounded memory that had access to a random number generator at the rate of one random bit produced every ${T}$ units of time, but otherwise evolved deterministically, would have an output sequence that had a topological entropy of at most ${\frac{1}{T} \log 2}$. A bounded sequence is said to be deterministic if its topological entropy is zero. A typical example is a polynomial sequence such as ${f(n) := e^{2\pi i \alpha n^2}}$ for some fixed ${\sigma}$; the ${m}$-blocks of such polynomials sequence have covering numbers that only grow polynomially in ${m}$, rather than exponentially, thus yielding the zero entropy. Unipotent flows, such as the horocycle flow on a compact hyperbolic surface, are another good source of deterministic sequences.

Conjecture 3 (Sarnak’s conjecture) Let ${f: {\bf N} \rightarrow {\bf C}}$ be a deterministic bounded sequence. Then

$\displaystyle \sum_{n \leq x} \mu(n) f(n) = o_{x \rightarrow \infty;f}(x).$

This conjecture in general is still quite far from being solved. However, special cases are known:

• For constant sequences, this is essentially the prime number theorem (1).
• For periodic sequences, this is essentially the prime number theorem in arithmetic progressions.
• For quasiperiodic sequences such as ${f(n) = F(\alpha n \hbox{ mod } 1)}$ for some continuous ${F}$, this follows from the work of Davenport.
• For nilsequences, this is a result of Ben Green and myself.
• For horocycle flows, this is a result of Bourgain, Sarnak, and Ziegler.
• For the Thue-Morse sequence, this is a result of Dartyge-Tenenbaum (with a stronger error term obtained by Maduit-Rivat). A subsequent result of Bourgain handles all bounded rank one sequences (though the Thue-Morse sequence is actually of rank two), and a related result of Green establishes asymptotic orthogonality of the Möbius function to bounded depth circuits, although such functions are not necessarily deterministic in nature.
• For the Rudin-Shapiro sequence, I sketched out an argument at this MathOverflow post.
• The Möbius function is known to itself be non-deterministic, because its square ${\mu^2(n)}$ (i.e. the indicator of the square-free functions) is known to be non-deterministic (indeed, its topological entropy is ${\frac{6}{\pi^2}\log 2}$). (The corresponding question for the Liouville function ${\lambda(n)}$, however, remains open, as the square ${\lambda^2(n)=1}$ has zero entropy.)
• In the converse direction, it is easy to construct sequences of arbitrarily small positive entropy that correlate with the Möbius function (a rather silly example is ${\mu(n) 1_{k|n}}$ for some fixed large (squarefree) ${k}$, which has topological entropy at most ${\log 2/k}$ but clearly correlates with ${\mu}$).

See this survey of Sarnak for further discussion of this and related topics.

In this post I wanted to give a very nice argument of Sarnak that links the above two conjectures:

Proposition 4 The Chowla conjecture implies the Sarnak conjecture.

The argument does not use any number-theoretic properties of the Möbius function; one could replace ${\mu}$ in both conjectures by any other function from the natural numbers to ${\{-1,0,+1\}}$ and obtain the same implication. The argument consists of the following ingredients:

1. To show that ${\sum_{n, it suffices to show that the expectation of the random variable ${\frac{1}{m} (\mu(n+1)f(n+1)+\ldots+\mu(n+m)f(n+m))}$, where ${n}$ is drawn uniformly at random from ${1}$ to ${x}$, can be made arbitrary small by making ${m}$ large (and ${n}$ even larger).
2. By the union bound and the zero topological entropy of ${f}$, it suffices to show that for any bounded deterministic coefficients ${c_1,\ldots,c_m}$, the random variable ${\frac{1}{m}(c_1 \mu(n+1) + \ldots + c_m \mu(n+m))}$ concentrates with exponentially high probability.
3. Finally, this exponentially high concentration can be achieved by the moment method, using a slight variant of the moment method proof of the large deviation estimates such as the Chernoff inequality or Hoeffding inequality (as discussed in this blog post).

As is often the case, though, while the “top-down” order of steps presented above is perhaps the clearest way to think conceptually about the argument, in order to present the argument formally it is more convenient to present the arguments in the reverse (or “bottom-up”) order. This is the approach taken below the fold.

One of the basic problems in analytic number theory is to estimate sums of the form

$\displaystyle \sum_{p

as ${x \rightarrow \infty}$, where ${p}$ ranges over primes and ${f}$ is some explicit function of interest (e.g. a linear phase function ${f(p) = e^{2\pi i \alpha p}}$ for some real number ${\alpha}$). This is essentially the same task as obtaining estimates on the sum

$\displaystyle \sum_{n

where ${\Lambda}$ is the von Mangoldt function. If ${f}$ is bounded, ${f(n)=O(1)}$, then from the prime number theorem one has the trivial bound

$\displaystyle \sum_{n

but often (when ${f}$ is somehow “oscillatory” in nature) one is seeking the refinement

$\displaystyle \sum_{n

or equivalently

$\displaystyle \sum_{p

Thanks to identities such as

$\displaystyle \Lambda(n) = \sum_{d|n} \mu(d) \log(\frac{n}{d}), \ \ \ \ \ (3)$

where ${\mu}$ is the Möbius function, refinements such as (1) are similar in spirit to estimates of the form

$\displaystyle \sum_{n

Unfortunately, the connection between (1) and (4) is not particularly tight; roughly speaking, one needs to improve the bounds in (4) (and variants thereof) by about two factors of ${\log x}$ before one can use identities such as (3) to recover (1). Still, one generally thinks of (1) and (4) as being “morally” equivalent, even if they are not formally equivalent.

When ${f}$ is oscillating in a sufficiently “irrational” way, then one standard way to proceed is the method of Type I and Type II sums, which uses truncated versions of divisor identities such as (3) to expand out either (1) or (4) into linear (Type I) or bilinear sums (Type II) with which one can exploit the oscillation of ${f}$. For instance, Vaughan’s identity lets one rewrite the sum in (1) as the sum of the Type I sum

$\displaystyle \sum_{d \leq U} \mu(d) (\sum_{V/d \leq r \leq x/d} (\log r) f(rd)),$

the Type I sum

$\displaystyle -\sum_{d \leq UV} a(d) \sum_{V/d \leq r \leq x/d} f(rd),$

the Type II sum

$\displaystyle -\sum_{V \leq d \leq x/U} \sum_{U < m \leq x/V} \Lambda(d) b(m) f(dm),$

and the error term ${\sum_{d \leq V} \Lambda(n) f(n)}$, whenever ${1 \leq U, V \leq x}$ are parameters, and ${a, b}$ are the sequences

$\displaystyle a(d) := \sum_{e \leq U, f \leq V: ef = d} \Lambda(d) \mu(e)$

and

$\displaystyle b(m) := \sum_{d|m: d \leq U} \mu(d).$

Similarly one can express (4) as the Type I sum

$\displaystyle -\sum_{d \leq UV} c(d) \sum_{UV/d \leq r \leq x/d} f(rd),$

the Type II sum

$\displaystyle - \sum_{V < d \leq x/U} \sum_{U < m \leq x/d} \mu(m) b(d) f(dm)$

and the error term ${\sum_{d \leq UV} \mu(n) f(N)}$, whenever ${1 \leq U,V \leq x}$ with ${UV \leq x}$, and ${c}$ is the sequence

$\displaystyle c(d) := \sum_{e \leq U, f \leq V: ef = d} \mu(d) \mu(e).$

After eliminating troublesome sequences such as ${a(), b(), c()}$ via Cauchy-Schwarz or the triangle inequality, one is then faced with the task of estimating Type I sums such as

$\displaystyle \sum_{r \leq y} f(rd)$

or Type II sums such as

$\displaystyle \sum_{r \leq y} f(rd) \overline{f(rd')}$

for various ${y, d, d' \geq 1}$. Here, the trivial bound is ${O(y)}$, but due to a number of logarithmic inefficiencies in the above method, one has to obtain bounds that are more like ${O( \frac{y}{\log^C y})}$ for some constant ${C}$ (e.g. ${C=5}$) in order to end up with an asymptotic such as (1) or (4).

However, in a recent paper of Bourgain, Sarnak, and Ziegler, it was observed that as long as one is only seeking the Mobius orthogonality (4) rather than the von Mangoldt orthogonality (1), one can avoid losing any logarithmic factors, and rely purely on qualitative equidistribution properties of ${f}$. A special case of their orthogonality criterion (which actually dates back to an earlier paper of Katai, as was pointed out to me by Nikos Frantzikinakis) is as follows:

Proposition 1 (Orthogonality criterion) Let ${f: {\bf N} \rightarrow {\bf C}}$ be a bounded function such that

$\displaystyle \sum_{n \leq x} f(pn) \overline{f(qn)} = o(x) \ \ \ \ \ (5)$

for any distinct primes ${p, q}$ (where the decay rate of the error term ${o(x)}$ may depend on ${p}$ and ${q}$). Then

$\displaystyle \sum_{n \leq x} \mu(n) f(n) =o(x). \ \ \ \ \ (6)$

Actually, the Bourgain-Sarnak-Ziegler paper establishes a more quantitative version of this proposition, in which ${\mu}$ can be replaced by an arbitrary bounded multiplicative function, but we will content ourselves with the above weaker special case. (See also these notes of Harper, which uses the Katai argument to give a slightly weaker quantitative bound in the same spirit.) This criterion can be viewed as a multiplicative variant of the classical van der Corput lemma, which in our notation asserts that ${\sum_{n \leq x} f(n) = o(x)}$ if one has ${\sum_{n \leq x} f(n+h) \overline{f(n)} = o(x)}$ for each fixed non-zero ${h}$.

As a sample application, Proposition 1 easily gives a proof of the asymptotic

$\displaystyle \sum_{n \leq x} \mu(n) e^{2\pi i \alpha n} = o(x)$

for any irrational ${\alpha}$. (For rational ${\alpha}$, this is a little trickier, as it is basically equivalent to the prime number theorem in arithmetic progressions.) The paper of Bourgain, Sarnak, and Ziegler also apply this criterion to nilsequences (obtaining a quick proof of a qualitative version of a result of Ben Green and myself, see these notes of Ziegler for details) and to horocycle flows (for which no Möbius orthogonality result was previously known).

Informally, the connection between (5) and (6) comes from the multiplicative nature of the Möbius function. If (6) failed, then ${\mu(n)}$ exhibits strong correlation with ${f(n)}$; by change of variables, we then expect ${\mu(pn)}$ to correlate with ${f(pn)}$ and ${\mu(pm)}$ to correlate with ${f(qn)}$, for “typical” ${p,q}$ at least. On the other hand, since ${\mu}$ is multiplicative, ${\mu(pn)}$ exhibits strong correlation with ${\mu(qn)}$. Putting all this together (and pretending correlation is transitive), this would give the claim (in the contrapositive). Of course, correlation is not quite transitive, but it turns out that one can use the Cauchy-Schwarz inequality as a substitute for transitivity of correlation in this case.

I will give a proof of Proposition 1 below the fold (which is not quite based on the argument in the above mentioned paper, but on a variant of that argument communicated to me by Tamar Ziegler, and also independently discovered by Adam Harper). The main idea is to exploit the following observation: if ${P}$ is a “large” but finite set of primes (in the sense that the sum ${A := \sum_{p \in P} \frac{1}{p}}$ is large), then for a typical large number ${n}$ (much larger than the elements of ${P}$), the number of primes in ${P}$ that divide ${n}$ is pretty close to ${A = \sum_{p \in P} \frac{1}{p}}$:

$\displaystyle \sum_{p \in P: p|n} 1 \approx A. \ \ \ \ \ (7)$

A more precise formalisation of this heuristic is provided by the Turan-Kubilius inequality, which is proven by a simple application of the second moment method.

In particular, one can sum (7) against ${\mu(n) f(n)}$ and obtain an approximation

$\displaystyle \sum_{n \leq x} \mu(n) f(n) \approx \frac{1}{A} \sum_{p \in P} \sum_{n \leq x: p|n} \mu(n) f(n)$

that approximates a sum of ${\mu(n) f(n)}$ by a bunch of sparser sums of ${\mu(n) f(n)}$. Since

$\displaystyle x = \frac{1}{A} \sum_{p \in P} \frac{x}{p},$

we see (heuristically, at least) that in order to establish (4), it would suffice to establish the sparser estimates

$\displaystyle \sum_{n \leq x: p|n} \mu(n) f(n) = o(\frac{x}{p})$

for all ${p \in P}$ (or at least for “most” ${p \in P}$).

Now we make the change of variables ${n = pm}$. As the Möbius function is multiplicative, we usually have ${\mu(n) = \mu(p) \mu(m) = - \mu(m)}$. (There is an exception when ${n}$ is divisible by ${p^2}$, but this will be a rare event and we will be able to ignore it.) So it should suffice to show that

$\displaystyle \sum_{m \leq x/p} \mu(m) f(pm) = o( x/p )$

for most ${p \in P}$. However, by the hypothesis (5), the sequences ${m \mapsto f(pm)}$ are asymptotically orthogonal as ${p}$ varies, and this claim will then follow from a Cauchy-Schwarz argument.

I’ve just uploaded to the arXiv the paper A remark on partial sums involving the Möbius function, submitted to Bull. Aust. Math. Soc..

The Möbius function ${\mu(n)}$ is defined to equal ${(-1)^k}$ when ${n}$ is the product of ${k}$ distinct primes, and equal to zero otherwise; it is closely connected to the distribution of the primes. In 1906, Landau observed that one could show using purely elementary means that the prime number theorem

$\displaystyle \sum_{p \leq x} 1 = (1+o(1)) \frac{x}{\log x} \ \ \ \ \ (1)$

(where ${o(1)}$ denotes a quantity that goes to zero as ${x \rightarrow \infty}$) was logically equivalent to the partial sum estimates

$\displaystyle \sum_{n \leq x} \mu(n) = o(x) \ \ \ \ \ (2)$

and

$\displaystyle \sum_{n \leq x} \frac{\mu(n)}{n} = o(1); \ \ \ \ \ (3)$

we give a sketch of the proof of these equivalences below the fold.

On the other hand, these three inequalities are all easy to prove if the ${o()}$ terms are replaced by their ${O()}$ counterparts. For instance, by observing that the binomial coefficient ${\binom{2n}{n}}$ is bounded by ${4^n}$ on the one hand (by Pascal’s triangle or the binomial theorem), and is divisible by every prime between ${n}$ and ${2n}$ on the other hand, we conclude that

$\displaystyle \sum_{n < p \leq 2n} \log p \leq n \log 4$

from which it is not difficult to show that

$\displaystyle \sum_{p \leq x} 1 = O( \frac{x}{\log x} ). \ \ \ \ \ (4)$

Also, since ${|\mu(n)| \leq 1}$, we clearly have

$\displaystyle |\sum_{n \leq x} \mu(n)| \leq x.$

Finally, one can also show that

$\displaystyle |\sum_{n \leq x} \frac{\mu(n)}{n}| \leq 1. \ \ \ \ \ (5)$

Indeed, assuming without loss of generality that ${x}$ is a positive integer, and summing the inversion formula ${1_{n=1} = \sum_{d|n} \mu(d)}$ over all ${n \leq x}$ one sees that

$\displaystyle 1 = \sum_{d \leq x} \mu(d) \left\lfloor \frac{x}{d}\right \rfloor = \sum_{d \leq x} \mu(d) \frac{x}{d} - \sum_{d

and the claim follows by bounding ${|\mu(d) \left \{ \frac{x}{d} \right\}|}$ by ${1}$.

In this paper I extend these observations to more general multiplicative subsemigroups of the natural numbers. More precisely, if ${P}$ is any set of primes (finite or infinite), I show that

$\displaystyle |\sum_{n \in \langle P \rangle: n \leq x} \frac{\mu(n)}{n}| \leq 1 \ \ \ \ \ (7)$

and that

$\displaystyle \sum_{n \in \langle P \rangle: n \leq x} \frac{\mu(n)}{n} = \prod_{p \in P} (1-\frac{1}{p}) + o(1), \ \ \ \ \ (8)$

where ${\langle P \rangle}$ is the multiplicative semigroup generated by ${P}$, i.e. the set of natural numbers whose prime factors lie in ${P}$.

Actually the methods are completely elementary (the paper is just six pages long), and I can give the proof of (7) in full here. Again we may take ${x}$ to be a positive integer. Clearly we may assume that

$\displaystyle \sum_{n \in \langle P \rangle: n \leq x} \frac{1}{n} > 1, \ \ \ \ \ (9)$

as the claim is trivial otherwise.

If ${P'}$ denotes the primes that are not in ${P}$, then Möbius inversion gives us

$\displaystyle 1_{n \in \langle P' \rangle} = \sum_{d|n; d \in \langle P \rangle} \mu(d).$

Summing this for ${1 \leq n \leq x}$ gives

$\displaystyle \sum_{n \in \langle P' \rangle: n \leq x} 1 = \sum_{d \in \langle P \rangle: d \leq x} \mu(d) \frac{x}{d} - \sum_{d \in \langle P \rangle: d \leq x} \mu(d) \left \{ \frac{x}{d} \right \}.$

We can bound ${|\mu(d) \left \{ \frac{x}{d} \right \}| \leq 1 - \frac{1}{d}}$ and so

$\displaystyle |\sum_{d \in \langle P \rangle: d \leq x} \mu(d) \frac{x}{d}| \leq \sum_{n \in \langle P' \rangle: n \leq x} 1 + \sum_{n \in \langle P \rangle: n \leq x} 1 - \sum_{n \in \langle P \rangle: n \leq x} \frac{1}{n}.$

The claim now follows from (9), since ${\langle P \rangle}$ and ${\langle P' \rangle}$ overlap only at ${1}$.

As special cases of (7) we see that

$\displaystyle |\sum_{d \leq x: d|m} \frac{\mu(d)}{d}| \leq 1$

and

$\displaystyle |\sum_{n \leq x: (m,n)=1} \frac{\mu(n)}{n}| \leq 1$

for all ${m,x}$. Since ${\mu(mn) = \mu(m) \mu(n) 1_{(m,n)=1}}$, we also have

$\displaystyle |\sum_{n \leq x} \frac{\mu(mn)}{n}| \leq 1.$

One might hope that these inequalities (which gain a factor of ${\log x}$ over the trivial bound) might be useful when performing effective sieve theory, or effective estimates on various sums involving the primes or arithmetic functions.

This inequality (7) is so simple to state and prove that I must think that it was known to, say, Landau or Chebyshev, but I can’t find any reference to it in the literature. [Update, Sep 4: I have learned that similar results have been obtained in a paper by Granville and Soundararajan, and have updated the paper appropriately.] The proof of (8) is a simple variant of that used to prove (7) but I will not detail it here.

Curiously, this is one place in number theory where the elementary methods seem superior to the analytic ones; there is a zeta function ${\zeta_P(s) = \sum_{n \in \langle P \rangle} \frac{1}{n^s} = \prod_{p \in P} (1-\frac{1}{p^s})^{-1}}$ associated to this problem, but it need not have a meromorphic continuation beyond the region ${\{ \hbox{Re}(s) > 1 \}}$, and it turns out to be remarkably difficult to use this function to establish the above results. (I do have a proof of this form, which I in fact found before I stumbled on the elementary proof, but it is far longer and messier.)

Ben Green and I have just uploaded to the arXiv our paper, “The Möbius function is asymptotically orthogonal to nilsequences“, which is a sequel to our earlier paper “The quantitative behaviour of polynomial orbits on nilmanifolds“, which I talked about in this post.  In this paper, we apply our previous results on quantitative equidistribution polynomial orbits in nilmanifolds to settle the Möbius and nilsequences conjecture from our earlier paper, as part of our program to detect and count solutions to linear equations in primes.  (The other major plank of that program, namely the inverse conjecture for the Gowers norm, remains partially unresolved at present.)  Roughly speaking, this conjecture asserts the asymptotic orthogonality

$\displaystyle |\frac{1}{N} \sum_{n=1}^N \mu(n) f(n)| \ll_A \log^{-A} N$ (1)

between the Möbius function $\mu(n)$ and any Lipschitz nilsequence f(n), by which we mean a sequence of the form $f(n) = F(g^n x)$ for some orbit $g^n x$ in a nilmanifold $G/\Gamma$, and some Lipschitz function $F: G/\Gamma \to {\Bbb C}$ on that nilmanifold.  (The implied constant can depend on the nilmanifold and on the Lipschitz constant of F, but it is important that it be independent of the generator g of the orbit or the base point x.)  The case when f is constant is essentially the prime number theorem; the case when f is periodic is essentially the prime number theorem in arithmetic progressions.  The case when f is almost periodic (e.g. $f(n) = e^{2\pi i \alpha n}$ for some irrational $\alpha$) was established by Davenport, using the method of Vinogradov.  The case when f was a 2-step nilsequence (such as the quadratic phase $f(n) = e^{2\pi i \alpha n^2}$; bracket quadratic phases such as $f(n) = e^{2\pi \lfloor \alpha n \rfloor \beta n}$ can also be covered by an approximation argument, though the logarithmic decay in (1) is weakened as a consequence) was done by Ben and myself a few years ago, by a rather ad hoc adaptation of Vinogradov’s method.  By using the equidistribution theory of nilmanifolds, we were able to apply Vinogradov’s method more systematically, and in fact the proof is relatively short (20 pages), although it relies on the 64-page predecessor paper on equidistribution.  I’ll talk a little bit more about the proof after the fold.

There is an amusing way to interpret the conjecture (using the close relationship between nilsequences and bracket polynomials) as an assertion of the pseudorandomness of the Liouville function from a computational complexity perspective.    Suppose you possess a calculator with the wonderful property of being infinite precision: it can accept arbitrarily large real numbers as input, manipulate them precisely, and also store them in memory.  However, this calculator has two limitations.  Firstly, the only operations available are addition, subtraction, multiplication, integer part $x \mapsto [x]$, fractional part $x \mapsto \{x\}$, memory store (into one of O(1) registers), and memory recall (from one of these O(1) registers).  In particular, there is no ability to perform division.  Secondly, the calculator only has a finite display screen, and when it shows a real number, it only shows O(1) digits before and after the decimal point.  (Thus, for instance, the real number 1234.56789 might be displayed only as $\ldots 34.56\ldots$.)

Now suppose you play the following game with an opponent.

1. The opponent specifies a large integer d.
2. You get to enter in O(1) real constants of your choice into your calculator.  These can be absolute constants such as $\sqrt{2}$ and $\pi$, or they can depend on d (e.g. you can enter in $10^{-d}$).
3. The opponent randomly selects an d-digit integer n, and enters n into one of the registers of your calculator.
4. You are allowed to perform O(1) operations on your calculator and record what is displayed on the calculator’s viewscreen.
5. After this, you have to guess whether the opponent’s number n had an odd or even number of prime factors (i.e. you guess $\lambda(n)$.)
6. If you guess correctly, you win $1; otherwise, you lose$1.

For instance, using your calculator you can work out the first few digits of $\{ \sqrt{2}n \lfloor \sqrt{3} n \rfloor \}$, provided of course that you entered the constants $\sqrt{2}$ and $\sqrt{3}$ in advance.  You can also work out the leading digits of n by storing $10^{-d}$ in advance, and computing the first few digits of $10^{-d} n$.

Our theorem is equivalent to the assertion that as d goes to infinity (keeping the O(1) constants fixed), your probability of winning this game converges to 1/2; in other words, your calculator becomes asymptotically useless to you for the purposes of guessing whether n has an odd or even number of prime factors, and you may as well just guess randomly.

[I should mention a recent result in a similar spirit by Mauduit and Rivat; in this language, their result asserts that knowing the last few digits of the digit-sum of n does not increase your odds of guessing $\lambda(n)$ correctly.]