Let {S} be a non-empty finite set. If {X} is a random variable taking values in {S}, the Shannon entropy {H[X]} of {X} is defined as

\displaystyle H[X] = -\sum_{s \in S} {\bf P}[X = s] \log {\bf P}[X = s].

There is a nice variational formula that lets one compute logs of sums of exponentials in terms of this entropy:

Lemma 1 (Gibbs variational formula) Let {f: S \rightarrow {\bf R}} be a function. Then

\displaystyle  \log \sum_{s \in S} \exp(f(s)) = \sup_X {\bf E} f(X) + {\bf H}[X]. \ \ \ \ \ (1)

Proof: Note that shifting {f} by a constant affects both sides of (1) the same way, so we may normalize {\sum_{s \in S} \exp(f(s)) = 1}. Then {\exp(f(s))} is now the probability distribution of some random variable {Y}, and the inequality can be rewritten as

\displaystyle  0 = \sup_X \sum_{s \in S} {\bf P}[X = s] \log {\bf P}[Y = s] -\sum_{s \in S} {\bf P}[X = s] \log {\bf P}[X = s].

But this is precisely the Gibbs inequality. (The expression inside the supremum can also be written as {-D_{KL}(X||Y)}, where {D_{KL}} denotes Kullback-Leibler divergence. One can also interpret this inequality as a special case of the Fenchel–Young inequality relating the conjugate convex functions {x \mapsto e^x} and {y \mapsto y \log y - y}.) \Box

In this note I would like to use this variational formula (which is also known as the Donsker-Varadhan variational formula) to give another proof of the following inequality of Carbery.

Theorem 2 (Generalized Cauchy-Schwarz inequality) Let {n \geq 0}, let {S, T_1,\dots,T_n} be finite non-empty sets, and let {\pi_i: S \rightarrow T_i} be functions for each {i=1,\dots,n}. Let {K: S \rightarrow {\bf R}^+} and {f_i: T_i \rightarrow {\bf R}^+} be positive functions for each {i=1,\dots,n}. Then

\displaystyle  \sum_{s \in S} K(s) \prod_{i=1}^n f_i(\pi_i(s)) \leq Q \prod_{i=1}^n (\sum_{t_i \in T_i} f_i(t_i)^{n+1})^{1/(n+1)}

where {Q} is the quantity

\displaystyle  Q := (\sum_{(s_0,\dots,s_n) \in \Omega_n} K(s_0) \dots K(s_n))^{1/(n+1)}

where {\Omega_n} is the set of all tuples {(s_0,\dots,s_n) \in S^{n+1}} such that {\pi_i(s_{i-1}) = \pi_i(s_i)} for {i=1,\dots,n}.

Thus for instance, the identity is trivial for {n=0}. When {n=1}, the inequality reads

\displaystyle  \sum_{s \in S} K(s) f_1(\pi_1(s)) \leq (\sum_{s_0,s_1 \in S: \pi_1(s_0)=\pi_1(s_1)} K(s_0) K(s_1))^{1/2}

\displaystyle  ( \sum_{t_1 \in T_1} f_1(t_1)^2)^{1/2},

which is easily proven by Cauchy-Schwarz, while for {n=2} the inequality reads

\displaystyle  \sum_{s \in S} K(s) f_1(\pi_1(s)) f_2(\pi_2(s))

\displaystyle  \leq (\sum_{s_0,s_1, s_2 \in S: \pi_1(s_0)=\pi_1(s_1); \pi_2(s_1)=\pi_2(s_2)} K(s_0) K(s_1) K(s_2))^{1/3}

\displaystyle (\sum_{t_1 \in T_1} f_1(t_1)^3)^{1/3} (\sum_{t_2 \in T_2} f_2(t_2)^3)^{1/3}

which can also be proven by elementary means. However even for {n=3}, the existing proofs require the “tensor power trick” in order to reduce to the case when the {f_i} are step functions (in which case the inequality can be proven elementarily, as discussed in the above paper of Carbery).

We now prove this inequality. We write {K(s) = \exp(k(s))} and {f_i(t_i) = \exp(g_i(t_i))} for some functions {k: S \rightarrow {\bf R}} and {g_i: T_i \rightarrow {\bf R}}. If we take logarithms in the inequality to be proven and apply Lemma 1, the inequality becomes

\displaystyle  \sup_X {\bf E} k(X) + \sum_{i=1}^n g_i(\pi_i(X)) + {\bf H}[X]

\displaystyle  \leq \frac{1}{n+1} \sup_{(X_0,\dots,X_n)} {\bf E} k(X_0)+\dots+k(X_n) + {\bf H}[X_0,\dots,X_n]

\displaystyle  + \frac{1}{n+1} \sum_{i=1}^n \sup_{Y_i} (n+1) {\bf E} g_i(Y_i) + {\bf H}[Y_i]

where {X} ranges over random variables taking values in {S}, {X_0,\dots,X_n} range over tuples of random variables taking values in {\Omega_n}, and {Y_i} range over random variables taking values in {T_i}. Comparing the suprema, the claim now reduces to

Lemma 3 (Conditional expectation computation) Let {X} be an {S}-valued random variable. Then there exists a {\Omega_n}-valued random variable {(X_0,\dots,X_n)}, where each {X_i} has the same distribution as {X}, and

\displaystyle  {\bf H}[X_0,\dots,X_n] = (n+1) {\bf H}[X]

\displaystyle - {\bf H}[\pi_1(X)] - \dots - {\bf H}[\pi_n(X)].

Proof: We induct on {n}. When {n=0} we just take {X_0 = X}. Now suppose that {n \geq 1}, and the claim has already been proven for {n-1}, thus one has already obtained a tuple {(X_0,\dots,X_{n-1}) \in \Omega_{n-1}} with each {X_0,\dots,X_{n-1}} having the same distribution as {X}, and

\displaystyle  {\bf H}[X_0,\dots,X_{n-1}] = n {\bf H}[X] - {\bf H}[\pi_1(X)] - \dots - {\bf H}[\pi_{n-1}(X)].

By hypothesis, {\pi_n(X_{n-1})} has the same distribution as {\pi_n(X)}. For each value {t_n} attained by {\pi_n(X)}, we can take conditionally independent copies of {(X_0,\dots,X_{n-1})} and {X} conditioned to the events {\pi_n(X_{n-1}) = t_n} and {\pi_n(X) = t_n} respectively, and then concatenate them to form a tuple {(X_0,\dots,X_n)} in {\Omega_n}, with {X_n} a further copy of {X} that is conditionally independent of {(X_0,\dots,X_{n-1})} relative to {\pi_n(X_{n-1}) = \pi_n(X)}. One can the use the entropy chain rule to compute

\displaystyle  {\bf H}[X_0,\dots,X_n] = {\bf H}[\pi_n(X_n)] + {\bf H}[X_0,\dots,X_n| \pi_n(X_n)]

\displaystyle  = {\bf H}[\pi_n(X_n)] + {\bf H}[X_0,\dots,X_{n-1}| \pi_n(X_n)] + {\bf H}[X_n| \pi_n(X_n)]

\displaystyle  = {\bf H}[\pi_n(X)] + {\bf H}[X_0,\dots,X_{n-1}| \pi_n(X_{n-1})] + {\bf H}[X_n| \pi_n(X_n)]

\displaystyle  = {\bf H}[\pi_n(X)] + ({\bf H}[X_0,\dots,X_{n-1}] - {\bf H}[\pi_n(X_{n-1})])

\displaystyle + ({\bf H}[X_n] - {\bf H}[\pi_n(X_n)])

\displaystyle  ={\bf H}[X_0,\dots,X_{n-1}] + {\bf H}[X_n] - {\bf H}[\pi_n(X_n)]

and the claim now follows from the induction hypothesis. \Box

With a little more effort, one can replace {S} by a more general measure space (and use differential entropy in place of Shannon entropy), to recover Carbery’s inequality in full generality; we leave the details to the interested reader.

In my previous post, I walked through the task of formally deducing one lemma from another in Lean 4. The deduction was deliberately chosen to be short and only showcased a small number of Lean tactics. Here I would like to walk through the process I used for a slightly longer proof I worked out recently, after seeing the following challenge from Damek Davis: to formalize (in a civilized fashion) the proof of the following lemma:

Lemma. Let \{a_k\} and \{D_k\} be sequences of real numbers indexed by natural numbers k=0,1,\dots, with a_k non-increasing and D_k non-negative. Suppose also that a_k \leq D_k - D_{k+1} for all k \geq 0. Then a_k \leq \frac{D_0}{k+1} for all k.

Here I tried to draw upon the lessons I had learned from the PFR formalization project, and to first set up a human readable proof of the lemma before starting the Lean formalization – a lower-case “blueprint” rather than the fancier Blueprint used in the PFR project. The main idea of the proof here is to use the telescoping series identity

\displaystyle \sum_{i=0}^k D_i - D_{i+1} = D_0 - D_{k+1}.

Since D_{k+1} is non-negative, and a_i \leq D_i - D_{i+1} by hypothesis, we have

\displaystyle \sum_{i=0}^k a_i \leq D_0

but by the monotone hypothesis on a_i the left-hand side is at least (k+1) a_k, giving the claim.

This is already a human-readable proof, but in order to formalize it more easily in Lean, I decided to rewrite it as a chain of inequalities, starting at a_k and ending at D_0 / (k+1). With a little bit of pen and paper effort, I obtained

a_k = (k+1) a_k / (k+1)

(by field identities)

= (\sum_{i=0}^k a_k) / (k+1)

(by the formula for summing a constant)

\leq (\sum_{i=0}^k a_i) / (k+1)

(by the monotone hypothesis)

\leq (\sum_{i=0}^k D_i - D_{i+1}) / (k+1)

(by the hypothesis a_i \leq D_i - D_{i+1}

= (D_0 - D_{k+1}) / (k+1)

(by telescoping series)

\leq D_0 / (k+1)

(by the non-negativity of D_{k+1}).

I decided that this was a good enough blueprint for me to work with. The next step is to formalize the statement of the lemma in Lean. For this quick project, it was convenient to use the online Lean playground, rather than my local IDE, so the screenshots will look a little different from those in the previous post. (If you like, you can follow this tour in that playground, by clicking on the screenshots of the Lean code.) I start by importing Lean’s math library, and starting an example of a statement to state and prove:

Now we have to declare the hypotheses and variables. The main variables here are the sequences a_k and D_k, which in Lean are best modeled by functions a, D from the natural numbers ℕ to the reals ℝ. (One can choose to “hardwire” the non-negativity hypothesis into the D_k by making D take values in the nonnegative reals {\bf R}^+ (denoted NNReal in Lean), but this turns out to be inconvenient, because the laws of algebra and summation that we will need are clunkier on the non-negative reals (which are not even a group) than on the reals (which are a field). So we add in the variables:

Now we add in the hypotheses, which in Lean convention are usually given names starting with h. This is fairly straightforward; the one thing is that the property of being monotone decreasing already has a name in Lean’s Mathlib, namely Antitone, and it is generally a good idea to use the Mathlib provided terminology (because that library contains a lot of useful lemmas about such terms).

One thing to note here is that Lean is quite good at filling in implied ranges of variables. Because a and D have the natural numbers ℕ as their domain, the dummy variable k in these hypotheses is automatically being quantified over ℕ. We could have made this quantification explicit if we so chose, for instance using ∀ k : ℕ, 0 ≤ D k instead of ∀ k, 0 ≤ D k, but it is not necessary to do so. Also note that Lean does not require parentheses when applying functions: we write D k here rather than D(k) (which in fact does not compile in Lean unless one puts a space between the D and the parentheses). This is slightly different from standard mathematical notation, but is not too difficult to get used to.

This looks like the end of the hypotheses, so we could now add a colon to move to the conclusion, and then add that conclusion:

This is a perfectly fine Lean statement. But it turns out that when proving a universally quantified statement such as ∀ k, a k ≤ D 0 / (k + 1), the first step is almost always to open up the quantifier to introduce the variable k (using the Lean command intro k). Because of this, it is slightly more efficient to hide the universal quantifier by placing the variable k in the hypotheses, rather than in the quantifier (in which case we have to now specify that it is a natural number, as Lean can no longer deduce this from context):

At this point Lean is complaining of an unexpected end of input: the example has been stated, but not proved. We will temporarily mollify Lean by adding a sorry as the purported proof:

Now Lean is content, other than giving a warning (as indicated by the yellow squiggle under the example) that the proof contains a sorry.

It is now time to follow the blueprint. The Lean tactic for proving an inequality via chains of other inequalities is known as calc. We use the blueprint to fill in the calc that we want, leaving the justifications of each step as “sorry”s for now:

Here, we “open“ed the Finset namespace in order to easily access Finset‘s range function, with range k basically being the finite set of natural numbers \{0,\dots,k-1\}, and also “open“ed the BigOperators namespace to access the familiar ∑ notation for (finite) summation, in order to make the steps in the Lean code resemble the blueprint as much as possible. One could avoid opening these namespaces, but then expressions such as ∑ i in range (k+1), a i would instead have to be written as something like Finset.sum (Finset.range (k+1)) (fun i ↦ a i), which looks a lot less like like standard mathematical writing. The proof structure here may remind some readers of the “two column proofs” that are somewhat popular in American high school geometry classes.

Now we have six sorries to fill. Navigating to the first sorry, Lean tells us the ambient hypotheses, and the goal that we need to prove to fill that sorry:

The ⊢ symbol here is Lean’s marker for the goal. The uparrows ↑ are coercion symbols, indicating that the natural number k has to be converted to a real number in order to interact via arithmetic operations with other real numbers such as a k, but we can ignore these coercions for this tour (for this proof, it turns out Lean will basically manage them automatically without need for any explicit intervention by a human).

The goal here is a self-evident algebraic identity; it involves division, so one has to check that the denominator is non-zero, but this is self-evident. In Lean, a convenient way to establish algebraic identities is to use the tactic field_simp to clear denominators, and then ring to verify any identity that is valid for commutative rings. This works, and clears the first sorry:

field_simp, by the way, is smart enough to deduce on its own that the denominator k+1 here is manifestly non-zero (and in fact positive); no human intervention is required to point this out. Similarly for other “clearing denominator” steps that we will encounter in the other parts of the proof.

Now we navigate to the next `sorry`. Lean tells us the hypotheses and goals:

We can reduce the goal by canceling out the common denominator ↑k+1. Here we can use the handy Lean tactic congr, which tries to match two sides of an equality goal as much as possible, and leave any remaining discrepancies between the two sides as further goals to be proven. Applying congr, the goal reduces to

Here one might imagine that this is something that one can prove by induction. But this particular sort of identity – summing a constant over a finite set – is already covered by Mathlib. Indeed, searching for Finset, sum, and const soon leads us to the Finset.sum_const lemma here. But there is an even more convenient path to take here, which is to apply the powerful tactic simp, which tries to simplify the goal as much as possible using all the “simp lemmas” Mathlib has to offer (of which Finset.sum_const is an example, but there are thousands of others). As it turns out, simp completely kills off this identity, without any further human intervention:

Now we move on to the next sorry, and look at our goal:

congr doesn’t work here because we have an inequality instead of an equality, but there is a powerful relative gcongr of congr that is perfectly suited for inequalities. It can also open up sums, products, and integrals, reducing global inequalities between such quantities into pointwise inequalities. If we invoke gcongr with i hi (where we tell gcongr to use i for the variable opened up, and hi for the constraint this variable will satisfy), we arrive at a greatly simplified goal (and a new ambient variable and hypothesis):

Now we need to use the monotonicity hypothesis on a, which we have named ha here. Looking at the documentation for Antitone, one finds a lemma that looks applicable here:

One can apply this lemma in this case by writing apply Antitone.imp ha, but because ha is already of type Antitone, we can abbreviate this to apply ha.imp. (Actually, as indicated in the documentation, due to the way Antitone is defined, we can even just use apply ha here.) This reduces the goal nicely:

The goal is now very close to the hypothesis hi. One could now look up the documentation for Finset.range to see how to unpack hi, but as before simp can do this for us. Invoking simp at hi, we obtain

Now the goal and hypothesis are very close indeed. Here we can just close the goal using the linarith tactic used in the previous tour:

The next sorry can be resolved by similar methods, using the hypothesis hD applied at the variable i:

Now for the penultimate sorry. As in a previous step, we can use congr to remove the denominator, leaving us in this state:

This is a telescoping series identity. One could try to prove it by induction, or one could try to see if this identity is already in Mathlib. Searching for Finset, sum, and sub will locate the right tool (as the fifth hit), but a simpler way to proceed here is to use the exact? tactic we saw in the previous tour:

A brief check of the documentation for sum_range_sub' confirms that this is what we want. Actually we can just use apply sum_range_sub' here, as the apply tactic is smart enough to fill in the missing arguments:

One last sorry to go. As before, we use gcongr to cancel denominators, leaving us with

This looks easy, because the hypothesis hpos will tell us that D (k+1) is nonnegative; specifically, the instance hpos (k+1) of that hypothesis will state exactly this. The linarith tactic will then resolve this goal once it is told about this particular instance:

We now have a complete proof – no more yellow squiggly line in the example. There are two warnings though – there are two variables i and hi introduced in the proof that Lean’s “linter” has noticed are not actually used in the proof. So we can rename them with underscores to tell Lean that we are okay with them not being used:

This is a perfectly fine proof, but upon noticing that many of the steps are similar to each other, one can do a bit of “code golf” as in the previous tour to compactify the proof a bit:

With enough familiarity with the Lean language, this proof actually tracks quite closely with (an optimized version of) the human blueprint.

This concludes the tour of a lengthier Lean proving exercise. I am finding the pre-planning step of the proof (using an informal “blueprint” to break the proof down into extremely granular pieces) to make the formalization process significantly easier than in the past (when I often adopted a sequential process of writing one line of code at a time without first sketching out a skeleton of the argument). (The proof here took only about 15 minutes to create initially, although for this blog post I had to recreate it with screenshots and supporting links, which took significantly more time.) I believe that a realistic near-term goal for AI is to be able to fill in automatically a significant fraction of the sorts of atomic “sorry“s of the size one saw in this proof, allowing one to convert a blueprint to a formal Lean proof even more rapidly.

One final remark: in this tour I filled in the “sorry“s in the order in which they appeared, but there is actually no requirement that one does this, and once one has used a blueprint to atomize a proof into self-contained smaller pieces, one can fill them in in any order. Importantly for a group project, these micro-tasks can be parallelized, with different contributors claiming whichever “sorry” they feel they are qualified to solve, and working independently of each other. (And, because Lean can automatically verify if their proof is correct, there is no need to have a pre-existing bond of trust with these contributors in order to accept their contributions.) Furthermore, because the specification of a “sorry” someone can make a meaningful contribution to the proof by working on an extremely localized component of it without needing the mathematical expertise to understand the global argument. This is not particularly important in this simple case, where the entire lemma is not too hard to understand to a trained mathematician, but can become quite relevant for complex formalization projects.

Since the release of my preprint with Tim, Ben, and Freddie proving the Polynomial Freiman-Ruzsa (PFR) conjecture over {\mathbb F}_2, I (together with Yael Dillies and Bhavik Mehta) have started a collaborative project to formalize this argument in the proof assistant language Lean4. It has been less than a week since the project was launched, but it is proceeding quite well, with a significant fraction of the paper already either fully or partially formalized. The project has been greatly assisted by the Blueprint tool of Patrick Massot, which allows one to write a human-readable “blueprint” of the proof that is linked to the Lean formalization; similar blueprints have been used for other projects, such as Scholze’s liquid tensor experiment. For the PFR project, the blueprint can be found here. One feature of the blueprint that I find particularly appealing is the dependency graph that is automatically generated from the blueprint, and can provide a rough snapshot of how far along the formalization has advanced. For PFR, the latest state of the dependency graph can be found here. At the current time of writing, the graph looks like this:

The color coding of the various bubbles (for lemmas) and rectangles (for definitions) is explained in the legend to the dependency graph, but roughly speaking the green bubbles/rectangles represent lemmas or definitions that have been fully formalized, and the blue ones represent lemmas or definitions which are ready to be formalized (their statements, but not proofs, have already been formalized, as well as those of all prerequisite lemmas and proofs). The goal is to get all the bubbles leading up to and including the “pfr” bubble at the bottom colored in green.

In this post I would like to give a quick “tour” of the project, to give a sense of how it operates. If one clicks on the “pfr” bubble at the bottom of the dependency graph, we get the following:

Here, Blueprint is displaying a human-readable form of the PFR statement. This is coming from the corresponding portion of the blueprint, which also comes with a human-readable proof of this statement that relies on other statements in the project:

(I have cropped out the second half of the proof here, as it is not relevant to the discussion.)

Observe that the “pfr” bubble is white, but has a green border. This means that the statement of PFR has been formalized in Lean, but not the proof; and the proof itself is not ready to be formalized, because some of the prerequisites (in particular, “entropy-pfr” (Theorem 6.16)) do not even have their statements formalized yet. If we click on the “Lean” link below the description of PFR in the dependency graph, we are lead to the (auto-generated) Lean documentation for this assertion:

This is what a typical theorem in Lean looks like (after a procedure known as “pretty printing”). There are a number of hypotheses stated before the colon, for instance that G is a finite elementary abelian group of order 2 (this is how we have chosen to formalize the finite field vector spaces {\bf F}_2^n), that A is a non-empty subset of G (the hypothesis that A is non-empty was not stated in the LaTeX version of the conjecture, but we realized it was necessary in the formalization, and will update the LaTeX blueprint shortly to reflect this) with the cardinality of A+A less than K times the cardinality of A, and the statement after the colon is the conclusion: that A can be contained in the sum c+H of a subgroup H of G and a set c of cardinality at most 2K^{12}.

The astute reader may notice that the above theorem seems to be missing one or two details, for instance it does not explicitly assert that H is a subgroup. This is because the “pretty printing” suppresses some of the information in the actual statement of the theorem, which can be seen by clicking on the “Source” link:

Here we see that H is required to have the “type” of an additive subgroup of G. (Lean’s language revolves very strongly around types, but for this tour we will not go into detail into what a type is exactly.) The prominent “sorry” at the bottom of this theorem asserts that a proof is not yet provided for this theorem, but the intention of course is to replace this “sorry” with an actual proof eventually.

Filling in this “sorry” is too hard to do right now, so let’s look for a simpler task to accomplish instead. Here is a simple intermediate lemma “ruzsa-nonneg” that shows up in the proof:

The expression d[X; Y] refers to something called the entropic Ruzsa distance between X and Y, which is something that is defined elsewhere in the project, but for the current discussion it is not important to know its precise definition, other than that it is a real number. The bubble is blue with a green border, which means that the statement has been formalized, and the proof is ready to be formalized also. The blueprint dependency graph indicates that this lemma can be deduced from just one preceding lemma, called “ruzsa-diff“:

ruzsa-diff” is also blue and bordered in green, so it has the same current status as “ruzsa-nonneg“: the statement is formalized, and the proof is ready to be formalized also, but the proof has not been written in Lean yet. The quantity H[X], by the way, refers to the Shannon entropy of X, defined elsewhere in the project, but for this discussion we do not need to know its definition, other than to know that it is a real number.

Looking at Lemma 3.11 and Lemma 3.13 it is clear how the former will imply the latter: the quantity |H[X] - H[Y]| is clearly non-negative! (There is a factor of 2 present in Lemma 3.11, but it can be easily canceled out.) So it should be an easy task to fill in the proof of Lemma 3.13 assuming Lemma 3.11, even if we still don’t know how to prove Lemma 3.11 yet. Let’s first look at the Lean code for each lemma. Lemma 3.11 is formalized as follows:

Again we have a “sorry” to indicate that this lemma does not currently have a proof. The Lean notation (as well as the name of the lemma) differs a little from the LaTeX version for technical reasons that we will not go into here. (Also, the variables X, \mu, Y, \mu' are introduced at an earlier stage in the Lean file; again, we will ignore this point for the ensuing discussion.) Meanwhile, Lemma 3.13 is currently formalized as

OK, I’m now going to try to fill in the latter “sorry”. In my local copy of the PFR github repository, I open up the relevant Lean file in my editor (Visual Studio Code, with the lean4 extension) and navigate to the “sorry” of “rdist_nonneg”. The accompanying “Lean infoview” then shows the current state of the Lean proof:

Here we see a number of ambient hypotheses (e.g., that G is an additive commutative group, that X is a map from \Omega to G, and so forth; many of these hypotheses are not actually relevant for this particular lemma), and at the bottom we see the goal we wish to prove.

OK, so now I’ll try to prove the claim. This is accomplished by applying a series of “tactics” to transform the goal and/or hypotheses. The first step I’ll do is to put in the factor of 2 that is needed to apply Lemma 3.11. This I will do with the “suffices” tactic, writing in the proof

I now have two goals (and two “sorries”): one to show that 0 \leq 2 d[X;Y] implies 0 \leq d[X,Y], and the other to show that 0 \leq 2 d[X;Y]. (The yellow squiggly underline indicates that this lemma has not been fully proven yet due to the presence of “sorry”s. The dot “.” is a syntactic marker that is useful to separate the two goals from each other, but you can ignore it for this tour.) The Lean tactic “suffices” corresponds, roughly speaking, to the phrase “It suffices to show that …” (or more precisely, “It suffices to show that … . To see this, … . It remains to verify the claim …”) in Mathematical English. For my own education, I wrote a “Lean phrasebook” of further correspondences between lines of Lean code and sentences or phrases in Mathematical English, which can be found here.

Let’s fill in the first “sorry”. The tactic state now looks like this (cropping out some irrelevant hypotheses):

Here I can use a handy tactic “linarith“, which solves any goal that can be derived by linear arithmetic from existing hypotheses:

This works, and now the tactic state reports no goals left to prove on this branch, so we move on to the remaining sorry, in which the goal is now to prove 0 \leq 2 d[X;Y]:

Here we will try to invoke Lemma 3.11. I add the following lines of code:

The Lean tactic “have” roughly corresponds to the Mathematical English phrase “We have the statement…” or “We claim the statement…”; like “suffices”, it splits a goal into two subgoals, though in the reversed order to “suffices”.

I again have two subgoals, one to prove the bound |H[X]-H[Y]| \leq 2 d[X;Y] (which I will call “h”), and then to deduce the previous goal 0 \leq 2 d[X;Y] from h. For the first, I know I should invoke the lemma “diff_ent_le_rdist” that is encoding Lemma 3.11. One way to do this is to try the tactic “exact?”, which will automatically search to see if the goal can already be deduced immediately from an existing lemma. It reports:

So I try this (by clicking on the suggested code, which automatically pastes it into the right location), and it works, leaving me with the final “sorry”:

The lean tactic “exact” corresponds, roughly speaking, to the Mathematical English phrase “But this is exactly …”.

At this point I should mention that I also have the Github Copilot extension to Visual Studio Code installed. This is an AI which acts as an advanced autocomplete that can suggest possible lines of code as one types. In this case, it offered a suggestion which was almost correct (the second line is what we need, whereas the first is not necessary, and in fact does not even compile in Lean):

In any event, “exact?” worked in this case, so I can ignore the suggestion of Copilot this time (it has been very useful in other cases though). I apply the “exact?” tactic a second time and follow its suggestion to establish the matching bound 0 \leq |H[X] - H[Y]|:

(One can find documention for the “abs_nonneg” method here. Copilot, by the way, was also able to resolve this step, albeit with a slightly different syntax; there are also several other search engines available to locate this method as well, such as Moogle. One of the main purposes of the Lean naming conventions for lemmas, by the way, is to facilitate the location of methods such as “abs_nonneg”, which is easier figure out how to search for than a method named (say) “Lemma 1.2.1”.) To fill in the final “sorry”, I try “exact?” one last time, to figure out how to combine h and h' to give the desired goal, and it works!

Note that all the squiggly underlines have disappeared, indicating that Lean has accepted this as a valid proof. The documentation for “ge_trans” may be found here. The reader may observe that this method uses the \geq relation rather than the \leq relation, but in Lean the assertions X \geq Y and Y \leq X are “definitionally equal“, allowing tactics such as “exact” to use them interchangeably. “exact le_trans h’ h” would also have worked in this instance.

It is possible to compactify this proof quite a bit by cutting out several intermediate steps (a procedure sometimes known as “code golf“):

And now the proof is done! In the end, it was literally a “one-line proof”, which makes sense given how close Lemma 3.11 and Lemma 3.13 were to each other.

The current version of Blueprint does not automatically verify the proof (even though it does compile in Lean), so we have to manually update the blueprint as well. The LaTeX for Lemma 3.13 currently looks like this:

I add the “\leanok” macro to the proof, to flag that the proof has now been formalized:

I then push everything back up to the master Github repository. The blueprint will take quite some time (about half an hour) to rebuild, but eventually it does, and the dependency graph (which Blueprint has for some reason decided to rearrange a bit) now shows “ruzsa-nonneg” in green:

And so the formalization of PFR moves a little bit closer to completion. (Of course, this was a particularly easy lemma to formalize, that I chose to illustrate the process; one can imagine that most other lemmas will take a bit more work.) Note that while “ruzsa-nonneg” is now colored in green, we don’t yet have a full proof of this result, because the lemma “ruzsa-diff” that it relies on is not green. Nevertheless, the proof is locally complete at this point; hopefully at some point in the future, the predecessor results will also be locally proven, at which point this result will be completely proven. Note how this blueprint structure allows one to work on different parts of the proof asynchronously; it is not necessary to wait for earlier stages of the argument to be fully formalized to start working on later stages, although I anticipate a small amount of interaction between different components as we iron out any bugs or slight inaccuracies in the blueprint. (For instance, I am suspecting that we may need to add some measurability hypotheses on the random variables X, Y in the above two lemmas to make them completely true, but this is something that should emerge organically as the formalization process continues.)

That concludes the brief tour! If you are interested in learning more about the project, you can follow the Zulip chat stream; you can also download Lean and work on the PFR project yourself, using a local copy of the Github repository and sending pull requests to the master copy if you have managed to fill in one or more of the “sorry”s in the current version (but if you plan to work on anything more large scale than filling in a small lemma, it is good to announce your intention on the Zulip chat to avoid duplication of effort) . (One key advantage of working with a project based around a proof assistant language such as Lean is that it makes large-scale mathematical collaboration possible without necessarily having a pre-established level of trust amongst the collaborators; my fellow repository maintainers and I have already approved several pull requests from contributors that had not previously met, as the code was verified to be correct and we could see that it advanced the project. Conversely, as the above example should hopefully demonstrate, it is possible for a contributor to work on one small corner of the project without necessarily needing to understand all the mathematics that goes into the project as a whole.)

If one just wants to experiment with Lean without going to the effort of downloading it, you can playing try the “Natural Number Game” for a gentle introduction to the language, or the Lean4 playground for an online Lean server. Further resources to learn Lean4 may be found here.

Tim Gowers, Ben Green, Freddie Manners, and I have just uploaded to the arXiv our paper “On a conjecture of Marton“. This paper establishes a version of the notorious Polynomial Freiman–Ruzsa conjecture (first proposed by Katalin Marton):

Theorem 1 (Polynomial Freiman–Ruzsa conjecture) Let {A \subset {\bf F}_2^n} be such that {|A+A| \leq K|A|}. Then {A} can be covered by at most {2K^{12}} translates of a subspace {H} of {{\bf F}_2^n} of cardinality at most {A}.

The previous best known result towards this conjecture was by Konyagin (as communicated in this paper of Sanders), who obtained a similar result but with {2K^{12}} replaced by {\exp(O_\varepsilon(\log^{3+\varepsilon} K))} for any {\varepsilon>0} (assuming that say {K \geq 3/2} to avoid some degeneracies as {K} approaches {1}, which is not the difficult case of the conjecture). The conjecture (with {12} replaced by an unspecified constant {C}) has a number of equivalent forms; see this survey of Green, and these papers of Lovett and of Green and myself for some examples; in particular, as discussed in the latter two references, the constants in the inverse {U^3({\bf F}_2^n)} theorem are now polynomial in nature (although we did not try to optimize the constant).

The exponent {12} here was the product of a large number of optimizations to the argument (our original exponent here was closer to {1000}), but can be improved even further with additional effort (our current argument, for instance, allows one to replace it with {7+\sqrt{17} = 11.123\dots}, but we decided to state our result using integer exponents instead).

In this paper we will focus exclusively on the characteristic {2} case (so we will be cavalier in identifying addition and subtraction), but in a followup paper we will establish similar results in other finite characteristics.

Much of the prior progress on this sort of result has proceeded via Fourier analysis. Perhaps surprisingly, our approach uses no Fourier analysis whatsoever, being conducted instead entirely in “physical space”. Broadly speaking, it follows a natural strategy, which is to induct on the doubling constant {|A+A|/|A|}. Indeed, suppose for instance that one could show that every set {A} of doubling constant {K} was “commensurate” in some sense to a set {A'} of doubling constant at most {K^{0.99}}. One measure of commensurability, for instance, might be the Ruzsa distance {\log \frac{|A+A'|}{|A|^{1/2} |A'|^{1/2}}}, which one might hope to control by {O(\log K)}. Then one could iterate this procedure until doubling constant dropped below say {3/2}, at which point the conjecture is known to hold (there is an elementary argument that if {A} has doubling constant less than {3/2}, then {A+A} is in fact a subspace of {{\bf F}_2^n}). One can then use several applications of the Ruzsa triangle inequality

\displaystyle  \log \frac{|A+C|}{|A|^{1/2} |C|^{1/2}} \leq \log \frac{|A+B|}{|A|^{1/2} |B|^{1/2}} + \log \frac{|B+C|}{|B|^{1/2} |C|^{1/2}}

to conclude (the fact that we reduce {K} to {K^{0.99}} means that the various Ruzsa distances that need to be summed are controlled by a convergent geometric series).

There are a number of possible ways to try to “improve” a set {A} of not too large doubling by replacing it with a commensurate set of better doubling. We note two particular potential improvements:

  • (i) Replacing {A} with {A+A}. For instance, if {A} was a random subset (of density {1/K}) of a large subspace {H} of {{\bf F}_2^n}, then replacing {A} with {A+A} usually drops the doubling constant from {K} down to nearly {1} (under reasonable choices of parameters).
  • (ii) Replacing {A} with {A \cap (A+h)} for a “typical” {h \in A+A}. For instance, if {A} was the union of {K} random cosets of a subspace {H} of large codimension, then replacing {A} with {A \cap (A+h)} again usually drops the doubling constant from {K} down to nearly {1}.

Unfortunately, there are sets {A} where neither of the above two operations (i), (ii) significantly improves the doubling constant. For instance, if {A} is a random density {1/\sqrt{K}} subset of {\sqrt{K}} random translates of a medium-sized subspace {H}, one can check that the doubling constant stays close to {K} if one applies either operation (i) or operation (ii). But in this case these operations don’t actually worsen the doubling constant much either, and by applying some combination of (i) and (ii) (either intersecting {A+A} with a translate, or taking a sumset of {A \cap (A+h)} with itself) one can start lowering the doubling constant again.

This begins to suggest a potential strategy: show that at least one of the operations (i) or (ii) will improve the doubling constant, or at least not worsen it too much; and in the latter case, perform some more complicated operation to locate the desired doubling constant improvement.

A sign that this strategy might have a chance of working is provided by the following heuristic argument. If {A} has doubling constant {K}, then the Cartesian product {A \times A} has doubling constant {K^2}. On the other hand, by using the projection map {\pi: {\bf F}_2^n \times {\bf F}_2^n \rightarrow {\bf F}_2^n} defined by {\pi(x,y) := x+y}, we see that {A \times A} projects to {\pi(A \times A) = A+A}, with fibres {\pi^{-1}(\{h\})} being essentially a copy of {A \cap (A+h)}. So, morally, {A \times A} also behaves like a “skew product” of {A+A} and the fibres {A \cap (A+h)}, which suggests (non-rigorously) that the doubling constant {K^2} of {A \times A} is also something like the doubling constant of {A + A}, times the doubling constant of a typical fibre {A \cap (A+h)}. This would imply that at least one of {A +A} and {A \cap (A+h)} would have doubling constant at most {K}, and thus that at least one of operations (i), (ii) would not worsen the doubling constant.

Unfortunately, this argument does not seem to be easily made rigorous using the traditional doubling constant; even the significantly weaker statement that {A+A} has doubling constant at most {K^2} is false (see comments for more discussion). However, it turns out (as discussed in this recent paper of myself with Green and Manners) that things are much better. Here, the analogue of a subset {A} in {{\bf F}_2^n} is a random variable {X} taking values in {{\bf F}_2^n}, and the analogue of the (logarithmic) doubling constant {\log \frac{|A+A|}{|A|}} is the entropic doubling constant {d(X;X) := {\bf H}(X_1+X_2)-{\bf H}(X)}, where {X_1,X_2} are independent copies of {X}. If {X} is a random variable in some additive group {G} and {\pi: G \rightarrow H} is a homomorphism, one then has what we call the fibring inequality

\displaystyle  d(X;X) \geq d(\pi(X);\pi(X)) + d(X|\pi(X); X|\pi(X)),

where the conditional doubling constant {d(X|\pi(X); X|\pi(X))} is defined as

\displaystyle  d(X|\pi(X); X|\pi(X)) = {\bf H}(X_1 + X_2 | \pi(X_1), \pi(X_2)) - {\bf H}( X | \pi(X) ).

Indeed, from the chain rule for Shannon entropy one has

\displaystyle  {\bf H}(X) = {\bf H}(\pi(X)) + {\bf H}(X|\pi(X))

and

\displaystyle  {\bf H}(X_1+X_2) = {\bf H}(\pi(X_1) + \pi(X_2)) + {\bf H}(X_1 + X_2|\pi(X_1) + \pi(X_2))

while from the non-negativity of conditional mutual information one has

\displaystyle  {\bf H}(X_1 + X_2|\pi(X_1) + \pi(X_2)) \geq {\bf H}(X_1 + X_2|\pi(X_1), \pi(X_2))

and it is an easy matter to combine all these identities and inequalities to obtain the claim.

Applying this inequality with {X} replaced by two independent copies {(X_1,X_2)} of itself, and using the addition map {(x,y) \mapsto x+y} for {\pi}, we obtain in particular that

\displaystyle  2 d(X;X) \geq d(X_1+X_2; X_1+X_2) + d(X_1,X_2|X_1+X_2; X_1,X_2|X_1+X_2)

or (since {X_2} is determined by {X_1} once one fixes {X_1+X_2})

\displaystyle  2 d(X;X) \geq d(X_1+X_2; X_1+X_2) + d(X_1|X_1+X_2; X_1|X_1+X_2).

So if {d(X;X) = \log K}, then at least one of {d(X_1+X_2; X_1+X_2)} or {d(X_1|X_1+X_2; X_1|X_1+X_2)} will be less than or equal to {\log K}. This is the entropy analogue of at least one of (i) or (ii) improving, or at least not degrading the doubling constant, although there are some minor technicalities involving how one deals with the conditioning to {X_1+X_2} in the second term {d(X_1|X_1+X_2; X_1|X_1+X_2)} that we will gloss over here (one can pigeonhole the instances of {X_1} to different events {X_1+X_2=x}, {X_1+X_2=x'}, and “depolarise” the induction hypothesis to deal with distances {d(X;Y)} between pairs of random variables {X,Y} that do not necessarily have the same distribution). Furthermore, we can even calculate the defect in the above inequality: a careful inspection of the above argument eventually reveals that

\displaystyle  2 d(X;X) = d(X_1+X_2; X_1+X_2) + d(X_1|X_1+X_2; X_1|X_1+X_2)

\displaystyle  + {\bf I}( X_1 + X_2 : X_1 + X_3 | X_1 + X_2 + X_3 + X_4)

where we now take four independent copies {X_1,X_2,X_3,X_4}. This leads (modulo some technicalities) to the following interesting conclusion: if neither (i) nor (ii) leads to an improvement in the entropic doubling constant, then {X_1+X_2} and {X_2+X_3} are conditionally independent relative to {X_1+X_2+X_3+X_4}. This situation (or an approximation to this situation) is what we refer to in the paper as the “endgame”.

A version of this endgame conclusion is in fact valid in any characteristic. But in characteristic {2}, we can take advantage of the identity

\displaystyle  (X_1+X_2) + (X_2+X_3) = X_1 + X_3.

Conditioning on {X_1+X_2+X_3+X_4}, and using symmetry we now conclude that if we are in the endgame exactly (so that the mutual information is zero), then the independent sum of two copies of {(X_1+X_2|X_1+X_2+X_3+X_4)} has exactly the same distribution; in particular, the entropic doubling constant here is zero, which is certainly a reduction in the doubling constant.

To deal with the situation where the conditional mutual information is small but not completely zero, we have to use an entropic version of the Balog-Szemeredi-Gowers lemma, but fortunately this was already worked out in an old paper of mine (although in order to optimise the final constant, we ended up using a slight variant of that lemma).

I am planning to formalize this paper in the Lean4 language. Further discussion of this project will take place on this Zulip stream, and the project itself will be held at this Github repository.

I have just uploaded to the arXiv my paper “A Maclaurin type inequality“. This paper concerns a variant of the Maclaurin inequality for the elementary symmetric means

\displaystyle  s_k(y) := \frac{1}{\binom{n}{k}} \sum_{1 \leq i_1 < \dots < i_k \leq n} y_{i_1} \dots y_{i_k}

of {n} real numbers {y_1,\dots,y_n}. This inequality asserts that

\displaystyle  s_\ell(y)^{1/\ell} \leq s_k(y)^{1/k}

whenever {1 \leq k \leq \ell \leq n} and {y_1,\dots,y_n} are non-negative. It can be proven as a consequence of the Newton inequality

\displaystyle  s_{k-1}(y) s_{k+1}(y) \leq s_k(y)^2

valid for all {1 \leq k < n} and arbitrary real {y_1,\dots,y_n} (in particular, here the {y_i} are allowed to be negative). Note that the {k=1, n=2} case of this inequality is just the arithmetic mean-geometric mean inequality

\displaystyle  y_1 y_2 \leq (\frac{y_1+y_2}{2})^2;

the general case of this inequality can be deduced from this special case by a number of standard manipulations (the most non-obvious of which is the operation of differentiating the real-rooted polynomial {\prod_{i=1}^n (z-y_i)} to obtain another real-rooted polynomial, thanks to Rolle’s theorem; the key point is that this operation preserves all the elementary symmetric means up to {s_{n-1}}). One can think of Maclaurin’s inequality as providing a refined version of the arithmetic mean-geometric mean inequality on {n} variables (which corresponds to the case {k=1}, {\ell=n}).

Whereas Newton’s inequality works for arbitrary real {y_i}, the Maclaurin inequality breaks down once one or more of the {y_i} are permitted to be negative. A key example occurs when {n} is even, half of the {y_i} are equal to {+1}, and half are equal to {-1}. Here, one can verify that the elementary symmetric means {s_k(y)} vanish for odd {k} and are equal to { (-1)^{k/2} \frac{\binom{n/2}{k/2}}{\binom{n}{k}}} for even {k}. In particular, some routine estimation then gives the order of magnitude bound

\displaystyle  |s_k(y)|^{\frac{1}{k}} \asymp \frac{k^{1/2}}{n^{1/2}} \ \ \ \ \ (1)

for {0 < k \leq n} even, thus giving a significant violation of the Maclaurin inequality even after putting absolute values around the {s_k(y)}. In particular, vanishing of one {s_k(y)} does not imply vanishing of all subsequent {s_\ell(y)}.

On the other hand, it was observed by Gopalan and Yehudayoff that if two consecutive values {s_k(y), s_{k+1}(y)} are small, then this makes all subsequent values {s_\ell(y)} small as well. More precise versions of this statement were subsequently observed by Meka-Reingold-Tal and Doron-Hatami-Hoza, who obtained estimates of the shape

\displaystyle  |s_\ell(y)|^{\frac{1}{\ell}} \ll \ell^{1/2} \max (|s_k(y)|^{\frac{1}{k}}, |s_{k+1}(y)|^{\frac{1}{k+1}}) \ \ \ \ \ (2)

whenever {1 \leq k \leq \ell \leq n} and {y_1,\dots,y_n} are real (but possibly negative). For instance, setting {k=1, \ell=n} we obtain the inequality

\displaystyle  (y_1 \dots y_n)^{1/n} \ll n^{1/2} \max( |s_1(y)|, |s_2(y)|^2)

which can be established by combining the arithmetic mean-geometric mean inequality

\displaystyle  (y_1 \dots y_n)^{2/n} \leq \frac{y_1^2 + \dots + y_n^2}{n}

with the Newton identity

\displaystyle  y_1^2 + \dots + y_n^2 = n^2 s_1(y)^2 - n(n-1) s_2(y).

As with the proof of the Newton inequalities, the general case of (2) can be obtained from this special case after some standard manipulations (including the differentiation operation mentioned previously).

However, if one inspects the bound (2) against the bounds (1) given by the key example, we see a mismatch – the right-hand side of (2) is larger than the left-hand side by a factor of about {k^{1/2}}. The main result of the paper rectifies this by establishing the optimal (up to constants) improvement

\displaystyle  |s_\ell(y)|^{\frac{1}{\ell}} \ll \frac{\ell^{1/2}}{k^{1/2}} \max (|s_k(y)|^{\frac{1}{k}}, |s_{k+1}(y)|^{\frac{1}{k+1}}) \ \ \ \ \ (3)

of (2). This answers a question posed on MathOverflow.

Unlike the previous arguments, we do not rely primarily on the arithmetic mean-geometric mean inequality. Instead, our primary tool is a new inequality

\displaystyle  \sum_{m=0}^\ell \binom{\ell}{m} |s_m(y)| r^m \geq (1+ |s_\ell(y)|^{2/\ell} r^2)^{\ell/2}, \ \ \ \ \ (4)

valid for all {1 \leq \ell \leq n} and {r>0}. Roughly speaking, the bound (3) would follow from (4) by setting {r \asymp (k/\ell)^{1/2} |s_\ell(y)|^{-1/\ell}}, provided that we can show that the {m=k,k+1} terms of the left-hand side dominate the sum in this regime. This can be done, after a technical step of passing to tuples {y} which nearly optimize the required inequality (3).

We sketch the proof of the inequality (4) as follows. One can use some standard manipulations reduce to the case where {\ell=n} and {|s_n(y)|=1}, and after replacing {r} with {1/r} one is now left with establishing the inequality

\displaystyle  \sum_{m=0}^n \binom{n}{m} |s_m(y)| r^{n-m} \geq (1+r^2)^{n/2}.

Note that equality is attained in the previously discussed example with half of the {y_i} equal to {+1} and the other half equal to {-1}, thanks to the binomial theorem.

To prove this identity, we consider the polynomial

\displaystyle  \prod_{j=1}^n (z - y_j) = \sum_{m=0}^n (-1)^m \binom{n}{m} s_k(m) z^{n-m}.

Evaluating this polynomial at {ir}, taking absolute values, using the triangle inequality, and then taking logarithms, we conclude that

\displaystyle  \frac{1}{2} \sum_{j=1}^n \log(y_j^2 + r^2) \leq \log(\sum_{m=0}^n \binom{n}{m} |s_m(y)| r^{n-m}).

A convexity argument gives the lower bound

\displaystyle  \log(y_j^2 + r^2) \geq \log(1+r^2) + \frac{2}{1+r^2} \log |y_j|

while the normalization {|s_n(y)|=1} gives

\displaystyle  \sum_{j=1}^n \log |y_j| = 0

and the claim follows.

A common task in analysis is to obtain bounds on sums

\displaystyle  \sum_{n \in A} f(n)

or integrals

\displaystyle  \int_A f(x)\ dx

where {A} is some simple region (such as an interval) in one or more dimensions, and {f} is an explicit (and elementary) non-negative expression involving one or more variables (such as {n} or {x}, and possibly also some additional parameters. Often, one would be content with an order of magnitude upper bound such as

\displaystyle  \sum_{n \in A} f(n) \ll X

or

\displaystyle  \int_A f(x)\ dx \ll X

where we use {X \ll Y} (or {Y \gg X} or {X = O(Y)}) to denote the bound {|X| \leq CY} for some constant {C}; sometimes one wishes to also obtain the matching lower bound, thus obtaining

\displaystyle  \sum_{n \in A} f(n) \asymp X

or

\displaystyle  \int_A f(x)\ dx \asymp X

where {X \asymp Y} is synonymous with {X \ll Y \ll X}. Finally, one may wish to obtain a more precise bound, such as

\displaystyle  \sum_{n \in A} f(n) = (1+o(1)) X

where {o(1)} is a quantity that goes to zero as the parameters of the problem go to infinity (or some other limit). (For a deeper dive into asymptotic notation in general, see this previous blog post.)

Here are some typical examples of such estimation problems, drawn from recent questions on MathOverflow:

  • (i) (From this question) If {d,p \geq 1} and {a>d/p}, is the expression

    \displaystyle  \sum_{j \in {\bf Z}} 2^{(\frac{d}{p}+1-a)j} \int_0^\infty e^{-2^j s} \frac{s^a}{1+s^{2a}}\ ds

    finite?
  • (ii) (From this question) If {h,m \geq 1}, how can one show that

    \displaystyle  \sum_{d=0}^\infty \frac{2d+1}{2h^2 (1 + \frac{d(d+1)}{h^2}) (1 + \frac{d(d+1)}{h^2m^2})^2} \ll 1 + \log(m^2)?

  • (iii) (From this question) Can one show that

    \displaystyle  \sum_{k=1}^{n-1} \frac{k^{2n-4k-3}(n^2-2nk+2k^2)}{(n-k)^{2n-4k-1}} = (c+o(1)) \sqrt{n}

    as {n \rightarrow \infty} for an explicit constant {c}, and what is this constant?

Compared to other estimation tasks, such as that of controlling oscillatory integrals, exponential sums, singular integrals, or expressions involving one or more unknown functions (that are only known to lie in some function spaces, such as an {L^p} space), high-dimensional geometry (or alternatively, large numbers of random variables), or number-theoretic structures (such as the primes), estimation of sums or integrals of non-negative elementary expressions is a relatively straightforward task, and can be accomplished by a variety of methods. The art of obtaining such estimates is typically not explicitly taught in textbooks, other than through some examples and exercises; it is typically picked up by analysts (or those working in adjacent areas, such as PDE, combinatorics, or theoretical computer science) as graduate students, while they work through their thesis or their first few papers in the subject.

Somewhat in the spirit of this previous post on analysis problem solving strategies, I am going to try here to collect some general principles and techniques that I have found useful for these sorts of problems. As with the previous post, I hope this will be something of a living document, and encourage others to add their own tips or suggestions in the comments.

Read the rest of this entry »

Rachel Greenfeld and I have just uploaded to the arXiv our paper “Undecidability of translational monotilings“. This is a sequel to our previous paper in which we constructed a translational monotiling {A \oplus F = {\bf Z}^d} of a high-dimensional lattice {{\bf Z}^d} (thus the monotile {F} is a finite set and the translates {a+F}, {a \in A} of {F} partition {{\bf Z}^d}) which was aperiodic (there is no way to “repair” this tiling into a periodic tiling {A' \oplus F = {\bf Z}^d}, in which {A'} is now periodic with respect to a finite index subgroup of {{\bf Z}^d}). This disproved the periodic tiling conjecture of Stein, Grunbaum-Shephard and Lagarias-Wang, which asserted that such aperiodic translational monotilings do not exist. (Compare with the “hat monotile“, which is a recently discovered aperiodic isometric monotile for of {{\bf R}^2}, where one is now allowed to use rotations and reflections as well as translations, or the even more recent “spectre monotile“, which is similar except that no reflections are needed.)

One of the motivations of this conjecture was the observation of Hao Wang that if the periodic tiling conjecture were true, then the translational monotiling problem is (algorithmically) decidable: there is a Turing machine which, when given a dimension {d} and a finite subset {F} of {{\bf Z}^d}, can determine in finite time whether {F} can tile {{\bf Z}^d}. This is because if a periodic tiling exists, it can be found by computer search; and if no tiling exists at all, then (by the compactness theorem) there exists some finite subset of {{\bf Z}^d} that cannot be covered by disjoint translates of {F}, and this can also be discovered by computer search. The periodic tiling conjecture asserts that these are the only two possible scenarios, thus giving the decidability.

On the other hand, Wang’s argument is not known to be reversible: the failure of the periodic tiling conjecture does not automatically imply the undecidability of the translational monotiling problem, as it does not rule out the existence of some other algorithm to determine tiling that does not rely on the existence of a periodic tiling. (For instance, even with the newly discovered hat and spectre tiles, it remains an open question whether the isometric monotiling problem for (say) polygons with rational coefficients in {{\bf R}^2} is decidable, with or without reflections.)

The main result of this paper settles this question (with one caveat):

Theorem 1 There does not exist any algorithm which, given a dimension {d}, a periodic subset {E} of {{\bf Z}^d}, and a finite subset {F} of {{\bf Z}^d}, determines in finite time whether there is a translational tiling {A \oplus F = E} of {E} by {F}.

The caveat is that we have to work with periodic subsets {E} of {{\bf Z}^d}, rather than all of {{\bf Z}^d}; we believe this is largely a technical restriction of our method, and it is likely that can be removed with additional effort and creativity. We also remark that when {d=2}, the periodic tiling conjecture was established by Bhattacharya, and so the problem is decidable in the {d=2} case. It remains open whether the tiling problem is decidable for any fixed value of {d>2} (note in the above result that the dimension {d} is not fixed, but is part of the input).

Because of a well known link between algorithmic undecidability and logical undecidability (also known as logical independence), the main theorem also implies the existence of an (in principle explicitly describable) dimension {d}, periodic subset {E} of {{\bf Z}^d}, and a finite subset {F} of {{\bf Z}^d}, such that the assertion that {F} tiles {E} by translation cannot be proven or disproven in ZFC set theory (assuming of course that this theory is consistent).

As a consequence of our method, we can also replace {{\bf Z}^d} here by “virtually two-dimensional” groups {{\bf Z}^2 \times G_0}, with {G_0} a finite abelian group (which now becomes part of the input, in place of the dimension {d}).

We now describe some of the main ideas of the proof. It is a common technique to show that a given problem is undecidable by demonstrating that some other problem that was already known to be undecidable can be “encoded” within the original problem, so that any algorithm for deciding the original problem would also decide the embedded problem. Accordingly, we will encode the Wang tiling problem as a monotiling problem in {{\bf Z}^d}:

Problem 2 (Wang tiling problem) Given a finite collection {{\mathcal W}} of Wang tiles (unit squares with each side assigned some color from a finite palette), is it possible to tile the plane with translates of these tiles along the standard lattice {{\bf Z}^2}, such that adjacent tiles have matching colors along their common edge?

It is a famous result of Berger that this problem is undecidable. The embedding of this problem into the higher-dimensional translational monotiling problem proceeds through some intermediate problems. Firstly, it is an easy matter to embed the Wang tiling problem into a similar problem which we call the domino problem:

Problem 3 (Domino problem) Given a finite collection {{\mathcal R}_1} (resp. {{\mathcal R}_2}) of horizontal (resp. vertical) dominoes – pairs of adjacent unit squares, each of which is decorated with an element of a finite set {{\mathcal W}} of “pips”, is it possible to assign a pip to each unit square in the standard lattice tiling of {{\bf Z}^2}, such that every horizontal (resp. vertical) pair of squares in this tiling is decorated using a domino from {{\mathcal R}_1} (resp. {{\mathcal R}_2})?

Indeed, one just has to interpet each Wang tile as a separate “pip”, and define the domino sets {{\mathcal R}_1}, {{\mathcal R}_2} to be the pairs of horizontally or vertically adjacent Wang tiles with matching colors along their edge.

Next, we embed the domino problem into a Sudoku problem:

Problem 4 (Sudoku problem) Given a column width {N}, a digit set {\Sigma}, a collection {{\mathcal S}} of functions {g: \{0,\dots,N-1\} \rightarrow \Sigma}, and an “initial condition” {{\mathcal C}} (which we will not detail here, as it is a little technical), is it possible to assign a digit {F(n,m)} to each cell {(n,m)} in the “Sudoku board” {\{0,1,\dots,N-1\} \times {\bf Z}} such that for any slope {j \in {\bf Z}} and intercept {i \in {\bf Z}}, the digits {n \mapsto F(n,jn+i)} along the line {\{(n,jn+i): 0 \leq n \leq N-1\}} lie in {{\mathcal S}} (and also that {F} obeys the initial condition {{\mathcal C}})?

The most novel part of the paper is the demonstration that the domino problem can indeed be embedded into the Sudoku problem. The embedding of the Sudoku problem into the monotiling problem follows from a modification of the methods in our previous papers, which had also introduced versions of the Sudoku problem, and created a “tiling language” which could be used to “program” various problems, including the Sudoku problem, as monotiling problems.

To encode the domino problem into the Sudoku problem, we need to take a domino function {{\mathcal T}: {\bf Z}^2 \rightarrow {\mathcal W}} (obeying the domino constraints associated to some domino sets {{\mathcal R}_1, {\mathcal R}_2}) and use it to build a Sudoku function {F: \{0,\dots,N-1\} \times {\bf Z} \rightarrow \Sigma} (obeying some Sudoku constraints relating to the domino sets); conversely, every Sudoku function obeying the rules of our Sudoku puzzle has to arise somehow from a domino function. The route to doing so was not immediately obvious, but after a helpful tip from Emmanuel Jeandel, we were able to adapt some ideas of Aanderaa and Lewis, in which certain hierarchical structures were used to encode one problem in another. Here, we interpret hierarchical structure {p}-adically (using two different primes due to the two-dimensionality of the domino problem). The Sudoku function {F} that will exemplify our embedding is then built from {{\mathcal T}} by the formula

\displaystyle  F(n,m) := ( f_{p_1}(m), f_{p_2}(m), {\mathcal T}(\nu_{p_1}(m), \nu_{p_2}(m)) ) \ \ \ \ \ (1) where {p_1,p_2} are two large distinct primes (for instance one can take {p_1=53}, {p_2=59} for concreteness), {\nu_p(m)} denotes the number of times {p} divides {m}, and {f_p(m) \in {\bf Z}/p{\bf Z} \backslash \{0\}} is the last non-zero digit in the base {p} expansion of {m}:

\displaystyle  f_p(m) := \frac{m}{p^{\nu_p(m)}} \hbox{ mod } p (with the conventions {\nu_p(0)=+\infty} and {f_p(0)=1}). In the case {p_1=3, p_2=5}, the first component of (1) looks like this:

and a typical instance of the final component {{\mathcal T}(\nu_{p_1}(m), \nu_{p_2}(m))} looks like this:

Amusingly, the decoration here is essentially following the rules of the children’s game “Fizz buzz“.

To demonstrate the embedding, we thus need to produce a specific Sudoku rule {{\mathcal S}} (as well as a more technical initial condition {{\mathcal C}}, which is basically required to exclude degenerate Sudoku solutions such as a constant solution) that can “capture” the target function (1), in the sense that the only solutions to this specific Sudoku puzzle are given by variants of {F} (e.g., {F} composed with various linear transformations). In our previous paper we were able to build a Sudoku puzzle that could similarly capture either of the first two components {f_{p_1}(m)}, {f_{p_2}(m)} of our target function (1) (up to linear transformations), by a procedure very akin to solving an actual Sudoku puzzle (combined with iterative use of a “Tetris” move in which we eliminate rows of the puzzle that we have fully solved, to focus on the remaining unsolved rows). Our previous paper treated the case when {p} was replaced with a power of {2}, as this was the only case that we know how to embed in a monotiling problem of the entirety of {{\bf Z}^d} (as opposed to a periodic subset {E} of {{\bf Z}^d}), but the analysis is in fact easier when {p} is a large odd prime, instead of a power of {2}. Once the first two components {f_{p_1}(m), f_{p_2}(m)} have been solved for, it is a relatively routine matter to design an additional constraint in the Sudoku rule that then constrains the third component to be of the desired form {{\mathcal T}(\nu_{p_1}(m), \nu_{p_2}(m))}, with {{\mathcal T}} obeying the domino constraints.

I have just uploaded to the arXiv my paper “Monotone non-decreasing sequences of the Euler totient function“. This paper concerns the quantity {M(x)}, defined as the length of the longest subsequence of the numbers from {1} to {x} for which the Euler totient function {\varphi} is non-decreasing. The first few values of {M} are

\displaystyle  1, 2, 3, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 11, 12, 12, \dots

(OEIS A365339). For instance, {M(6)=5} because the totient function is non-decreasing on the set {\{1,2,3,4,5\}} or {\{1,2,3,4,6\}}, but not on the set {\{1,2,3,4,5,6\}}.

Since {\varphi(p)=p-1} for any prime {p}, we have {M(x) \geq \pi(x)}, where {\pi(x)} is the prime counting function. Empirically, the primes come quite close to achieving the maximum length {M(x)}; indeed it was conjectured by Pollack, Pomerance, and Treviño, based on numerical evidence, that one had

\displaystyle  M(x) = \pi(x)+64 \ \ \ \ \ (1)

for all {x \geq 31957}; this conjecture is verified up to {x=10^7}. The previous best known upper bound was basically of the form

\displaystyle  M(x) \leq \exp( (C+o(1)) (\log\log\log x)^2 ) \frac{x}{\log x} \ \ \ \ \ (2)

as {x \rightarrow \infty} for an explicit constant {C = 0.81781\dots}, from combining results from the above paper with that of Ford or of Maier-Pomerance. In this paper we obtain the asymptotic

\displaystyle  M(x) = \left( 1 + O \left(\frac{(\log\log x)^5}{\log x}\right) \right) \frac{x}{\log x}

so in particular {M(x) = (1+o(1))\pi(x)}. This answers a question of Erdős, as well as a closely related question of Pollack, Pomerance, and Treviño.

The methods of proof turn out to be mostly elementary (the most advanced result from analytic number theory we need is the prime number theorem with classical error term). The basic idea is to isolate one key prime factor {p} of a given number {1 \leq n \leq x} which has a sizeable influence on the totient function {\varphi(n)}. For instance, for “typical” numbers {n}, one has a factorization

\displaystyle  n = d p_2 p_1

where {p_2} is a medium sized prime, {p_1} is a significantly larger prime, and {d} is a number with all prime factors less than {p_2}. This leads to an approximation

\displaystyle  \varphi(n) \approx \frac{\varphi(d)}{d} (1-\frac{1}{p_2}) n.

As a consequence, if we temporarily hold {d} fixed, and also localize {n} to a relatively short interval, then {\varphi} can only be non-decreasing in {n} if {p_2} is also non-decreasing at the same time. This turns out to significantly cut down on the possible length of a non-decreasing sequence in this regime, particularly if {p_2} is large; this can be formalized by partitioning the range of {p_2} into various subintervals and inspecting how this (and the monotonicity hypothesis on {\varphi}) constrains the values of {n} associated to each subinterval. When {p_2} is small, we instead use a factorization

\displaystyle  n = d p \ \ \ \ \ (3)

where {d} is very smooth (i.e., has no large prime factors), and {p} is a large prime. Now we have the approximation

\displaystyle  \varphi(n) \approx \frac{\varphi(d)}{d} n \ \ \ \ \ (4)

and we can conclude that {\frac{\varphi(d)}{d}} will have to basically be piecewise constant in order for {\varphi} to be non-decreasing. Pursuing this analysis more carefully (in particular controlling the size of various exceptional sets in which the above analysis breaks down), we end up achieving the main theorem so long as we can prove the preliminary inequality

\displaystyle  \sum_{\frac{\varphi(d)}{d}=q} \frac{1}{d} \leq 1 \ \ \ \ \ (5)

for all positive rational numbers {q}. This is in fact also a necessary condition; any failure of this inequality can be easily converted to a counterexample to the bound (2), by considering numbers of the form (3) with {\frac{\varphi(d)}{d}} equal to a fixed constant {q} (and omitting a few rare values of {n} where the approximation (4) is bad enough that {\varphi} is temporarily decreasing). Fortunately, there is a minor miracle, relating to the fact that the largest prime factor of denominator of {\frac{\varphi(d)}{d}} in lowest terms necessarily equals the largest prime factor of {d}, that allows one to evaluate the left-hand side of (5) almost exactly (this expression either vanishes, or is the product of {\frac{1}{p-1}} for some primes {p} ranging up to the largest prime factor of {q}) that allows one to easily establish (5). If one were to try to prove an analogue of our main result for the sum-of-divisors function {\sigma(n)}, one would need the analogue

\displaystyle  \sum_{\frac{\sigma(d)}{d}=q} \frac{1}{d} \leq 1 \ \ \ \ \ (6)

of (5), which looks within reach of current methods (and was even claimed without proof by Erdos), but does not have a full proof in the literature at present.

In the final section of the paper we discuss some near counterexamples to the strong conjecture (1) that indicate that it is likely going to be difficult to get close to proving this conjecture without assuming some rather strong hypotheses. Firstly, we show that failure of Legendre’s conjecture on the existence of a prime between any two consecutive squares can lead to a counterexample to (1). Secondly, we show that failure of the Dickson-Hardy-Littlewood conjecture can lead to a separate (and more dramatic) failure of (1), in which the primes are no longer the dominant sequence on which the totient function is non-decreasing, but rather the numbers which are a power of two times a prime become the dominant sequence. This suggests that any significant improvement to (2) would require assuming something comparable to the prime tuples conjecture, and perhaps also some unproven hypotheses on prime gaps.

As someone who had a relatively light graduate education in algebra, the import of Yoneda’s lemma in category theory has always eluded me somewhat; the statement and proof are simple enough, but definitely have the “abstract nonsense” flavor that one often ascribes to this part of mathematics, and I struggled to connect it to the more grounded forms of intuition, such as those based on concrete examples, that I was more comfortable with. There is a popular MathOverflow post devoted to this question, with many answers that were helpful to me, but I still felt vaguely dissatisfied. However, recently when pondering the very concrete concept of a polynomial, I managed to accidentally stumble upon a special case of Yoneda’s lemma in action, which clarified this lemma conceptually for me. In the end it was a very simple observation (and would be extremely pedestrian to anyone who works in an algebraic field of mathematics), but as I found this helpful to a non-algebraist such as myself, and I thought I would share it here in case others similarly find it helpful.

In algebra we see a distinction between a polynomial form (also known as a formal polynomial), and a polynomial function, although this distinction is often elided in more concrete applications. A polynomial form in, say, one variable with integer coefficients, is a formal expression {P} of the form

\displaystyle  P = a_d {\mathrm n}^d + \dots + a_1 {\mathrm n} + a_0 \ \ \ \ \ (1)

where {a_0,\dots,a_d} are coefficients in the integers, and {{\mathrm n}} is an indeterminate: a symbol that is often intended to be interpreted as an integer, real number, complex number, or element of some more general ring {R}, but is for now a purely formal object. The collection of such polynomial forms is denoted {{\bf Z}[{\mathrm n}]}, and is a commutative ring.

A polynomial form {P} can be interpreted in any ring {R} (even non-commutative ones) to create a polynomial function {P_R : R \rightarrow R}, defined by the formula

\displaystyle  P_R(n) := a_d n^d + \dots + a_1 n + a_0 \ \ \ \ \ (2)

for any {n \in R}. This definition (2) looks so similar to the definition (1) that we usually abuse notation and conflate {P} with {P_R}. This conflation is supported by the identity theorem for polynomials, that asserts that if two polynomial forms {P, Q} agree at an infinite number of (say) complex numbers, thus {P_{\bf C}(z) = Q_{\bf C}(z)} for infinitely many {z}, then they agree {P=Q} as polynomial forms (i.e., their coefficients match). But this conflation is sometimes dangerous, particularly when working in finite characteristic. For instance:

  • (i) The linear forms {{\mathrm n}} and {-{\mathrm n}} are distinct as polynomial forms, but agree when interpreted in the ring {{\bf Z}/2{\bf Z}}, since {n = -n} for all {n \in {\bf Z}/2{\bf Z}}.
  • (ii) Similarly, if {p} is a prime, then the degree one form {{\mathrm n}} and the degree {p} form {{\mathrm n}^p} are distinct as polynomial forms (and in particular have distinct degrees), but agree when interpreted in the ring {{\bf Z}/p{\bf Z}}, thanks to Fermat’s little theorem.
  • (iii) The polynomial form {{\mathrm n}^2+1} has no roots when interpreted in the reals {{\bf R}}, but has roots when interpreted in the complex numbers {{\bf C}}. Similarly, the linear form {2{\mathrm n}-1} has no roots when interpreted in the integers {{\bf Z}}, but has roots when interpreted in the rationals {{\bf Q}}.

The above examples show that if one only interprets polynomial forms in a specific ring {R}, then some information about the polynomial could be lost (and some features of the polynomial, such as roots, may be “invisible” to that interpretation). But this turns out not to be the case if one considers interpretations in all rings simultaneously, as we shall now discuss.

If {R, S} are two different rings, then the polynomial functions {P_R: R \rightarrow R} and {P_S: S \rightarrow S} arising from interpreting a polynomial form {P} in these two rings are, strictly speaking, different functions. However, they are often closely related to each other. For instance, if {R} is a subring of {S}, then {P_R} agrees with the restriction of {P_S} to {R}. More generally, if there is a ring homomorphism {\phi: R \rightarrow S} from {R} to {S}, then {P_R} and {P_S} are intertwined by the relation

\displaystyle  \phi \circ P_R = P_S \circ \phi, \ \ \ \ \ (3)

which basically asserts that ring homomorphism respect polynomial operations. Note that the previous observation corresponded to the case when {\phi} was an inclusion homomorphism. Another example comes from the complex conjugation automorphism {z \mapsto \overline{z}} on the complex numbers, in which case (3) asserts the identity

\displaystyle  \overline{P_{\bf C}(z)} = P_{\bf C}(\overline{z})

for any polynomial function {P_{\bf C}} on the complex numbers, and any complex number {z}.

What was surprising to me (as someone who had not internalized the Yoneda lemma) was that the converse statement was true: if one had a function {F_R: R \rightarrow R} associated to every ring {R} that obeyed the intertwining relation

\displaystyle  \phi \circ F_R = F_S \circ \phi \ \ \ \ \ (4)

for every ring homomorphism {\phi: R \rightarrow S}, then there was a unique polynomial form {P \in {\bf Z}[\mathrm{n}]} such that {F_R = P_R} for all rings {R}. This seemed surprising to me because the functions {F} were a priori arbitrary functions, and as an analyst I would not expect them to have polynomial structure. But the fact that (4) holds for all rings {R,S} and all homomorphisms {\phi} is in fact rather powerful. As an analyst, I am tempted to proceed by first working with the ring {{\bf C}} of complex numbers and taking advantage of the aforementioned identity theorem, but this turns out to be tricky because {{\bf C}} does not “talk” to all the other rings {R} enough, in the sense that there are not always as many ring homomorphisms from {{\bf C}} to {R} as one would like. But there is in fact a more elementary argument that takes advantage of a particularly relevant (and “talkative”) ring to the theory of polynomials, namely the ring {{\bf Z}[\mathrm{n}]} of polynomials themselves. Given any other ring {R}, and any element {n} of that ring, there is a unique ring homomorphism {\phi_{R,n}: {\bf Z}[\mathrm{n}] \rightarrow R} from {{\bf Z}[\mathrm{n}]} to {R} that maps {\mathrm{n}} to {n}, namely the evaluation map

\displaystyle  \phi_{R,n} \colon a_d {\mathrm n}^d + \dots + a_1 {\mathrm n} + a_0 \mapsto a_d n^d + \dots + a_1 n + a_0

that sends a polynomial form to its evaluation at {n}. Applying (4) to this ring homomorphism, and specializing to the element {\mathrm{n}} of {{\bf Z}[\mathrm{n}]}, we conclude that

\displaystyle  \phi_{R,n}( F_{{\bf Z}[\mathrm{n}]}(\mathrm{n}) ) = F_R( n )

for any ring {R} and any {n \in R}. If we then define {P \in {\bf Z}[\mathrm{n}]} to be the formal polynomial

\displaystyle  P := F_{{\bf Z}[\mathrm{n}]}(\mathrm{n}),

then this identity can be rewritten as

\displaystyle  F_R = P_R

and so we have indeed shown that the family {F_R} arises from a polynomial form {P}. Conversely, from the identity

\displaystyle  P = P_{{\bf Z}[\mathrm{n}]}(\mathrm{n})

valid for any polynomial form {P}, we see that two polynomial forms {P,Q} can only generate the same polynomial functions {P_R, Q_R} for all rings {R} if they are identical as polynomial forms. So the polynomial form {P} associated to the family {F_R} is unique.

We have thus created an identification of form and function: polynomial forms {P} are in one-to-one correspondence with families of functions {F_R} obeying the intertwining relation (4). But this identification can be interpreted as a special case of the Yoneda lemma, as follows. There are two categories in play here: the category {\mathbf{Ring}} of rings (where the morphisms are ring homomorphisms), and the category {\mathrm{Set}} of sets (where the morphisms are arbitrary functions). There is an obvious forgetful functor {\mathrm{Forget}: \mathbf{Ring} \rightarrow \mathbf{Set}} between these two categories that takes a ring and removes all of the algebraic structure, leaving behind just the underlying set. A collection {F_R: R \rightarrow R} of functions (i.e., {\mathbf{Set}}-morphisms) for each {R} in {\mathbf{Ring}} that obeys the intertwining relation (4) is precisely the same thing as a natural transformation from the forgetful functor {\mathrm{Forget}} to itself. So we have identified formal polynomials in {{\bf Z}[\mathbf{n}]} as a set with natural endomorphisms of the forgetful functor:

\displaystyle  \mathrm{Forget}({\bf Z}[\mathbf{n}]) \equiv \mathrm{Hom}( \mathrm{Forget}, \mathrm{Forget} ). \ \ \ \ \ (5)

Informally: polynomial forms are precisely those operations on rings that are respected by ring homomorphisms.

What does this have to do with Yoneda’s lemma? Well, remember that every element {n} of a ring {R} came with an evaluation homomorphism {\phi_{R,n}: {\bf Z}[\mathrm{n}] \rightarrow R}. Conversely, every homomorphism from {{\bf Z}[\mathrm{n}]} to {R} will be of the form {\phi_{R,n}} for a unique {n} – indeed, {n} will just be the image of {\mathrm{n}} under this homomorphism. So the evaluation homomorphism provides a one-to-one correspondence between elements of {R}, and ring homomorphisms in {\mathrm{Hom}({\bf Z}[\mathrm{n}], R)}. This correspondence is at the level of sets, so this gives the identification

\displaystyle  \mathrm{Forget} \equiv \mathrm{Hom}({\bf Z}[\mathrm{n}], -).

Thus our identification can be written as

\displaystyle  \mathrm{Forget}({\bf Z}[\mathbf{n}]) \equiv \mathrm{Hom}( \mathrm{Hom}({\bf Z}[\mathrm{n}], -), \mathrm{Forget} )

which is now clearly a special case of the Yoneda lemma

\displaystyle  F(A) \equiv \mathrm{Hom}( \mathrm{Hom}(A, -), F )

that applies to any functor {F: {\mathcal C} \rightarrow \mathbf{Set}} from a (locally small) category {{\mathcal C}} and any object {A} in {{\mathcal C}}. And indeed if one inspects the standard proof of this lemma, it is essentially the same argument as the argument we used above to establish the identification (5). More generally, it seems to me that the Yoneda lemma is often used to identify “formal” objects with their “functional” interpretations, as long as one simultaneously considers interpretations across an entire category (such as the category of rings), as opposed to just a single interpretation in a single object of the category in which there may be some loss of information due to the peculiarities of that specific object. Grothendieck’s “functor of points” interpretation of a scheme, discussed in this previous blog post, is one typical example of this.

Kevin Ford, Dimitris Koukoulopoulos and I have just uploaded to the arXiv our paper “A lower bound on the mean value of the Erdős-Hooley delta function“. This paper complements the recent paper of Dimitris and myself obtaining the upper bound

\displaystyle  \frac{1}{x} \sum_{n \leq x} \Delta(n) \ll (\log\log x)^{11/4}

on the mean value of the Erdős-Hooley delta function

\displaystyle  \Delta(n) := \sup_u \# \{ d|n: e^u < d \leq e^{u+1} \}

In this paper we obtain a lower bound

\displaystyle  \frac{1}{x} \sum_{n \leq x} \Delta(n) \gg (\log\log x)^{1+\eta-o(1)}

where {\eta = 0.3533227\dots} is an exponent that arose in previous work of result of Ford, Green, and Koukoulopoulos, who showed that

\displaystyle  \Delta(n) \gg (\log\log n)^{\eta-o(1)} \ \ \ \ \ (1)

for all {n} outside of a set of density zero. The previous best known lower bound for the mean value was

\displaystyle  \frac{1}{x} \sum_{n \leq x} \Delta(n) \gg \log\log x,

due to Hall and Tenenbaum.

The point is the main contributions to the mean value of {\Delta(n)} are driven not by “typical” numbers {n} of some size {x}, but rather of numbers that have a splitting

\displaystyle  n = n' n''

where {n''} is the product of primes between some intermediate threshold {1 \leq y \leq x} and {x} and behaves “typically” (so in particular, it has about {\log\log x - \log\log y + O(\sqrt{\log\log x})} prime factors, as per the Hardy-Ramanujan law and the Erdős-Kac law, but {n'} is the product of primes up to {y} and has double the number of typical prime factors – {2 \log\log y + O(\sqrt{\log\log x})}, rather than {\log\log y + O(\sqrt{\log\log x})} – thus {n''} is the type of number that would make a significant contribution to the mean value of the divisor function {\tau(n'')}. Here {y} is such that {\log\log y} is an integer in the range

\displaystyle  \varepsilon\log\log x \leq \log \log y \leq (1-\varepsilon) \log\log x

for some small constant {\varepsilon>0} there are basically {\log\log x} different values of {y} give essentially disjoint contributions. From the easy inequalities

\displaystyle  \Delta(n) \gg \Delta(n') \Delta(n'') \geq \frac{\tau(n')}{\log n'} \Delta(n'') \ \ \ \ \ (2)

(the latter coming from the pigeonhole principle) and the fact that {\frac{\tau(n')}{\log n'}} has mean about one, one would expect to get the above result provided that one could get a lower bound of the form

\displaystyle  \Delta(n'') \gg (\log \log n'')^{\eta-o(1)} \ \ \ \ \ (3)

for most typical {n''} with prime factors between {y} and {x}. Unfortunately, due to the lack of small prime factors in {n''}, the arguments of Ford, Green, Koukoulopoulos that give (1) for typical {n} do not quite work for the rougher numbers {n''}. However, it turns out that one can get around this problem by replacing (2) by the more efficient inequality

\displaystyle  \Delta(n) \gg \frac{\tau(n')}{\log n'} \Delta^{(\log n')}(n'')

where

\displaystyle  \Delta^{(v)}(n) := \sup_u \# \{ d|n: e^u < d \leq e^{u+v} \}

is an enlarged version of {\Delta^{(n)}} when {v \geq 1}. This inequality is easily proven by applying the pigeonhole principle to the factors of {n} of the form {d' d''}, where {d'} is one of the {\tau(n')} factors of {n'}, and {d''} is one of the {\Delta^{(\log n')}(n'')} factors of {n''} in the optimal interval {[e^u, e^{u+\log n'}]}. The extra room provided by the enlargement of the range {[e^u, e^{u+1}]} to {[e^u, e^{u+\log n'}]} turns out to be sufficient to adapt the Ford-Green-Koukoulopoulos argument to the rough setting. In fact we are able to use the main technical estimate from that paper as a “black box”, namely that if one considers a random subset {A} of {[D^c, D]} for some small {c>0} and sufficiently large {D} with each {n \in [D^c, D]} lying in {A} with an independent probability {1/n}, then with high probability there should be {\gg c^{-1/\eta+o(1)}} subset sums of {A} that attain the same value. (Initially, what “high probability” means is just “close to {1}“, but one can reduce the failure probability significantly as {c \rightarrow 0} by a “tensor power trick” taking advantage of Bennett’s inequality.)

Archives