Given a function {f: {\bf N} \rightarrow \{-1,+1\}} on the natural numbers taking values in {+1, -1}, one can invoke the Furstenberg correspondence principle to locate a measure preserving system {T \circlearrowright (X, \mu)} – a probability space {(X,\mu)} together with a measure-preserving shift {T: X \rightarrow X} (or equivalently, a measure-preserving {{\bf Z}}-action on {(X,\mu)}) – together with a measurable function (or “observable”) {F: X \rightarrow \{-1,+1\}} that has essentially the same statistics as {f} in the sense that

\displaystyle \lim \inf_{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^N f(n+h_1) \dots f(n+h_k)

\displaystyle \leq \int_X F(T^{h_1} x) \dots F(T^{h_k} x)\ d\mu(x)

\displaystyle \leq \lim \sup_{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^N f(n+h_1) \dots f(n+h_k)

for any integers {h_1,\dots,h_k}. In particular, one has

\displaystyle \int_X F(T^{h_1} x) \dots F(T^{h_k} x)\ d\mu(x) = \lim_{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^N f(n+h_1) \dots f(n+h_k) \ \ \ \ \ (1)


whenever the limit on the right-hand side exists. We will refer to the system {T \circlearrowright (X,\mu)} together with the designated function {F} as a Furstenberg limit ot the sequence {f}. These Furstenberg limits capture some, but not all, of the asymptotic behaviour of {f}; roughly speaking, they control the typical “local” behaviour of {f}, involving correlations such as {\frac{1}{N} \sum_{n=1}^N f(n+h_1) \dots f(n+h_k)} in the regime where {h_1,\dots,h_k} are much smaller than {N}. However, the control on error terms here is usually only qualitative at best, and one usually does not obtain non-trivial control on correlations in which the {h_1,\dots,h_k} are allowed to grow at some significant rate with {N} (e.g. like some power {N^\theta} of {N}).

The correspondence principle is discussed in these previous blog posts. One way to establish the principle is by introducing a Banach limit {p\!-\!\lim: \ell^\infty({\bf N}) \rightarrow {\bf R}} that extends the usual limit functional on the subspace of {\ell^\infty({\bf N})} consisting of convergent sequences while still having operator norm one. Such functionals cannot be constructed explicitly, but can be proven to exist (non-constructively and non-uniquely) using the Hahn-Banach theorem; one can also use a non-principal ultrafilter here if desired. One can then seek to construct a system {T \circlearrowright (X,\mu)} and a measurable function {F: X \rightarrow \{-1,+1\}} for which one has the statistics

\displaystyle \int_X F(T^{h_1} x) \dots F(T^{h_k} x)\ d\mu(x) = p\!-\!\lim_{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^N f(n+h_1) \dots f(n+h_k) \ \ \ \ \ (2)


for all {h_1,\dots,h_k}. One can explicitly construct such a system as follows. One can take {X} to be the Cantor space {\{-1,+1\}^{\bf Z}} with the product {\sigma}-algebra and the shift

\displaystyle T ( (x_n)_{n \in {\bf Z}} ) := (x_{n+1})_{n \in {\bf Z}}

with the function {F: X \rightarrow \{-1,+1\}} being the coordinate function at zero:

\displaystyle F( (x_n)_{n \in {\bf Z}} ) := x_0

(so in particular {F( T^h (x_n)_{n \in {\bf Z}} ) = x_h} for any {h \in {\bf Z}}). The only thing remaining is to construct the invariant measure {\mu}. In order to be consistent with (2), one must have

\displaystyle \mu( \{ (x_n)_{n \in {\bf Z}}: x_{h_j} = \epsilon_j \forall 1 \leq j \leq k \} )

\displaystyle = p\!-\!\lim_{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^N 1_{f(n+h_1)=\epsilon_1} \dots 1_{f(n+h_k)=\epsilon_k}

for any distinct integers {h_1,\dots,h_k} and signs {\epsilon_1,\dots,\epsilon_k}. One can check that this defines a premeasure on the Boolean algebra of {\{-1,+1\}^{\bf Z}} defined by cylinder sets, and the existence of {\mu} then follows from the Hahn-Kolmogorov extension theorem (or the closely related Kolmogorov extension theorem). One can then check that the correspondence (2) holds, and that {\mu} is translation-invariant; the latter comes from the translation invariance of the (Banach-)Césaro averaging operation {f \mapsto p\!-\!\lim_{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^N f(n)}. A variant of this construction shows that the Furstenberg limit is unique up to equivalence if and only if all the limits appearing in (1) actually exist.

One can obtain a slightly tighter correspondence by using a smoother average than the Césaro average. For instance, one can use the logarithmic Césaro averages {\lim_{N \rightarrow \infty} \frac{1}{\log N}\sum_{n=1}^N \frac{f(n)}{n}} in place of the Césaro average {\sum_{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^N f(n)}, thus one replaces (2) by

\displaystyle \int_X F(T^{h_1} x) \dots F(T^{h_k} x)\ d\mu(x)

\displaystyle = p\!-\!\lim_{N \rightarrow \infty} \frac{1}{\log N} \sum_{n=1}^N \frac{f(n+h_1) \dots f(n+h_k)}{n}.

Whenever the Césaro average of a bounded sequence {f: {\bf N} \rightarrow {\bf R}} exists, then the logarithmic Césaro average exists and is equal to the Césaro average. Thus, a Furstenberg limit constructed using logarithmic Banach-Césaro averaging still obeys (1) for all {h_1,\dots,h_k} when the right-hand side limit exists, but also obeys the more general assertion

\displaystyle \int_X F(T^{h_1} x) \dots F(T^{h_k} x)\ d\mu(x)

\displaystyle = \lim_{N \rightarrow \infty} \frac{1}{\log N} \sum_{n=1}^N \frac{f(n+h_1) \dots f(n+h_k)}{n}

whenever the limit of the right-hand side exists.

In a recent paper of Frantizinakis, the Furstenberg limits of the Liouville function {\lambda} (with logarithmic averaging) were studied. Some (but not all) of the known facts and conjectures about the Liouville function can be interpreted in the Furstenberg limit. For instance, in a recent breakthrough result of Matomaki and Radziwill (discussed previously here), it was shown that the Liouville function exhibited cancellation on short intervals in the sense that

\displaystyle \lim_{H \rightarrow \infty} \limsup_{X \rightarrow \infty} \frac{1}{X} \int_X^{2X} \frac{1}{H} |\sum_{x \leq n \leq x+H} \lambda(n)|\ dx = 0.

In terms of Furstenberg limits of the Liouville function, this assertion is equivalent to the assertion that

\displaystyle \lim_{H \rightarrow \infty} \int_X |\frac{1}{H} \sum_{h=1}^H F(T^h x)|\ d\mu(x) = 0

for all Furstenberg limits {T \circlearrowright (X,\mu), F} of Liouville (including those without logarithmic averaging). Invoking the mean ergodic theorem (discussed in this previous post), this assertion is in turn equivalent to the observable {F} that corresponds to the Liouville function being orthogonal to the invariant factor {L^\infty(X,\mu)^{\bf Z} = \{ g \in L^\infty(X,\mu): g \circ T = g \}} of {X}; equivalently, the first Gowers-Host-Kra seminorm {\|F\|_{U^1(X)}} of {F} (as defined for instance in this previous post) vanishes. The Chowla conjecture, which asserts that

\displaystyle \lim_{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^N \lambda(n+h_1) \dots \lambda(n+h_k) = 0

for all distinct integers {h_1,\dots,h_k}, is equivalent to the assertion that all the Furstenberg limits of Liouville are equivalent to the Bernoulli system ({\{-1,+1\}^{\bf Z}} with the product measure arising from the uniform distribution on {\{-1,+1\}}, with the shift {T} and observable {F} as before). Similarly, the logarithmically averaged Chowla conjecture

\displaystyle \lim_{N \rightarrow \infty} \frac{1}{\log N} \sum_{n=1}^N \frac{\lambda(n+h_1) \dots \lambda(n+h_k)}{n} = 0

is equivalent to the assertion that all the Furstenberg limits of Liouville with logarithmic averaging are equivalent to the Bernoulli system. Recently, I was able to prove the two-point version

\displaystyle \lim_{N \rightarrow \infty} \frac{1}{\log N} \sum_{n=1}^N \frac{\lambda(n) \lambda(n+h)}{n} = 0 \ \ \ \ \ (3)


of the logarithmically averaged Chowla conjecture, for any non-zero integer {h}; this is equivalent to the perfect strong mixing property

\displaystyle \int_X F(x) F(T^h x)\ d\mu(x) = 0

for any Furstenberg limit of Liouville with logarithmic averaging, and any {h \neq 0}.

The situation is more delicate with regards to the Sarnak conjecture, which is equivalent to the assertion that

\displaystyle \lim_{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^N \lambda(n) f(n) = 0

for any zero-entropy sequence {f: {\bf N} \rightarrow {\bf R}} (see this previous blog post for more discussion). Morally speaking, this conjecture should be equivalent to the assertion that any Furstenberg limit of Liouville is disjoint from any zero entropy system, but I was not able to formally establish an implication in either direction due to some technical issues regarding the fact that the Furstenberg limit does not directly control long-range correlations, only short-range ones. (There are however ergodic theoretic interpretations of the Sarnak conjecture that involve the notion of generic points; see this paper of El Abdalaoui, Lemancyk, and de la Rue.) But the situation is currently better with the logarithmically averaged Sarnak conjecture

\displaystyle \lim_{N \rightarrow \infty} \frac{1}{\log N} \sum_{n=1}^N \frac{\lambda(n) f(n)}{n} = 0,

as I was able to show that this conjecture was equivalent to the logarithmically averaged Chowla conjecture, and hence to all Furstenberg limits of Liouville with logarithmic averaging being Bernoulli; I also showed the conjecture was equivalent to local Gowers uniformity of the Liouville function, which is in turn equivalent to the function {F} having all Gowers-Host-Kra seminorms vanishing in every Furstenberg limit with logarithmic averaging. In this recent paper of Frantzikinakis, this analysis was taken further, showing that the logarithmically averaged Chowla and Sarnak conjectures were in fact equivalent to the much milder seeming assertion that all Furstenberg limits with logarithmic averaging were ergodic.

Actually, the logarithmically averaged Furstenberg limits have more structure than just a {{\bf Z}}-action on a measure preserving system {(X,\mu)} with a single observable {F}. Let {Aff_+({\bf Z})} denote the semigroup of affine maps {n \mapsto an+b} on the integers with {a,b \in {\bf Z}} and {a} positive. Also, let {\hat {\bf Z}} denote the profinite integers (the inverse limit of the cyclic groups {{\bf Z}/q{\bf Z}}). Observe that {Aff_+({\bf Z})} acts on {\hat {\bf Z}} by taking the inverse limit of the obvious actions of {Aff_+({\bf Z})} on {{\bf Z}/q{\bf Z}}.

Proposition 1 (Enriched logarithmically averaged Furstenberg limit of Liouville) Let {p\!-\!\lim} be a Banach limit. Then there exists a probability space {(X,\mu)} with an action {\phi \mapsto T^\phi} of the affine semigroup {Aff_+({\bf Z})}, as well as measurable functions {F: X \rightarrow \{-1,+1\}} and {M: X \rightarrow \hat {\bf Z}}, with the following properties:

  • (i) (Affine Furstenberg limit) For any {\phi_1,\dots,\phi_k \in Aff_+({\bf Z})}, and any congruence class {a\ (q)}, one has

    \displaystyle p\!-\!\lim_{N \rightarrow \infty} \frac{1}{\log N} \sum_{n=1}^N \frac{\lambda(\phi_1(n)) \dots \lambda(\phi_k(n)) 1_{n = a\ (q)}}{n}

    \displaystyle = \int_X F( T^{\phi_1}(x) ) \dots F( T^{\phi_k}(x) ) 1_{M(x) = a\ (q)}\ d\mu(x).

  • (ii) (Equivariance of {M}) For any {\phi \in Aff_+({\bf Z})}, one has

    \displaystyle M( T^\phi(x) ) = \phi( M(x) )

    for {\mu}-almost every {x \in X}.

  • (iii) (Multiplicativity at fixed primes) For any prime {p}, one has

    \displaystyle F( T^{p\cdot} x ) = - F(x)

    for {\mu}-almost every {x \in X}, where {p \cdot \in Aff_+({\bf Z})} is the dilation map {n \mapsto pn}.

  • (iv) (Measure pushforward) If {\phi \in Aff_+({\bf Z})} is of the form {\phi(n) = an+b} and {S_\phi \subset X} is the set {S_\phi = \{ x \in X: M(x) \in \phi(\hat {\bf Z}) \}}, then the pushforward {T^\phi_* \mu} of {\mu} by {\phi} is equal to {a \mu\downharpoonright_{S_\phi}}, that is to say one has

    \displaystyle \mu( (T^\phi)^{-1}(E) ) = a \mu( E \cap S_\phi )

    for every measurable {E \subset X}.

Note that {{\bf Z}} can be viewed as the subgroup of {Aff_+({\bf Z})} consisting of the translations {n \mapsto n + b}. If one only keeps the {{\bf Z}}-portion of the {Aff_+({\bf Z})} action and forgets the rest (as well as the function {M}) then the action becomes measure-preserving, and we recover an ordinary Furstenberg limit with logarithmic averaging. However, the additional structure here can be quite useful; for instance, one can transfer the proof of (3) to this setting, which we sketch below the fold, after proving the proposition.

The observable {M}, roughly speaking, means that points {x} in the Furstenberg limit {X} constructed by this proposition are still “virtual integers” in the sense that one can meaningfully compute the residue class of {x} modulo any natural number modulus {q}, by first applying {M} and then reducing mod {q}. The action of {Aff_+({\bf Z})} means that one can also meaningfully multiply {x} by any natural number, and translate it by any integer. As with other applications of the correspondence principle, the main advantage of moving to this more “virtual” setting is that one now acquires a probability measure {\mu}, so that the tools of ergodic theory can be readily applied.

Read the rest of this entry »

Given a random variable {X} that takes on only finitely many values, we can define its Shannon entropy by the formula

\displaystyle  H(X) := \sum_x \mathbf{P}(X=x) \log \frac{1}{\mathbf{P}(X=x)}

with the convention that {0 \log \frac{1}{0} = 0}. (In some texts, one uses the logarithm to base {2} rather than the natural logarithm, but the choice of base will not be relevant for this discussion.) This is clearly a nonnegative quantity. Given two random variables {X,Y} taking on finitely many values, the joint variable {(X,Y)} is also a random variable taking on finitely many values, and also has an entropy {H(X,Y)}. It obeys the Shannon inequalities

\displaystyle  H(X), H(Y) \leq H(X,Y) \leq H(X) + H(Y)

so we can define some further nonnegative quantities, the mutual information

\displaystyle  I(X:Y) := H(X) + H(Y) - H(X,Y)

and the conditional entropies

\displaystyle  H(X|Y) := H(X,Y) - H(Y); \quad H(Y|X) := H(X,Y) - H(X).

More generally, given three random variables {X,Y,Z}, one can define the conditional mutual information

\displaystyle  I(X:Y|Z) := H(X|Z) + H(Y|Z) - H(X,Y|Z)

and the final of the Shannon entropy inequalities asserts that this quantity is also non-negative.

The mutual information {I(X:Y)} is a measure of the extent to which {X} and {Y} fail to be independent; indeed, it is not difficult to show that {I(X:Y)} vanishes if and only if {X} and {Y} are independent. Similarly, {I(X:Y|Z)} vanishes if and only if {X} and {Y} are conditionally independent relative to {Z}. At the other extreme, {H(X|Y)} is a measure of the extent to which {X} fails to depend on {Y}; indeed, it is not difficult to show that {H(X|Y)=0} if and only if {X} is determined by {Y} in the sense that there is a deterministic function {f} such that {X = f(Y)}. In a related vein, if {X} and {X'} are equivalent in the sense that there are deterministic functional relationships {X = f(X')}, {X' = g(X)} between the two variables, then {X} is interchangeable with {X'} for the purposes of computing the above quantities, thus for instance {H(X) = H(X')}, {H(X,Y) = H(X',Y)}, {I(X:Y) = I(X':Y)}, {I(X:Y|Z) = I(X':Y|Z)}, etc..

One can get some initial intuition for these information-theoretic quantities by specialising to a simple situation in which all the random variables {X} being considered come from restricting a single random (and uniformly distributed) boolean function {F: \Omega \rightarrow \{0,1\}} on a given finite domain {\Omega} to some subset {A} of {\Omega}:

\displaystyle  X = F \downharpoonright_A.

In this case, {X} has the law of a random uniformly distributed boolean function from {A} to {\{0,1\}}, and the entropy here can be easily computed to be {|A| \log 2}, where {|A|} denotes the cardinality of {A}. If {X} is the restriction of {F} to {A}, and {Y} is the restriction of {F} to {B}, then the joint variable {(X,Y)} is equivalent to the restriction of {F} to {A \cup B}. If one discards the normalisation factor {\log 2}, one then obtains the following dictionary between entropy and the combinatorics of finite sets:

Random variables {X,Y,Z} Finite sets {A,B,C}
Entropy {H(X)} Cardinality {|A|}
Joint variable {(X,Y)} Union {A \cup B}
Mutual information {I(X:Y)} Intersection cardinality {|A \cap B|}
Conditional entropy {H(X|Y)} Set difference cardinality {|A \backslash B|}
Conditional mutual information {I(X:Y|Z)} {|(A \cap B) \backslash C|}
{X, Y} independent {A, B} disjoint
{X} determined by {Y} {A} a subset of {B}
{X,Y} conditionally independent relative to {Z} {A \cap B \subset C}

Every (linear) inequality or identity about entropy (and related quantities, such as mutual information) then specialises to a combinatorial inequality or identity about finite sets that is easily verified. For instance, the Shannon inequality {H(X,Y) \leq H(X)+H(Y)} becomes the union bound {|A \cup B| \leq |A| + |B|}, and the definition of mutual information becomes the inclusion-exclusion formula

\displaystyle  |A \cap B| = |A| + |B| - |A \cup B|.

For a more advanced example, consider the data processing inequality that asserts that if {X, Z} are conditionally independent relative to {Y}, then {I(X:Z) \leq I(X:Y)}. Specialising to sets, this now says that if {A, C} are disjoint outside of {B}, then {|A \cap C| \leq |A \cap B|}; this can be made apparent by considering the corresponding Venn diagram. This dictionary also suggests how to prove the data processing inequality using the existing Shannon inequalities. Firstly, if {A} and {C} are not necessarily disjoint outside of {B}, then a consideration of Venn diagrams gives the more general inequality

\displaystyle  |A \cap C| \leq |A \cap B| + |(A \cap C) \backslash B|

and a further inspection of the diagram then reveals the more precise identity

\displaystyle  |A \cap C| + |(A \cap B) \backslash C| = |A \cap B| + |(A \cap C) \backslash B|.

Using the dictionary in the reverse direction, one is then led to conjecture the identity

\displaystyle  I( X : Z ) + I( X : Y | Z ) = I( X : Y ) + I( X : Z | Y )

which (together with non-negativity of conditional mutual information) implies the data processing inequality, and this identity is in turn easily established from the definition of mutual information.

On the other hand, not every assertion about cardinalities of sets generalises to entropies of random variables that are not arising from restricting random boolean functions to sets. For instance, a basic property of sets is that disjointness from a given set {C} is preserved by unions:

\displaystyle  A \cap C = B \cap C = \emptyset \implies (A \cup B) \cap C = \emptyset.

Indeed, one has the union bound

\displaystyle  |(A \cup B) \cap C| \leq |A \cap C| + |B \cap C|. \ \ \ \ \ (1)

Applying the dictionary in the reverse direction, one might now conjecture that if {X} was independent of {Z} and {Y} was independent of {Z}, then {(X,Y)} should also be independent of {Z}, and furthermore that

\displaystyle  I(X,Y:Z) \leq I(X:Z) + I(Y:Z)

but these statements are well known to be false (for reasons related to pairwise independence of random variables being strictly weaker than joint independence). For a concrete counterexample, one can take {X, Y \in {\bf F}_2} to be independent, uniformly distributed random elements of the finite field {{\bf F}_2} of two elements, and take {Z := X+Y} to be the sum of these two field elements. One can easily check that each of {X} and {Y} is separately independent of {Z}, but the joint variable {(X,Y)} determines {Z} and thus is not independent of {Z}.

From the inclusion-exclusion identities

\displaystyle  |A \cap C| = |A| + |C| - |A \cup C|

\displaystyle  |B \cap C| = |B| + |C| - |B \cup C|

\displaystyle  |(A \cup B) \cap C| = |A \cup B| + |C| - |A \cup B \cup C|

\displaystyle  |A \cap B \cap C| = |A| + |B| + |C| - |A \cup B| - |B \cup C| - |A \cup C|

\displaystyle + |A \cup B \cup C|

one can check that (1) is equivalent to the trivial lower bound {|A \cap B \cap C| \geq 0}. The basic issue here is that in the dictionary between entropy and combinatorics, there is no satisfactory entropy analogue of the notion of a triple intersection {A \cap B \cap C}. (Even the double intersection {A \cap B} only exists information theoretically in a “virtual” sense; the mutual information {I(X:Y)} allows one to “compute the entropy” of this “intersection”, but does not actually describe this intersection itself as a random variable.)

However, this issue only arises with three or more variables; it is not too difficult to show that the only linear equalities and inequalities that are necessarily obeyed by the information-theoretic quantities {H(X), H(Y), H(X,Y), I(X:Y), H(X|Y), H(Y|X)} associated to just two variables {X,Y} are those that are also necessarily obeyed by their combinatorial analogues {|A|, |B|, |A \cup B|, |A \cap B|, |A \backslash B|, |B \backslash A|}. (See for instance the Venn diagram at the Wikipedia page for mutual information for a pictorial summation of this statement.)

One can work with a larger class of special cases of Shannon entropy by working with random linear functions rather than random boolean functions. Namely, let {S} be some finite-dimensional vector space over a finite field {{\mathbf F}}, and let {f: S \rightarrow {\mathbf F}} be a random linear functional on {S}, selected uniformly among all such functions. Every subspace {U} of {S} then gives rise to a random variable {X = X_U: U \rightarrow {\mathbf F}} formed by restricting {f} to {U}. This random variable is also distributed uniformly amongst all linear functions on {U}, and its entropy can be easily computed to be {\mathrm{dim}(U) \log |\mathbf{F}|}. Given two random variables {X, Y} formed by restricting {f} to {U, V} respectively, the joint random variable {(X,Y)} determines the random linear function {f} on the union {U \cup V} on the two spaces, and thus by linearity on the Minkowski sum {U+V} as well; thus {(X,Y)} is equivalent to the restriction of {f} to {U+V}. In particular, {H(X,Y) = \mathrm{dim}(U+V) \log |\mathbf{F}|}. This implies that {I(X:Y) = \mathrm{dim}(U \cap V) \log |\mathbf{F}|} and also {H(X|Y) = \mathrm{dim}(\pi_V(U)) \log |\mathbf{F}|}, where {\pi_V: S \rightarrow S/V} is the quotient map. After discarding the normalising constant {\log |\mathbf{F}|}, this leads to the following dictionary between information theoretic quantities and linear algebra quantities, analogous to the previous dictionary:

Random variables {X,Y,Z} Subspaces {U,V,W}
Entropy {H(X)} Dimension {\mathrm{dim}(U)}
Joint variable {(X,Y)} Sum {U+V}
Mutual information {I(X:Y)} Dimension of intersection {\mathrm{dim}(U \cap V)}
Conditional entropy {H(X|Y)} Dimension of projection {\mathrm{dim}(\pi_V(U))}
Conditional mutual information {I(X:Y|Z)} {\mathrm{dim}(\pi_W(U) \cap \pi_W(V))}
{X, Y} independent {U, V} transverse ({U \cap V = \{0\}})
{X} determined by {Y} {U} a subspace of {V}
{X,Y} conditionally independent relative to {Z} {\pi_W(U)}, {\pi_W(V)} transverse.

The combinatorial dictionary can be regarded as a specialisation of the linear algebra dictionary, by taking {S} to be the vector space {\mathbf{F}_2^\Omega} over the finite field {\mathbf{F}_2} of two elements, and only considering those subspaces {U} that are coordinate subspaces {U = {\bf F}_2^A} associated to various subsets {A} of {\Omega}.

As before, every linear inequality or equality that is valid for the information-theoretic quantities discussed above, is automatically valid for the linear algebra counterparts for subspaces of a vector space over a finite field by applying the above specialisation (and dividing out by the normalising factor of {\log |\mathbf{F}|}). In fact, the requirement that the field be finite can be removed by applying the compactness theorem from logic (or one of its relatives, such as Los’s theorem on ultraproducts, as done in this previous blog post).

The linear algebra model captures more of the features of Shannon entropy than the combinatorial model. For instance, in contrast to the combinatorial case, it is possible in the linear algebra setting to have subspaces {U,V,W} such that {U} and {V} are separately transverse to {W}, but their sum {U+V} is not; for instance, in a two-dimensional vector space {{\bf F}^2}, one can take {U,V,W} to be the one-dimensional subspaces spanned by {(0,1)}, {(1,0)}, and {(1,1)} respectively. Note that this is essentially the same counterexample from before (which took {{\bf F}} to be the field of two elements). Indeed, one can show that any necessarily true linear inequality or equality involving the dimensions of three subspaces {U,V,W} (as well as the various other quantities on the above table) will also be necessarily true when applied to the entropies of three discrete random variables {X,Y,Z} (as well as the corresponding quantities on the above table).

However, the linear algebra model does not completely capture the subtleties of Shannon entropy once one works with four or more variables (or subspaces). This was first observed by Ingleton, who established the dimensional inequality

\displaystyle  \mathrm{dim}(U \cap V) \leq \mathrm{dim}(\pi_W(U) \cap \pi_W(V)) + \mathrm{dim}(\pi_X(U) \cap \pi_X(V)) + \mathrm{dim}(W \cap X) \ \ \ \ \ (2)

for any subspaces {U,V,W,X}. This is easiest to see when the three terms on the right-hand side vanish; then {\pi_W(U), \pi_W(V)} are transverse, which implies that {U\cap V \subset W}; similarly {U \cap V \subset X}. But {W} and {X} are transverse, and this clearly implies that {U} and {V} are themselves transverse. To prove the general case of Ingleton’s inequality, one can define {Y := U \cap V} and use {\mathrm{dim}(\pi_W(Y)) \leq \mathrm{dim}(\pi_W(U) \cap \pi_W(V))} (and similarly for {X} instead of {W}) to reduce to establishing the inequality

\displaystyle  \mathrm{dim}(Y) \leq \mathrm{dim}(\pi_W(Y)) + \mathrm{dim}(\pi_X(Y)) + \mathrm{dim}(W \cap X) \ \ \ \ \ (3)

which can be rearranged using {\mathrm{dim}(\pi_W(Y)) = \mathrm{dim}(Y) - \mathrm{dim}(W) + \mathrm{dim}(\pi_Y(W))} (and similarly for {X} instead of {W}) and {\mathrm{dim}(W \cap X) = \mathrm{dim}(W) + \mathrm{dim}(X) - \mathrm{dim}(W + X)} as

\displaystyle  \mathrm{dim}(W + X ) \leq \mathrm{dim}(\pi_Y(W)) + \mathrm{dim}(\pi_Y(X)) + \mathrm{dim}(Y)

but this is clear since {\mathrm{dim}(W + X ) \leq \mathrm{dim}(\pi_Y(W) + \pi_Y(X)) + \mathrm{dim}(Y)}.

Returning to the entropy setting, the analogue

\displaystyle  H( V ) \leq H( V | Z ) + H(V | W ) + I(Z:W)

of (3) is true (exercise!), but the analogue

\displaystyle  I(X:Y) \leq I(X:Y|Z) + I(X:Y|W) + I(Z:W) \ \ \ \ \ (4)

of Ingleton’s inequality is false in general. Again, this is easiest to see when all the terms on the right-hand side vanish; then {X,Y} are conditionally independent relative to {Z}, and relative to {W}, and {Z} and {W} are independent, and the claim (4) would then be asserting that {X} and {Y} are independent. While there is no linear counterexample to this statement, there are simple non-linear ones: for instance, one can take {Z,W} to be independent uniform variables from {\mathbf{F}_2}, and take {X} and {Y} to be (say) {ZW} and {(1-Z)(1-W)} respectively (thus {X, Y} are the indicators of the events {Z=W=1} and {Z=W=0} respectively). Once one conditions on either {Z} or {W}, one of {X,Y} has positive conditional entropy and the other has zero entropy, and so {X, Y} are conditionally independent relative to either {Z} or {W}; also, {Z} or {W} are independent of each other. But {X} and {Y} are not independent of each other (they cannot be simultaneously equal to {1}). Somehow, the feature of the linear algebra model that is not present in general is that in the linear algebra setting, every pair of subspaces {U, V} has a well-defined intersection {U \cap V} that is also a subspace, whereas for arbitrary random variables {X, Y}, there does not necessarily exist the analogue of an intersection, namely a “common information” random variable {V} that has the entropy of {I(X:Y)} and is determined either by {X} or by {Y}.

I do not know if there is any simpler model of Shannon entropy that captures all the inequalities available for four variables. One significant complication is that there exist some information inequalities in this setting that are not of Shannon type, such as the Zhang-Yeung inequality

\displaystyle  I(X:Y) \leq 2 I(X:Y|Z) + I(X:Z|Y) + I(Y:Z|X)

\displaystyle + I(X:Y|W) + I(Z:W).

One can however still use these simpler models of Shannon entropy to be able to guess arguments that would work for general random variables. An example of this comes from my paper on the logarithmically averaged Chowla conjecture, in which I showed among other things that

\displaystyle  |\sum_{n \leq x} \frac{\lambda(n) \lambda(n+1)}{n}| \leq \varepsilon x \ \ \ \ \ (5)

whenever {x} was sufficiently large depending on {\varepsilon>0}, where {\lambda} is the Liouville function. The information-theoretic part of the proof was as follows. Given some intermediate scale {H} between {1} and {x}, one can form certain random variables {X_H, Y_H}. The random variable {X_H} is a sign pattern of the form {(\lambda(n+1),\dots,\lambda(n+H))} where {n} is a random number chosen from {1} to {x} (with logarithmic weighting). The random variable {Y_H} was tuple {(n \hbox{ mod } p)_{p \sim \varepsilon^2 H}} of reductions of {n} to primes {p} comparable to {\varepsilon^2 H}. Roughly speaking, what was implicitly shown in the paper (after using the multiplicativity of {\lambda}, the circle method, and the Matomaki-Radziwill theorem on short averages of multiplicative functions) is that if the inequality (5) fails, then there was a lower bound

\displaystyle  I( X_H : Y_H ) \gg \varepsilon^7 \frac{H}{\log H}

on the mutual information between {X_H} and {Y_H}. From translation invariance, this also gives the more general lower bound

\displaystyle  I( X_{H_0,H} : Y_H ) \gg \varepsilon^7 \frac{H}{\log H} \ \ \ \ \ (6)

for any {H_0}, where {X_{H_0,H}} denotes the shifted sign pattern {(\lambda(n+H_0+1),\dots,\lambda(n+H_0+H))}. On the other hand, one had the entropy bounds

\displaystyle  H( X_{H_0,H} ), H(Y_H) \ll H

and from concatenating sign patterns one could see that {X_{H_0,H+H'}} is equivalent to the joint random variable {(X_{H_0,H}, X_{H_0+H,H'})} for any {H_0,H,H'}. Applying these facts and using an “entropy decrement” argument, I was able to obtain a contradiction once {H} was allowed to become sufficiently large compared to {\varepsilon}, but the bound was quite weak (coming ultimately from the unboundedness of {\sum_{\log H_- \leq j \leq \log H_+} \frac{1}{j \log j}} as the interval {[H_-,H_+]} of values of {H} under consideration becomes large), something of the order of {H \sim \exp\exp\exp(\varepsilon^{-7})}; the quantity {H} needs at various junctures to be less than a small power of {\log x}, so the relationship between {x} and {\varepsilon} becomes essentially quadruple exponential in nature, {x \sim \exp\exp\exp\exp(\varepsilon^{-7})}. The basic strategy was to observe that the lower bound (6) causes some slowdown in the growth rate {H(X_{kH})/kH} of the mean entropy, in that this quantity decreased by {\gg \frac{\varepsilon^7}{\log H}} as {k} increased from {1} to {\log H}, basically by dividing {X_{kH}} into {k} components {X_{jH, H}}, {j=0,\dots,k-1} and observing from (6) each of these shares a bit of common information with the same variable {Y_H}. This is relatively clear when one works in a set model, in which {Y_H} is modeled by a set {B_H} of size {O(H)}, and {X_{H_0,H}} is modeled by a set of the form

\displaystyle  X_{H_0,H} = \bigcup_{H_0 < h \leq H_0+H} A_h

for various sets {A_h} of size {O(1)} (also there is some translation symmetry that maps {A_h} to a shift {A_{h+1}} while preserving all of the {B_H}).

However, on considering the set model recently, I realised that one can be a little more efficient by exploiting the fact (basically the Chinese remainder theorem) that the random variables {Y_H} are basically jointly independent as {H} ranges over dyadic values that are much smaller than {\log x}, which in the set model corresponds to the {B_H} all being disjoint. One can then establish a variant

\displaystyle  I( X_{H_0,H} : Y_H | (Y_{H'})_{H' < H}) \gg \varepsilon^7 \frac{H}{\log H} \ \ \ \ \ (7)

of (6), which in the set model roughly speaking asserts that each {B_H} claims a portion of the {\bigcup_{H_0 < h \leq H_0+H} A_h} of cardinality {\gg \varepsilon^7 \frac{H}{\log H}} that is not claimed by previous choices of {B_H}. This leads to a more efficient contradiction (relying on the unboundedness of {\sum_{\log H_- \leq j \leq \log H_+} \frac{1}{j}} rather than {\sum_{\log H_- \leq j \leq \log H_+} \frac{1}{j \log j}}) that looks like it removes one order of exponential growth, thus the relationship between {x} and {\varepsilon} is now {x \sim \exp\exp\exp(\varepsilon^{-7})}. Returning to the entropy model, one can use (7) and Shannon inequalities to establish an inequality of the form

\displaystyle  \frac{1}{2H} H(X_{2H} | (Y_{H'})_{H' \leq 2H}) \leq \frac{1}{H} H(X_{H} | (Y_{H'})_{H' \leq H}) - \frac{c \varepsilon^7}{\log H}

for a small constant {c>0}, which on iterating and using the boundedness of {\frac{1}{H} H(X_{H} | (Y_{H'})_{H' \leq H})} gives the claim. (A modification of this analysis, at least on the level of the back of the envelope calculation, suggests that the Matomaki-Radziwill theorem is needed only for ranges {H} greater than {\exp( (\log\log x)^{\varepsilon^{7}} )} or so, although at this range the theorem is not significantly simpler than the general case).

Daniel Kane and I have just uploaded to the arXiv our paper “A bound on partitioning clusters“, submitted to the Electronic Journal of Combinatorics. In this short and elementary paper, we consider a question that arose from biomathematical applications: given a finite family {X} of sets (or “clusters”), how many ways can there be of partitioning a set {A \in X} in this family as the disjoint union {A = A_1 \uplus A_2} of two other sets {A_1, A_2} in this family? That is to say, what is the best upper bound one can place on the quantity

\displaystyle | \{ (A,A_1,A_2) \in X^3: A = A_1 \uplus A_2 \}|

in terms of the cardinality {|X|} of {X}? A trivial upper bound would be {|X|^2}, since this is the number of possible pairs {(A_1,A_2)}, and {A_1,A_2} clearly determine {A}. In our paper, we establish the improved bound

\displaystyle | \{ (A,A_1,A_2) \in X^3: A = A_1 \uplus A_2 \}| \leq |X|^{3/p}

where {p} is the somewhat strange exponent

\displaystyle p := \log_3 \frac{27}{4} = 1.73814\dots, \ \ \ \ \ (1)


so that {3/p = 1.72598\dots}. Furthermore, this exponent is best possible!

Actually, the latter claim is quite easy to show: one takes {X} to be all the subsets of {\{1,\dots,n\}} of cardinality either {n/3} or {2n/3}, for {n} a multiple of {3}, and the claim follows readily from Stirling’s formula. So it is perhaps the former claim that is more interesting (since many combinatorial proof techniques, such as those based on inequalities such as the Cauchy-Schwarz inequality, tend to produce exponents that are rational or at least algebraic). We follow the common, though unintuitive, trick of generalising a problem to make it simpler. Firstly, one generalises the bound to the “trilinear” bound

\displaystyle | \{ (A_1,A_2,A_3) \in X_1 \times X_2 \times X_3: A_3 = A_1 \uplus A_2 \}|

\displaystyle \leq |X_1|^{1/p} |X_2|^{1/p} |X_3|^{1/p}

for arbitrary finite collections {X_1,X_2,X_3} of sets. One can place all the sets in {X_1,X_2,X_3} inside a single finite set such as {\{1,\dots,n\}}, and then by replacing every set {A_3} in {X_3} by its complement in {\{1,\dots,n\}}, one can phrase the inequality in the equivalent form

\displaystyle | \{ (A_1,A_2,A_3) \in X_1 \times X_2 \times X_3: \{1,\dots,n\} =A_1 \uplus A_2 \uplus A_3 \}|

\displaystyle \leq |X_1|^{1/p} |X_2|^{1/p} |X_3|^{1/p}

for arbitrary collections {X_1,X_2,X_3} of subsets of {\{1,\dots,n\}}. We generalise further by turning sets into functions, replacing the estimate with the slightly stronger convolution estimate

\displaystyle f_1 * f_2 * f_3 (1,\dots,1) \leq \|f_1\|_{\ell^p(\{0,1\}^n)} \|f_2\|_{\ell^p(\{0,1\}^n)} \|f_3\|_{\ell^p(\{0,1\}^n)}

for arbitrary functions {f_1,f_2,f_3} on the Hamming cube {\{0,1\}^n}, where the convolution is on the integer lattice {\bf Z}^n rather than on the finite field vector space {\bf F}_2^n. The advantage of working in this general setting is that it becomes very easy to apply induction on the dimension {n}; indeed, to prove this estimate for arbitrary {n} it suffices to do so for {n=1}. This reduces matters to establishing the elementary inequality

\displaystyle (ab(1-c))^{1/p} + (bc(1-a))^{1/p} + (ca(1-b))^{1/p} \leq 1

for all {0 \leq a,b,c \leq 1}, which can be done by a combination of undergraduate multivariable calculus and a little bit of numerical computation. (The left-hand side turns out to have local maxima at {(1,1,0), (1,0,1), (0,1,1), (2/3,2/3,2/3)}, with the latter being the cause of the numerology (1).)

The same sort of argument also gives an energy bound

\displaystyle E(A,A) \leq |A|^{\log_2 6}

for any subset {A \subset \{0,1\}^n} of the Hamming cube, where

\displaystyle E(A,A) := |\{(a_1,a_2,a_3,a_4) \in A^4: a_1+a_2 = a_3 + a_4 \}|

is the additive energy of {A}. The example {A = \{0,1\}^n} shows that the exponent {\log_2 6} cannot be improved.

The self-chosen remit of my blog is “Updates on my research and expository papers, discussion of open problems, and other maths-related topics”.  Of the 774 posts on this blog, I estimate that about 99% of the posts indeed relate to mathematics, mathematicians, or the administration of this mathematical blog, and only about 1% are not related to mathematics or the community of mathematicians in any significant fashion.

This is not one of the 1%.

Mathematical research is clearly an international activity.  But actually a stronger claim is true: mathematical research is a transnational activity, in that the specific nationality of individual members of a research team or research community are (or should be) of no appreciable significance for the purpose of advancing mathematics.  For instance, even during the height of the Cold War, there was no movement in (say) the United States to boycott Soviet mathematicians or theorems, or to only use results from Western literature (though the latter did sometimes happen by default, due to the limited avenues of information exchange between East and West, and former did occasionally occur for political reasons, most notably with the Soviet Union preventing Gregory Margulis from traveling to receive his Fields Medal in 1978 EDIT: and also Sergei Novikov in 1970).    The national origin of even the most fundamental components of mathematics, whether it be the geometry (γεωμετρία) of the ancient Greeks, the algebra (الجبر) of the Islamic world, or the Hindu-Arabic numerals 0,1,\dots,9, are primarily of historical interest, and have only a negligible impact on the worldwide adoption of these mathematical tools. While it is true that individual mathematicians or research teams sometimes compete with each other to be the first to solve some desired problem, and that a citizen could take pride in the mathematical achievements of researchers from their country, one did not see any significant state-sponsored “space races” in which it was deemed in the national interest that a particular result ought to be proven by “our” mathematicians and not “theirs”.   Mathematical research ability is highly non-fungible, and the value added by foreign students and faculty to a mathematics department cannot be completely replaced by an equivalent amount of domestic students and faculty, no matter how large and well educated the country (though a state can certainly work at the margins to encourage and support more domestic mathematicians).  It is no coincidence that all of the top mathematics department worldwide actively recruit the best mathematicians regardless of national origin, and often retain immigration counsel to assist with situations in which these mathematicians come from a country that is currently politically disfavoured by their own.

Of course, mathematicians cannot ignore the political realities of the modern international order altogether.  Anyone who has organised an international conference or program knows that there will inevitably be visa issues to resolve because the host country makes it particularly difficult for certain nationals to attend the event.  I myself, like many other academics working long-term in the United States, have certainly experienced my own share of immigration bureaucracy, starting with various glitches in the renewal or application of my J-1 and O-1 visas, then to the lengthy vetting process for acquiring permanent residency (or “green card”) status, and finally to becoming naturalised as a US citizen (retaining dual citizenship with Australia).  Nevertheless, while the process could be slow and frustrating, there was at least an order to it.  The rules of the game were complicated, but were known in advance, and did not abruptly change in the middle of playing it (save in truly exceptional situations, such as the days after the September 11 terrorist attacks).  One just had to study the relevant visa regulations (or hire an immigration lawyer to do so), fill out the paperwork and submit to the relevant background checks, and remain in good standing until the application was approved in order to study, work, or participate in a mathematical activity held in another country.  On rare occasion, some senior university administrator may have had to contact a high-ranking government official to approve some particularly complicated application, but for the most part one could work through normal channels in order to ensure for instance that the majority of participants of a conference could actually be physically present at that conference, or that an excellent mathematician hired by unanimous consent by a mathematics department could in fact legally work in that department.

With the recent and highly publicised executive order on immigration, many of these fundamental assumptions have been seriously damaged, if not destroyed altogether.  Even if the order was withdrawn immediately, there is no longer an assurance, even for nationals not initially impacted by that order, that some similar abrupt and major change in the rules for entry to the United States could not occur, for instance for a visitor who has already gone through the lengthy visa application process and background checks, secured the appropriate visa, and is already in flight to the country.  This is already affecting upcoming or ongoing mathematical conferences or programs in the US, with many international speakers (including those from countries not directly affected by the order) now cancelling their visit, either in protest or in concern about their ability to freely enter and leave the country.  Even some conferences outside the US are affected, as some mathematicians currently in the US with a valid visa or even permanent residency are uncertain if they could ever return back to their place of work if they left the country to attend a meeting.  In the slightly longer term, it is likely that the ability of elite US institutions to attract the best students and faculty will be seriously impacted.  Again, the losses would be strongest regarding candidates that were nationals of the countries affected by the current executive order, but I fear that many other mathematicians from other countries would now be much more concerned about entering and living in the US than they would have previously.

It is still possible for this sort of long-term damage to the mathematical community (both within the US and abroad) to be reversed or at least contained, but at present there is a real risk of the damage becoming permanent.  To prevent this, it seems insufficient for me for the current order to be rescinded, as desirable as that would be; some further legislative or judicial action would be needed to begin restoring enough trust in the stability of the US immigration and visa system that the international travel that is so necessary to modern mathematical research becomes “just” a bureaucratic headache again.

Of course, the impact of this executive order is far, far broader than just its effect on mathematicians and mathematical research.  But there are countless other venues on the internet and elsewhere to discuss these other aspects (or politics in general).  (For instance, discussion of the qualifications, or lack thereof, of the current US president can be carried out at this previous post.) I would therefore like to open this post to readers to discuss the effects or potential effects of this order on the mathematical community; I particularly encourage mathematicians who have been personally affected by this order to share their experiences.  As per the rules of the blog, I request that “the discussions are kept constructive, polite, and at least tangentially relevant to the topic at hand”.

Some relevant links (please feel free to suggest more, either through comments or by email):

I’ve just uploaded to the arXiv my paper “Some remarks on the lonely runner conjecture“, submitted to Contributions to discrete mathematics. I had blogged about the lonely runner conjecture in this previous blog post, and I returned to the problem recently to see if I could obtain anything further. The results obtained were more modest than I had hoped, but they did at least seem to indicate a potential strategy to make further progress on the problem, and also highlight some of the difficulties of the problem.

One can rephrase the lonely runner conjecture as the following covering problem. Given any integer “velocity” {v} and radius {0 < \delta < 1/2}, define the Bohr set {B(v,\delta)} to be the subset of the unit circle {{\bf R}/{\bf Z}} given by the formula

\displaystyle B(v,\delta) := \{ t \in {\bf R}/{\bf Z}: \|vt\| \leq \delta \},

where {\|x\|} denotes the distance of {x} to the nearest integer. Thus, for {v} positive, {B(v,\delta)} is simply the union of the {v} intervals {[\frac{a-\delta}{v}, \frac{a+\delta}{v}]} for {a=0,\dots,v-1}, projected onto the unit circle {{\bf R}/{\bf Z}}; in the language of the usual formulation of the lonely runner conjecture, {B(v,\delta)} represents those times in which a runner moving at speed {v} returns to within {\delta} of his or her starting position. For any non-zero integers {v_1,\dots,v_n}, let {\delta(v_1,\dots,v_n)} be the smallest radius {\delta} such that the {n} Bohr sets {B(v_1,\delta),\dots,B(v_n,\delta)} cover the unit circle:

\displaystyle {\bf R}/{\bf Z} = \bigcup_{i=1}^n B(v_i,\delta). \ \ \ \ \ (1)


Then define {\delta_n} to be the smallest value of {\delta(v_1,\dots,v_n)}, as {v_1,\dots,v_n} ranges over tuples of distinct non-zero integers. The Dirichlet approximation theorem quickly gives that

\displaystyle \delta(1,\dots,n) = \frac{1}{n+1}

and hence

\displaystyle \delta_n \leq \frac{1}{n+1}

for any {n \geq 1}. The lonely runner conjecture is equivalent to the assertion that this bound is in fact optimal:

Conjecture 1 (Lonely runner conjecture) For any {n \geq 1}, one has {\delta_n = \frac{1}{n+1}}.

This conjecture is currently known for {n \leq 6} (see this paper of Barajas and Serra), but remains open for higher {n}.

It is natural to try to attack the problem by establishing lower bounds on the quantity {\delta_n}. We have the following “trivial” bound, that gets within a factor of two of the conjecture:

Proposition 2 (Trivial bound) For any {n \geq 1}, one has {\delta_n \geq \frac{1}{2n}}.

Proof: It is not difficult to see that for any non-zero velocity {v} and any {0 < \delta < 1/2}, the Bohr set {B(v,\delta)} has Lebesgue measure {m(B(v,\delta)) = 2\delta}. In particular, by the union bound

\displaystyle m(\bigcup_{i=1}^n B(v_i,\delta)) \leq \sum_{i=1}^n m(B(v_i,\delta)) \ \ \ \ \ (2)


we see that the covering (1) is only possible if {1 \leq 2 n \delta}, giving the claim. \Box

So, in some sense, all the difficulty is coming from the need to improve upon the trivial union bound (2) by a factor of two.

Despite the crudeness of the union bound (2), it has proven surprisingly hard to make substantial improvements on the trivial bound {\delta_n \geq \frac{1}{2n}}. In 1994, Chen obtained the slight improvement

\displaystyle \delta_n \geq \frac{1}{2n - 1 + \frac{1}{2n-3}}

which was improved a little by Chen and Cusick in 1999 to

\displaystyle \delta_n \geq \frac{1}{2n-3}

when {2n-3} was prime. In a recent paper of Perarnau and Serra, the bound

\displaystyle \delta_n \geq \frac{1}{2n-2+o(1)}

was obtained for arbitrary {n}. These bounds only improve upon the trivial bound by a multiplicative factor of {1+O(1/n)}. Heuristically, one reason for this is as follows. The union bound (2) would of course be sharp if the Bohr sets {B(v_i,\delta)} were all disjoint. Strictly speaking, such disjointness is not possible, because all the Bohr sets {B(v_i,\delta)} have to contain the origin as an interior point. However, it is possible to come up with a large number of Bohr sets {B(v_i,\delta)} which are almost disjoint. For instance, suppose that we had velocities {v_1,\dots,v_s} that were all prime numbers between {n/4} and {n/2}, and that {\delta} was equal to {\delta_n} (and in particular was between {1/2n} and {1/(n+1)}. Then each set {B(v_i,\delta)} can be split into a “kernel” interval {[-\frac{\delta}{v_i}, \frac{\delta}{v_i}]}, together with the “petal” intervals {\bigcup_{a=1}^{v_i-1} [\frac{a-\delta}{v_i}, \frac{a+\delta}{v_i}]}. Roughly speaking, as the prime {v_i} varies, the kernel interval stays more or less fixed, but the petal intervals range over disjoint sets, and from this it is not difficult to show that

\displaystyle m(\bigcup_{i=1}^s B(v_i,\delta)) = (1-O(\frac{1}{n})) \sum_{i=1}^s m(B(v_i,\delta)),

so that the union bound is within a multiplicative factor of {1+O(\frac{1}{n})} of the truth in this case.

This does not imply that {\delta_n} is within a multiplicative factor of {1+O(1/n)} of {\frac{1}{2n}}, though, because there are not enough primes between {n/4} and {n/2} to assign to {n} distinct velocities; indeed, by the prime number theorem, there are only about {\frac{n}{4\log n}} such velocities that could be assigned to a prime. So, while the union bound could be close to tight for up to {\asymp n/\log n} Bohr sets, the above counterexamples don’t exclude improvements to the union bound for larger collections of Bohr sets. Following this train of thought, I was able to obtain a logarithmic improvement to previous lower bounds:

Theorem 3 For sufficiently large {n}, one has {\delta_n \geq \frac{1}{2n} + \frac{c \log n}{n^2 (\log\log n)^2}} for some absolute constant {c>0}.

The factors of {\log\log n} in the denominator are for technical reasons and might perhaps be removable by a more careful argument. However it seems difficult to adapt the methods to improve the {\log n} in the numerator, basically because of the obstruction provided by the near-counterexample discussed above.

Roughly speaking, the idea of the proof of this theorem is as follows. If we have the covering (1) for {\delta} very close to {1/2n}, then the multiplicity function {\sum_{i=1}^n 1_{B(v_i,\delta)}} will then be mostly equal to {1}, but occasionally be larger than {1}. On the other hand, one can compute that the {L^2} norm of this multiplicity function is significantly larger than {1} (in fact it is at least {(3/2-o(1))^{1/2}}). Because of this, the {L^3} norm must be very large, which means that the triple intersections {B(v_i,\delta) \cap B(v_j,\delta) \cap B(v_k,\delta)} must be quite large for many triples {(i,j,k)}. Using some basic Fourier analysis and additive combinatorics, one can deduce from this that the velocities {v_1,\dots,v_n} must have a large structured component, in the sense that there exists an arithmetic progression of length {\asymp n} that contains {\asymp n} of these velocities. For simplicity let us take the arithmetic progression to be {\{1,\dots,n\}}, thus {\asymp n} of the velocities {v_1,\dots,v_n} lie in {\{1,\dots,n\}}. In particular, from the prime number theorem, most of these velocities will not be prime, and will in fact likely have a “medium-sized” prime factor (in the precise form of the argument, “medium-sized” is defined to be “between {\log^{10} n} and {n^{1/10}}“). Using these medium-sized prime factors, one can show that many of the {B(v_i,\delta)} will have quite a large overlap with many of the other {B(v_j,\delta)}, and this can be used after some elementary arguments to obtain a more noticeable improvement on the union bound (2) than was obtained previously.

A modification of the above argument also allows for the improved estimate

\displaystyle \delta(v_1,\dots,v_n) \geq \frac{1+c-o(1)}{2n} \ \ \ \ \ (3)


if one knows that all of the velocities {v_1,\dots,v_n} are of size {O(n)}.

In my previous blog post, I showed that in order to prove the lonely runner conjecture, it suffices to do so under the additional assumption that all of the velocities {v_1,\dots,v_n} are of size {O(n^{O(n^2)})}; I reproduce this argument (slightly cleaned up for publication) in the current preprint. There is unfortunately a huge gap between {O(n)} and {O(n^{O(n^2)})}, so the above bound (3) does not immediately give any new bounds for {\delta_n}. However, one could perhaps try to start attacking the lonely runner conjecture by increasing the range {O(n)} for which one has good results, and by decreasing the range {O(n^{O(n^2)})} that one can reduce to. For instance, in the current preprint I give an elementary argument (using a certain amount of case-checking) that shows that the lonely runner bound

\displaystyle \delta(v_1,\dots,v_n) \geq \frac{1}{n+1} \ \ \ \ \ (4)


holds if all the velocities {v_1,\dots,v_n} are assumed to lie between {1} and {1.2 n}. This upper threshold of {1.2 n} is only a tiny improvement over the trivial threshold of {n}, but it seems to be an interesting sub-problem of the lonely runner conjecture to increase this threshold further. One key target would be to get up to {2n}, as there are actually a number of {n}-tuples {(v_1,\dots,v_n)} in this range for which (4) holds with equality. The Dirichlet approximation theorem of course gives the tuple {(1,2,\dots,n)}, but there is also the double {(2,4,\dots,2n)} of this tuple, and furthermore there is an additional construction of Goddyn and Wong that gives some further examples such as {(1,2,3,4,5,7,12)}, or more generally one can start with the standard tuple {(1,\dots,n)} and accelerate one of the velocities {v} to {2v}; this turns out to work as long as {v} shares a common factor with every integer between {n-v+1} and {2n-2v+1}. There are a few more examples of this type in the paper of Goddyn and Wong, but all of them can be placed in an arithmetic progression of length {O(n \log n)} at most, so if one were very optimistic, one could perhaps envision a strategy in which the upper bound of {O(n^{O(n^2)})} mentioned earlier was reduced all the way to something like {O( n \log n )}, and then a separate argument deployed to treat this remaining case, perhaps isolating the constructions of Goddyn and Wong (and possible variants thereof) as the only extreme cases.

I just learned (from Emmanuel Kowalski’s blog) that the AMS has just started a repository of open-access mathematics lecture notes.  There are only a few such sets of notes there at present, but hopefully it will grow in the future; I just submitted some old lecture notes of mine from an undergraduate linear algebra course I taught in 2002 (with some updating of format and fixing of various typos).


[Update, Dec 22: my own notes are now on the repository.]

I’ve just uploaded to the arXiv my paper Finite time blowup for a supercritical defocusing nonlinear Schrödinger system, submitted to Analysis and PDE. This paper is an analogue of a recent paper of mine in which I constructed a supercritical defocusing nonlinear wave (NLW) system {-\partial_{tt} u + \Delta u = (\nabla F)(u)} which exhibited smooth solutions that developed singularities in finite time. Here, we achieve essentially the same conclusion for the (inhomogeneous) supercritical defocusing nonlinear Schrödinger (NLS) equation

\displaystyle  i \partial_t u + \Delta u = (\nabla F)(u) + G \ \ \ \ \ (1)

where {u: {\bf R} \times {\bf R}^d \rightarrow {\bf C}^m} is now a system of scalar fields, {F: {\bf C}^m \rightarrow {\bf R}} is a potential which is strictly positive and homogeneous of degree {p+1} (and invariant under phase rotations {u \mapsto e^{i\theta} u}), and {G: {\bf R} \times {\bf R}^d \rightarrow {\bf C}^m} is a smooth compactly supported forcing term, needed for technical reasons.

To oversimplify somewhat, the equation (1) is known to be globally regular in the energy-subcritical case when {d \leq 2}, or when {d \geq 3} and {p < 1+\frac{4}{d-2}}; global regularity is also known (but is significantly more difficult to establish) in the energy-critical case when {d \geq 3} and {p = 1 +\frac{4}{d-2}}. (This is an oversimplification for a number of reasons, in particular in higher dimensions one only knows global well-posedness instead of global regularity. See this previous post for some exploration of this issue in the context of nonlinear wave equations.) The main result of this paper is to show that global regularity can break down in the remaining energy-supercritical case when {d \geq 3} and {p > 1 + \frac{4}{d-2}}, at least when the target dimension {m} is allowed to be sufficiently large depending on the spatial dimension {d} (I did not try to achieve the optimal value of {m} here, but the argument gives a value of {m} that grows quadratically in {d}). Unfortunately, this result does not directly impact the most interesting case of the defocusing scalar NLS equation

\displaystyle  i \partial_t u + \Delta u = |u|^{p-1} u \ \ \ \ \ (2)

in which {m=1}; however it does establish a rigorous barrier to any attempt to prove global regularity for the scalar NLS equation, in that such an attempt needs to crucially use some property of the scalar NLS that is not shared by the more general systems in (1). For instance, any approach that is primarily based on the conservation laws of mass, momentum, and energy (which are common to both (1) and (2)) will not be sufficient to establish global regularity of supercritical defocusing scalar NLS.

The method of proof in this paper is broadly similar to that in the previous paper for NLW, but with a number of additional technical complications. Both proofs begin by reducing matters to constructing a discretely self-similar solution. In the case of NLW, this solution lived on a forward light cone {\{ (t,x): |x| \leq t \}} and obeyed a self-similarity

\displaystyle  u(2t, 2x) = 2^{-\frac{2}{p-1}} u(t,x).

The ability to restrict to a light cone arose from the finite speed of propagation properties of NLW. For NLS, the solution will instead live on the domain

\displaystyle  H_d := ([0,+\infty) \times {\bf R}^d) \backslash \{(0,0)\}

and obey a parabolic self-similarity

\displaystyle  u(4t, 2x) = 2^{-\frac{2}{p-1}} u(t,x)

and solve the homogeneous version {G=0} of (1). (The inhomogeneity {G} emerges when one truncates the self-similar solution so that the initial data is compactly supported in space.) A key technical point is that {u} has to be smooth everywhere in {H_d}, including the boundary component {\{ (0,x): x \in {\bf R}^d \backslash \{0\}\}}. This unfortunately rules out many of the existing constructions of self-similar solutions, which typically will have some sort of singularity at the spatial origin.

The remaining steps of the argument can broadly be described as quantifier elimination: one systematically eliminates each of the degrees of freedom of the problem in turn by locating the necessary and sufficient conditions required of the remaining degrees of freedom in order for the constraints of a particular degree of freedom to be satisfiable. The first such degree of freedom to eliminate is the potential function {F}. The task here is to determine what constraints must exist on a putative solution {u} in order for there to exist a (positive, homogeneous, smooth away from origin) potential {F} obeying the homogeneous NLS equation

\displaystyle  i \partial_t u + \Delta u = (\nabla F)(u).

Firstly, the requirement that {F} be homogeneous implies the Euler identity

\displaystyle  \langle (\nabla F)(u), u \rangle = (p+1) F(u)

(where {\langle,\rangle} denotes the standard real inner product on {{\bf C}^m}), while the requirement that {F} be phase invariant similarly yields the variant identity

\displaystyle  \langle (\nabla F)(u), iu \rangle = 0,

so if one defines the potential energy field to be {V = F(u)}, we obtain from the chain rule the equations

\displaystyle  \langle i \partial_t u + \Delta u, u \rangle = (p+1) V

\displaystyle  \langle i \partial_t u + \Delta u, iu \rangle = 0

\displaystyle  \langle i \partial_t u + \Delta u, \partial_t u \rangle = \partial_t V

\displaystyle  \langle i \partial_t u + \Delta u, \partial_{x_j} u \rangle = \partial_{x_j} V.

Conversely, it turns out (roughly speaking) that if one can locate fields {u} and {V} obeying the above equations (as well as some other technical regularity and non-degeneracy conditions), then one can find an {F} with all the required properties. The first of these equations can be thought of as a definition of the potential energy field {V}, and the other three equations are basically disguised versions of the conservation laws of mass, energy, and momentum respectively. The construction of {F} relies on a classical extension theorem of Seeley that is a relative of the Whitney extension theorem.

Now that the potential {F} is eliminated, the next degree of freedom to eliminate is the solution field {u}. One can observe that the above equations involving {u} and {V} can be expressed instead in terms of {V} and the Gram-type matrix {G[u,u]} of {u}, which is a {(2d+4) \times (2d+4)} matrix consisting of the inner products {\langle D_1 u, D_2 u \rangle} where {D_1,D_2} range amongst the {2d+4} differential operators

\displaystyle  D_1,D_2 \in \{ 1, i, \partial_t, i\partial_t, \partial_{x_1},\dots,\partial_{x_d}, i\partial_{x_1}, \dots, i\partial_{x_d}\}.

To eliminate {u}, one thus needs to answer the question of what properties are required of a {(2d+4) \times (2d+4)} matrix {G} for it to be the Gram-type matrix {G = G[u,u]} of a field {u}. Amongst some obvious necessary conditions are that {G} needs to be symmetric and positive semi-definite; there are also additional constraints coming from identities such as

\displaystyle  \partial_t \langle u, u \rangle = 2 \langle u, \partial_t u \rangle

\displaystyle  \langle i u, \partial_t u \rangle = - \langle u, i \partial_t u \rangle


\displaystyle  \partial_{x_j} \langle iu, \partial_{x_k} u \rangle - \partial_{x_k} \langle iu, \partial_{x_j} u \rangle = 2 \langle i \partial_{x_j} u, \partial_{x_k} u \rangle.

Ideally one would like a theorem that asserts (for {m} large enough) that as long as {G} obeys all of the “obvious” constraints, then there exists a suitably non-degenerate map {u} such that {G = G[u,u]}. In the case of NLW, the analogous claim was basically a consequence of the Nash embedding theorem (which can be viewed as a theorem about the solvability of the system of equations {\langle \partial_{x_j} u, \partial_{x_k} u \rangle = g_{jk}} for a given positive definite symmetric set of fields {g_{jk}}). However, the presence of the complex structure in the NLS case poses some significant technical challenges (note for instance that the naive complex version of the Nash embedding theorem is false, due to obstructions such as Liouville’s theorem that prevent a compact complex manifold from being embeddable holomorphically in {{\bf C}^m}). Nevertheless, by adapting the proof of the Nash embedding theorem (in particular, the simplified proof of Gunther that avoids the need to use the Nash-Moser iteration scheme) we were able to obtain a partial complex analogue of the Nash embedding theorem that sufficed for our application; it required an artificial additional “curl-free” hypothesis on the Gram-type matrix {G[u,u]}, but fortunately this hypothesis ends up being automatic in our construction. Also, this version of the Nash embedding theorem is unable to prescribe the component {\langle \partial_t u, \partial_t u \rangle} of the Gram-type matrix {G[u,u]}, but fortunately this component is not used in any of the conservation laws and so the loss of this component does not cause any difficulty.

After applying the above-mentioned Nash-embedding theorem, the task is now to locate a matrix {G} obeying all the hypotheses of that theorem, as well as the conservation laws for mass, momentum, and energy (after defining the potential energy field {V} in terms of {G}). This is quite a lot of fields and constraints, but one can cut down significantly on the degrees of freedom by requiring that {G} is spherically symmetric (in a tensorial sense) and also continuously self-similar (not just discretely self-similar). Note that this hypothesis is weaker than the assertion that the original field {u} is spherically symmetric and continuously self-similar; indeed we do not know if non-trivial solutions of this type actually exist. These symmetry hypotheses reduce the number of independent components of the {(2d+4) \times (2d+4)} matrix {G} to just six: {g_{1,1}, g_{1,i\partial_t}, g_{1,i\partial_r}, g_{\partial_r, \partial_r}, g_{\partial_\omega, \partial_\omega}, g_{\partial_r, \partial_t}}, which now take as their domain the {1+1}-dimensional space

\displaystyle  H_1 := ([0,+\infty) \times {\bf R}) \backslash \{(0,0)\}.

One now has to construct these six fields, together with a potential energy field {v}, that obey a number of constraints, notably some positive definiteness constraints as well as the aforementioned conservation laws for mass, momentum, and energy.

The field {g_{1,i\partial_t}} only arises in the equation for the potential {v} (coming from Euler’s identity) and can easily be eliminated. Similarly, the field {g_{\partial_r,\partial_t}} only makes an appearance in the current of the energy conservation law, and so can also be easily eliminated so long as the total energy is conserved. But in the energy-supercritical case, the total energy is infinite, and so it is relatively easy to eliminate the field {g_{\partial_r, \partial_t}} from the problem also. This leaves us with the task of constructing just five fields {g_{1,1}, g_{1,i\partial_r}, g_{\partial_r,\partial_r}, g_{\partial_\omega,\partial_\omega}, v} obeying a number of positivity conditions, symmetry conditions, regularity conditions, and conservation laws for mass and momentum.

The potential field {v} can effectively be absorbed into the angular stress field {g_{\partial_\omega,\partial_\omega}} (after placing an appropriate counterbalancing term in the radial stress field {g_{\partial_r, \partial_r}} so as not to disrupt the conservation laws), so we can also eliminate this field. The angular stress field {g_{\partial_\omega, \partial_\omega}} is then only constrained through the momentum conservation law and a requirement of positivity; one can then eliminate this field by converting the momentum conservation law from an equality to an inequality. Finally, the radial stress field {g_{\partial_r, \partial_r}} is also only constrained through a positive definiteness constraint and the momentum conservation inequality, so it can also be eliminated from the problem after some further modification of the momentum conservation inequality.

The task then reduces to locating just two fields {g_{1,1}, g_{1,i\partial_r}} that obey a mass conservation law

\displaystyle  \partial_t g_{1,1} = 2 \left(\partial_r + \frac{d-1}{r} \right) g_{1,i\partial r}

together with an additional inequality that is the remnant of the momentum conservation law. One can solve for the mass conservation law in terms of a single scalar field {W} using the ansatz

\displaystyle g_{1,1} = 2 r^{1-d} \partial_r (r^d W)

\displaystyle g_{1,i\partial_r} = r^{1-d} \partial_t (r^d W)

so the problem has finally been simplified to the task of locating a single scalar field {W} with some scaling and homogeneity properties that obeys a certain differential inequality relating to momentum conservation. This turns out to be possible by explicitly writing down a specific scalar field {W} using some asymptotic parameters and cutoff functions.

[This guest post is authored by Caroline Series.]

The Chern Medal is a relatively new prize, awarded once every four years jointly by the IMU
and the Chern Medal Foundation (CMF) to an individual whose accomplishments warrant
the highest level of recognition for outstanding achievements in the field of mathematics.
Funded by the CMF, the Medalist receives a cash prize of US$ 250,000.  In addition, each
Medalist may nominate one or more organizations to receive funding totalling US$ 250,000, for the support of research, education, or other outreach programs in the field of mathematics.

Professor Chern devoted his life to mathematics, both in active research and education, and in nurturing the field whenever the opportunity arose. He obtained fundamental results in all the major aspects of modern geometry and founded the area of global differential geometry. Chern exhibited keen aesthetic tastes in his selection of problems, and the breadth of his work deepened the connections of geometry with different areas of mathematics. He was also generous during his lifetime in his personal support of the field.

Nominations should be sent to the Prize Committee Chair:  Caroline Series, email: by 31st December 2016. Further details and nomination guidelines for this and the other IMU prizes can be found at


I’ve just uploaded to the arXiv my paper “An integration approach to the Toeplitz square peg problem“, submitted to Forum of Mathematics, Sigma. This paper resulted from my attempts recently to solve the Toeplitz square peg problem (also known as the inscribed square problem):

Conjecture 1 (Toeplitz square peg problem) Let {\gamma} be a simple closed curve in the plane. Is it necessarily the case that {\gamma} contains four vertices of a square?

See this recent survey of Matschke in the Notices of the AMS for the latest results on this problem.

The route I took to the results in this paper was somewhat convoluted. I was motivated to look at this problem after lecturing recently on the Jordan curve theorem in my class. The problem is superficially similar to the Jordan curve theorem in that the result is known (and rather easy to prove) if {\gamma} is sufficiently regular (e.g. if it is a polygonal path), but seems to be significantly more difficult when the curve is merely assumed to be continuous. Roughly speaking, all the known positive results on the problem have proceeded using (in some form or another) tools from homology: note for instance that one can view the conjecture as asking whether the four-dimensional subset {\gamma^4} of the eight-dimensional space {({\bf R}^2)^4} necessarily intersects the four-dimensional space {\mathtt{Squares} \subset ({\bf R}^2)^4} consisting of the quadruples {(v_1,v_2,v_3,v_4)} traversing a square in (say) anti-clockwise order; this space is a four-dimensional linear subspace of {({\bf R}^2)^4}, with a two-dimensional subspace of “degenerate” squares {(v,v,v,v)} removed. If one ignores this degenerate subspace, one can use intersection theory to conclude (under reasonable “transversality” hypotheses) that {\gamma^4} intersects {\mathtt{Squares}} an odd number of times (up to the cyclic symmetries of the square), which is basically how Conjecture 1 is proven in the regular case. Unfortunately, if one then takes a limit and considers what happens when {\gamma} is just a continuous curve, the odd number of squares created by these homological arguments could conceivably all degenerate to points, thus blocking one from proving the conjecture in the general case.

Inspired by my previous work on finite time blowup for various PDEs, I first tried looking for a counterexample in the category of (locally) self-similar curves that are smooth (or piecewise linear) away from a single origin where it can oscillate infinitely often; this is basically the smoothest type of curve that was not already covered by previous results. By a rescaling and compactness argument, it is not difficult to see that such a counterexample would exist if there was a counterexample to the following periodic version of the conjecture:

Conjecture 2 (Periodic square peg problem) Let {\gamma_1, \gamma_2} be two disjoint simple closed piecewise linear curves in the cylinder {({\bf R}/{\bf Z}) \times {\bf R}} which have a winding number of one, that is to say they are homologous to the loop {x \mapsto (x,0)} from {{\bf R}/{\bf Z}} to {({\bf R}/{\bf Z}) \times {\bf R}}. Then the union of {\gamma_1} and {\gamma_2} contains the four vertices of a square.

In contrast to Conjecture 1, which is known for polygonal paths, Conjecture 2 is still open even under the hypothesis of polygonal paths; the homological arguments alluded to previously now show that the number of inscribed squares in the periodic setting is even rather than odd, which is not enough to conclude the conjecture. (This flipping of parity from odd to even due to an infinite amount of oscillation is reminiscent of the “Eilenberg-Mazur swindle“, discussed in this previous post.)

I therefore tried to construct counterexamples to Conjecture 2. I began perturbatively, looking at curves {\gamma_1, \gamma_2} that were small perturbations of constant functions. After some initial Taylor expansion, I was blocked from forming such a counterexample because an inspection of the leading Taylor coefficients required one to construct a continuous periodic function of mean zero that never vanished, which of course was impossible by the intermediate value theorem. I kept expanding to higher and higher order to try to evade this obstruction (this, incidentally, was when I discovered this cute application of Lagrange reversion) but no matter how high an accuracy I went (I think I ended up expanding to sixth order in a perturbative parameter {\varepsilon} before figuring out what was going on!), this obstruction kept resurfacing again and again. I eventually figured out that this obstruction was being caused by a “conserved integral of motion” for both Conjecture 2 and Conjecture 1, which can in fact be used to largely rule out perturbative constructions. This yielded a new positive result for both conjectures:

Theorem 3

  • (i) Conjecture 1 holds when {\gamma} is the union {\{ (t,f(t)): t \in [t_0,t_1]\} \cup \{ (t,g(t)): t \in [t_0,t_1]\}} of the graphs of two Lipschitz functions {f,g: [t_0,t_1] \rightarrow {\bf R}} of Lipschitz constant less than one that agree at the endpoints.
  • (ii) Conjecture 2 holds when {\gamma_1, \gamma_2} are graphs of Lipschitz functions {f: {\bf R}/{\bf Z} \rightarrow {\bf R}, g: {\bf R}/{\bf Z} \rightarrow {\bf R}} of Lipschitz constant less than one.

We sketch the proof of Theorem 3(i) as follows (the proof of Theorem 3(ii) is very similar). Let {\gamma_1: [t_0, t_1] \rightarrow {\bf R}} be the curve {\gamma_1(t) := (t,f(t))}, thus {\gamma_1} traverses one of the two graphs that comprise {\gamma}. For each time {t \in [t_0,t_1]}, there is a unique square with first vertex {\gamma_1(t)} (and the other three vertices, traversed in anticlockwise order, denoted {\gamma_2(t), \gamma_3(t), \gamma_4(t)}) such that {\gamma_2(t)} also lies in the graph of {f} and {\gamma_4(t)} also lies in the graph of {g} (actually for technical reasons we have to extend {f,g} by constants to all of {{\bf R}} in order for this claim to be true). To see this, we simply rotate the graph of {g} clockwise by {\frac{\pi}{2}} around {\gamma_1(t)}, where (by the Lipschitz hypotheses) it must hit the graph of {f} in a unique point, which is {\gamma_2(t)}, and which then determines the other two vertices {\gamma_3(t), \gamma_4(t)} of the square. The curve {\gamma_3(t)} has the same starting and ending point as the graph of {f} or {g}; using the Lipschitz hypothesis one can show this graph is simple. If the curve ever hits the graph of {g} other than at the endpoints, we have created an inscribed square, so we may assume for contradiction that {\gamma_3(t)} avoids the graph of {g}, and hence by the Jordan curve theorem the two curves enclose some non-empty bounded open region {\Omega}.

Now for the conserved integral of motion. If we integrate the {1}-form {y\ dx} on each of the four curves {\gamma_1, \gamma_2, \gamma_3, \gamma_4}, we obtain the identity

\displaystyle  \int_{\gamma_1} y\ dx - \int_{\gamma_2} y\ dx + \int_{\gamma_3} y\ dx - \int_{\gamma_4} y\ dx = 0.

This identity can be established by the following calculation: one can parameterise

\displaystyle  \gamma_1(t) = (x(t), y(t))

\displaystyle  \gamma_2(t) = (x(t)+a(t), y(t)+b(t))

\displaystyle  \gamma_3(t) = (x(t)+a(t)-b(t), y(t)+a(t)+b(t))

\displaystyle  \gamma_4(t) = (x(t)-b(t), y(t)+a(t))

for some Lipschitz functions {x,y,a,b: [t_0,t_1] \rightarrow {\bf R}}; thus for instance {\int_{\gamma_1} y\ dx = \int_{t_0}^{t_1} y(t)\ dx(t)}. Inserting these parameterisations and doing some canceling, one can write the above integral as

\displaystyle  \int_{t_0}^{t_1} d \frac{a(t)^2-b(t)^2}{2}

which vanishes because {a(t), b(t)} (which represent the sidelengths of the squares determined by {\gamma_1(t), \gamma_2(t), \gamma_3(t), \gamma_4(t)} vanish at the endpoints {t=t_0,t_1}.

Using this conserved integral of motion, one can show that

\displaystyle  \int_{\gamma_3} y\ dx = \int_{t_0}^{t_1} g(t)\ dt

which by Stokes’ theorem then implies that the bounded open region {\Omega} mentioned previously has zero area, which is absurd.

This argument hinged on the curve {\gamma_3} being simple, so that the Jordan curve theorem could apply. Once one left the perturbative regime of curves of small Lipschitz constant, it became possible for {\gamma_3} to be self-crossing, but nevertheless there still seemed to be some sort of integral obstruction. I eventually isolated the problem in the form of a strengthened version of Conjecture 2:

Conjecture 4 (Area formulation of square peg problem) Let {\gamma_1, \gamma_2, \gamma_3, \gamma_4: {\bf R}/{\bf Z} \rightarrow ({\bf R}/{\bf Z}) \times {\bf R}} be simple closed piecewise linear curves of winding number {1} obeying the area identity

\displaystyle  \int_{\gamma_1} y\ dx - \int_{\gamma_2} y\ dx + \int_{\gamma_3} y\ dx - \int_{\gamma_4} y\ dx = 0

(note the {1}-form {y\ dx} is still well defined on the cylinder {({\bf R}/{\bf Z}) \times {\bf R}}; note also that the curves {\gamma_1,\gamma_2,\gamma_3,\gamma_4} are allowed to cross each other.) Then there exists a (possibly degenerate) square with vertices (traversed in anticlockwise order) lying on {\gamma_1, \gamma_2, \gamma_3, \gamma_4} respectively.

It is not difficult to see that Conjecture 4 implies Conjecture 2. Actually I believe that the converse implication is at least morally true, in that any counterexample to Conjecture 4 can be eventually transformed to a counterexample to Conjecture 2 and Conjecture 1. The conserved integral of motion argument can establish Conjecture 4 in many cases, for instance if {\gamma_2,\gamma_4} are graphs of functions of Lipschitz constant less than one.

Conjecture 4 has a model special case, when one of the {\gamma_i} is assumed to just be a horizontal loop. In this case, the problem collapses to that of producing an intersection between two three-dimensional subsets of a six-dimensional space, rather than to four-dimensional subsets of an eight-dimensional space. More precisely, some elementary transformations reveal that this special case of Conjecture 4 can be formulated in the following fashion in which the geometric notion of a square is replaced by the additive notion of a triple of real numbers summing to zero:

Conjecture 5 (Special case of area formulation) Let {\gamma_1, \gamma_2, \gamma_3: {\bf R}/{\bf Z} \rightarrow ({\bf R}/{\bf Z}) \times {\bf R}} be simple closed piecewise linear curves of winding number {1} obeying the area identity

\displaystyle  \int_{\gamma_1} y\ dx + \int_{\gamma_2} y\ dx + \int_{\gamma_3} y\ dx = 0.

Then there exist {x \in {\bf R}/{\bf Z}} and {y_1,y_2,y_3 \in {\bf R}} with {y_1+y_2+y_3=0} such that {(x,y_i) \in \gamma_i} for {i=1,2,3}.

This conjecture is easy to establish if one of the curves, say {\gamma_3}, is the graph {\{ (t,f(t)): t \in {\bf R}/{\bf Z}\}} of some piecewise linear function {f: {\bf R}/{\bf Z} \rightarrow {\bf R}}, since in that case the curve {\gamma_1} and the curve {\tilde \gamma_2 := \{ (x, -y-f(x)): (x,y) \in \gamma_2 \}} enclose the same area in the sense that {\int_{\gamma_1} y\ dx = \int_{\tilde \gamma_2} y\ dx}, and hence must intersect by the Jordan curve theorem (otherwise they would enclose a non-zero amount of area between them), giving the claim. But when none of the {\gamma_1,\gamma_2,\gamma_3} are graphs, the situation becomes combinatorially more complicated.

Using some elementary homological arguments (e.g. breaking up closed {1}-cycles into closed paths) and working with a generic horizontal slice of the curves, I was able to show that Conjecture 5 was equivalent to a one-dimensional problem that was largely combinatorial in nature, revolving around the sign patterns of various triple sums {y_{1,a} + y_{2,b} + y_{3,c}} with {y_{1,a}, y_{2,b}, y_{3,c}} drawn from various finite sets of reals.

Conjecture 6 (Combinatorial form) Let {k_1,k_2,k_3} be odd natural numbers, and for each {i=1,2,3}, let {y_{i,1},\dots,y_{i,k_i}} be distinct real numbers; we adopt the convention that {y_{i,0}=y_{i,k_i+1}=-\infty}. Assume the following axioms:

  • (i) For any {1 \leq p \leq k_1, 1 \leq q \leq k_2, 1 \leq r \leq k_3}, the sums {y_{1,p} + y_{2,q} + y_{3,r}} are non-zero.
  • (ii) (Non-crossing) For any {i=1,2,3} and {0 \leq p < q \leq k_i} with the same parity, the pairs {\{ y_{i,p}, y_{i,p+1}\}} and {\{y_{i,q}, y_{i,q+1}\}} are non-crossing in the sense that

    \displaystyle  \sum_{a \in \{p,p+1\}} \sum_{b \in \{q,q+1\}} (-1)^{a+b} \mathrm{sgn}( y_{i,a} - y_{i,b} ) = 0.

  • (iii) (Non-crossing sums) For any {0 \leq p \leq k_1}, {0 \leq q \leq k_2}, {0 \leq r \leq k_3} of the same parity, one has

    \displaystyle  \sum_{a \in \{p,p+1\}} \sum_{b \in \{q,q+1\}} \sum_{c \in \{r,r+1\}} (-1)^{a+b+c} \mathrm{sgn}( y_{1,a} + y_{2,b} + y_{3,c} ) = 0.

Then one has

\displaystyle  \sum_{i=1}^3 \sum_{p=1}^{k_i} (-1)^{p-1} y_{i,p} < 0.

Roughly speaking, Conjecture 6 and Conjecture 5 are connected by constructing curves {\gamma_i} to connect {(0, y_{i,p})} to {(0,y_{i,p+1})} for {0 \leq p \leq k+1} by various paths, which either lie to the right of the {y} axis (when {p} is odd) or to the left of the {y} axis (when {p} is even). The axiom (ii) is asserting that the numbers {-\infty, y_{i,1},\dots,y_{i,k_i}} are ordered according to the permutation of a meander (formed by gluing together two non-crossing perfect matchings).

Using various ad hoc arguments involving “winding numbers”, it is possible to prove this conjecture in many cases (e.g. if one of the {k_i} is at most {3}), to the extent that I have now become confident that this conjecture is true (and have now come full circle from trying to disprove Conjecture 1 to now believing that this conjecture holds also). But it seems that there is some non-trivial combinatorial argument to be made if one is to prove this conjecture; purely homological arguments seem to partially resolve the problem, but are not sufficient by themselves.

While I was not able to resolve the square peg problem, I think these results do provide a roadmap to attacking it, first by focusing on the combinatorial conjecture in Conjecture 6 (or its equivalent form in Conjecture 5), then after that is resolved moving on to Conjecture 4, and then finally to Conjecture 1.

By an odd coincidence, I stumbled upon a second question in as many weeks about power series, and once again the only way I know how to prove the result is by complex methods; once again, I am leaving it here as a challenge to any interested readers, and I would be particularly interested in knowing of a proof that was not based on complex analysis (or thinly disguised versions thereof), or for a reference to previous literature where something like this identity has occured. (I suspect for instance that something like this may have shown up before in free probability, based on the answer to part (ii) of the problem.)

Here is a purely algebraic form of the problem:

Problem 1 Let {F = F(z)} be a formal function of one variable {z}. Suppose that {G = G(z)} is the formal function defined by

\displaystyle G := \sum_{n=1}^\infty \left( \frac{F^n}{n!} \right)^{(n-1)}

\displaystyle = F + \left(\frac{F^2}{2}\right)' + \left(\frac{F^3}{6}\right)'' + \dots

\displaystyle = F + FF' + (F (F')^2 + \frac{1}{2} F^2 F'') + \dots,

where we use {f^{(k)}} to denote the {k}-fold derivative of {f} with respect to the variable {z}.

  • (i) Show that {F} can be formally recovered from {G} by the formula

    \displaystyle F = \sum_{n=1}^\infty (-1)^{n-1} \left( \frac{G^n}{n!} \right)^{(n-1)}

    \displaystyle = G - \left(\frac{G^2}{2}\right)' + \left(\frac{G^3}{6}\right)'' - \dots

    \displaystyle = G - GG' + (G (G')^2 + \frac{1}{2} G^2 G'') - \dots.

  • (ii) There is a remarkable further formal identity relating {F(z)} with {G(z)} that does not explicitly involve any infinite summation. What is this identity?

To rigorously formulate part (i) of this problem, one could work in the commutative differential ring of formal infinite series generated by polynomial combinations of {F} and its derivatives (with no constant term). Part (ii) is a bit trickier to formulate in this abstract ring; the identity in question is easier to state if {F, G} are formal power series, or (even better) convergent power series, as it involves operations such as composition or inversion that can be more easily defined in those latter settings.

To illustrate Problem 1(i), let us compute up to third order in {F}, using {{\mathcal O}(F^4)} to denote any quantity involving four or more factors of {F} and its derivatives, and similarly for other exponents than {4}. Then we have

\displaystyle G = F + FF' + (F (F')^2 + \frac{1}{2} F^2 F'') + {\mathcal O}(F^4)

and hence

\displaystyle G' = F' + (F')^2 + FF'' + {\mathcal O}(F^3)

\displaystyle G'' = F'' + {\mathcal O}(F^2);

multiplying, we have

\displaystyle GG' = FF' + F (F')^2 + F^2 F'' + F (F')^2 + {\mathcal O}(F^4)


\displaystyle G (G')^2 + \frac{1}{2} G^2 G'' = F (F')^2 + \frac{1}{2} F^2 F'' + {\mathcal O}(F^4)

and hence after a lot of canceling

\displaystyle G - GG' + (G (G')^2 + \frac{1}{2} G^2 G'') = F + {\mathcal O}(F^4).

Thus Problem 1(i) holds up to errors of {{\mathcal O}(F^4)} at least. In principle one can continue verifying Problem 1(i) to increasingly high order in {F}, but the computations rapidly become quite lengthy, and I do not know of a direct way to ensure that one always obtains the required cancellation at the end of the computation.

Problem 1(i) can also be posed in formal power series: if

\displaystyle F(z) = a_1 z + a_2 z^2 + a_3 z^3 + \dots

is a formal power series with no constant term with complex coefficients {a_1, a_2, \dots} with {|a_1|<1}, then one can verify that the series

\displaystyle G := \sum_{n=1}^\infty \left( \frac{F^n}{n!} \right)^{(n-1)}

makes sense as a formal power series with no constant term, thus

\displaystyle G(z) = b_1 z + b_2 z^2 + b_3 z^3 + \dots.

For instance it is not difficult to show that {b_1 = \frac{a_1}{1-a_1}}. If one further has {|b_1| < 1}, then it turns out that

\displaystyle F = \sum_{n=1}^\infty (-1)^{n-1} \left( \frac{G^n}{n!} \right)^{(n-1)}

as formal power series. Currently the only way I know how to show this is by first proving the claim for power series with a positive radius of convergence using the Cauchy integral formula, but even this is a bit tricky unless one has managed to guess the identity in (ii) first. (In fact, the way I discovered this problem was by first trying to solve (a variant of) the identity in (ii) by Taylor expansion in the course of attacking another problem, and obtaining the transform in Problem 1 as a consequence.)

The transform that takes {F} to {G} resembles both the exponential function

\displaystyle \exp(F) = \sum_{n=0}^\infty \frac{F^n}{n!}

and Taylor’s formula

\displaystyle F(z) = \sum_{n=0}^\infty \frac{F^{(n)}(0)}{n!} z^n

but does not seem to be directly connected to either (this is more apparent once one knows the identity in (ii)).