You are currently browsing the category archive for the ‘expository’ category.

In this post we assume the Riemann hypothesis and the simplicity of zeroes, thus the zeroes of ${\zeta}$ in the critical strip take the form ${\frac{1}{2} \pm i \gamma_j}$ for some real number ordinates ${0 < \gamma_1 < \gamma_2 < \dots}$. From the Riemann-von Mangoldt formula, one has the asymptotic

$\displaystyle \gamma_n = (1+o(1)) \frac{2\pi}{\log n} n$

as ${n \rightarrow \infty}$; in particular, the spacing ${\gamma_{n+1} - \gamma_n}$ should behave like ${\frac{2\pi}{\log n}}$ on the average. However, it can happen that some gaps are unusually small compared to other nearby gaps. For the sake of concreteness, let us define a Lehmer pair to be a pair of adjacent ordinates ${\gamma_n, \gamma_{n+1}}$ such that

$\displaystyle \frac{1}{(\gamma_{n+1} - \gamma_n)^2} \geq 1.3 \sum_{m \neq n,n+1} \frac{1}{(\gamma_m - \gamma_n)^2} + \frac{1}{(\gamma_m - \gamma_{n+1})^2}. \ \ \ \ \ (1)$

The specific value of constant ${1.3}$ is not particularly important here; anything larger than ${\frac{5}{4}}$ would suffice. An example of such a pair would be the classical pair

$\displaystyle \gamma_{6709} = 7005.062866\dots$

$\displaystyle \gamma_{6710} = 7005.100564\dots$

discovered by Lehmer. It follows easily from the main results of Csordas, Smith, and Varga that if an infinite number of Lehmer pairs (in the above sense) existed, then the de Bruijn-Newman constant ${\Lambda}$ is non-negative. This implication is now redundant in view of the unconditional results of this recent paper of Rodgers and myself; however, the question of whether an infinite number of Lehmer pairs exist remain open.

In this post, I sketch an argument that Brad and I came up with (as initially suggested by Odlyzko) the GUE hypothesis implies the existence of infinitely many Lehmer pairs. We argue probabilistically: pick a sufficiently large number ${T}$, pick ${n}$ at random from ${T \log T}$ to ${2 T \log T}$ (so that the average gap size is close to ${\frac{2\pi}{\log T}}$), and prove that the Lehmer pair condition (1) occurs with positive probability.

Introduce the renormalised ordinates ${x_n := \frac{\log T}{2\pi} \gamma_n}$ for ${T \log T \leq n \leq 2 T \log T}$, and let ${\varepsilon > 0}$ be a small absolute constant (independent of ${T}$). It will then suffice to show that

$\displaystyle \frac{1}{(x_{n+1} - x_n)^2} \geq$

$\displaystyle 1.3 \sum_{m \in [T \log T, 2T \log T]: m \neq n,n+1} \frac{1}{(x_m - x_n)^2} + \frac{1}{(x_m - x_{n+1})^2}$

$\displaystyle + \frac{1}{6\varepsilon^2}$

(say) with probability ${\gg \varepsilon^4 - o(1)}$, since the contribution of those ${m}$ outside of ${[T \log T, 2T \log T]}$ can be absorbed by the ${\frac{1}{\varepsilon^2}}$ factor with probability ${o(1)}$.

As one consequence of the GUE hypothesis, we have ${x_{n+1} - x_n \leq \varepsilon^2}$ with probability ${O(\varepsilon^6)}$. Thus, if ${E := \{ m \in [T \log T, 2T \log T]: x_{m+1} - x_m \leq \varepsilon^2 \}}$, then ${E}$ has density ${O( \varepsilon^6 )}$. Applying the Hardy-Littlewood maximal inequality, we see that with probability ${O(\varepsilon^6)}$, we have

$\displaystyle \sup_{h \geq 1} | \# E \cap [n+h, n-h] | \leq \frac{1}{10}$

which implies in particular that

$\displaystyle |x_m - x_n|, |x_{m} - x_{n+1}| \gg \varepsilon^2 |m-n|$

for all ${m \in [T \log T, 2 T \log T] \backslash \{ n, n+1\}}$. This implies in particular that

$\displaystyle \sum_{m \in [T \log T, 2T \log T]: |m-n| \geq \varepsilon^{-3}} \frac{1}{(x_m - x_n)^2} + \frac{1}{(x_m - x_{n+1})^2} \ll \varepsilon^{-1}$

and so it will suffice to show that

$\displaystyle \frac{1}{(x_{n+1} - x_n)^2}$

$\displaystyle \geq 1.3 \sum_{m \in [T \log T, 2T \log T]: m \neq n,n+1; |m-n| < \varepsilon^{-3}} \frac{1}{(x_m - x_n)^2} + \frac{1}{(x_m - x_{n+1})^2} + \frac{1}{5\varepsilon^2}$

(say) with probability ${\gg \varepsilon^4 - o(1)}$.

By the GUE hypothesis (and the fact that ${\varepsilon}$ is independent of ${T}$), it suffices to show that a Dyson sine process ${(x_n)_{n \in {\bf Z}}}$, normalised so that ${x_0}$ is the first positive point in the process, obeys the inequality

$\displaystyle \frac{1}{(x_{1} - x_0)^2} \geq 1.3 \sum_{|m| < \varepsilon^{-3}: m \neq 0,1} \frac{1}{(x_m - x_0)^2} + \frac{1}{(x_m - x_1)^2} \ \ \ \ \ (2)$

with probability ${\gg \varepsilon^4}$. However, if we let ${A > 0}$ be a moderately large constant (and assume ${\varepsilon}$ small depending on ${A}$), one can show using ${k}$-point correlation functions for the Dyson sine process (and the fact that the Dyson kernel ${K(x,y) = \sin(\pi(x-y))/\pi(x-y)}$ equals ${1}$ to second order at the origin) that

$\displaystyle {\bf E} N_{[-\varepsilon,0]} N_{[0,\varepsilon]} \gg \varepsilon^4$

$\displaystyle {\bf E} N_{[-\varepsilon,0]} \binom{N_{[0,\varepsilon]}}{2} \ll \varepsilon^7$

$\displaystyle {\bf E} \binom{N_{[-\varepsilon,0]}}{2} N_{[0,\varepsilon]} \ll \varepsilon^7$

$\displaystyle {\bf E} N_{[-\varepsilon,0]} N_{[0,\varepsilon]} N_{[\varepsilon,A^{-1}]} \ll A^{-3} \varepsilon^4$

$\displaystyle {\bf E} N_{[-\varepsilon,0]} N_{[0,\varepsilon]} N_{[-A^{-1}, -\varepsilon]} \ll A^{-3} \varepsilon^4$

$\displaystyle {\bf E} N_{[-\varepsilon,0]} N_{[0,\varepsilon]} N_{[-k, k]}^2 \ll k^2 \varepsilon^4 \ \ \ \ \ (3)$

for any natural number ${k}$, where ${N_{I}}$ denotes the number of elements of the process in ${I}$. For instance, the expression ${{\bf E} N_{[-\varepsilon,0]} \binom{N_{[0,\varepsilon]}}{2} }$ can be written in terms of the three-point correlation function ${\rho_3(x_1,x_2,x_3) = \mathrm{det}(K(x_i,x_j))_{1 \leq i,j \leq 3}}$ as

$\displaystyle \int_{-\varepsilon \leq x_1 \leq 0 \leq x_2 \leq x_3 \leq \varepsilon} \rho_3( x_1, x_2, x_3 )\ dx_1 dx_2 dx_3$

which can easily be estimated to be ${O(\varepsilon^7)}$ (since ${\rho_3 = O(\varepsilon^4)}$ in this region), and similarly for the other estimates claimed above.

Since for natural numbers ${a,b}$, the quantity ${ab - 2 a \binom{b}{2} - 2 b \binom{a}{2} = ab (5-2a-2b)}$ is only positive when ${a=b=1}$, we see from the first three estimates that the event ${E}$ that ${N_{[-\varepsilon,0]} = N_{[0,\varepsilon]} = 1}$ occurs with probability ${\gg \varepsilon^4}$. In particular, by Markov’s inequality we have the conditional probabilities

$\displaystyle {\bf P} ( N_{[\varepsilon,A^{-1}]} \geq 1 | E ) \ll A^{-3}$

$\displaystyle {\bf P} ( N_{[-A^{-1}, -\varepsilon]} \geq 1 | E ) \ll A^{-3}$

$\displaystyle {\bf P} ( N_{[-k, k]} \geq A k^{5/3} | E ) \ll A^{-4} k^{-4/3}$

and thus, if ${A}$ is large enough, and ${\varepsilon}$ small enough, it will be true with probability ${\gg \varepsilon^4}$ that

$\displaystyle N_{[-\varepsilon,0]}, N_{[0,\varepsilon]} = 1$

and

$\displaystyle N_{[A^{-1}, \varepsilon]} = N_{[\varepsilon, A^{-1}]} = 0$

and simultaneously that

$\displaystyle N_{[-k,k]} \leq A k^{5/3}$

for all natural numbers ${k}$. This implies in particular that

$\displaystyle x_1 - x_0 \leq 2\varepsilon$

and

$\displaystyle |x_m - x_0|, |x_m - x_1| \gg_A |m|^{3/5}$

for all ${m \neq 0,1}$, which gives (2) for ${\varepsilon}$ small enough.

Remark 1 The above argument needed the GUE hypothesis for correlations up to fourth order (in order to establish (3)). It might be possible to reduce the number of correlations needed, but I do not see how to obtain the claim just using pair correlations only.

The Boussinesq equations for inviscid, incompressible two-dimensional fluid flow in the presence of gravity are given by

$\displaystyle (\partial_t + u_x \partial_x+ u_y \partial_y) u_x = -\partial_x p \ \ \ \ \ (1)$

$\displaystyle (\partial_t + u_x \partial_x+ u_y \partial_y) u_y = \rho - \partial_y p \ \ \ \ \ (2)$

$\displaystyle (\partial_t + u_x \partial_x+ u_y \partial_y) \rho = 0 \ \ \ \ \ (3)$

$\displaystyle \partial_x u_x + \partial_y u_y = 0 \ \ \ \ \ (4)$

where ${u: {\bf R} \times {\bf R}^2 \rightarrow {\bf R}^2}$ is the velocity field, ${p: {\bf R} \times {\bf R}^2 \rightarrow {\bf R}}$ is the pressure field, and ${\rho: {\bf R} \times {\bf R}^2 \rightarrow {\bf R}}$ is the density field (or, in some physical interpretations, the temperature field). In this post we shall restrict ourselves to formal manipulations, assuming implicitly that all fields are regular enough (or sufficiently decaying at spatial infinity) that the manipulations are justified. Using the material derivative ${D_t := \partial_t + u_x \partial_x + u_y \partial_y}$, one can abbreviate these equations as

$\displaystyle D_t u_x = -\partial_x p$

$\displaystyle D_t u_y = \rho - \partial_y p$

$\displaystyle D_t \rho = 0$

$\displaystyle \partial_x u_x + \partial_y u_y = 0.$

One can eliminate the role of the pressure ${p}$ by working with the vorticity ${\omega := \partial_x u_y - \partial_y u_x}$. A standard calculation then leads us to the equivalent “vorticity-stream” formulation

$\displaystyle D_t \omega = \partial_x \rho$

$\displaystyle D_t \rho = 0$

$\displaystyle \omega = \partial_x u_y - \partial_y u_x$

$\displaystyle \partial_x u_y + \partial_y u_y = 0$

of the Boussinesq equations. The latter two equations can be used to recover the velocity field ${u}$ from the vorticity ${\omega}$ by the Biot-Savart law

$\displaystyle u_x := -\partial_y \Delta^{-1} \omega; \quad u_y = \partial_x \Delta^{-1} \omega.$

It has long been observed (see e.g. Section 5.4.1 of Bertozzi-Majda) that the Boussinesq equations are very similar, though not quite identical, to the three-dimensional inviscid incompressible Euler equations under the hypothesis of axial symmetry (with swirl). The Euler equations are

$\displaystyle \partial_t u + (u \cdot \nabla) u = - \nabla p$

$\displaystyle \nabla \cdot u = 0$

where now the velocity field ${u: {\bf R} \times {\bf R}^3 \rightarrow {\bf R}^3}$ and pressure field ${p: {\bf R} \times {\bf R}^3 \rightarrow {\bf R}}$ are over the three-dimensional domain ${{\bf R}^3}$. If one expresses ${{\bf R}^3}$ in polar coordinates ${(z,r,\theta)}$ then one can write the velocity vector field ${u}$ in these coordinates as

$\displaystyle u = u^z \frac{d}{dz} + u^r \frac{d}{dr} + u^\theta \frac{d}{d\theta}.$

If we make the axial symmetry assumption that these components, as well as ${p}$, do not depend on the ${\theta}$ variable, thus

$\displaystyle \partial_\theta u^z, \partial_\theta u^r, \partial_\theta u^\theta, \partial_\theta p = 0,$

then after some calculation (which we give below the fold) one can eventually reduce the Euler equations to the system

$\displaystyle \tilde D_t \omega = \frac{1}{r^4} \partial_z \rho \ \ \ \ \ (5)$

$\displaystyle \tilde D_t \rho = 0 \ \ \ \ \ (6)$

$\displaystyle \omega = \frac{1}{r} (\partial_z u^r - \partial_r u^z) \ \ \ \ \ (7)$

$\displaystyle \partial_z(ru^z) + \partial_r(ru^r) = 0 \ \ \ \ \ (8)$

where ${\tilde D_t := \partial_t + u^z \partial_z + u^r \partial_r}$ is the modified material derivative, and ${\rho}$ is the field ${\rho := (r u^\theta)^2}$. This is almost identical with the Boussinesq equations except for some additional powers of ${r}$; thus, the intuition is that the Boussinesq equations are a simplified model for axially symmetric Euler flows when one stays away from the axis ${r=0}$ and also does not wander off to ${r=\infty}$.

However, this heuristic is not rigorous; the above calculations do not actually give an embedding of the Boussinesq equations into Euler. (The equations do match on the cylinder ${r=1}$, but this is a measure zero subset of the domain, and so is not enough to give an embedding on any non-trivial region of space.) Recently, while playing around with trying to embed other equations into the Euler equations, I discovered that it is possible to make such an embedding into a four-dimensional Euler equation, albeit on a slightly curved manifold rather than in Euclidean space. More precisely, we use the Ebin-Marsden generalisation

$\displaystyle \partial_t u + \nabla_u u = - \mathrm{grad}_g p$

$\displaystyle \mathrm{div}_g u = 0$

of the Euler equations to an arbitrary Riemannian manifold ${(M,g)}$ (ignoring any issues of boundary conditions for this discussion), where ${u: {\bf R} \rightarrow \Gamma(TM)}$ is a time-dependent vector field, ${p: {\bf R} \rightarrow C^\infty(M)}$ is a time-dependent scalar field, and ${\nabla_u}$ is the covariant derivative along ${u}$ using the Levi-Civita connection ${\nabla}$. In Penrose abstract index notation (using the Levi-Civita connection ${\nabla}$, and raising and lowering indices using the metric ${g = g_{ij}}$), the equations of motion become

$\displaystyle \partial_t u^i + u^j \nabla_j u^i = - \nabla^i p \ \ \ \ \ (9)$

$\displaystyle \nabla_i u^i = 0;$

in coordinates, this becomes

$\displaystyle \partial_t u^i + u^j (\partial_j u^i + \Gamma^i_{jk} u^k) = - g^{ij} \partial_j p$

$\displaystyle \partial_i u^i + \Gamma^i_{ik} u^k = 0 \ \ \ \ \ (10)$

where the Christoffel symbols ${\Gamma^i_{jk}}$ are given by the formula

$\displaystyle \Gamma^i_{jk} := \frac{1}{2} g^{il} (\partial_j g_{lk} + \partial_k g_{lj} - \partial_l g_{jk}),$

where ${g^{il}}$ is the inverse to the metric tensor ${g_{il}}$. If the coordinates are chosen so that the volume form ${dg}$ is the Euclidean volume form ${dx}$, thus ${\mathrm{det}(g)=1}$, then on differentiating we have ${g^{ij} \partial_k g_{ij} = 0}$, and hence ${\Gamma^i_{ik} = 0}$, and so the divergence-free equation (10) simplifies in this case to ${\partial_i u^i = 0}$. The Ebin-Marsden Euler equations are the natural generalisation of the Euler equations to arbitrary manifolds; for instance, they (formally) conserve the kinetic energy

$\displaystyle \frac{1}{2} \int_M |u|_g^2\ dg = \frac{1}{2} \int_M g_{ij} u^i u^j\ dg$

and can be viewed as the formal geodesic flow equation on the infinite-dimensional manifold of volume-preserving diffeomorphisms on ${M}$ (see this previous post for a discussion of this in the flat space case).

The specific four-dimensional manifold in question is the space ${{\bf R} \times {\bf R}^+ \times {\bf R}/{\bf Z} \times {\bf R}/{\bf Z}}$ with metric

$\displaystyle dx^2 + dy^2 + y^{-1} dz^2 + y dw^2$

and solutions to the Boussinesq equation on ${{\bf R} \times {\bf R}^+}$ can be transformed into solutions to the Euler equations on this manifold. This is part of a more general family of embeddings into the Euler equations in which passive scalar fields (such as the field ${\rho}$ appearing in the Boussinesq equations) can be incorporated into the dynamics via fluctuations in the Riemannian metric ${g}$). I am writing the details below the fold (partly for my own benefit).

A basic object of study in multiplicative number theory are the arithmetic functions: functions ${f: {\bf N} \rightarrow {\bf C}}$ from the natural numbers to the complex numbers. Some fundamental examples of such functions include

• The constant function ${1: n \mapsto 1}$;
• The Kronecker delta function ${\delta: n \mapsto 1_{n=1}}$;
• The natural logarithm function ${L: n \mapsto \log n}$;
• The divisor function ${d_2: n \mapsto \sum_{d|n} 1}$;
• The von Mangoldt function ${\Lambda}$, with ${\Lambda(n)}$ defined to equal ${\log p}$ when ${n}$ is a power ${p^j}$ of a prime ${p}$ for some ${j \geq 1}$, and defined to equal zero otherwise; and
• The Möbius function ${\mu}$, with ${\mu(n)}$ defined to equal ${(-1)^k}$ when ${n}$ is the product of ${k}$ distinct primes, and defined to equal zero otherwise.

Given an arithmetic function ${f}$, we are often interested in statistics such as the summatory function

$\displaystyle \sum_{n \leq x} f(n), \ \ \ \ \ (1)$

the logarithmically (or harmonically) weighted summatory function

$\displaystyle \sum_{n \leq x} \frac{f(n)}{n}, \ \ \ \ \ (2)$

or the Dirichlet series

$\displaystyle {\mathcal D}[f](s) := \sum_n \frac{f(n)}{n^s}.$

In the latter case, one typically has to first restrict ${s}$ to those complex numbers whose real part is large enough in order to ensure the series on the right converges; but in many important cases, one can then extend the Dirichlet series to almost all of the complex plane by analytic continuation. One is also interested in correlations involving additive shifts, such as ${\sum_{n \leq x} f(n) f(n+h)}$, but these are significantly more difficult to study and cannot be easily estimated by the methods of classical multiplicative number theory.

A key operation on arithmetic functions is that of Dirichlet convolution, which when given two arithmetic functions ${f,g: {\bf N} \rightarrow {\bf C}}$, forms a new arithmetic function ${f*g: {\bf N} \rightarrow {\bf C}}$, defined by the formula

$\displaystyle f*g(n) := \sum_{d|n} f(d) g(\frac{n}{d}).$

Thus for instance ${1*1 = d_2}$, ${1 * \Lambda = L}$, ${1 * \mu = \delta}$, and ${\delta * f = f}$ for any arithmetic function ${f}$. Dirichlet convolution and Dirichlet series are related by the fundamental formula

$\displaystyle {\mathcal D}[f * g](s) = {\mathcal D}[f](s) {\mathcal D}[g](s), \ \ \ \ \ (3)$

at least when the real part of ${s}$ is large enough that all sums involved become absolutely convergent (but in practice one can use analytic continuation to extend this identity to most of the complex plane). There is also the identity

$\displaystyle {\mathcal D}[Lf](s) = - \frac{d}{ds} {\mathcal D}[f](s), \ \ \ \ \ (4)$

at least when the real part of ${s}$ is large enough to justify interchange of differentiation and summation. As a consequence, many Dirichlet series can be expressed in terms of the Riemann zeta function ${\zeta = {\mathcal D}[1]}$, thus for instance

$\displaystyle {\mathcal D}[d_2](s) = \zeta^2(s)$

$\displaystyle {\mathcal D}[L](s) = - \zeta'(s)$

$\displaystyle {\mathcal D}[\delta](s) = 1$

$\displaystyle {\mathcal D}[\mu](s) = \frac{1}{\zeta(s)}$

$\displaystyle {\mathcal D}[\Lambda](s) = -\frac{\zeta'(s)}{\zeta(s)}.$

Much of the difficulty of multiplicative number theory can be traced back to the discrete nature of the natural numbers ${{\bf N}}$, which form a rather complicated abelian semigroup with respect to multiplication (in particular the set of generators is the set of prime numbers). One can obtain a simpler analogue of the subject by working instead with the half-infinite interval ${{\bf N}_\infty := [1,+\infty)}$, which is a much simpler abelian semigroup under multiplication (being a one-dimensional Lie semigroup). (I will think of this as a sort of “completion” of ${{\bf N}}$ at the infinite place ${\infty}$, hence the terminology.) Accordingly, let us define a continuous arithmetic function to be a locally integrable function ${f: {\bf N}_\infty \rightarrow {\bf C}}$. The analogue of the summatory function (1) is then an integral

$\displaystyle \int_1^x f(t)\ dt,$

and similarly the analogue of (2) is

$\displaystyle \int_1^x \frac{f(t)}{t}\ dt.$

The analogue of the Dirichlet series is the Mellin-type transform

$\displaystyle {\mathcal D}_\infty[f](s) := \int_1^\infty \frac{f(t)}{t^s}\ dt,$

which will be well-defined at least if the real part of ${s}$ is large enough and if the continuous arithmetic function ${f: {\bf N}_\infty \rightarrow {\bf C}}$ does not grow too quickly, and hopefully will also be defined elsewhere in the complex plane by analytic continuation.

For instance, the continuous analogue of the discrete constant function ${1: {\bf N} \rightarrow {\bf C}}$ would be the constant function ${1_\infty: {\bf N}_\infty \rightarrow {\bf C}}$, which maps any ${t \in [1,+\infty)}$ to ${1}$, and which we will denote by ${1_\infty}$ in order to keep it distinct from ${1}$. The two functions ${1_\infty}$ and ${1}$ have approximately similar statistics; for instance one has

$\displaystyle \sum_{n \leq x} 1 = \lfloor x \rfloor \approx x-1 = \int_1^x 1\ dt$

and

$\displaystyle \sum_{n \leq x} \frac{1}{n} = H_{\lfloor x \rfloor} \approx \log x = \int_1^x \frac{1}{t}\ dt$

where ${H_n}$ is the ${n^{th}}$ harmonic number, and we are deliberately vague as to what the symbol ${\approx}$ means. Continuing this analogy, we would expect

$\displaystyle {\mathcal D}[1](s) = \zeta(s) \approx \frac{1}{s-1} = {\mathcal D}_\infty[1_\infty](s)$

which reflects the fact that ${\zeta}$ has a simple pole at ${s=1}$ with residue ${1}$, and no other poles. Note that the identity ${{\mathcal D}_\infty[1_\infty](s) = \frac{1}{s-1}}$ is initially only valid in the region ${\mathrm{Re} s > 1}$, but clearly the right-hand side can be continued analytically to the entire complex plane except for the pole at ${1}$, and so one can define ${{\mathcal D}_\infty[1_\infty]}$ in this region also.

In a similar vein, the logarithm function ${L: {\bf N} \rightarrow {\bf C}}$ is approximately similar to the logarithm function ${L_\infty: {\bf N}_\infty \rightarrow {\bf C}}$, giving for instance the crude form

$\displaystyle \sum_{n \leq x} L(n) = \log \lfloor x \rfloor! \approx x \log x - x = \int_1^\infty L_\infty(t)\ dt$

of Stirling’s formula, or the Dirichlet series approximation

$\displaystyle {\mathcal D}[L](s) = -\zeta'(s) \approx \frac{1}{(s-1)^2} = {\mathcal D}_\infty[L_\infty](s).$

The continuous analogue of Dirichlet convolution is multiplicative convolution using the multiplicative Haar measure ${\frac{dt}{t}}$: given two continuous arithmetic functions ${f_\infty, g_\infty: {\bf N}_\infty \rightarrow {\bf C}}$, one can define their convolution ${f_\infty *_\infty g_\infty: {\bf N}_\infty \rightarrow {\bf C}}$ by the formula

$\displaystyle f_\infty *_\infty g_\infty(t) := \int_1^t f_\infty(s) g_\infty(\frac{t}{s}) \frac{ds}{s}.$

Thus for instance ${1_\infty * 1_\infty = L_\infty}$. A short computation using Fubini’s theorem shows the analogue

$\displaystyle D_\infty[f_\infty *_\infty g_\infty](s) = D_\infty[f_\infty](s) D_\infty[g_\infty](s)$

of (3) whenever the real part of ${s}$ is large enough that Fubini’s theorem can be justified; similarly, differentiation under the integral sign shows that

$\displaystyle D_\infty[L_\infty f_\infty](s) = -\frac{d}{ds} D_\infty[f_\infty](s) \ \ \ \ \ (5)$

again assuming that the real part of ${s}$ is large enough that differentiation under the integral sign (or some other tool like this, such as the Cauchy integral formula for derivatives) can be justified.

Direct calculation shows that for any complex number ${\rho}$, one has

$\displaystyle \frac{1}{s-\rho} = D_\infty[ t \mapsto t^{\rho-1} ](s)$

(at least for the real part of ${s}$ large enough), and hence by several applications of (5)

$\displaystyle \frac{1}{(s-\rho)^k} = D_\infty[ t \mapsto \frac{1}{(k-1)!} t^{\rho-1} \log^{k-1} t ](s)$

for any natural number ${k}$. This can lead to the following heuristic: if a Dirichlet series ${D[f](s)}$ behaves like a linear combination of poles ${\frac{1}{(s-\rho)^k}}$, in that

$\displaystyle D[f](s) \approx \sum_\rho \frac{c_\rho}{(s-\rho)^{k_\rho}}$

for some set ${\rho}$ of poles and some coefficients ${c_\rho}$ and natural numbers ${k_\rho}$ (where we again are vague as to what ${\approx}$ means, and how to interpret the sum ${\sum_\rho}$ if the set of poles is infinite), then one should expect the arithmetic function ${f}$ to behave like the continuous arithmetic function

$\displaystyle t \mapsto \sum_\rho \frac{c_\rho}{(k_\rho-1)!} t^{\rho-1} \log^{k_\rho-1} t.$

In particular, if we only have simple poles,

$\displaystyle D[f](s) \approx \sum_\rho \frac{c_\rho}{s-\rho}$

then we expect to have ${f}$ behave like continuous arithmetic function

$\displaystyle t \mapsto \sum_\rho c_\rho t^{\rho-1}.$

Integrating this from ${1}$ to ${x}$, this heuristically suggests an approximation

$\displaystyle \sum_{n \leq x} f(n) \approx \sum_\rho c_\rho \frac{x^\rho-1}{\rho}$

for the summatory function, and similarly

$\displaystyle \sum_{n \leq x} \frac{f(n)}{n} \approx \sum_\rho c_\rho \frac{x^{\rho-1}-1}{\rho-1},$

with the convention that ${\frac{x^\rho-1}{\rho}}$ is ${\log x}$ when ${\rho=0}$, and similarly ${\frac{x^{\rho-1}-1}{\rho-1}}$ is ${\log x}$ when ${\rho=1}$. One can make these sorts of approximations more rigorous by means of Perron’s formula (or one of its variants) combined with the residue theorem, provided that one has good enough control on the relevant Dirichlet series, but we will not pursue these rigorous calculations here. (But see for instance this previous blog post for some examples.)

For instance, using the more refined approximation

$\displaystyle \zeta(s) \approx \frac{1}{s-1} + \gamma$

to the zeta function near ${s=1}$, we have

$\displaystyle {\mathcal D}[d_2](s) = \zeta^2(s) \approx \frac{1}{(s-1)^2} + \frac{2 \gamma}{s-1}$

we would expect that

$\displaystyle d_2 \approx L_\infty + 2 \gamma$

and thus for instance

$\displaystyle \sum_{n \leq x} d_2(n) \approx x \log x - x + 2 \gamma x$

which matches what one actually gets from the Dirichlet hyperbola method (see e.g. equation (44) of this previous post).

Or, noting that ${\zeta(s)}$ has a simple pole at ${s=1}$ and assuming simple zeroes elsewhere, the log derivative ${-\zeta'(s)/\zeta(s)}$ will have simple poles of residue ${+1}$ at ${s=1}$ and ${-1}$ at all the zeroes, leading to the heuristic

$\displaystyle {\mathcal D}[\Lambda](s) = -\frac{\zeta'(s)}{\zeta(s)} \approx \frac{1}{s-1} - \sum_\rho \frac{1}{s-\rho}$

suggesting that ${\Lambda}$ should behave like the continuous arithmetic function

$\displaystyle t \mapsto 1 - \sum_\rho t^{\rho-1}$

leading for instance to the summatory approximation

$\displaystyle \sum_{n \leq x} \Lambda(n) \approx x - \sum_\rho \frac{x^\rho-1}{\rho}$

which is a heuristic form of the Riemann-von Mangoldt explicit formula (see Exercise 45 of these notes for a rigorous version of this formula).

Exercise 1 Go through some of the other explicit formulae listed at this Wikipedia page and give heuristic justifications for them (up to some lower order terms) by similar calculations to those given above.

Given the “adelic” perspective on number theory, I wonder if there are also ${p}$-adic analogues of arithmetic functions to which a similar set of heuristics can be applied, perhaps to study sums such as ${\sum_{n \leq x: n = a \hbox{ mod } p^j} f(n)}$. A key problem here is that there does not seem to be any good interpretation of the expression ${\frac{1}{t^s}}$ when ${s}$ is complex and ${t}$ is a ${p}$-adic number, so it is not clear that one can analyse a Dirichlet series ${p}$-adically. For similar reasons, we don’t have a canonical way to define ${\chi(t)}$ for a Dirichlet character ${\chi}$ (unless its conductor happens to be a power of ${p}$), so there doesn’t seem to be much to say in the ${q}$-aspect either.

Let ${\lambda: {\bf N} \rightarrow \{-1,1\}}$ be the Liouville function, thus ${\lambda(n)}$ is defined to equal ${+1}$ when ${n}$ is the product of an even number of primes, and ${-1}$ when ${n}$ is the product of an odd number of primes. The Chowla conjecture asserts that ${\lambda}$ has the statistics of a random sign pattern, in the sense that

$\displaystyle \lim_{N \rightarrow \infty} \mathbb{E}_{n \leq N} \lambda(n+h_1) \dots \lambda(n+h_k) = 0 \ \ \ \ \ (1)$

for all ${k \geq 1}$ and all distinct natural numbers ${h_1,\dots,h_k}$, where we use the averaging notation

$\displaystyle \mathbb{E}_{n \leq N} f(n) := \frac{1}{N} \sum_{n \leq N} f(n).$

For ${k=1}$, this conjecture is equivalent to the prime number theorem (as discussed in this previous blog post), but the conjecture remains open for any ${k \geq 2}$.

In recent years, it has been realised that one can make more progress on this conjecture if one works instead with the logarithmically averaged version

$\displaystyle \lim_{N \rightarrow \infty} \mathbb{E}_{n \leq N}^{\log} \lambda(n+h_1) \dots \lambda(n+h_k) = 0 \ \ \ \ \ (2)$

of the conjecture, where we use the logarithmic averaging notation

$\displaystyle \mathbb{E}_{n \leq N}^{\log} f(n) := \frac{\sum_{n \leq N} \frac{f(n)}{n}}{\sum_{n \leq N} \frac{1}{n}}.$

Using the summation by parts (or telescoping series) identity

$\displaystyle \sum_{n \leq N} \frac{f(n)}{n} = \sum_{M < N} \frac{1}{M(M+1)} (\sum_{n \leq M} f(n)) + \frac{1}{N} \sum_{n \leq N} f(n) \ \ \ \ \ (3)$

it is not difficult to show that the Chowla conjecture (1) for a given ${k,h_1,\dots,h_k}$ implies the logarithmically averaged conjecture (2). However, the converse implication is not at all clear. For instance, for ${k=1}$, we have already mentioned that the Chowla conjecture

$\displaystyle \lim_{N \rightarrow \infty} \mathbb{E}_{n \leq N} \lambda(n) = 0$

is equivalent to the prime number theorem; but the logarithmically averaged analogue

$\displaystyle \lim_{N \rightarrow \infty} \mathbb{E}^{\log}_{n \leq N} \lambda(n) = 0$

is significantly easier to show (a proof with the Liouville function ${\lambda}$ replaced by the closely related Möbius function ${\mu}$ is given in this previous blog post). And indeed, significantly more is now known for the logarithmically averaged Chowla conjecture; in this paper of mine I had proven (2) for ${k=2}$, and in this recent paper with Joni Teravainen, we proved the conjecture for all odd ${k}$ (with a different proof also given here).

In view of this emerging consensus that the logarithmically averaged Chowla conjecture was easier than the ordinary Chowla conjecture, it was thus somewhat of a surprise for me to read a recent paper of Gomilko, Kwietniak, and Lemanczyk who (among other things) established the following statement:

Theorem 1 Assume that the logarithmically averaged Chowla conjecture (2) is true for all ${k}$. Then there exists a sequence ${N_i}$ going to infinity such that the Chowla conjecture (1) is true for all ${k}$ along that sequence, that is to say

$\displaystyle \lim_{N_i \rightarrow \infty} \mathbb{E}_{n \leq N_i} \lambda(n+h_1) \dots \lambda(n+h_k) = 0$

for all ${k}$ and all distinct ${h_1,\dots,h_k}$.

This implication does not use any special properties of the Liouville function (other than that they are bounded), and in fact proceeds by ergodic theoretic methods, focusing in particular on the ergodic decomposition of invariant measures of a shift into ergodic measures. Ergodic methods have proven remarkably fruitful in understanding these sorts of number theoretic and combinatorial problems, as could already be seen by the ergodic theoretic proof of Szemerédi’s theorem by Furstenberg, and more recently by the work of Frantzikinakis and Host on Sarnak’s conjecture. (My first paper with Teravainen also uses ergodic theory tools.) Indeed, many other results in the subject were first discovered using ergodic theory methods.

On the other hand, many results in this subject that were first proven ergodic theoretically have since been reproven by more combinatorial means; my second paper with Teravainen is an instance of this. As it turns out, one can also prove Theorem 1 by a standard combinatorial (or probabilistic) technique known as the second moment method. In fact, one can prove slightly more:

Theorem 2 Let ${k}$ be a natural number. Assume that the logarithmically averaged Chowla conjecture (2) is true for ${2k}$. Then there exists a set ${{\mathcal N}}$ of natural numbers of logarithmic density ${1}$ (that is, ${\lim_{N \rightarrow \infty} \mathbb{E}_{n \leq N}^{\log} 1_{n \in {\mathcal N}} = 1}$) such that

$\displaystyle \lim_{N \rightarrow \infty: N \in {\mathcal N}} \mathbb{E}_{n \leq N} \lambda(n+h_1) \dots \lambda(n+h_k) = 0$

for any distinct ${h_1,\dots,h_k}$.

It is not difficult to deduce Theorem 1 from Theorem 2 using a diagonalisation argument. Unfortunately, the known cases of the logarithmically averaged Chowla conjecture (${k=2}$ and odd ${k}$) are currently insufficient to use Theorem 2 for any purpose other than to reprove what is already known to be true from the prime number theorem. (Indeed, the even cases of Chowla, in either logarithmically averaged or non-logarithmically averaged forms, seem to be far more powerful than the odd cases; see Remark 1.7 of this paper of myself and Teravainen for a related observation in this direction.)

We now sketch the proof of Theorem 2. For any distinct ${h_1,\dots,h_k}$, we take a large number ${H}$ and consider the limiting the second moment

$\displaystyle \limsup_{N \rightarrow \infty} \mathop{\bf E}_{n \leq N}^{\log} |\mathop{\bf E}_{m \leq H} \lambda(n+m+h_1) \dots \lambda(n+m+h_k)|^2.$

We can expand this as

$\displaystyle \limsup_{N \rightarrow \infty} \mathop{\bf E}_{m,m' \leq H} \mathop{\bf E}_{n \leq N}^{\log} \lambda(n+m+h_1) \dots \lambda(n+m+h_k)$

$\displaystyle \lambda(n+m'+h_1) \dots \lambda(n+m'+h_k).$

If all the ${m+h_1,\dots,m+h_k,m'+h_1,\dots,m'+h_k}$ are distinct, the hypothesis (2) tells us that the inner averages goes to zero as ${N \rightarrow \infty}$. The remaining averages are ${O(1)}$, and there are ${O( k^2 )}$ of these averages. We conclude that

$\displaystyle \limsup_{N \rightarrow \infty} \mathop{\bf E}_{n \leq N}^{\log} |\mathop{\bf E}_{m \leq H} \lambda(n+m+h_1) \dots \lambda(n+m+h_k)|^2 \ll k^2 / H.$

By Markov’s inequality (and (3)), we conclude that for any fixed ${h_1,\dots,h_k, H}$, there exists a set ${{\mathcal N}_{h_1,\dots,h_k,H}}$ of upper logarithmic density at least ${1-k/H^{1/2}}$, thus

$\displaystyle \limsup_{N \rightarrow \infty} \mathbb{E}_{n \leq N}^{\log} 1_{n \in {\mathcal N}_{h_1,\dots,h_k,H}} \geq 1 - k/H^{1/2}$

such that

$\displaystyle \mathop{\bf E}_{n \leq N} |\mathop{\bf E}_{m \leq H} \lambda(n+m+h_1) \dots \lambda(n+m+h_k)|^2 \ll k / H^{1/2}.$

By deleting at most finitely many elements, we may assume that ${{\mathcal N}_{h_1,\dots,h_k,H}}$ consists only of elements of size at least ${H^2}$ (say).

For any ${H_0}$, if we let ${{\mathcal N}_{h_1,\dots,h_k, \geq H_0}}$ be the union of ${{\mathcal N}_{h_1,\dots,h_k, H}}$ for ${H \geq H_0}$, then ${{\mathcal N}_{h_1,\dots,h_k, \geq H_0}}$ has logarithmic density ${1}$. By a diagonalisation argument (using the fact that the set of tuples ${(h_1,\dots,h_k)}$ is countable), we can then find a set ${{\mathcal N}}$ of natural numbers of logarithmic density ${1}$, such that for every ${h_1,\dots,h_k,H_0}$, every sufficiently large element of ${{\mathcal N}}$ lies in ${{\mathcal N}_{h_1,\dots,h_k,\geq H_0}}$. Thus for every sufficiently large ${N}$ in ${{\mathcal N}}$, one has

$\displaystyle \mathop{\bf E}_{n \leq N} |\mathop{\bf E}_{m \leq H} \lambda(n+m+h_1) \dots \lambda(n+m+h_k)|^2 \ll k / H^{1/2}.$

for some ${H \geq H_0}$ with ${N \geq H^2}$. By Cauchy-Schwarz, this implies that

$\displaystyle \mathop{\bf E}_{n \leq N} \mathop{\bf E}_{m \leq H} \lambda(n+m+h_1) \dots \lambda(n+m+h_k) \ll k^{1/2} / H^{1/4};$

interchanging the sums and using ${N \geq H^2}$ and ${H \geq H_0}$, this implies that

$\displaystyle \mathop{\bf E}_{n \leq N} \lambda(n+h_1) \dots \lambda(n+h_k) \ll k^{1/2} / H^{1/4} \leq k^{1/2} / H_0^{1/4}.$

We conclude on taking ${H_0}$ to infinity that

$\displaystyle \lim_{N \rightarrow \infty; N \in {\mathcal N}} \mathop{\bf E}_{n \leq N} \lambda(n+h_1) \dots \lambda(n+h_k) = 0$

as required.

Let ${P(z) = z^n + a_{n-1} z^{n-1} + \dots + a_0}$ be a monic polynomial of degree ${n}$ with complex coefficients. Then by the fundamental theorem of algebra, we can factor ${P}$ as

$\displaystyle P(z) = (z-z_1) \dots (z-z_n) \ \ \ \ \ (1)$

for some complex zeroes ${z_1,\dots,z_n}$ (possibly with repetition).

Now suppose we evolve ${P}$ with respect to time by heat flow, creating a function ${P(t,z)}$ of two variables with given initial data ${P(0,z) = P(z)}$ for which

$\displaystyle \partial_t P(t,z) = \partial_{zz} P(t,z). \ \ \ \ \ (2)$

On the space of polynomials of degree at most ${n}$, the operator ${\partial_{zz}}$ is nilpotent, and one can solve this equation explicitly both forwards and backwards in time by the Taylor series

$\displaystyle P(t,z) = \sum_{j=0}^\infty \frac{t^j}{j!} \partial_z^{2j} P(0,z).$

For instance, if one starts with a quadratic ${P(0,z) = z^2 + bz + c}$, then the polynomial evolves by the formula

$\displaystyle P(t,z) = z^2 + bz + (c+2t).$

As the polynomial ${P(t)}$ evolves in time, the zeroes ${z_1(t),\dots,z_n(t)}$ evolve also. Assuming for sake of discussion that the zeroes are simple, the inverse function theorem tells us that the zeroes will (locally, at least) evolve smoothly in time. What are the dynamics of this evolution?

For instance, in the quadratic case, the quadratic formula tells us that the zeroes are

$\displaystyle z_1(t) = \frac{-b + \sqrt{b^2 - 4(c+2t)}}{2}$

and

$\displaystyle z_2(t) = \frac{-b - \sqrt{b^2 - 4(c+2t)}}{2}$

after arbitrarily choosing a branch of the square root. If ${b,c}$ are real and the discriminant ${b^2 - 4c}$ is initially positive, we see that we start with two real zeroes centred around ${-b/2}$, which then approach each other until time ${t = \frac{b^2-4c}{8}}$, at which point the roots collide and then move off from each other in an imaginary direction.

In the general case, we can obtain the equations of motion by implicitly differentiating the defining equation

$\displaystyle P( t, z_i(t) ) = 0$

in time using (2) to obtain

$\displaystyle \partial_{zz} P( t, z_i(t) ) + \partial_t z_i(t) \partial_z P(t,z_i(t)) = 0.$

To simplify notation we drop the explicit dependence on time, thus

$\displaystyle \partial_{zz} P(z_i) + (\partial_t z_i) \partial_z P(z_i)= 0.$

From (1) and the product rule, we see that

$\displaystyle \partial_z P( z_i ) = \prod_{j:j \neq i} (z_i - z_j)$

and

$\displaystyle \partial_{zz} P( z_i ) = 2 \sum_{k:k \neq i} \prod_{j:j \neq i,k} (z_i - z_j)$

(where all indices are understood to range over ${1,\dots,n}$) leading to the equations of motion

$\displaystyle \partial_t z_i = \sum_{k:k \neq i} \frac{2}{z_k - z_i}, \ \ \ \ \ (3)$

at least when one avoids those times in which there is a repeated zero. In the case when the zeroes ${z_i}$ are real, each term ${\frac{2}{z_k-z_i}}$ represents a (first-order) attraction in the dynamics between ${z_i}$ and ${z_k}$, but the dynamics are more complicated for complex zeroes (e.g. purely imaginary zeroes will experience repulsion rather than attraction, as one already sees in the quadratic example). Curiously, this system resembles that of Dyson brownian motion (except with the brownian motion part removed, and time reversed). I learned of the connection between the ODE (3) and the heat equation from this paper of Csordas, Smith, and Varga, but perhaps it has been mentioned in earlier literature as well.

One interesting consequence of these equations is that if the zeroes are real at some time, then they will stay real as long as the zeroes do not collide. Let us now restrict attention to the case of real simple zeroes, in which case we will rename the zeroes as ${x_i}$ instead of ${z_i}$, and order them as ${x_1 < \dots < x_n}$. The evolution

$\displaystyle \partial_t x_i = \sum_{k:k \neq i} \frac{2}{x_k - x_i}$

can now be thought of as reverse gradient flow for the “entropy”

$\displaystyle H := -\sum_{i,j: i \neq j} \log |x_i - x_j|,$

(which is also essentially the logarithm of the discriminant of the polynomial) since we have

$\displaystyle \partial_t x_i = \frac{\partial H}{\partial x_i}.$

In particular, we have the monotonicity formula

$\displaystyle \partial_t H = 4E$

where ${E}$ is the “energy”

$\displaystyle E := \frac{1}{4} \sum_i (\frac{\partial H}{\partial x_i})^2$

$\displaystyle = \sum_i (\sum_{k:k \neq i} \frac{1}{x_k-x_i})^2$

$\displaystyle = \sum_{i,k: i \neq k} \frac{1}{(x_k-x_i)^2} + 2 \sum_{i,j,k: i,j,k \hbox{ distinct}} \frac{1}{(x_k-x_i)(x_j-x_i)}$

$\displaystyle = \sum_{i,k: i \neq k} \frac{1}{(x_k-x_i)^2}$

where in the last line we use the antisymmetrisation identity

$\displaystyle \frac{1}{(x_k-x_i)(x_j-x_i)} + \frac{1}{(x_i-x_j)(x_k-x_j)} + \frac{1}{(x_j-x_k)(x_i-x_k)} = 0.$

Among other things, this shows that as one goes backwards in time, the entropy decreases, and so no collisions can occur to the past, only in the future, which is of course consistent with the attractive nature of the dynamics. As ${H}$ is a convex function of the positions ${x_1,\dots,x_n}$, one expects ${H}$ to also evolve in a convex manner in time, that is to say the energy ${E}$ should be increasing. This is indeed the case:

Exercise 1 Show that

$\displaystyle \partial_t E = 2 \sum_{i,j: i \neq j} (\frac{2}{(x_i-x_j)^2} - \sum_{k: i,j,k \hbox{ distinct}} \frac{1}{(x_k-x_i)(x_k-x_j)})^2.$

Symmetric polynomials of the zeroes are polynomial functions of the coefficients and should thus evolve in a polynomial fashion. One can compute this explicitly in simple cases. For instance, the center of mass is an invariant:

$\displaystyle \partial_t \frac{1}{n} \sum_i x_i = 0.$

The variance decreases linearly:

Exercise 2 Establish the virial identity

$\displaystyle \partial_t \sum_{i,j} (x_i-x_j)^2 = - 4n^2(n-1).$

As the variance (which is proportional to ${\sum_{i,j} (x_i-x_j)^2}$) cannot become negative, this identity shows that “finite time blowup” must occur – that the zeroes must collide at or before the time ${\frac{1}{4n^2(n-1)} \sum_{i,j} (x_i-x_j)^2}$.

Exercise 3 Show that the Stieltjes transform

$\displaystyle s(t,z) = \sum_i \frac{1}{x_i - z}$

solves the viscous Burgers equation

$\displaystyle \partial_t s = \partial_{zz} s - 2 s \partial_z s,$

either by using the original heat equation (2) and the identity ${s = - \partial_z P / P}$, or else by using the equations of motion (3). This relation between the Burgers equation and the heat equation is known as the Cole-Hopf transformation.

The paper of Csordas, Smith, and Varga mentioned previously gives some other bounds on the lifespan of the dynamics; roughly speaking, they show that if there is one pair of zeroes that are much closer to each other than to the other zeroes then they must collide in a short amount of time (unless there is a collision occuring even earlier at some other location). Their argument extends also to situations where there are an infinite number of zeroes, which they apply to get new results on Newman’s conjecture in analytic number theory. I would be curious to know of further places in the literature where this dynamics has been studied.

Suppose we have an ${n \times n}$ matrix ${M}$ that is expressed in block-matrix form as

$\displaystyle M = \begin{pmatrix} A & B \\ C & D \end{pmatrix}$

where ${A}$ is an ${(n-k) \times (n-k)}$ matrix, ${B}$ is an ${(n-k) \times k}$ matrix, ${C}$ is an ${k \times (n-k)}$ matrix, and ${D}$ is a ${k \times k}$ matrix for some ${1 < k < n}$. If ${A}$ is invertible, we can use the technique of Schur complementation to express the inverse of ${M}$ (if it exists) in terms of the inverse of ${A}$, and the other components ${B,C,D}$ of course. Indeed, to solve the equation

$\displaystyle M \begin{pmatrix} x & y \end{pmatrix} = \begin{pmatrix} a & b \end{pmatrix},$

where ${x, a}$ are ${(n-k) \times 1}$ column vectors and ${y,b}$ are ${k \times 1}$ column vectors, we can expand this out as a system

$\displaystyle Ax + By = a$

$\displaystyle Cx + Dy = b.$

Using the invertibility of ${A}$, we can write the first equation as

$\displaystyle x = A^{-1} a - A^{-1} B y \ \ \ \ \ (1)$

and substituting this into the second equation yields

$\displaystyle (D - C A^{-1} B) y = b - C A^{-1} a$

and thus (assuming that ${D - CA^{-1} B}$ is invertible)

$\displaystyle y = - (D - CA^{-1} B)^{-1} CA^{-1} a + (D - CA^{-1} B)^{-1} b$

and then inserting this back into (1) gives

$\displaystyle x = (A^{-1} + A^{-1} B (D - CA^{-1} B)^{-1} C A^{-1}) a - A^{-1} B (D - CA^{-1} B)^{-1} b.$

Comparing this with

$\displaystyle \begin{pmatrix} x & y \end{pmatrix} = M^{-1} \begin{pmatrix} a & b \end{pmatrix},$

we have managed to express the inverse of ${M}$ as

$\displaystyle M^{-1} =$

$\displaystyle \begin{pmatrix} A^{-1} + A^{-1} B (D - CA^{-1} B)^{-1} C A^{-1} & - A^{-1} B (D - CA^{-1} B)^{-1} \\ - (D - CA^{-1} B)^{-1} CA^{-1} & (D - CA^{-1} B)^{-1} \end{pmatrix}. \ \ \ \ \ (2)$

One can consider the inverse problem: given the inverse ${M^{-1}}$ of ${M}$, does one have a nice formula for the inverse ${A^{-1}}$ of the minor ${A}$? Trying to recover this directly from (2) looks somewhat messy. However, one can proceed as follows. Let ${U}$ denote the ${n \times k}$ matrix

$\displaystyle U := \begin{pmatrix} 0 \\ I_k \end{pmatrix}$

(with ${I_k}$ the ${k \times k}$ identity matrix), and let ${V}$ be its transpose:

$\displaystyle V := \begin{pmatrix} 0 & I_k \end{pmatrix}.$

Then for any scalar ${t}$ (which we identify with ${t}$ times the identity matrix), one has

$\displaystyle M + UtV = \begin{pmatrix} A & B \\ C & D+t \end{pmatrix},$

and hence by (2)

$\displaystyle (M+UtV)^{-1} =$

$\displaystyle \begin{pmatrix} A^{-1} + A^{-1} B (D + t - CA^{-1} B)^{-1} C A^{-1} & - A^{-1} B (D + t- CA^{-1} B)^{-1} \\ - (D + t - CA^{-1} B)^{-1} CA^{-1} & (D + t - CA^{-1} B)^{-1} \end{pmatrix}.$

noting that the inverses here will exist for ${t}$ large enough. Taking limits as ${t \rightarrow \infty}$, we conclude that

$\displaystyle \lim_{t \rightarrow \infty} (M+UtV)^{-1} = \begin{pmatrix} A^{-1} & 0 \\ 0 & 0 \end{pmatrix}.$

On the other hand, by the Woodbury matrix identity (discussed in this previous blog post), we have

$\displaystyle (M+UtV)^{-1} = M^{-1} - M^{-1} U (t^{-1} + V M^{-1} U)^{-1} V M^{-1}$

and hence on taking limits and comparing with the preceding identity, one has

$\displaystyle \begin{pmatrix} A^{-1} & 0 \\ 0 & 0 \end{pmatrix} = M^{-1} - M^{-1} U (V M^{-1} U)^{-1} V M^{-1}.$

This achieves the aim of expressing the inverse ${A^{-1}}$ of the minor in terms of the inverse of the full matrix. Taking traces and rearranging, we conclude in particular that

$\displaystyle \mathrm{tr} A^{-1} = \mathrm{tr} M^{-1} - \mathrm{tr} (V M^{-2} U) (V M^{-1} U)^{-1}. \ \ \ \ \ (3)$

In the ${k=1}$ case, this can be simplified to

$\displaystyle \mathrm{tr} A^{-1} = \mathrm{tr} M^{-1} - \frac{e_n^T M^{-2} e_n}{e_n^T M^{-1} e_n} \ \ \ \ \ (4)$

where ${e_n}$ is the ${n^{th}}$ basis column vector.

We can apply this identity to understand how the spectrum of an ${n \times n}$ random matrix ${M}$ relates to that of its top left ${n-1 \times n-1}$ minor ${A}$. Subtracting any complex multiple ${z}$ of the identity from ${M}$ (and hence from ${A}$), we can relate the Stieltjes transform ${s_M(z) := \frac{1}{n} \mathrm{tr}(M-z)^{-1}}$ of ${M}$ with the Stieltjes transform ${s_A(z) := \frac{1}{n-1} \mathrm{tr}(A-z)^{-1}}$ of ${A}$:

$\displaystyle s_A(z) = \frac{n}{n-1} s_M(z) - \frac{1}{n-1} \frac{e_n^T (M-z)^{-2} e_n}{e_n^T (M-z)^{-1} e_n} \ \ \ \ \ (5)$

At this point we begin to proceed informally. Assume for sake of argument that the random matrix ${M}$ is Hermitian, with distribution that is invariant under conjugation by the unitary group ${U(n)}$; for instance, ${M}$ could be drawn from the Gaussian Unitary Ensemble (GUE), or alternatively ${M}$ could be of the form ${M = U D U^*}$ for some real diagonal matrix ${D}$ and ${U}$ a unitary matrix drawn randomly from ${U(n)}$ using Haar measure. To fix normalisations we will assume that the eigenvalues of ${M}$ are typically of size ${O(1)}$. Then ${A}$ is also Hermitian and ${U(n)}$-invariant. Furthermore, the law of ${e_n^T (M-z)^{-1} e_n}$ will be the same as the law of ${u^* (M-z)^{-1} u}$, where ${u}$ is now drawn uniformly from the unit sphere (independently of ${M}$). Diagonalising ${M}$ into eigenvalues ${\lambda_j}$ and eigenvectors ${v_j}$, we have

$\displaystyle u^* (M-z)^{-1} u = \sum_{j=1}^n \frac{|u^* v_j|^2}{\lambda_j - z}.$

One can think of ${u}$ as a random (complex) Gaussian vector, divided by the magnitude of that vector (which, by the Chernoff inequality, will concentrate to ${\sqrt{n}}$). Thus the coefficients ${u^* v_j}$ with respect to the orthonormal basis ${v_1,\dots,v_j}$ can be thought of as independent (complex) Gaussian vectors, divided by that magnitude. Using this and the Chernoff inequality again, we see (for ${z}$ distance ${\sim 1}$ away from the real axis at least) that one has the concentration of measure

$\displaystyle u^* (M-z)^{-1} u \approx \frac{1}{n} \sum_{j=1}^n \frac{1}{\lambda_j - z}$

and thus

$\displaystyle e_n^T (M-z)^{-1} e_n \approx \frac{1}{n} \mathrm{tr} (M-z)^{-1} = s_M(z)$

(that is to say, the diagonal entries of ${(M-z)^{-1}}$ are roughly constant). Similarly we have

$\displaystyle e_n^T (M-z)^{-2} e_n \approx \frac{1}{n} \mathrm{tr} (M-z)^{-2} = \frac{d}{dz} s_M(z).$

Inserting this into (5) and discarding terms of size ${O(1/n^2)}$, we thus conclude the approximate relationship

$\displaystyle s_A(z) \approx s_M(z) + \frac{1}{n} ( s_M(z) - s_M(z)^{-1} \frac{d}{dz} s_M(z) ).$

This can be viewed as a difference equation for the Stieltjes transform of top left minors of ${M}$. Iterating this equation, and formally replacing the difference equation by a differential equation in the large ${n}$ limit, we see that when ${n}$ is large and ${k \approx e^{-t} n}$ for some ${t \geq 0}$, one expects the top left ${k \times k}$ minor ${A_k}$ of ${M}$ to have Stieltjes transform

$\displaystyle s_{A_k}(z) \approx s( t, z ) \ \ \ \ \ (6)$

where ${s(t,z)}$ solves the Burgers-type equation

$\displaystyle \partial_t s(t,z) = s(t,z) - s(t,z)^{-1} \frac{d}{dz} s(t,z) \ \ \ \ \ (7)$

with initial data ${s(0,z) = s_M(z)}$.

Example 1 If ${M}$ is a constant multiple ${M = cI_n}$ of the identity, then ${s_M(z) = \frac{1}{c-z}}$. One checks that ${s(t,z) = \frac{1}{c-z}}$ is a steady state solution to (7), which is unsurprising given that all minors of ${M}$ are also ${c}$ times the identity.

Example 2 If ${M}$ is GUE normalised so that each entry has variance ${\sigma^2/n}$, then by the semi-circular law (see previous notes) one has ${s_M(z) \approx \frac{-z + \sqrt{z^2-4\sigma^2}}{2\sigma^2} = -\frac{2}{z + \sqrt{z^2-4\sigma^2}}}$ (using an appropriate branch of the square root). One can then verify the self-similar solution

$\displaystyle s(t,z) = \frac{-z + \sqrt{z^2 - 4\sigma^2 e^{-t}}}{2\sigma^2 e^{-t}} = -\frac{2}{z + \sqrt{z^2 - 4\sigma^2 e^{-t}}}$

to (7), which is consistent with the fact that a top ${k \times k}$ minor of ${M}$ also has the law of GUE, with each entry having variance ${\sigma^2 / n \approx \sigma^2 e^{-t} / k}$ when ${k \approx e^{-t} n}$.

One can justify the approximation (6) given a sufficiently good well-posedness theory for the equation (7). We will not do so here, but will note that (as with the classical inviscid Burgers equation) the equation can be solved exactly (formally, at least) by the method of characteristics. For any initial position ${z_0}$, we consider the characteristic flow ${t \mapsto z(t)}$ formed by solving the ODE

$\displaystyle \frac{d}{dt} z(t) = s(t,z(t))^{-1} \ \ \ \ \ (8)$

with initial data ${z(0) = z_0}$, ignoring for this discussion the problems of existence and uniqueness. Then from the chain rule, the equation (7) implies that

$\displaystyle \frac{d}{dt} s( t, z(t) ) = s(t,z(t))$

and thus ${s(t,z(t)) = e^t s(0,z_0)}$. Inserting this back into (8) we see that

$\displaystyle z(t) = z_0 + s(0,z_0)^{-1} (1-e^{-t})$

and thus (7) may be solved implicitly via the equation

$\displaystyle s(t, z_0 + s(0,z_0)^{-1} (1-e^{-t}) ) = e^t s(0, z_0) \ \ \ \ \ (9)$

for all ${t}$ and ${z_0}$.

Remark 3 In practice, the equation (9) may stop working when ${z_0 + s(0,z_0)^{-1} (1-e^{-t})}$ crosses the real axis, as (7) does not necessarily hold in this region. It is a cute exercise (ultimately coming from the Cauchy-Schwarz inequality) to show that this crossing always happens, for instance if ${z_0}$ has positive imaginary part then ${z_0 + s(0,z_0)^{-1}}$ necessarily has negative or zero imaginary part.

Example 4 Suppose we have ${s(0,z) = \frac{1}{c-z}}$ as in Example 1. Then (9) becomes

$\displaystyle s( t, z_0 + (c-z_0) (1-e^{-t}) ) = \frac{e^t}{c-z_0}$

for any ${t,z_0}$, which after making the change of variables ${z = z_0 + (c-z_0) (1-e^{-t}) = c - e^{-t} (c - z_0)}$ becomes

$\displaystyle s(t, z ) = \frac{1}{c-z}$

as in Example 1.

Example 5 Suppose we have

$\displaystyle s(0,z) = \frac{-z + \sqrt{z^2-4\sigma^2}}{2\sigma^2} = -\frac{2}{z + \sqrt{z^2-4\sigma^2}}.$

as in Example 2. Then (9) becomes

$\displaystyle s(t, z_0 - \frac{z_0 + \sqrt{z_0^2-4\sigma^2}}{2} (1-e^{-t}) ) = e^t \frac{-z_0 + \sqrt{z_0^2-4\sigma^2}}{2\sigma^2}.$

If we write

$\displaystyle z := z_0 - \frac{z_0 + \sqrt{z_0^2-4\sigma^2}}{2} (1-e^{-t})$

$\displaystyle = \frac{(1+e^{-t}) z_0 - (1-e^{-t}) \sqrt{z_0^2-4\sigma^2}}{2}$

one can calculate that

$\displaystyle z^2 - 4 \sigma^2 e^{-t} = (\frac{(1-e^{-t}) z_0 - (1+e^{-t}) \sqrt{z_0^2-4\sigma^2}}{2})^2$

and hence

$\displaystyle \frac{-z + \sqrt{z^2 - 4\sigma^2 e^{-t}}}{2\sigma^2 e^{-t}} = e^t \frac{-z_0 + \sqrt{z_0^2-4\sigma^2}}{2\sigma^2}$

which gives

$\displaystyle s(t,z) = \frac{-z + \sqrt{z^2 - 4\sigma^2 e^{-t}}}{2\sigma^2 e^{-t}}. \ \ \ \ \ (10)$

One can recover the spectral measure ${\mu}$ from the Stieltjes transform ${s(z)}$ as the weak limit of ${x \mapsto \frac{1}{\pi} \mathrm{Im} s(x+i\varepsilon)}$ as ${\varepsilon \rightarrow 0}$; we write this informally as

$\displaystyle d\mu(x) = \frac{1}{\pi} \mathrm{Im} s(x+i0^+)\ dx.$

In this informal notation, we have for instance that

$\displaystyle \delta_c(x) = \frac{1}{\pi} \mathrm{Im} \frac{1}{c-x-i0^+}\ dx$

which can be interpreted as the fact that the Cauchy distributions ${\frac{1}{\pi} \frac{\varepsilon}{(c-x)^2+\varepsilon^2}}$ converge weakly to the Dirac mass at ${c}$ as ${\varepsilon \rightarrow 0}$. Similarly, the spectral measure associated to (10) is the semicircular measure ${\frac{1}{2\pi \sigma^2 e^{-t}} (4 \sigma^2 e^{-t}-x^2)_+^{1/2}}$.

If we let ${\mu_t}$ be the spectral measure associated to ${s(t,\cdot)}$, then the curve ${e^{-t} \mapsto \mu_t}$ from ${(0,1]}$ to the space of measures is the high-dimensional limit ${n \rightarrow \infty}$ of a Gelfand-Tsetlin pattern (discussed in this previous post), if the pattern is randomly generated amongst all matrices ${M}$ with spectrum asymptotic to ${\mu_0}$ as ${n \rightarrow \infty}$. For instance, if ${\mu_0 = \delta_c}$, then the curve is ${\alpha \mapsto \delta_c}$, corresponding to a pattern that is entirely filled with ${c}$‘s. If instead ${\mu_0 = \frac{1}{2\pi \sigma^2} (4\sigma^2-x^2)_+^{1/2}}$ is a semicircular distribution, then the pattern is

$\displaystyle \alpha \mapsto \frac{1}{2\pi \sigma^2 \alpha} (4\sigma^2 \alpha -x^2)_+^{1/2},$

thus at height ${\alpha}$ from the top, the pattern is semicircular on the interval ${[-2\sigma \sqrt{\alpha}, 2\sigma \sqrt{\alpha}]}$. The interlacing property of Gelfand-Tsetlin patterns translates to the claim that ${\alpha \mu_\alpha(-\infty,\lambda)}$ (resp. ${\alpha \mu_\alpha(\lambda,\infty)}$) is non-decreasing (resp. non-increasing) in ${\alpha}$ for any fixed ${\lambda}$. In principle one should be able to establish these monotonicity claims directly from the PDE (7) or from the implicit solution (9), but it was not clear to me how to do so.

An interesting example of such a limiting Gelfand-Tsetlin pattern occurs when ${\mu_0 = \frac{1}{2} \delta_{-1} + \frac{1}{2} \delta_1}$, which corresponds to ${M}$ being ${2P-I}$, where ${P}$ is an orthogonal projection to a random ${n/2}$-dimensional subspace of ${{\bf C}^n}$. Here we have

$\displaystyle s(0,z) = \frac{1}{2} \frac{1}{-1-z} + \frac{1}{2} \frac{1}{1-z} = \frac{z}{1-z^2}$

and so (9) in this case becomes

$\displaystyle s(t, z_0 + \frac{1-z_0^2}{z_0} (1-e^{-t}) ) = \frac{e^t z_0}{1-z_0^2}$

A tedious calculation then gives the solution

$\displaystyle s(t,z) = \frac{(2e^{-t}-1)z + \sqrt{z^2 - 4e^{-t}(1-e^{-t})}}{2e^{-t}(1-z^2)}. \ \ \ \ \ (11)$

For ${\alpha = e^{-t} > 1/2}$, there are simple poles at ${z=-1,+1}$, and the associated measure is

$\displaystyle \mu_\alpha = \frac{2\alpha-1}{2\alpha} \delta_{-1} + \frac{2\alpha-1}{2\alpha} \delta_1 + \frac{1}{2\pi \alpha(1-x^2)} (4\alpha(1-\alpha)-x^2)_+^{1/2}\ dx.$

This reflects the interlacing property, which forces ${\frac{2\alpha-1}{2\alpha} \alpha n}$ of the ${\alpha n}$ eigenvalues of the ${\alpha n \times \alpha n}$ minor to be equal to ${-1}$ (resp. ${+1}$). For ${\alpha = e^{-t} \leq 1/2}$, the poles disappear and one just has

$\displaystyle \mu_\alpha = \frac{1}{2\pi \alpha(1-x^2)} (4\alpha(1-\alpha)-x^2)_+^{1/2}\ dx.$

For ${\alpha=1/2}$, one has an inverse semicircle distribution

$\displaystyle \mu_{1/2} = \frac{1}{\pi} (1-x^2)_+^{-1/2}.$

There is presumably a direct geometric explanation of this fact (basically describing the singular values of the product of two random orthogonal projections to half-dimensional subspaces of ${{\bf C}^n}$), but I do not know of one off-hand.

The evolution of ${s(t,z)}$ can also be understood using the ${R}$-transform and ${S}$-transform from free probability. Formally, letlet ${z(t,s)}$ be the inverse of ${s(t,z)}$, thus

$\displaystyle s(t,z(t,s)) = s$

for all ${t,s}$, and then define the ${R}$-transform

$\displaystyle R(t,s) := z(t,-s) - \frac{1}{s}.$

The equation (9) may be rewritten as

$\displaystyle z( t, e^t s ) = z(0,s) + s^{-1} (1-e^{-t})$

and hence

$\displaystyle R(t, -e^t s) = R(0, -s)$

or equivalently

$\displaystyle R(t,s) = R(0, e^{-t} s). \ \ \ \ \ (12)$

See these previous notes for a discussion of free probability topics such as the ${R}$-transform.

Example 6 If ${s(t,z) = \frac{1}{c-z}}$ then the ${R}$ transform is ${R(t,s) = c}$.

Example 7 If ${s(t,z)}$ is given by (10), then the ${R}$ transform is

$\displaystyle R(t,s) = \sigma^2 e^{-t} s.$

Example 8 If ${s(t,z)}$ is given by (11), then the ${R}$ transform is

$\displaystyle R(t,s) = \frac{-1 + \sqrt{1 + 4 s^2 e^{-2t}}}{2 s e^{-t}}.$

This simple relationship (12) is essentially due to Nica and Speicher (thanks to Dima Shylakhtenko for this reference). It has the remarkable consequence that when ${\alpha = 1/m}$ is the reciprocal of a natural number ${m}$, then ${\mu_{1/m}}$ is the free arithmetic mean of ${m}$ copies of ${\mu}$, that is to say ${\mu_{1/m}}$ is the free convolution ${\mu \boxplus \dots \boxplus \mu}$ of ${m}$ copies of ${\mu}$, pushed forward by the map ${\lambda \rightarrow \lambda/m}$. In terms of random matrices, this is asserting that the top ${n/m \times n/m}$ minor of a random matrix ${M}$ has spectral measure approximately equal to that of an arithmetic mean ${\frac{1}{m} (M_1 + \dots + M_m)}$ of ${m}$ independent copies of ${M}$, so that the process of taking top left minors is in some sense a continuous analogue of the process of taking freely independent arithmetic means. There ought to be a geometric proof of this assertion, but I do not know of one. In the limit ${m \rightarrow \infty}$ (or ${\alpha \rightarrow 0}$), the ${R}$-transform becomes linear and the spectral measure becomes semicircular, which is of course consistent with the free central limit theorem.

In a similar vein, if one defines the function

$\displaystyle \omega(t,z) := \alpha \int_{\bf R} \frac{zx}{1-zx}\ d\mu_\alpha(x) = e^{-t} (- 1 - z^{-1} s(t, z^{-1}))$

and inverts it to obtain a function ${z(t,\omega)}$ with

$\displaystyle \omega(t, z(t,\omega)) = \omega$

for all ${t, \omega}$, then the ${S}$-transform ${S(t,\omega)}$ is defined by

$\displaystyle S(t,\omega) := \frac{1+\omega}{\omega} z(t,\omega).$

Writing

$\displaystyle s(t,z) = - z^{-1} ( 1 + e^t \omega(t, z^{-1}) )$

for any ${t}$, ${z}$, we have

$\displaystyle z_0 + s(0,z_0)^{-1} (1-e^{-t}) = z_0 \frac{\omega(0,z_0^{-1})+e^{-t}}{\omega(0,z_0^{-1})+1}$

and so (9) becomes

$\displaystyle - z_0^{-1} \frac{\omega(0,z_0^{-1})+1}{\omega(0,z_0^{-1})+e^{-t}} (1 + e^{t} \omega(t, z_0^{-1} \frac{\omega(0,z_0^{-1})+1}{\omega(0,z_0^{-1})+e^{-t}}))$

$\displaystyle = - e^t z_0^{-1} (1 + \omega(0, z_0^{-1}))$

which simplifies to

$\displaystyle \omega(t, z_0^{-1} \frac{\omega(0,z_0^{-1})+1}{\omega(0,z_0^{-1})+e^{-t}})) = \omega(0, z_0^{-1});$

replacing ${z_0}$ by ${z(0,\omega)^{-1}}$ we obtain

$\displaystyle \omega(t, z(0,\omega) \frac{\omega+1}{\omega+e^{-t}}) = \omega$

and thus

$\displaystyle z(0,\omega)\frac{\omega+1}{\omega+e^{-t}} = z(t, \omega)$

and hence

$\displaystyle S(0, \omega) = \frac{\omega+e^{-t}}{\omega+1} S(t, \omega).$

One can compute ${\frac{\omega+e^{-t}}{\omega+1}}$ to be the ${S}$-transform of the measure ${(1-\alpha) \delta_0 + \alpha \delta_1}$; from the link between ${S}$-transforms and free products (see e.g. these notes of Guionnet), we conclude that ${(1-\alpha)\delta_0 + \alpha \mu_\alpha}$ is the free product of ${\mu_1}$ and ${(1-\alpha) \delta_0 + \alpha \delta_1}$. This is consistent with the random matrix theory interpretation, since ${(1-\alpha)\delta_0 + \alpha \mu_\alpha}$ is also the spectral measure of ${PMP}$, where ${P}$ is the orthogonal projection to the span of the first ${\alpha n}$ basis elements, so in particular ${P}$ has spectral measure ${(1-\alpha) \delta_0 + \alpha \delta_1}$. If ${M}$ is unitarily invariant then (by a fundamental result of Voiculescu) it is asymptotically freely independent of ${P}$, so the spectral measure of ${PMP = P^{1/2} M P^{1/2}}$ is asymptotically the free product of that of ${M}$ and of ${P}$.

Szemerédi’s theorem asserts that all subsets of the natural numbers of positive density contain arbitrarily long arithmetic progressions.  Roth’s theorem is the special case when one considers arithmetic progressions of length three.  Both theorems have many important proofs using tools from additive combinatorics, (higher order) Fourier analysis, (hyper) graph regularity theory, and ergodic theory.  However, the original proof by Endre Szemerédi, while extremely intricate, was purely combinatorial (and in particular “elementary”) and almost entirely self-contained, except for an invocation of the van der Waerden theorem.  It is also notable for introducing a prototype of what is now known as the Szemerédi regularity lemma.

Back in 2005, I rewrote Szemerédi’s original proof in order to understand it better, however my rewrite ended up being about the same length as the original argument and was probably only usable to myself.  In 2012, after Szemerédi was awarded the Abel prize, I revisited this argument with the intention to try to write up a more readable version of the proof, but ended up just presenting some ingredients of the argument in a blog post, rather than try to rewrite the whole thing.  In that post, I suspected that the cleanest way to write up the argument would be through the language of nonstandard analysis (perhaps in an iterated hyperextension that could handle various hierarchies of infinitesimals), but was unable to actually achieve any substantial simplifications by passing to the nonstandard world.

A few weeks ago, I participated in a week-long workshop at the American Institute of Mathematics on “Nonstandard methods in combinatorial number theory”, and spent some time in a working group with Shabnam Akhtari, Irfam Alam, Renling Jin, Steven Leth, Karl Mahlburg, Paul Potgieter, and Henry Towsner to try to obtain a manageable nonstandard version of Szemerédi’s original proof.  We didn’t end up being able to do so – in fact there are now signs that perhaps nonstandard analysis is not the optimal framework in which to place this argument – but we did at least clarify the existing standard argument, to the point that I was able to go back to my original rewrite of the proof and present it in a more civilised form, which I am now uploading here as an unpublished preprint.   There are now a number of simplifications to the proof.  Firstly, one no longer needs the full strength of the regularity lemma; only the simpler “weak” regularity lemma of Frieze and Kannan is required.  Secondly, the proof has been “factored” into a number of stand-alone propositions of independent interest, in particular involving just (families of) one-dimensional arithmetic progressions rather than the complicated-looking multidimensional arithmetic progressions that occur so frequently in the original argument of Szemerédi.  Finally, the delicate manipulations of densities and epsilons via double counting arguments in Szemerédi’s original paper have been abstracted into a certain key property of families of arithmetic progressions that I call the “double counting property”.

The factoring mentioned above is particularly simple in the case of proving Roth’s theorem, which is now presented separately in the above writeup.  Roth’s theorem seeks to locate a length three progression ${(P(1),P(2),P(3)) = (a, a+r, a+2r)}$ in which all three elements lie in a single set.  This will be deduced from an easier variant of the theorem in which one locates (a family of) length three progressions in which just the first two elements ${P(1), P(2)}$ of the progression lie in a good set (and some other properties of the family are also required).  This is in turn derived from an even easier variant in which now just the first element of the progression is required to be in the good set.

More specifically, Roth’s theorem is now deduced from

Theorem 1.5.  Let ${L}$ be a natural number, and let ${S}$ be a set of integers of upper density at least ${1-1/10L}$.  Then, whenever ${S}$ is partitioned into finitely many colour classes, there exists a colour class ${A}$ and a family ${(P_l(1),P_l(2),P_l(3))_{l=1}^L}$ of 3-term arithmetic progressions with the following properties:

1. For each ${l}$, ${P_l(1)}$ and ${P_l(2)}$ lie in ${A}$.
2. For each ${l}$, ${P_l(3)}$ lie in ${S}$.
3. The ${P_l(3)}$ for ${l=1,\dots,L}$ are in arithmetic progression.

The situation in this theorem is depicted by the following diagram, in which elements of $A$ are in blue and elements of $S$ are in grey:

Theorem 1.5 is deduced in turn from the following easier variant:

Theorem 1.6.  Let ${L}$ be a natural number, and let ${S}$ be a set of integers of upper density at least ${1-1/10L}$.  Then, whenever ${S}$ is partitioned into finitely many colour classes, there exists a colour class ${A}$ and a family ${(P_l(1),P_l(2),P_l(3))_{l=1}^L}$ of 3-term arithmetic progressions with the following properties:

1. For each ${l}$, ${P_l(1)}$ lie in ${A}$.
2. For each ${l}$, ${P_l(2)}$ and ${P_l(3)}$ lie in ${S}$.
3. The ${P_l(2)}$ for ${l=1,\dots,L}$ are in arithmetic progression.

The situation here is described by the figure below.

Theorem 1.6 is easy to prove.  To derive Theorem 1.5 from Theorem 1.6, or to derive Roth’s theorem from Theorem 1.5, one uses double counting arguments, van der Waerden’s theorem, and the weak regularity lemma, largely as described in this previous blog post; see the writeup for the full details.  (I would be interested in seeing a shorter proof of Theorem 1.5 though that did not go through these arguments, and did not use the more powerful theorems of  Roth or Szemerédi.)

Fix a non-negative integer ${k}$. Define an (weak) integer partition of length ${k}$ to be a tuple ${\lambda = (\lambda_1,\dots,\lambda_k)}$ of non-increasing non-negative integers ${\lambda_1 \geq \dots \geq \lambda_k \geq 0}$. (Here our partitions are “weak” in the sense that we allow some parts of the partition to be zero. Henceforth we will omit the modifier “weak”, as we will not need to consider the more usual notion of “strong” partitions.) To each such partition ${\lambda}$, one can associate a Young diagram consisting of ${k}$ left-justified rows of boxes, with the ${i^{th}}$ row containing ${\lambda_i}$ boxes. A semi-standard Young tableau (or Young tableau for short) ${T}$ of shape ${\lambda}$ is a filling of these boxes by integers in ${\{1,\dots,k\}}$ that is weakly increasing along rows (moving rightwards) and strictly increasing along columns (moving downwards). The collection of such tableaux will be denoted ${{\mathcal T}_\lambda}$. The weight ${|T|}$ of a tableau ${T}$ is the tuple ${(n_1,\dots,n_k)}$, where ${n_i}$ is the number of occurrences of the integer ${i}$ in the tableau. For instance, if ${k=3}$ and ${\lambda = (6,4,2)}$, an example of a Young tableau of shape ${\lambda}$ would be

$\displaystyle \begin{tabular}{|c|c|c|c|c|c|} \hline 1 & 1 & 1 & 2 & 3 & 3 \\ \cline{1-6} 2 & 2 & 2 &3\\ \cline{1-4} 3 & 3\\ \cline{1-2} \end{tabular}$

The weight here would be ${|T| = (3,4,5)}$.

To each partition ${\lambda}$ one can associate the Schur polynomial ${s_\lambda(u_1,\dots,u_k)}$ on ${k}$ variables ${u = (u_1,\dots,u_k)}$, which we will define as

$\displaystyle s_\lambda(u) := \sum_{T \in {\mathcal T}_\lambda} u^{|T|}$

using the multinomial convention

$\displaystyle (u_1,\dots,u_k)^{(n_1,\dots,n_k)} := u_1^{n_1} \dots u_k^{n_k}.$

Thus for instance the Young tableau ${T}$ given above would contribute a term ${u_1^3 u_2^4 u_3^5}$ to the Schur polynomial ${s_{(6,4,2)}(u_1,u_2,u_3)}$. In the case of partitions of the form ${(n,0,\dots,0)}$, the Schur polynomial ${s_{(n,0,\dots,0)}}$ is just the complete homogeneous symmetric polynomial ${h_n}$ of degree ${n}$ on ${k}$ variables:

$\displaystyle s_{(n,0,\dots,0)}(u_1,\dots,u_k) := \sum_{n_1,\dots,n_k \geq 0: n_1+\dots+n_k = n} u_1^{n_1} \dots u_k^{n_k},$

thus for instance

$\displaystyle s_{(3,0)}(u_1,u_2) = u_1^3 + u_1^2 u_2 + u_1 u_2^2 + u_2^3.$

Schur polyomials are ubiquitous in the algebraic combinatorics of “type ${A}$ objects” such as the symmetric group ${S_k}$, the general linear group ${GL_k}$, or the unitary group ${U_k}$. For instance, one can view ${s_\lambda}$ as the character of an irreducible polynomial representation of ${GL_k({\bf C})}$ associated with the partition ${\lambda}$. However, we will not focus on these interpretations of Schur polynomials in this post.

This definition of Schur polynomials allows for a way to describe the polynomials recursively. If ${k > 1}$ and ${T}$ is a Young tableau of shape ${\lambda = (\lambda_1,\dots,\lambda_k)}$, taking values in ${\{1,\dots,k\}}$, one can form a sub-tableau ${T'}$ of some shape ${\lambda' = (\lambda'_1,\dots,\lambda'_{k-1})}$ by removing all the appearances of ${k}$ (which, among other things, necessarily deletes the ${k^{th}}$ row). For instance, with ${T}$ as in the previous example, the sub-tableau ${T'}$ would be

$\displaystyle \begin{tabular}{|c|c|c|c|} \hline 1 & 1 & 1 & 2 \\ \cline{1-4} 2 & 2 & 2 \\ \cline{1-3} \end{tabular}$

and the reduced partition ${\lambda'}$ in this case is ${(4,3)}$. As Young tableaux are required to be strictly increasing down columns, we can see that the reduced partition ${\lambda'}$ must intersperse the original partition ${\lambda}$ in the sense that

$\displaystyle \lambda_{i+1} \leq \lambda'_i \leq \lambda_i \ \ \ \ \ (1)$

for all ${1 \leq i \leq k-1}$; we denote this interspersion relation as ${\lambda' \prec \lambda}$ (though we caution that this is not intended to be a partial ordering). In the converse direction, if ${\lambda' \prec \lambda}$ and ${T'}$ is a Young tableau with shape ${\lambda'}$ with entries in ${\{1,\dots,k-1\}}$, one can form a Young tableau ${T}$ with shape ${\lambda}$ and entries in ${\{1,\dots,k\}}$ by appending to ${T'}$ an entry of ${k}$ in all the boxes that appear in the ${\lambda}$ shape but not the ${\lambda'}$ shape. This one-to-one correspondence leads to the recursion

$\displaystyle s_\lambda(u) = \sum_{\lambda' \prec \lambda} s_{\lambda'}(u') u_k^{|\lambda| - |\lambda'|} \ \ \ \ \ (2)$

where ${u = (u_1,\dots,u_k)}$, ${u' = (u_1,\dots,u_{k-1})}$, and the size ${|\lambda|}$ of a partition ${\lambda = (\lambda_1,\dots,\lambda_k)}$ is defined as ${|\lambda| := \lambda_1 + \dots + \lambda_k}$.

One can use this recursion (2) to prove some further standard identities for Schur polynomials, such as the determinant identity

$\displaystyle s_\lambda(u) V(u) = \det( u_i^{\lambda_j+k-j} )_{1 \leq i,j \leq k} \ \ \ \ \ (3)$

for ${u=(u_1,\dots,u_k)}$, where ${V(u)}$ denotes the Vandermonde determinant

$\displaystyle V(u) := \prod_{1 \leq i < j \leq k} (u_i - u_j), \ \ \ \ \ (4)$

or the Jacobi-Trudi identity

$\displaystyle s_\lambda(u) = \det( h_{\lambda_j - j + i}(u) )_{1 \leq i,j \leq k}, \ \ \ \ \ (5)$

with the convention that ${h_d(u) = 0}$ if ${d}$ is negative. Thus for instance

$\displaystyle s_{(1,1,0,\dots,0)}(u) = h_1^2(u) - h_0(u) h_2(u) = \sum_{1 \leq i < j \leq k} u_i u_j.$

We review the (standard) derivation of these identities via (2) below the fold. Among other things, these identities show that the Schur polynomials are symmetric, which is not immediately obvious from their definition.

One can also iterate (2) to write

$\displaystyle s_\lambda(u) = \sum_{() = \lambda^0 \prec \lambda^1 \prec \dots \prec \lambda^k = \lambda} \prod_{j=1}^k u_j^{|\lambda^j| - |\lambda^{j-1}|} \ \ \ \ \ (6)$

where the sum is over all tuples ${\lambda^1,\dots,\lambda^k}$, where each ${\lambda^j}$ is a partition of length ${j}$ that intersperses the next partition ${\lambda^{j+1}}$, with ${\lambda^k}$ set equal to ${\lambda}$. We will call such a tuple an integral Gelfand-Tsetlin pattern based at ${\lambda}$.

One can generalise (6) by introducing the skew Schur functions

$\displaystyle s_{\lambda/\mu}(u) := \sum_{\mu = \lambda^i \prec \dots \prec \lambda^k = \lambda} \prod_{j=i+1}^k u_j^{|\lambda^j| - |\lambda^{j-1}|} \ \ \ \ \ (7)$

for ${u = (u_{i+1},\dots,u_k)}$, whenever ${\lambda}$ is a partition of length ${k}$ and ${\mu}$ a partition of length ${i}$ for some ${0 \leq i \leq k}$, thus the Schur polynomial ${s_\lambda}$ is also the skew Schur polynomial ${s_{\lambda /()}}$ with ${i=0}$. (One could relabel the variables here to be something like ${(u_1,\dots,u_{k-i})}$ instead, but this labeling seems slightly more natural, particularly in view of identities such as (8) below.)

By construction, we have the decomposition

$\displaystyle s_{\lambda/\nu}(u_{i+1},\dots,u_k) = \sum_\mu s_{\mu/\nu}(u_{i+1},\dots,u_j) s_{\lambda/\mu}(u_{j+1},\dots,u_k) \ \ \ \ \ (8)$

whenever ${0 \leq i \leq j \leq k}$, and ${\nu, \mu, \lambda}$ are partitions of lengths ${i,j,k}$ respectively. This gives another recursive way to understand Schur polynomials and skew Schur polynomials. For instance, one can use it to establish the generalised Jacobi-Trudi identity

$\displaystyle s_{\lambda/\mu}(u) = \det( h_{\lambda_j - j - \mu_i + i}(u) )_{1 \leq i,j \leq k}, \ \ \ \ \ (9)$

with the convention that ${\mu_i = 0}$ for ${i}$ larger than the length of ${\mu}$; we do this below the fold.

The Schur polynomials (and skew Schur polynomials) are “discretised” (or “quantised”) in the sense that their parameters ${\lambda, \mu}$ are required to be integer-valued, and their definition similarly involves summation over a discrete set. It turns out that there are “continuous” (or “classical”) analogues of these functions, in which the parameters ${\lambda,\mu}$ now take real values rather than integers, and are defined via integration rather than summation. One can view these continuous analogues as a “semiclassical limit” of their discrete counterparts, in a manner that can be made precise using the machinery of geometric quantisation, but we will not do so here.

The continuous analogues can be defined as follows. Define a real partition of length ${k}$ to be a tuple ${\lambda = (\lambda_1,\dots,\lambda_k)}$ where ${\lambda_1 \geq \dots \geq \lambda_k \geq 0}$ are now real numbers. We can define the relation ${\lambda' \prec \lambda}$ of interspersion between a length ${k-1}$ real partition ${\lambda' = (\lambda'_1,\dots,\lambda'_{k-1})}$ and a length ${k}$ real partition ${\lambda = (\lambda_1,\dots,\lambda_{k})}$ precisely as before, by requiring that the inequalities (1) hold for all ${1 \leq i \leq k-1}$. We can then define the continuous Schur functions ${S_\lambda(x)}$ for ${x = (x_1,\dots,x_k) \in {\bf R}^k}$ recursively by defining

$\displaystyle S_{()}() = 1$

and

$\displaystyle S_\lambda(x) = \int_{\lambda' \prec \lambda} S_{\lambda'}(x') \exp( (|\lambda| - |\lambda'|) x_k ) \ \ \ \ \ (10)$

for ${k \geq 1}$ and ${\lambda}$ of length ${k}$, where ${x' := (x_1,\dots,x_{k-1})}$ and the integral is with respect to ${k-1}$-dimensional Lebesgue measure, and ${|\lambda| = \lambda_1 + \dots + \lambda_k}$ as before. Thus for instance

$\displaystyle S_{(\lambda_1)}(x_1) = \exp( \lambda_1 x_1 )$

and

$\displaystyle S_{(\lambda_1,\lambda_2)}(x_1,x_2) = \int_{\lambda_2}^{\lambda_1} \exp( \lambda'_1 x_1 + (\lambda_1+\lambda_2-\lambda'_1) x_2 )\ d\lambda'_1.$

More generally, we can define the continuous skew Schur functions ${S_{\lambda/\mu}(x)}$ for ${\lambda}$ of length ${k}$, ${\mu}$ of length ${j \leq k}$, and ${x = (x_{j+1},\dots,x_k) \in {\bf R}^{k-j}}$ recursively by defining

$\displaystyle S_{\mu/\mu}() = 1$

and

$\displaystyle S_{\lambda/\mu}(x) = \int_{\lambda' \prec \lambda} S_{\lambda'/\mu}(x') \exp( (|\lambda| - |\lambda'|) x_k )$

for ${k > j}$. Thus for instance

$\displaystyle S_{(\lambda_1,\lambda_2,\lambda_3)/(\mu_1,\mu_2)}(x_3) = 1_{\lambda_3 \leq \mu_2 \leq \lambda_2 \leq \mu_1 \leq \lambda_1} \exp( x_3 (\lambda_1+\lambda_2+\lambda_3 - \mu_1 - \mu_2 ))$

and

$\displaystyle S_{(\lambda_1,\lambda_2,\lambda_3)/(\mu_1)}(x_2, x_3) = \int_{\lambda_2 \leq \lambda'_2 \leq \lambda_2, \mu_1} \int_{\mu_1, \lambda_2 \leq \lambda'_1 \leq \lambda_1}$

$\displaystyle \exp( x_2 (\lambda'_1+\lambda'_2 - \mu_1) + x_3 (\lambda_1+\lambda_2+\lambda_3 - \lambda'_1 - \lambda'_2))\ d\lambda'_1 d\lambda'_2.$

By expanding out the recursion, one obtains the analogue

$\displaystyle S_\lambda(x) = \int_{\lambda^1 \prec \dots \prec \lambda^k = \lambda} \exp( \sum_{j=1}^k x_j (|\lambda^j| - |\lambda^{j-1}|))\ d\lambda^1 \dots d\lambda^{k-1},$

of (6), and more generally one has

$\displaystyle S_{\lambda/\mu}(x) = \int_{\mu = \lambda^i \prec \dots \prec \lambda^k = \lambda} \exp( \sum_{j=i+1}^k x_j (|\lambda^j| - |\lambda^{j-1}|))\ d\lambda^{i+1} \dots d\lambda^{k-1}.$

We will call the tuples ${(\lambda^1,\dots,\lambda^k)}$ in the first integral real Gelfand-Tsetlin patterns based at ${\lambda}$. The analogue of (8) is then

$\displaystyle S_{\lambda/\nu}(x_{i+1},\dots,x_k) = \int S_{\mu/\nu}(x_{i+1},\dots,x_j) S_{\lambda/\mu}(x_{j+1},\dots,x_k)\ d\mu$

where the integral is over all real partitions ${\mu}$ of length ${j}$, with Lebesgue measure.

By approximating various integrals by their Riemann sums, one can relate the continuous Schur functions to their discrete counterparts by the limiting formula

$\displaystyle N^{-k(k-1)/2} s_{\lfloor N \lambda \rfloor}( \exp[ x/N ] ) \rightarrow S_\lambda(x) \ \ \ \ \ (11)$

as ${N \rightarrow \infty}$ for any length ${k}$ real partition ${\lambda = (\lambda_1,\dots,\lambda_k)}$ and any ${x = (x_1,\dots,x_k) \in {\bf R}^k}$, where

$\displaystyle \lfloor N \lambda \rfloor := ( \lfloor N \lambda_1 \rfloor, \dots, \lfloor N \lambda_k \rfloor )$

and

$\displaystyle \exp[x/N] := (\exp(x_1/N), \dots, \exp(x_k/N)).$

More generally, one has

$\displaystyle N^{j(j-1)/2-k(k-1)/2} s_{\lfloor N \lambda \rfloor / \lfloor N \mu \rfloor}( \exp[ x/N ] ) \rightarrow S_{\lambda/\mu}(x)$

as ${N \rightarrow \infty}$ for any length ${k}$ real partition ${\lambda}$, any length ${j}$ real partition ${\mu}$ with ${0 \leq j \leq k}$, and any ${x = (x_{j+1},\dots,x_k) \in {\bf R}^{k-j}}$.

As a consequence of these limiting formulae, one expects all of the discrete identities above to have continuous counterparts. This is indeed the case; below the fold we shall prove the discrete and continuous identities in parallel. These are not new results by any means, but I was not able to locate a good place in the literature where they are explicitly written down, so I thought I would try to do so here (primarily for my own internal reference, but perhaps the calculations will be worthwhile to some others also).

The determinant ${\det_n(A)}$ of an ${n \times n}$ matrix (with coefficients in an arbitrary field) obey many useful identities, starting of course with the fundamental multiplicativity ${\det_n(AB) = \det_n(A) \det_n(B)}$ for ${n \times n}$ matrices ${A,B}$. This multiplicativity can in turn be used to establish many further identities; in particular, as shown in this previous post, it implies the Schur determinant identity

$\displaystyle \det_{n+k}\begin{pmatrix} A & B \\ C & D \end{pmatrix} = \det_n(A) \det_k( D - C A^{-1} B ) \ \ \ \ \ (1)$

whenever ${A}$ is an invertible ${n \times n}$ matrix, ${B}$ is an ${n \times k}$ matrix, ${C}$ is a ${k \times n}$ matrix, and ${D}$ is a ${k \times k}$ matrix. The matrix ${D - CA^{-1} B}$ is known as the Schur complement of the block ${A}$.

I only recently discovered that this identity in turn immediately implies what I always found to be a somewhat curious identity, namely the Dodgson condensation identity (also known as the Desnanot-Jacobi identity)

$\displaystyle \det_n(M) \det_{n-2}(M^{1,n}_{1,n}) = \det_{n-1}( M^1_1 ) \det_{n-1}(M^n_n)$

$\displaystyle - \det_{n-1}(M^1_n) \det_{n-1}(M^n_1)$

for any ${n \geq 3}$ and ${n \times n}$ matrix ${M}$, where ${M^i_j}$ denotes the ${n-1 \times n-1}$ matrix formed from ${M}$ by removing the ${i^{th}}$ row and ${j^{th}}$ column, and similarly ${M^{i,i'}_{j,j'}}$ denotes the ${n-2 \times n-2}$ matrix formed from ${M}$ by removing the ${i^{th}}$ and ${(i')^{th}}$ rows and ${j^{th}}$ and ${(j')^{th}}$ columns. Thus for instance when ${n=3}$ we obtain

$\displaystyle \det_3 \begin{pmatrix} a & b & c \\ d & e & f \\ g & h & i \end{pmatrix} \cdot e$

$\displaystyle = \det_2 \begin{pmatrix} e & f \\ h & i \end{pmatrix} \cdot \det_2 \begin{pmatrix} a & b \\ d & e \end{pmatrix}$

$\displaystyle - \det_2 \begin{pmatrix} b & c \\ e & f \end{pmatrix} \cdot \det_2 \begin{pmatrix} d & e \\ g & h \end{pmatrix}$

for any scalars ${a,b,c,d,e,f,g,h,i}$. (Charles Dodgson, better known by his pen name Lewis Caroll, is of course also known for writing “Alice in Wonderland” and “Through the Looking Glass“.)

The derivation is not new; it is for instance noted explicitly in this paper of Brualdi and Schneider, though I do not know if this is the earliest place in the literature where it can be found. (EDIT: Apoorva Khare has pointed out to me that the original arguments of Dodgson can be interpreted as implicitly following this derivation.) I thought it is worth presenting the short derivation here, though.

Firstly, by swapping the first and ${(n-1)^{th}}$ rows, and similarly for the columns, it is easy to see that the Dodgson condensation identity is equivalent to the variant

$\displaystyle \det_n(M) \det_{n-2}(M^{n-1,n}_{n-1,n}) = \det_{n-1}( M^{n-1}_{n-1} ) \det_{n-1}(M^n_n) \ \ \ \ \ (2)$

$\displaystyle - \det_{n-1}(M^{n-1}_n) \det_{n-1}(M^n_{n-1}).$

Now write

$\displaystyle M = \begin{pmatrix} A & B_1 & B_2 \\ C_1 & d_{11} & d_{12} \\ C_2 & d_{21} & d_{22} \end{pmatrix}$

where ${A}$ is an ${n-2 \times n-2}$ matrix, ${B_1, B_2}$ are ${n-2 \times 1}$ column vectors, ${C_1, C_2}$ are ${1 \times n-2}$ row vectors, and ${d_{11}, d_{12}, d_{21}, d_{22}}$ are scalars. If ${A}$ is invertible, we may apply the Schur determinant identity repeatedly to conclude that

$\displaystyle \det_n(M) = \det_{n-2}(A) \det_2 \begin{pmatrix} d_{11} - C_1 A^{-1} B_1 & d_{12} - C_1 A^{-1} B_2 \\ d_{21} - C_2 A^{-1} B_1 & d_{22} - C_2 A^{-1} B_2 \end{pmatrix}$

$\displaystyle \det_{n-2} (M^{n-1,n}_{n-1,n}) = \det_{n-2}(A)$

$\displaystyle \det_{n-1}( M^{n-1}_{n-1} ) = \det_{n-2}(A) (d_{22} - C_2 A^{-1} B_2 )$

$\displaystyle \det_{n-1}( M^{n-1}_{n} ) = \det_{n-2}(A) (d_{21} - C_2 A^{-1} B_1 )$

$\displaystyle \det_{n-1}( M^{n}_{n-1} ) = \det_{n-2}(A) (d_{12} - C_1 A^{-1} B_2 )$

$\displaystyle \det_{n-1}( M^{n}_{n} ) = \det_{n-2}(A) (d_{11} - C_1 A^{-1} B_1 )$

and the claim (2) then follows by a brief calculation (and the explicit form ${\det_2 \begin{pmatrix} a & b \\ c & d \end{pmatrix} = ad-bc}$ of the ${2 \times 2}$ determinant). To remove the requirement that ${A}$ be invertible, one can use a limiting argument, noting that one can work without loss of generality in an algebraically closed field, and in such a field, the set of invertible matrices is dense in the Zariski topology. (In the case when the scalars are reals or complexes, one can just use density in the ordinary topology instead if desired.)

The same argument gives the more general determinant identity of Sylvester

$\displaystyle \det_n(M) \det_{n-k}(M^S_S)^{k-1} = \det_k \left( \det_{n-k+1}(M^{S \backslash \{i\}}_{S \backslash \{j\}}) \right)_{i,j \in S}$

whenever ${n > k \geq 1}$, ${S}$ is a ${k}$-element subset of ${\{1,\dots,n\}}$, and ${M^S_{S'}}$ denotes the matrix formed from ${M}$ by removing the rows associated to ${S}$ and the columns associated to ${S'}$. (The Dodgson condensation identity is basically the ${k=2}$ case of this identity.)

A closely related proof of (2) proceeds by elementary row and column operations. Observe that if one adds some multiple of one of the first ${n-2}$ rows of ${M}$ to one of the last two rows of ${M}$, then the left and right sides of (2) do not change. If the minor ${A}$ is invertible, this allows one to reduce to the case where the components ${C_1,C_2}$ of the matrix vanish. Similarly, using elementary column operations instead of row operations we may assume that ${B_1,B_2}$ vanish. All matrices involved are now block-diagonal and the identity follows from a routine computation.

The latter approach can also prove the cute identity

$\displaystyle \det_2 \begin{pmatrix} \det_n( X_1, Y_1, A ) & \det_n( X_1, Y_2, A ) \\ \det_n(X_2, Y_1, A) & \det_n(X_2,Y_2, A) \end{pmatrix} = \det_n( X_1,X_2,A) \det_n(Y_1,Y_2,A)$

for any ${n \geq 2}$, any ${n \times 1}$ column vectors ${X_1,X_2,Y_1,Y_2}$, and any ${n \times n-2}$ matrix ${A}$, which can for instance be found in page 7 of this text of Karlin. Observe that both sides of this identity are unchanged if one adds some multiple of any column of ${A}$ to one of ${X_1,X_2,Y_1,Y_2}$; for generic ${A}$, this allows one to reduce ${X_1,X_2,Y_1,Y_2}$ to have only the first two entries allowed to be non-zero, at which point the determinants split into ${2 \times 2}$ and ${n -2 \times n-2}$ determinants and we can reduce to the ${n=2}$ case (eliminating the role of ${A}$). One can now either proceed by a direct computation, or by observing that the left-hand side is quartilinear in ${X_1,X_2,Y_1,Y_2}$ and antisymmetric in ${X_1,X_2}$ and ${Y_1,Y_2}$ which forces it to be a scalar multiple of ${\det_2(X_1,X_2) \det_2(Y_1,Y_2)}$, at which point one can test the identity at a single point (e.g. ${X_1=Y_1 = e_1}$ and ${X_2=Y_2=e_2}$ for the standard basis ${e_1,e_2}$) to conclude the argument. (One can also derive this identity from the Sylvester determinant identity but I think the calculations are a little messier if one goes by that route. Conversely, one can recover the Dodgson condensation identity from Karlin’s identity by setting ${X_1=e_1}$, ${X_2=e_2}$ (for instance) and then permuting some rows and columns.)

In one of the earliest posts on this blog, I talked about the ability to “arbitrage” a disparity of symmetry in an inequality, and in particular to “amplify” such an inequality into a stronger one. (The principle can apply to other mathematical statements than inequalities, with the “hypothesis” and “conclusion” of that statement generally playing the role of the “right-hand side” and “left-hand side” of an inequality, but for sake of discussion I will restrict attention here to inequalities.) One can formalise this principle as follows. Many inequalities in analysis can be expressed in the form

$\displaystyle A(f) \leq B(f) \ \ \ \ \ (1)$

for all ${f}$ in some space ${X}$ (in many cases ${X}$ will be a function space, and ${f}$ a function in that space), where ${A(f)}$ and ${B(f)}$ are some functionals of ${f}$ (that is to say, real-valued functions of ${f}$). For instance, ${B(f)}$ might be some function space norm of ${f}$ (e.g. an ${L^p}$ norm), and ${A(f)}$ might be some function space norm of some transform of ${f}$. In addition, we assume we have some group ${G}$ of symmetries ${T: X \rightarrow X}$ acting on the underlying space. For instance, if ${X}$ is a space of functions on some spatial domain, the group might consist of translations (e.g. ${Tf(x) = f(x-h)}$ for some shift ${h}$), or perhaps dilations with some normalisation (e.g. ${Tf(x) = \frac{1}{\lambda^\alpha} f(\frac{x}{\lambda})}$ for some dilation factor ${\lambda > 0}$ and some normalisation exponent ${\alpha \in {\bf R}}$, which can be thought of as the dimensionality of length one is assigning to ${f}$). If we have

$\displaystyle A(Tf) = A(f)$

for all symmetries ${T \in G}$ and all ${f \in X}$, we say that ${A}$ is invariant with respect to the symmetries in ${G}$; otherwise, it is not.

Suppose we know that the inequality (1) holds for all ${f \in X}$, but that there is an imbalance of symmetry: either ${A}$ is ${G}$-invariant and ${B}$ is not, or vice versa. Suppose first that ${A}$ is ${G}$-invariant and ${B}$ is not. Substituting ${f}$ by ${Tf}$ in (1) and taking infima, we can then amplify (1) to the stronger inequality

$\displaystyle A(f) \leq \inf_{T \in G} B(Tf).$

In particular, it is often the case that there is a way to send ${T}$ off to infinity in such a way that the functional ${B(Tf)}$ has a limit ${B_\infty(f)}$, in which case we obtain the amplification

$\displaystyle A(f) \leq B_\infty(f) \ \ \ \ \ (2)$

of (1). Note that these amplified inequalities will now be ${G}$-invariant on both sides (assuming that the way in which we take limits as ${T \rightarrow \infty}$ is itself ${G}$-invariant, which it often is in practice). Similarly, if ${B}$ is ${G}$-invariant but ${A}$ is not, we may instead amplify (1) to

$\displaystyle \sup_{T \in G} A(Tf) \leq B(f)$

and in particular (if ${A(Tf)}$ has a limit ${A_\infty(f)}$ as ${T \rightarrow \infty}$)

$\displaystyle A_\infty(f) \leq B(f). \ \ \ \ \ (3)$

If neither ${A(f)}$ nor ${B(f)}$ has a ${G}$-symmetry, one can still use the ${G}$-symmetry by replacing ${f}$ by ${Tf}$ and taking a limit to conclude that

$\displaystyle A_\infty(f) \leq B_\infty(f),$

though now this inequality is not obviously stronger than the original inequality (1) (for instance it could well be trivial). In some cases one can also average over ${G}$ instead of taking a limit as ${T \rightarrow \infty}$, thus averaging a non-invariant inequality into an invariant one.

As discussed in the previous post, this use of amplification gives rise to a general principle about inequalities: the most efficient inequalities are those in which the left-hand side and right-hand side enjoy the same symmetries. It is certainly possible to have true inequalities that have an imbalance of symmetry, but as shown above, such inequalities can always be amplified to more efficient and more symmetric inequalities. In the case when limits such as ${A_\infty}$ and ${B_\infty}$ exist, the limiting functionals ${A_\infty(f)}$ and ${B_\infty(f)}$ are often simpler in form, or more tractable analytically, than their non-limiting counterparts ${A(f)}$ and ${B(f)}$ (this is one of the main reasons why we take limits at infinity in the first place!), and so in many applications there is really no reason to use the weaker and more complicated inequality (1), when stronger, simpler, and more symmetric inequalities such as (2), (3) are available. Among other things, this explains why many of the most useful and natural inequalities one sees in analysis are dimensionally consistent.

One often tries to prove inequalities (1) by directly chaining together simpler inequalities. For instance, one might attempt to prove (1) by by first bounding ${A(f)}$ by some auxiliary quantity ${C(f)}$, and then bounding ${C(f)}$ by ${B(f)}$, thus obtaining (1) by chaining together two inequalities

$\displaystyle A(f) \leq C(f) \leq B(f). \ \ \ \ \ (4)$

A variant of the above principle then asserts that when proving inequalities by such direct methods, one should, whenever possible, try to maintain the symmetries that are present in both sides of the inequality. Why? Well, suppose that we ignored this principle and tried to prove (1) by establishing (4) for some ${C}$ that is not ${G}$-invariant. Assuming for sake of argument that (4) were actually true, we could amplify the first half ${A(f) \leq C(f)}$ of this inequality to conclude that

$\displaystyle A(f) \leq \inf_{T \in G} C(Tf)$

and also amplify the second half ${C(f) \leq B(f)}$ of the inequality to conclude that

$\displaystyle \sup_{T \in G} C(Tf) \leq B(f)$

and hence (4) amplifies to

$\displaystyle A(f) \leq \inf_{T \in G} C(Tf) \leq \sup_{T \in G} C(Tf) \leq B(f). \ \ \ \ \ (5)$

Let’s say for sake of argument that all the quantities involved here are positive numbers (which is often the case in analysis). Then we see in particular that

$\displaystyle \frac{\sup_{T \in G} C(Tf)}{\inf_{T \in G} C(Tf)} \leq \frac{B(f)}{A(f)}. \ \ \ \ \ (6)$

Informally, (6) asserts that in order for the strategy (4) used to prove (1) to work, the extent to which ${C}$ fails to be ${G}$-invariant cannot exceed the amount of “room” present in (1). In particular, when dealing with those “extremal” ${f}$ for which the left and right-hand sides of (1) are comparable to each other, one can only have a bounded amount of non-${G}$-invariance in the functional ${C}$. If ${C}$ fails so badly to be ${G}$-invariant that one does not expect the left-hand side of (6) to be at all bounded in such extremal situations, then the strategy of proving (1) using the intermediate quantity ${C}$ is doomed to failure – even if one has already produced some clever proof of one of the two inequalities ${A(f) \leq C(f)}$ or ${C(f) \leq B(f)}$ needed to make this strategy work. And even if it did work, one could amplify (4) to a simpler inequality

$\displaystyle A(f) \leq C_\infty(f) \leq B(f) \ \ \ \ \ (7)$

(assuming that the appropriate limit ${C_\infty(f) = \lim_{T \rightarrow \infty} C(Tf)}$ existed) which would likely also be easier to prove (one can take whatever proofs one had in mind of the inequalities in (4), conjugate them by ${T}$, and take a limit as ${T \rightarrow \infty}$ to extract a proof of (7)).

Here are some simple (but somewhat contrived) examples to illustrate these points. Suppose one wishes to prove the inequality

$\displaystyle xy \leq x^2 + y^2 \ \ \ \ \ (8)$

for all ${x,y>0}$. Both sides of this inequality are invariant with respect to interchanging ${x}$ with ${y}$, so the principle suggests that when proving this inequality directly, one should only use sub-inequalities that are also invariant with respect to this interchange. However, in this particular case there is enough “room” in the inequality that it is possible (though somewhat unnatural) to violate this principle. For instance, one could decide (for whatever reason) to start with the inequality

$\displaystyle 0 \leq (x - y/2)^2 = x^2 - xy + y^2/4$

to conclude that

$\displaystyle xy \leq x^2 + y^2/4$

and then use the obvious inequality ${x^2 + y^2/4 \leq x^2+y^2}$ to conclude the proof. Here, the intermediate quantity ${x^2 + y^2/4}$ is not invariant with respect to interchange of ${x}$ and ${y}$, but the failure is fairly mild (changing ${x}$ and ${y}$ only modifies the quantity ${x^2 + y^2/4}$ by a multiplicative factor of ${4}$ at most), and disappears completely in the most extremal case ${x=y}$, which helps explain why one could get away with using this quantity in the proof here. But it would be significantly harder (though still not impossible) to use non-symmetric intermediaries to prove the sharp version

$\displaystyle xy \leq \frac{x^2 + y^2}{2}$

of (8) (that is to say, the arithmetic mean-geometric mean inequality). Try it!

Similarly, consider the task of proving the triangle inequality

$\displaystyle |z+w| \leq |z| + |w| \ \ \ \ \ (9)$

for complex numbers ${z, w}$. One could try to leverage the triangle inequality ${|x+y| \leq |x| + |y|}$ for real numbers by using the crude estimate

$\displaystyle |z+w| \leq |\hbox{Re}(z+w)| + |\hbox{Im}(z+w)|$

and then use the real triangle inequality to obtain

$\displaystyle |\hbox{Re}(z+w)| \leq |\hbox{Re}(z)| + |\hbox{Re}(w)|$

and

$\displaystyle |\hbox{Im}(z+w)| \leq |\hbox{Im}(z)| + |\hbox{Im}(w)|$

and then finally use the inequalities

$\displaystyle |\hbox{Re}(z)|, |\hbox{Im}(z)| \leq |z| \ \ \ \ \ (10)$

and

$\displaystyle |\hbox{Re}(w)|, |\hbox{Im}(w)| \leq |w| \ \ \ \ \ (11)$

but when one puts this all together at the end of the day, one loses a factor of two:

$\displaystyle |z+w| \leq 2(|z| + |w|).$

One can “blame” this loss on the fact that while the original inequality (9) was invariant with respect to phase rotation ${(z,w) \mapsto (e^{i\theta} z, e^{i\theta} w)}$, the intermediate expressions we tried to use when proving it were not, leading to inefficient estimates. One can try to be smarter than this by using Pythagoras’ theorem ${|z|^2 = |\hbox{Re}(z)|^2 + |\hbox{Im}(z)|^2}$; this reduces the loss from ${2}$ to ${\sqrt{2}}$ but does not eliminate it completely, which is to be expected as one is still using non-invariant estimates in the proof. But one can remove the loss completely by using amplification; see the previous blog post for details (we also give a reformulation of this amplification below).

Here is a slight variant of the above example. Suppose that you had just learned in class to prove the triangle inequality

$\displaystyle (\sum_{n=1}^\infty |a_n+b_n|^2)^{1/2} \leq (\sum_{n=1}^\infty |a_n|^2)^{1/2} + (\sum_{n=1}^\infty |b_n|^2)^{1/2} \ \ \ \ \ (12)$

for (say) real square-summable sequences ${(a_n)_{n=1}^\infty}$, ${(b_n)_{n=1}^\infty}$, and was tasked to conclude the corresponding inequality

$\displaystyle (\sum_{n \in {\bf Z}} |a_n+b_n|^2)^{1/2} \leq (\sum_{n \in {\bf Z}} |a_n|^2)^{1/2} + (\sum_{n \in {\bf Z}} |b_n|^2)^{1/2} \ \ \ \ \ (13)$

for doubly infinite square-summable sequences ${(a_n)_{n \in {\bf Z}}, (b_n)_{n \in {\bf Z}}}$. The quickest way to do this is of course to exploit a bijection between the natural numbers ${1,2,\dots}$ and the integers, but let us say for sake of argument that one was unaware of such a bijection. One could then proceed instead by splitting the integers into the positive integers and the non-positive integers, and use (12) on each component separately; this is very similar to the strategy of proving (9) by splitting a complex number into real and imaginary parts, and will similarly lose a factor of ${2}$ or ${\sqrt{2}}$. In this case, one can “blame” this loss on the abandonment of translation invariance: both sides of the inequality (13) are invariant with respect to shifting the sequences ${(a_n)_{n \in {\bf Z}}}$, ${(b_n)_{n \in {\bf Z}}}$ by some shift ${h}$ to arrive at ${(a_{n-h})_{n \in {\bf Z}}, (b_{n-h})_{n \in {\bf Z}}}$, but the intermediate quantities caused by splitting the integers into two subsets are not invariant. Another way of thinking about this is that the splitting of the integers gives a privileged role to the origin ${n=0}$, whereas the inequality (13) treats all values of ${n}$ equally thanks to the translation invariance, and so using such a splitting is unnatural and not likely to lead to optimal estimates. On the other hand, one can deduce (13) from (12) by sending this symmetry to infinity; indeed, after applying a shift to (12) we see that

$\displaystyle (\sum_{n=-N}^\infty |a_n+b_n|^2)^{1/2} \leq (\sum_{n=-N}^\infty |a_n|^2)^{1/2} + (\sum_{n=-N}^\infty |b_n|^2)^{1/2}$

for any ${N}$, and on sending ${N \rightarrow \infty}$ we obtain (13) (one could invoke the monotone convergence theorem here to justify the limit, though in this case it is simple enough that one can just use first principles).

Note that the principle of preserving symmetry only applies to direct approaches to proving inequalities such as (1). There is a complementary approach, discussed for instance in this previous post, which is to spend the symmetry to place the variable ${f}$ “without loss of generality” in a “normal form”, “convenient coordinate system”, or a “good gauge”. Abstractly: suppose that there is some subset ${Y}$ of ${X}$ with the property that every ${f \in X}$ can be expressed in the form ${f = Tg}$ for some ${T \in G}$ and ${g \in Y}$ (that is to say, ${X = GY}$). Then, if one wishes to prove an inequality (1) for all ${f \in X}$, and one knows that both sides ${A(f), B(f)}$ of this inequality are ${G}$-invariant, then it suffices to check (1) just for those ${f}$ in ${Y}$, as this together with the ${G}$-invariance will imply the same inequality (1) for all ${f}$ in ${GY=X}$. By restricting to those ${f}$ in ${Y}$, one has given up (or spent) the ${G}$-invariance, as the set ${Y}$ will in typical not be preserved by the group action ${G}$. But by the same token, by eliminating the invariance, one also eliminates the prohibition on using non-invariant proof techniques, and one is now free to use a wider range of inequalities in order to try to establish (1). Of course, such inequalities should make crucial use of the restriction ${f \in Y}$, for if they did not, then the arguments would work in the more general setting ${f \in X}$, and then the previous principle would again kick in and warn us that the use of non-invariant inequalities would be inefficient. Thus one should “spend” the symmetry wisely to “buy” a restriction ${f \in Y}$ that will be of maximal utility in calculations (for instance by setting as many annoying factors and terms in one’s analysis to be ${0}$ or ${1}$ as possible).

As a simple example of this, let us revisit the complex triangle inequality (9). As already noted, both sides of this inequality are invariant with respect to the phase rotation symmetry ${(z,w) \mapsto (e^{i\theta} z, e^{i\theta} w)}$. This seems to limit one to using phase-rotation-invariant techniques to establish the inequality, in particular ruling out the use of real and imaginary parts as discussed previously. However, we can instead spend the phase rotation symmetry to restrict to a special class of ${z}$ and ${w}$. It turns out that the most efficient way to spend the symmetry is to achieve the normalisation of ${z+w}$ being a nonnegative real; this is of course possible since any complex number ${z+w}$ can be turned into a nonnegative real by multiplying by an appropriate phase ${e^{i\theta}}$. Once ${z+w}$ is a nonnegative real, the imaginary part disappears and we have

$\displaystyle |z+w| = \hbox{Re}(z+w) = \hbox{Re}(z) + \hbox{Re}(w),$

and the triangle inequality (9) is now an immediate consequence of (10), (11). (But note that if one had unwisely spent the symmetry to normalise, say, ${z}$ to be a non-negative real, then one is no closer to establishing (9) than before one had spent the symmetry.)