You are currently browsing the category archive for the ‘math.RA’ category.

Suppose we have an {n \times n} matrix {M} that is expressed in block-matrix form as

\displaystyle  M = \begin{pmatrix} A & B \\ C & D \end{pmatrix}

where {A} is an {(n-k) \times (n-k)} matrix, {B} is an {(n-k) \times k} matrix, {C} is an {k \times (n-k)} matrix, and {D} is a {k \times k} matrix for some {1 < k < n}. If {A} is invertible, we can use the technique of Schur complementation to express the inverse of {M} (if it exists) in terms of the inverse of {A}, and the other components {B,C,D} of course. Indeed, to solve the equation

\displaystyle  M \begin{pmatrix} x & y \end{pmatrix} = \begin{pmatrix} a & b \end{pmatrix},

where {x, a} are {(n-k) \times 1} column vectors and {y,b} are {k \times 1} column vectors, we can expand this out as a system

\displaystyle  Ax + By = a

\displaystyle  Cx + Dy = b.

Using the invertibility of {A}, we can write the first equation as

\displaystyle  x = A^{-1} a - A^{-1} B y \ \ \ \ \ (1)

and substituting this into the second equation yields

\displaystyle  (D - C A^{-1} B) y = b - C A^{-1} a

and thus (assuming that {D - CA^{-1} B} is invertible)

\displaystyle  y = - (D - CA^{-1} B)^{-1} CA^{-1} a + (D - CA^{-1} B)^{-1} b

and then inserting this back into (1) gives

\displaystyle  x = (A^{-1} + A^{-1} B (D - CA^{-1} B)^{-1} C A^{-1}) a - A^{-1} B (D - CA^{-1} B)^{-1} b.

Comparing this with

\displaystyle  \begin{pmatrix} x & y \end{pmatrix} = M^{-1} \begin{pmatrix} a & b \end{pmatrix},

we have managed to express the inverse of {M} as

\displaystyle  M^{-1} =

\displaystyle  \begin{pmatrix} A^{-1} + A^{-1} B (D - CA^{-1} B)^{-1} C A^{-1} & - A^{-1} B (D - CA^{-1} B)^{-1} \\ - (D - CA^{-1} B)^{-1} CA^{-1} & (D - CA^{-1} B)^{-1} \end{pmatrix}. \ \ \ \ \ (2)

One can consider the inverse problem: given the inverse {M^{-1}} of {M}, does one have a nice formula for the inverse {A^{-1}} of the minor {A}? Trying to recover this directly from (2) looks somewhat messy. However, one can proceed as follows. Let {U} denote the {n \times k} matrix

\displaystyle  U := \begin{pmatrix} 0 \\ I_k \end{pmatrix}

(with {I_k} the {k \times k} identity matrix), and let {V} be its transpose:

\displaystyle  V := \begin{pmatrix} 0 & I_k \end{pmatrix}.

Then for any scalar {t} (which we identify with {t} times the identity matrix), one has

\displaystyle  M + UtV = \begin{pmatrix} A & B \\ C & D+t \end{pmatrix},

and hence by (2)

\displaystyle  (M+UtV)^{-1} =

\displaystyle \begin{pmatrix} A^{-1} + A^{-1} B (D + t - CA^{-1} B)^{-1} C A^{-1} & - A^{-1} B (D + t- CA^{-1} B)^{-1} \\ - (D + t - CA^{-1} B)^{-1} CA^{-1} & (D + t - CA^{-1} B)^{-1} \end{pmatrix}.

noting that the inverses here will exist for {t} large enough. Taking limits as {t \rightarrow \infty}, we conclude that

\displaystyle  \lim_{t \rightarrow \infty} (M+UtV)^{-1} = \begin{pmatrix} A^{-1} & 0 \\ 0 & 0 \end{pmatrix}.

On the other hand, by the Woodbury matrix identity (discussed in this previous blog post), we have

\displaystyle  (M+UtV)^{-1} = M^{-1} - M^{-1} U (t^{-1} + V M^{-1} U)^{-1} V M^{-1}

and hence on taking limits and comparing with the preceding identity, one has

\displaystyle  \begin{pmatrix} A^{-1} & 0 \\ 0 & 0 \end{pmatrix} = M^{-1} - M^{-1} U (V M^{-1} U)^{-1} V M^{-1}.

This achieves the aim of expressing the inverse {A^{-1}} of the minor in terms of the inverse of the full matrix. Taking traces and rearranging, we conclude in particular that

\displaystyle  \mathrm{tr} A^{-1} = \mathrm{tr} M^{-1} - \mathrm{tr} (V M^{-2} U) (V M^{-1} U)^{-1}. \ \ \ \ \ (3)

In the {k=1} case, this can be simplified to

\displaystyle  \mathrm{tr} A^{-1} = \mathrm{tr} M^{-1} - \frac{e_n^T M^{-2} e_n}{e_n^T M^{-1} e_n} \ \ \ \ \ (4)

where {e_n} is the {n^{th}} basis column vector.

We can apply this identity to understand how the spectrum of an {n \times n} random matrix {M} relates to that of its top left {n-1 \times n-1} minor {A}. Subtracting any complex multiple {z} of the identity from {M} (and hence from {A}), we can relate the Stieltjes transform {s_M(z) := \frac{1}{n} \mathrm{tr}(M-z)^{-1}} of {M} with the Stieltjes transform {s_A(z) := \frac{1}{n-1} \mathrm{tr}(A-z)^{-1}} of {A}:

\displaystyle  s_A(z) = \frac{n}{n-1} s_M(z) - \frac{1}{n-1} \frac{e_n^T (M-z)^{-2} e_n}{e_n^T (M-z)^{-1} e_n} \ \ \ \ \ (5)

At this point we begin to proceed informally. Assume for sake of argument that the random matrix {M} is Hermitian, with distribution that is invariant under conjugation by the unitary group {U(n)}; for instance, {M} could be drawn from the Gaussian Unitary Ensemble (GUE), or alternatively {M} could be of the form {M = U D U^*} for some real diagonal matrix {D} and {U} a unitary matrix drawn randomly from {U(n)} using Haar measure. To fix normalisations we will assume that the eigenvalues of {M} are typically of size {O(1)}. Then {A} is also Hermitian and {U(n)}-invariant. Furthermore, the law of {e_n^T (M-z)^{-1} e_n} will be the same as the law of {u^* (M-z)^{-1} u}, where {u} is now drawn uniformly from the unit sphere (independently of {M}). Diagonalising {M} into eigenvalues {\lambda_j} and eigenvectors {v_j}, we have

\displaystyle u^* (M-z)^{-1} u = \sum_{j=1}^n \frac{|u^* v_j|^2}{\lambda_j - z}.

One can think of {u} as a random (complex) Gaussian vector, divided by the magnitude of that vector (which, by the Chernoff inequality, will concentrate to {\sqrt{n}}). Thus the coefficients {u^* v_j} with respect to the orthonormal basis {v_1,\dots,v_j} can be thought of as independent (complex) Gaussian vectors, divided by that magnitude. Using this and the Chernoff inequality again, we see (for {z} distance {\sim 1} away from the real axis at least) that one has the concentration of measure

\displaystyle  u^* (M-z)^{-1} u \approx \frac{1}{n} \sum_{j=1}^n \frac{1}{\lambda_j - z}

and thus

\displaystyle  e_n^T (M-z)^{-1} e_n \approx \frac{1}{n} \mathrm{tr} (M-z)^{-1} = s_M(z)

(that is to say, the diagonal entries of {(M-z)^{-1}} are roughly constant). Similarly we have

\displaystyle  e_n^T (M-z)^{-2} e_n \approx \frac{1}{n} \mathrm{tr} (M-z)^{-2} = \frac{d}{dz} s_M(z).

Inserting this into (5) and discarding terms of size {O(1/n^2)}, we thus conclude the approximate relationship

\displaystyle  s_A(z) \approx s_M(z) + \frac{1}{n} ( s_M(z) - s_M(z)^{-1} \frac{d}{dz} s_M(z) ).

This can be viewed as a difference equation for the Stieltjes transform of top left minors of {M}. Iterating this equation, and formally replacing the difference equation by a differential equation in the large {n} limit, we see that when {n} is large and {k \approx e^{-t} n} for some {t \geq 0}, one expects the top left {k \times k} minor {A_k} of {M} to have Stieltjes transform

\displaystyle  s_{A_k}(z) \approx s( t, z ) \ \ \ \ \ (6)

where {s(t,z)} solves the Burgers-type equation

\displaystyle  \partial_t s(t,z) = s(t,z) - s(t,z)^{-1} \frac{d}{dz} s(t,z) \ \ \ \ \ (7)

with initial data {s(0,z) = s_M(z)}.

Example 1 If {M} is a constant multiple {M = cI_n} of the identity, then {s_M(z) = \frac{1}{c-z}}. One checks that {s(t,z) = \frac{1}{c-z}} is a steady state solution to (7), which is unsurprising given that all minors of {M} are also {c} times the identity.

Example 2 If {M} is GUE normalised so that each entry has variance {\sigma^2/n}, then by the semi-circular law (see previous notes) one has {s_M(z) \approx \frac{-z + \sqrt{z^2-4\sigma^2}}{2\sigma^2} = -\frac{2}{z + \sqrt{z^2-4\sigma^2}}} (using an appropriate branch of the square root). One can then verify the self-similar solution

\displaystyle  s(t,z) = \frac{-z + \sqrt{z^2 - 4\sigma^2 e^{-t}}}{2\sigma^2 e^{-t}} = -\frac{2}{z + \sqrt{z^2 - 4\sigma^2 e^{-t}}}

to (7), which is consistent with the fact that a top {k \times k} minor of {M} also has the law of GUE, with each entry having variance {\sigma^2 / n \approx \sigma^2 e^{-t} / k} when {k \approx e^{-t} n}.

One can justify the approximation (6) given a sufficiently good well-posedness theory for the equation (7). We will not do so here, but will note that (as with the classical inviscid Burgers equation) the equation can be solved exactly (formally, at least) by the method of characteristics. For any initial position {z_0}, we consider the characteristic flow {t \mapsto z(t)} formed by solving the ODE

\displaystyle  \frac{d}{dt} z(t) = s(t,z(t))^{-1} \ \ \ \ \ (8)

with initial data {z(0) = z_0}, ignoring for this discussion the problems of existence and uniqueness. Then from the chain rule, the equation (7) implies that

\displaystyle  \frac{d}{dt} s( t, z(t) ) = s(t,z(t))

and thus {s(t,z(t)) = e^t s(0,z_0)}. Inserting this back into (8) we see that

\displaystyle  z(t) = z_0 + s(0,z_0)^{-1} (1-e^{-t})

and thus (7) may be solved implicitly via the equation

\displaystyle  s(t, z_0 + s(0,z_0)^{-1} (1-e^{-t}) ) = e^t s(0, z_0) \ \ \ \ \ (9)

for all {t} and {z_0}.

Remark 3 In practice, the equation (9) may stop working when {z_0 + s(0,z_0)^{-1} (1-e^{-t})} crosses the real axis, as (7) does not necessarily hold in this region. It is a cute exercise (ultimately coming from the Cauchy-Schwarz inequality) to show that this crossing always happens, for instance if {z_0} has positive imaginary part then {z_0 + s(0,z_0)^{-1}} necessarily has negative or zero imaginary part.

Example 4 Suppose we have {s(0,z) = \frac{1}{c-z}} as in Example 1. Then (9) becomes

\displaystyle  s( t, z_0 + (c-z_0) (1-e^{-t}) ) = \frac{e^t}{c-z_0}

for any {t,z_0}, which after making the change of variables {z = z_0 + (c-z_0) (1-e^{-t}) = c - e^{-t} (c - z_0)} becomes

\displaystyle  s(t, z ) = \frac{1}{c-z}

as in Example 1.

Example 5 Suppose we have

\displaystyle  s(0,z) = \frac{-z + \sqrt{z^2-4\sigma^2}}{2\sigma^2} = -\frac{2}{z + \sqrt{z^2-4\sigma^2}}.

as in Example 2. Then (9) becomes

\displaystyle  s(t, z_0 - \frac{z_0 + \sqrt{z_0^2-4\sigma^2}}{2} (1-e^{-t}) ) = e^t \frac{-z_0 + \sqrt{z_0^2-4\sigma^2}}{2\sigma^2}.

If we write

\displaystyle  z := z_0 - \frac{z_0 + \sqrt{z_0^2-4\sigma^2}}{2} (1-e^{-t})

\displaystyle  = \frac{(1+e^{-t}) z_0 - (1-e^{-t}) \sqrt{z_0^2-4\sigma^2}}{2}

one can calculate that

\displaystyle  z^2 - 4 \sigma^2 e^{-t} = (\frac{(1-e^{-t}) z_0 - (1+e^{-t}) \sqrt{z_0^2-4\sigma^2}}{2})^2

and hence

\displaystyle  \frac{-z + \sqrt{z^2 - 4\sigma^2 e^{-t}}}{2\sigma^2 e^{-t}} = e^t \frac{-z_0 + \sqrt{z_0^2-4\sigma^2}}{2\sigma^2}

which gives

\displaystyle  s(t,z) = \frac{-z + \sqrt{z^2 - 4\sigma^2 e^{-t}}}{2\sigma^2 e^{-t}}. \ \ \ \ \ (10)

One can recover the spectral measure {\mu} from the Stieltjes transform {s(z)} as the weak limit of {x \mapsto \frac{1}{\pi} \mathrm{Im} s(x+i\varepsilon)} as {\varepsilon \rightarrow 0}; we write this informally as

\displaystyle  d\mu(x) = \frac{1}{\pi} \mathrm{Im} s(x+i0^+)\ dx.

In this informal notation, we have for instance that

\displaystyle  \delta_c(x) = \frac{1}{\pi} \mathrm{Im} \frac{1}{c-x-i0^+}\ dx

which can be interpreted as the fact that the Cauchy distributions {\frac{1}{\pi} \frac{\varepsilon}{(c-x)^2+\varepsilon^2}} converge weakly to the Dirac mass at {c} as {\varepsilon \rightarrow 0}. Similarly, the spectral measure associated to (10) is the semicircular measure {\frac{1}{2\pi \sigma^2 e^{-t}} (4 \sigma^2 e^{-t}-x^2)_+^{1/2}}.

If we let {\mu_t} be the spectral measure associated to {s(t,\cdot)}, then the curve {e^{-t} \mapsto \mu_t} from {(0,1]} to the space of measures is the high-dimensional limit {n \rightarrow \infty} of a Gelfand-Tsetlin pattern (discussed in this previous post), if the pattern is randomly generated amongst all matrices {M} with spectrum asymptotic to {\mu_0} as {n \rightarrow \infty}. For instance, if {\mu_0 = \delta_c}, then the curve is {\alpha \mapsto \delta_c}, corresponding to a pattern that is entirely filled with {c}‘s. If instead {\mu_0 = \frac{1}{2\pi \sigma^2} (4\sigma^2-x^2)_+^{1/2}} is a semicircular distribution, then the pattern is

\displaystyle  \alpha \mapsto \frac{1}{2\pi \sigma^2 \alpha} (4\sigma^2 \alpha -x^2)_+^{1/2},

thus at height {\alpha} from the top, the pattern is semicircular on the interval {[-2\sigma \sqrt{\alpha}, 2\sigma \sqrt{\alpha}]}. The interlacing property of Gelfand-Tsetlin patterns translates to the claim that {\alpha \mu_\alpha(-\infty,\lambda)} (resp. {\alpha \mu_\alpha(\lambda,\infty)}) is non-decreasing (resp. non-increasing) in {\alpha} for any fixed {\lambda}. In principle one should be able to establish these monotonicity claims directly from the PDE (7) or from the implicit solution (9), but it was not clear to me how to do so.

An interesting example of such a limiting Gelfand-Tsetlin pattern occurs when {\mu_0 = \frac{1}{2} \delta_{-1} + \frac{1}{2} \delta_1}, which corresponds to {M} being {2P-I}, where {P} is an orthogonal projection to a random {n/2}-dimensional subspace of {{\bf C}^n}. Here we have

\displaystyle  s(0,z) = \frac{1}{2} \frac{1}{-1-z} + \frac{1}{2} \frac{1}{1-z} = \frac{z}{1-z^2}

and so (9) in this case becomes

\displaystyle  s(t, z_0 + \frac{1-z_0^2}{z_0} (1-e^{-t}) ) = \frac{e^t z_0}{1-z_0^2}

A tedious calculation then gives the solution

\displaystyle  s(t,z) = \frac{(2e^{-t}-1)z + \sqrt{z^2 - 4e^{-t}(1-e^{-t})}}{2e^{-t}(1-z^2)}. \ \ \ \ \ (11)

For {\alpha = e^{-t} > 1/2}, there are simple poles at {z=-1,+1}, and the associated measure is

\displaystyle  \mu_\alpha = \frac{2\alpha-1}{2\alpha} \delta_{-1} + \frac{2\alpha-1}{2\alpha} \delta_1 + \frac{1}{2\pi \alpha(1-x^2)} (4\alpha(1-\alpha)-x^2)_+^{1/2}\ dx.

This reflects the interlacing property, which forces {\frac{2\alpha-1}{2\alpha} \alpha n} of the {\alpha n} eigenvalues of the {\alpha n \times \alpha n} minor to be equal to {-1} (resp. {+1}). For {\alpha = e^{-t} \leq 1/2}, the poles disappear and one just has

\displaystyle  \mu_\alpha = \frac{1}{2\pi \alpha(1-x^2)} (4\alpha(1-\alpha)-x^2)_+^{1/2}\ dx.

For {\alpha=1/2}, one has an inverse semicircle distribution

\displaystyle  \mu_{1/2} = \frac{1}{\pi} (1-x^2)_+^{-1/2}.

There is presumably a direct geometric explanation of this fact (basically describing the singular values of the product of two random orthogonal projections to half-dimensional subspaces of {{\bf C}^n}), but I do not know of one off-hand.

The evolution of {s(t,z)} can also be understood using the {R}-transform and {S}-transform from free probability. Formally, letlet {z(t,s)} be the inverse of {s(t,z)}, thus

\displaystyle  s(t,z(t,s)) = s

for all {t,s}, and then define the {R}-transform

\displaystyle  R(t,s) := z(t,-s) - \frac{1}{s}.

The equation (9) may be rewritten as

\displaystyle  z( t, e^t s ) = z(0,s) + s^{-1} (1-e^{-t})

and hence

\displaystyle  R(t, -e^t s) = R(0, -s)

or equivalently

\displaystyle  R(t,s) = R(0, e^{-t} s). \ \ \ \ \ (12)

See these previous notes for a discussion of free probability topics such as the {R}-transform.

Example 6 If {s(t,z) = \frac{1}{c-z}} then the {R} transform is {R(t,s) = c}.

Example 7 If {s(t,z)} is given by (10), then the {R} transform is

\displaystyle  R(t,s) = \sigma^2 e^{-t} s.

Example 8 If {s(t,z)} is given by (11), then the {R} transform is

\displaystyle  R(t,s) = \frac{-1 + \sqrt{1 + 4 s^2 e^{-2t}}}{2 s e^{-t}}.

This simple relationship (12) is essentially due to Nica and Speicher (thanks to Dima Shylakhtenko for this reference). It has the remarkable consequence that when {\alpha = 1/m} is the reciprocal of a natural number {m}, then {\mu_{1/m}} is the free arithmetic mean of {m} copies of {\mu}, that is to say {\mu_{1/m}} is the free convolution {\mu \boxplus \dots \boxplus \mu} of {m} copies of {\mu}, pushed forward by the map {\lambda \rightarrow \lambda/m}. In terms of random matrices, this is asserting that the top {n/m \times n/m} minor of a random matrix {M} has spectral measure approximately equal to that of an arithmetic mean {\frac{1}{m} (M_1 + \dots + M_m)} of {m} independent copies of {M}, so that the process of taking top left minors is in some sense a continuous analogue of the process of taking freely independent arithmetic means. There ought to be a geometric proof of this assertion, but I do not know of one. In the limit {m \rightarrow \infty} (or {\alpha \rightarrow 0}), the {R}-transform becomes linear and the spectral measure becomes semicircular, which is of course consistent with the free central limit theorem.

In a similar vein, if one defines the function

\displaystyle  \omega(t,z) := \alpha \int_{\bf R} \frac{zx}{1-zx}\ d\mu_\alpha(x) = e^{-t} (- 1 - z^{-1} s(t, z^{-1}))

and inverts it to obtain a function {z(t,\omega)} with

\displaystyle  \omega(t, z(t,\omega)) = \omega

for all {t, \omega}, then the {S}-transform {S(t,\omega)} is defined by

\displaystyle  S(t,\omega) := \frac{1+\omega}{\omega} z(t,\omega).


\displaystyle  s(t,z) = - z^{-1} ( 1 + e^t \omega(t, z^{-1}) )

for any {t}, {z}, we have

\displaystyle  z_0 + s(0,z_0)^{-1} (1-e^{-t}) = z_0 \frac{\omega(0,z_0^{-1})+e^{-t}}{\omega(0,z_0^{-1})+1}

and so (9) becomes

\displaystyle  - z_0^{-1} \frac{\omega(0,z_0^{-1})+1}{\omega(0,z_0^{-1})+e^{-t}} (1 + e^{t} \omega(t, z_0^{-1} \frac{\omega(0,z_0^{-1})+1}{\omega(0,z_0^{-1})+e^{-t}}))

\displaystyle = - e^t z_0^{-1} (1 + \omega(0, z_0^{-1}))

which simplifies to

\displaystyle  \omega(t, z_0^{-1} \frac{\omega(0,z_0^{-1})+1}{\omega(0,z_0^{-1})+e^{-t}})) = \omega(0, z_0^{-1});

replacing {z_0} by {z(0,\omega)^{-1}} we obtain

\displaystyle  \omega(t, z(0,\omega) \frac{\omega+1}{\omega+e^{-t}}) = \omega

and thus

\displaystyle  z(0,\omega)\frac{\omega+1}{\omega+e^{-t}} = z(t, \omega)

and hence

\displaystyle  S(0, \omega) = \frac{\omega+e^{-t}}{\omega+1} S(t, \omega).

One can compute {\frac{\omega+e^{-t}}{\omega+1}} to be the {S}-transform of the measure {(1-\alpha) \delta_0 + \alpha \delta_1}; from the link between {S}-transforms and free products (see e.g. these notes of Guionnet), we conclude that {(1-\alpha)\delta_0 + \alpha \mu_\alpha} is the free product of {\mu_1} and {(1-\alpha) \delta_0 + \alpha \delta_1}. This is consistent with the random matrix theory interpretation, since {(1-\alpha)\delta_0 + \alpha \mu_\alpha} is also the spectral measure of {PMP}, where {P} is the orthogonal projection to the span of the first {\alpha n} basis elements, so in particular {P} has spectral measure {(1-\alpha) \delta_0 + \alpha \delta_1}. If {M} is unitarily invariant then (by a fundamental result of Voiculescu) it is asymptotically freely independent of {P}, so the spectral measure of {PMP = P^{1/2} M P^{1/2}} is asymptotically the free product of that of {M} and of {P}.

Fix a non-negative integer {k}. Define an (weak) integer partition of length {k} to be a tuple {\lambda = (\lambda_1,\dots,\lambda_k)} of non-increasing non-negative integers {\lambda_1 \geq \dots \geq \lambda_k \geq 0}. (Here our partitions are “weak” in the sense that we allow some parts of the partition to be zero. Henceforth we will omit the modifier “weak”, as we will not need to consider the more usual notion of “strong” partitions.) To each such partition {\lambda}, one can associate a Young diagram consisting of {k} left-justified rows of boxes, with the {i^{th}} row containing {\lambda_i} boxes. A semi-standard Young tableau (or Young tableau for short) {T} of shape {\lambda} is a filling of these boxes by integers in {\{1,\dots,k\}} that is weakly increasing along rows (moving rightwards) and strictly increasing along columns (moving downwards). The collection of such tableaux will be denoted {{\mathcal T}_\lambda}. The weight {|T|} of a tableau {T} is the tuple {(n_1,\dots,n_k)}, where {n_i} is the number of occurrences of the integer {i} in the tableau. For instance, if {k=3} and {\lambda = (6,4,2)}, an example of a Young tableau of shape {\lambda} would be

\displaystyle  \begin{tabular}{|c|c|c|c|c|c|} \hline 1 & 1 & 1 & 2 & 3 & 3 \\ \cline{1-6} 2 & 2 & 2 &3\\ \cline{1-4} 3 & 3\\ \cline{1-2} \end{tabular}

The weight here would be {|T| = (3,4,5)}.

To each partition {\lambda} one can associate the Schur polynomial {s_\lambda(u_1,\dots,u_k)} on {k} variables {u = (u_1,\dots,u_k)}, which we will define as

\displaystyle  s_\lambda(u) := \sum_{T \in {\mathcal T}_\lambda} u^{|T|}

using the multinomial convention

\displaystyle (u_1,\dots,u_k)^{(n_1,\dots,n_k)} := u_1^{n_1} \dots u_k^{n_k}.

Thus for instance the Young tableau {T} given above would contribute a term {u_1^3 u_2^4 u_3^5} to the Schur polynomial {s_{(6,4,2)}(u_1,u_2,u_3)}. In the case of partitions of the form {(n,0,\dots,0)}, the Schur polynomial {s_{(n,0,\dots,0)}} is just the complete homogeneous symmetric polynomial {h_n} of degree {n} on {k} variables:

\displaystyle  s_{(n,0,\dots,0)}(u_1,\dots,u_k) := \sum_{n_1,\dots,n_k \geq 0: n_1+\dots+n_k = n} u_1^{n_1} \dots u_k^{n_k},

thus for instance

\displaystyle  s_{(3,0)}(u_1,u_2) = u_1^3 + u_1^2 u_2 + u_1 u_2^2 + u_2^3.

Schur polyomials are ubiquitous in the algebraic combinatorics of “type {A} objects” such as the symmetric group {S_k}, the general linear group {GL_k}, or the unitary group {U_k}. For instance, one can view {s_\lambda} as the character of an irreducible polynomial representation of {GL_k({\bf C})} associated with the partition {\lambda}. However, we will not focus on these interpretations of Schur polynomials in this post.

This definition of Schur polynomials allows for a way to describe the polynomials recursively. If {k > 1} and {T} is a Young tableau of shape {\lambda = (\lambda_1,\dots,\lambda_k)}, taking values in {\{1,\dots,k\}}, one can form a sub-tableau {T'} of some shape {\lambda' = (\lambda'_1,\dots,\lambda'_{k-1})} by removing all the appearances of {k} (which, among other things, necessarily deletes the {k^{th}} row). For instance, with {T} as in the previous example, the sub-tableau {T'} would be

\displaystyle  \begin{tabular}{|c|c|c|c|} \hline 1 & 1 & 1 & 2 \\ \cline{1-4} 2 & 2 & 2 \\ \cline{1-3} \end{tabular}

and the reduced partition {\lambda'} in this case is {(4,3)}. As Young tableaux are required to be strictly increasing down columns, we can see that the reduced partition {\lambda'} must intersperse the original partition {\lambda} in the sense that

\displaystyle  \lambda_{i+1} \leq \lambda'_i \leq \lambda_i \ \ \ \ \ (1)

for all {1 \leq i \leq k-1}; we denote this interspersion relation as {\lambda' \prec \lambda} (though we caution that this is not intended to be a partial ordering). In the converse direction, if {\lambda' \prec \lambda} and {T'} is a Young tableau with shape {\lambda'} with entries in {\{1,\dots,k-1\}}, one can form a Young tableau {T} with shape {\lambda} and entries in {\{1,\dots,k\}} by appending to {T'} an entry of {k} in all the boxes that appear in the {\lambda} shape but not the {\lambda'} shape. This one-to-one correspondence leads to the recursion

\displaystyle  s_\lambda(u) = \sum_{\lambda' \prec \lambda} s_{\lambda'}(u') u_k^{|\lambda| - |\lambda'|} \ \ \ \ \ (2)

where {u = (u_1,\dots,u_k)}, {u' = (u_1,\dots,u_{k-1})}, and the size {|\lambda|} of a partition {\lambda = (\lambda_1,\dots,\lambda_k)} is defined as {|\lambda| := \lambda_1 + \dots + \lambda_k}.

One can use this recursion (2) to prove some further standard identities for Schur polynomials, such as the determinant identity

\displaystyle  s_\lambda(u) V(u) = \det( u_i^{\lambda_j+k-j} )_{1 \leq i,j \leq k} \ \ \ \ \ (3)

for {u=(u_1,\dots,u_k)}, where {V(u)} denotes the Vandermonde determinant

\displaystyle  V(u) := \prod_{1 \leq i < j \leq k} (u_i - u_j), \ \ \ \ \ (4)

or the Jacobi-Trudi identity

\displaystyle  s_\lambda(u) = \det( h_{\lambda_j - j + i}(u) )_{1 \leq i,j \leq k}, \ \ \ \ \ (5)

with the convention that {h_d(u) = 0} if {d} is negative. Thus for instance

\displaystyle s_{(1,1,0,\dots,0)}(u) = h_1^2(u) - h_0(u) h_2(u) = \sum_{1 \leq i < j \leq k} u_i u_j.

We review the (standard) derivation of these identities via (2) below the fold. Among other things, these identities show that the Schur polynomials are symmetric, which is not immediately obvious from their definition.

One can also iterate (2) to write

\displaystyle  s_\lambda(u) = \sum_{() = \lambda^0 \prec \lambda^1 \prec \dots \prec \lambda^k = \lambda} \prod_{j=1}^k u_j^{|\lambda^j| - |\lambda^{j-1}|} \ \ \ \ \ (6)

where the sum is over all tuples {\lambda^1,\dots,\lambda^k}, where each {\lambda^j} is a partition of length {j} that intersperses the next partition {\lambda^{j+1}}, with {\lambda^k} set equal to {\lambda}. We will call such a tuple an integral Gelfand-Tsetlin pattern based at {\lambda}.

One can generalise (6) by introducing the skew Schur functions

\displaystyle  s_{\lambda/\mu}(u) := \sum_{\mu = \lambda^i \prec \dots \prec \lambda^k = \lambda} \prod_{j=i+1}^k u_j^{|\lambda^j| - |\lambda^{j-1}|} \ \ \ \ \ (7)

for {u = (u_{i+1},\dots,u_k)}, whenever {\lambda} is a partition of length {k} and {\mu} a partition of length {i} for some {0 \leq i \leq k}, thus the Schur polynomial {s_\lambda} is also the skew Schur polynomial {s_{\lambda /()}} with {i=0}. (One could relabel the variables here to be something like {(u_1,\dots,u_{k-i})} instead, but this labeling seems slightly more natural, particularly in view of identities such as (8) below.)

By construction, we have the decomposition

\displaystyle  s_{\lambda/\nu}(u_{i+1},\dots,u_k) = \sum_\mu s_{\mu/\nu}(u_{i+1},\dots,u_j) s_{\lambda/\mu}(u_{j+1},\dots,u_k) \ \ \ \ \ (8)

whenever {0 \leq i \leq j \leq k}, and {\nu, \mu, \lambda} are partitions of lengths {i,j,k} respectively. This gives another recursive way to understand Schur polynomials and skew Schur polynomials. For instance, one can use it to establish the generalised Jacobi-Trudi identity

\displaystyle  s_{\lambda/\mu}(u) = \det( h_{\lambda_j - j - \mu_i + i}(u) )_{1 \leq i,j \leq k}, \ \ \ \ \ (9)

with the convention that {\mu_i = 0} for {i} larger than the length of {\mu}; we do this below the fold.

The Schur polynomials (and skew Schur polynomials) are “discretised” (or “quantised”) in the sense that their parameters {\lambda, \mu} are required to be integer-valued, and their definition similarly involves summation over a discrete set. It turns out that there are “continuous” (or “classical”) analogues of these functions, in which the parameters {\lambda,\mu} now take real values rather than integers, and are defined via integration rather than summation. One can view these continuous analogues as a “semiclassical limit” of their discrete counterparts, in a manner that can be made precise using the machinery of geometric quantisation, but we will not do so here.

The continuous analogues can be defined as follows. Define a real partition of length {k} to be a tuple {\lambda = (\lambda_1,\dots,\lambda_k)} where {\lambda_1 \geq \dots \geq \lambda_k \geq 0} are now real numbers. We can define the relation {\lambda' \prec \lambda} of interspersion between a length {k-1} real partition {\lambda' = (\lambda'_1,\dots,\lambda'_{k-1})} and a length {k} real partition {\lambda = (\lambda_1,\dots,\lambda_{k})} precisely as before, by requiring that the inequalities (1) hold for all {1 \leq i \leq k-1}. We can then define the continuous Schur functions {S_\lambda(x)} for {x = (x_1,\dots,x_k) \in {\bf R}^k} recursively by defining

\displaystyle  S_{()}() = 1


\displaystyle  S_\lambda(x) = \int_{\lambda' \prec \lambda} S_{\lambda'}(x') \exp( (|\lambda| - |\lambda'|) x_k ) \ \ \ \ \ (10)

for {k \geq 1} and {\lambda} of length {k}, where {x' := (x_1,\dots,x_{k-1})} and the integral is with respect to {k-1}-dimensional Lebesgue measure, and {|\lambda| = \lambda_1 + \dots + \lambda_k} as before. Thus for instance

\displaystyle  S_{(\lambda_1)}(x_1) = \exp( \lambda_1 x_1 )


\displaystyle  S_{(\lambda_1,\lambda_2)}(x_1,x_2) = \int_{\lambda_2}^{\lambda_1} \exp( \lambda'_1 x_1 + (\lambda_1+\lambda_2-\lambda'_1) x_2 )\ d\lambda'_1.

More generally, we can define the continuous skew Schur functions {S_{\lambda/\mu}(x)} for {\lambda} of length {k}, {\mu} of length {j \leq k}, and {x = (x_{j+1},\dots,x_k) \in {\bf R}^{k-j}} recursively by defining

\displaystyle  S_{\mu/\mu}() = 1


\displaystyle  S_{\lambda/\mu}(x) = \int_{\lambda' \prec \lambda} S_{\lambda'/\mu}(x') \exp( (|\lambda| - |\lambda'|) x_k )

for {k > j}. Thus for instance

\displaystyle  S_{(\lambda_1,\lambda_2,\lambda_3)/(\mu_1,\mu_2)}(x_3) = 1_{\lambda_3 \leq \mu_2 \leq \lambda_2 \leq \mu_1 \leq \lambda_1} \exp( x_3 (\lambda_1+\lambda_2+\lambda_3 - \mu_1 - \mu_2 ))


\displaystyle  S_{(\lambda_1,\lambda_2,\lambda_3)/(\mu_1)}(x_2, x_3) = \int_{\lambda_2 \leq \lambda'_2 \leq \lambda_2, \mu_1} \int_{\mu_1, \lambda_2 \leq \lambda'_1 \leq \lambda_1}

\displaystyle \exp( x_2 (\lambda'_1+\lambda'_2 - \mu_1) + x_3 (\lambda_1+\lambda_2+\lambda_3 - \lambda'_1 - \lambda'_2))\ d\lambda'_1 d\lambda'_2.

By expanding out the recursion, one obtains the analogue

\displaystyle  S_\lambda(x) = \int_{\lambda^1 \prec \dots \prec \lambda^k = \lambda} \exp( \sum_{j=1}^k x_j (|\lambda^j| - |\lambda^{j-1}|))\ d\lambda^1 \dots d\lambda^{k-1},

of (6), and more generally one has

\displaystyle  S_{\lambda/\mu}(x) = \int_{\mu = \lambda^i \prec \dots \prec \lambda^k = \lambda} \exp( \sum_{j=i+1}^k x_j (|\lambda^j| - |\lambda^{j-1}|))\ d\lambda^{i+1} \dots d\lambda^{k-1}.

We will call the tuples {(\lambda^1,\dots,\lambda^k)} in the first integral real Gelfand-Tsetlin patterns based at {\lambda}. The analogue of (8) is then

\displaystyle  S_{\lambda/\nu}(x_{i+1},\dots,x_k) = \int S_{\mu/\nu}(x_{i+1},\dots,x_j) S_{\lambda/\mu}(x_{j+1},\dots,x_k)\ d\mu

where the integral is over all real partitions {\mu} of length {j}, with Lebesgue measure.

By approximating various integrals by their Riemann sums, one can relate the continuous Schur functions to their discrete counterparts by the limiting formula

\displaystyle  N^{-k(k-1)/2} s_{\lfloor N \lambda \rfloor}( \exp[ x/N ] ) \rightarrow S_\lambda(x) \ \ \ \ \ (11)

as {N \rightarrow \infty} for any length {k} real partition {\lambda = (\lambda_1,\dots,\lambda_k)} and any {x = (x_1,\dots,x_k) \in {\bf R}^k}, where

\displaystyle  \lfloor N \lambda \rfloor := ( \lfloor N \lambda_1 \rfloor, \dots, \lfloor N \lambda_k \rfloor )


\displaystyle  \exp[x/N] := (\exp(x_1/N), \dots, \exp(x_k/N)).

More generally, one has

\displaystyle  N^{j(j-1)/2-k(k-1)/2} s_{\lfloor N \lambda \rfloor / \lfloor N \mu \rfloor}( \exp[ x/N ] ) \rightarrow S_{\lambda/\mu}(x)

as {N \rightarrow \infty} for any length {k} real partition {\lambda}, any length {j} real partition {\mu} with {0 \leq j \leq k}, and any {x = (x_{j+1},\dots,x_k) \in {\bf R}^{k-j}}.

As a consequence of these limiting formulae, one expects all of the discrete identities above to have continuous counterparts. This is indeed the case; below the fold we shall prove the discrete and continuous identities in parallel. These are not new results by any means, but I was not able to locate a good place in the literature where they are explicitly written down, so I thought I would try to do so here (primarily for my own internal reference, but perhaps the calculations will be worthwhile to some others also).

Read the rest of this entry »

The determinant {\det_n(A)} of an {n \times n} matrix (with coefficients in an arbitrary field) obey many useful identities, starting of course with the fundamental multiplicativity {\det_n(AB) = \det_n(A) \det_n(B)} for {n \times n} matrices {A,B}. This multiplicativity can in turn be used to establish many further identities; in particular, as shown in this previous post, it implies the Schur determinant identity

\displaystyle  \det_{n+k}\begin{pmatrix} A & B \\ C & D \end{pmatrix} = \det_n(A) \det_k( D - C A^{-1} B ) \ \ \ \ \ (1)

whenever {A} is an invertible {n \times n} matrix, {B} is an {n \times k} matrix, {C} is a {k \times n} matrix, and {D} is a {k \times k} matrix. The matrix {D - CA^{-1} B} is known as the Schur complement of the block {A}.

I only recently discovered that this identity in turn immediately implies what I always found to be a somewhat curious identity, namely the Dodgson condensation identity (also known as the Desnanot-Jacobi identity)

\displaystyle  \det_n(M) \det_{n-2}(M^{1,n}_{1,n}) = \det_{n-1}( M^1_1 ) \det_{n-1}(M^n_n)

\displaystyle - \det_{n-1}(M^1_n) \det_{n-1}(M^n_1)

for any {n \geq 3} and {n \times n} matrix {M}, where {M^i_j} denotes the {n-1 \times n-1} matrix formed from {M} by removing the {i^{th}} row and {j^{th}} column, and similarly {M^{i,i'}_{j,j'}} denotes the {n-2 \times n-2} matrix formed from {M} by removing the {i^{th}} and {(i')^{th}} rows and {j^{th}} and {(j')^{th}} columns. Thus for instance when {n=3} we obtain

\displaystyle  \det_3 \begin{pmatrix} a & b & c \\ d & e & f \\ g & h & i \end{pmatrix} \cdot e

\displaystyle  = \det_2 \begin{pmatrix} e & f \\ h & i \end{pmatrix} \cdot \det_2 \begin{pmatrix} a & b \\ d & e \end{pmatrix}

\displaystyle  - \det_2 \begin{pmatrix} b & c \\ e & f \end{pmatrix} \cdot \det_2 \begin{pmatrix} d & e \\ g & h \end{pmatrix}

for any scalars {a,b,c,d,e,f,g,h,i}. (Charles Dodgson, better known by his pen name Lewis Caroll, is of course also known for writing “Alice in Wonderland” and “Through the Looking Glass“.)

The derivation is not new; it is for instance noted explicitly in this paper of Brualdi and Schneider, though I do not know if this is the earliest place in the literature where it can be found. (EDIT: Apoorva Khare has pointed out to me that the original arguments of Dodgson can be interpreted as implicitly following this derivation.) I thought it is worth presenting the short derivation here, though.

Firstly, by swapping the first and {(n-1)^{th}} rows, and similarly for the columns, it is easy to see that the Dodgson condensation identity is equivalent to the variant

\displaystyle  \det_n(M) \det_{n-2}(M^{n-1,n}_{n-1,n}) = \det_{n-1}( M^{n-1}_{n-1} ) \det_{n-1}(M^n_n) \ \ \ \ \ (2)

\displaystyle  - \det_{n-1}(M^{n-1}_n) \det_{n-1}(M^n_{n-1}).

Now write

\displaystyle  M = \begin{pmatrix} A & B_1 & B_2 \\ C_1 & d_{11} & d_{12} \\ C_2 & d_{21} & d_{22} \end{pmatrix}

where {A} is an {n-2 \times n-2} matrix, {B_1, B_2} are {n-2 \times 1} column vectors, {C_1, C_2} are {1 \times n-2} row vectors, and {d_{11}, d_{12}, d_{21}, d_{22}} are scalars. If {A} is invertible, we may apply the Schur determinant identity repeatedly to conclude that

\displaystyle  \det_n(M) = \det_{n-2}(A) \det_2 \begin{pmatrix} d_{11} - C_1 A^{-1} B_1 & d_{12} - C_1 A^{-1} B_2 \\ d_{21} - C_2 A^{-1} B_1 & d_{22} - C_2 A^{-1} B_2 \end{pmatrix}

\displaystyle  \det_{n-2} (M^{n-1,n}_{n-1,n}) = \det_{n-2}(A)

\displaystyle  \det_{n-1}( M^{n-1}_{n-1} ) = \det_{n-2}(A) (d_{22} - C_2 A^{-1} B_2 )

\displaystyle  \det_{n-1}( M^{n-1}_{n} ) = \det_{n-2}(A) (d_{21} - C_2 A^{-1} B_1 )

\displaystyle  \det_{n-1}( M^{n}_{n-1} ) = \det_{n-2}(A) (d_{12} - C_1 A^{-1} B_2 )

\displaystyle  \det_{n-1}( M^{n}_{n} ) = \det_{n-2}(A) (d_{11} - C_1 A^{-1} B_1 )

and the claim (2) then follows by a brief calculation (and the explicit form {\det_2 \begin{pmatrix} a & b \\ c & d \end{pmatrix} = ad-bc} of the {2 \times 2} determinant). To remove the requirement that {A} be invertible, one can use a limiting argument, noting that one can work without loss of generality in an algebraically closed field, and in such a field, the set of invertible matrices is dense in the Zariski topology. (In the case when the scalars are reals or complexes, one can just use density in the ordinary topology instead if desired.)

The same argument gives the more general determinant identity of Sylvester

\displaystyle  \det_n(M) \det_{n-k}(M^S_S)^{k-1} = \det_k \left( \det_{n-k+1}(M^{S \backslash \{i\}}_{S \backslash \{j\}}) \right)_{i,j \in S}

whenever {n > k \geq 1}, {S} is a {k}-element subset of {\{1,\dots,n\}}, and {M^S_{S'}} denotes the matrix formed from {M} by removing the rows associated to {S} and the columns associated to {S'}. (The Dodgson condensation identity is basically the {k=2} case of this identity.)

A closely related proof of (2) proceeds by elementary row and column operations. Observe that if one adds some multiple of one of the first {n-2} rows of {M} to one of the last two rows of {M}, then the left and right sides of (2) do not change. If the minor {A} is invertible, this allows one to reduce to the case where the components {C_1,C_2} of the matrix vanish. Similarly, using elementary column operations instead of row operations we may assume that {B_1,B_2} vanish. All matrices involved are now block-diagonal and the identity follows from a routine computation.

The latter approach can also prove the cute identity

\displaystyle  \det_2 \begin{pmatrix} \det_n( X_1, Y_1, A ) & \det_n( X_1, Y_2, A ) \\ \det_n(X_2, Y_1, A) & \det_n(X_2,Y_2, A) \end{pmatrix} = \det_n( X_1,X_2,A) \det_n(Y_1,Y_2,A)

for any {n \geq 2}, any {n \times 1} column vectors {X_1,X_2,Y_1,Y_2}, and any {n \times n-2} matrix {A}, which can for instance be found in page 7 of this text of Karlin. Observe that both sides of this identity are unchanged if one adds some multiple of any column of {A} to one of {X_1,X_2,Y_1,Y_2}; for generic {A}, this allows one to reduce {X_1,X_2,Y_1,Y_2} to have only the first two entries allowed to be non-zero, at which point the determinants split into {2 \times 2} and {n -2 \times n-2} determinants and we can reduce to the {n=2} case (eliminating the role of {A}). One can now either proceed by a direct computation, or by observing that the left-hand side is quartilinear in {X_1,X_2,Y_1,Y_2} and antisymmetric in {X_1,X_2} and {Y_1,Y_2} which forces it to be a scalar multiple of {\det_2(X_1,X_2) \det_2(Y_1,Y_2)}, at which point one can test the identity at a single point (e.g. {X_1=Y_1 = e_1} and {X_2=Y_2=e_2} for the standard basis {e_1,e_2}) to conclude the argument. (One can also derive this identity from the Sylvester determinant identity but I think the calculations are a little messier if one goes by that route. Conversely, one can recover the Dodgson condensation identity from Karlin’s identity by setting {X_1=e_1}, {X_2=e_2} (for instance) and then permuting some rows and columns.)

In July I will be spending a week at Park City, being one of the mini-course lecturers in the Graduate Summer School component of the Park City Summer Session on random matrices.  I have chosen to give some lectures on least singular values of random matrices, the circular law, and the Lindeberg exchange method in random matrix theory; this is a slightly different set of topics than I had initially advertised (which was instead about the Lindeberg exchange method and the local relaxation flow method), but after consulting with the other mini-course lecturers I felt that this would be a more complementary set of topics.  I have uploaded an draft of my lecture notes (some portion of which is derived from my monograph on the subject); as always, comments and corrections are welcome.

<I>[Update, June 23: notes revised and reformatted to PCMI format. -T.]</I>



Suppose {F: X \rightarrow Y} is a continuous (but nonlinear) map from one normed vector space {X} to another {Y}. The continuity means, roughly speaking, that if {x_0, x \in X} are such that {\|x-x_0\|_X} is small, then {\|F(x)-F(x_0)\|_Y} is also small (though the precise notion of “smallness” may depend on {x} or {x_0}, particularly if {F} is not known to be uniformly continuous). If {F} is known to be differentiable (in, say, the Fréchet sense), then we in fact have a linear bound of the form

\displaystyle  \|F(x)-F(x_0)\|_Y \leq C(x_0) \|x-x_0\|_X

for some {C(x_0)} depending on {x_0}, if {\|x-x_0\|_X} is small enough; one can of course make {C(x_0)} independent of {x_0} (and drop the smallness condition) if {F} is known instead to be Lipschitz continuous.

In many applications in analysis, one would like more explicit and quantitative bounds that estimate quantities like {\|F(x)-F(x_0)\|_Y} in terms of quantities like {\|x-x_0\|_X}. There are a number of ways to do this. First of all, there is of course the trivial estimate arising from the triangle inequality:

\displaystyle  \|F(x)-F(x_0)\|_Y \leq \|F(x)\|_Y + \|F(x_0)\|_Y. \ \ \ \ \ (1)

This estimate is usually not very good when {x} and {x_0} are close together. However, when {x} and {x_0} are far apart, this estimate can be more or less sharp. For instance, if the magnitude of {F} varies so much from {x_0} to {x} that {\|F(x)\|_Y} is more than (say) twice that of {\|F(x_0)\|_Y}, or vice versa, then (1) is sharp up to a multiplicative constant. Also, if {F} is oscillatory in nature, and the distance between {x} and {x_0} exceeds the “wavelength” of the oscillation of {F} at {x_0} (or at {x}), then one also typically expects (1) to be close to sharp. Conversely, if {F} does not vary much in magnitude from {x_0} to {x}, and the distance between {x} and {x_0} is less than the wavelength of any oscillation present in {F}, one expects to be able to improve upon (1).

When {F} is relatively simple in form, one can sometimes proceed simply by substituting {x = x_0 + h}. For instance, if {F: R \rightarrow R} is the squaring function {F(x) = x^2} in a commutative ring {R}, one has

\displaystyle  F(x_0+h) = (x_0+h)^2 = x_0^2 + 2x_0 h+ h^2

and thus

\displaystyle  F(x_0+h) - F(x_0) = 2x_0 h + h^2

or in terms of the original variables {x, x_0} one has

\displaystyle  F(x) - F(x_0) = 2 x_0 (x-x_0) + (x-x_0)^2.

If the ring {R} is not commutative, one has to modify this to

\displaystyle  F(x) - F(x_0) = x_0 (x-x_0) + (x-x_0) x_0 + (x-x_0)^2.

Thus, for instance, if {A, B} are {n \times n} matrices and {\| \|_{op}} denotes the operator norm, one sees from the triangle inequality and the sub-multiplicativity {\| AB\|_{op} \leq \| A \|_{op} \|B\|_{op}} of operator norm that

\displaystyle  \| A^2 - B^2 \|_{op} \leq \| A - B \|_{op} ( 2 \|B\|_{op} + \|A - B \|_{op} ). \ \ \ \ \ (2)

If {F(x)} involves {x} (or various components of {x}) in several places, one can sometimes get a good estimate by “swapping” {x} with {x_0} at each of the places in turn, using a telescoping series. For instance, if we again use the squaring function {F(x) = x^2 = x x} in a non-commutative ring, we have

\displaystyle  F(x) - F(x_0) = x x - x_0 x_0

\displaystyle  = (x x - x_0 x) + (x_0 x - x_0 x_0)

\displaystyle  = (x-x_0) x + x_0 (x-x_0)

which for instance leads to a slight improvement of (2):

\displaystyle  \| A^2 - B^2 \|_{op} \leq \| A - B \|_{op} ( \| A\|_{op} + \|B\|_{op} ).

More generally, for any natural number {n}, one has the identity

\displaystyle  x^n - x_0^n = (x-x_0) (x^{n-1} + x^{n-2} x_0 + \dots + x x_0^{n-2} + x_0^{n-1}) \ \ \ \ \ (3)

in a commutative ring, while in a non-commutative ring one must modify this to

\displaystyle  x^n - x_0^n = \sum_{i=0}^{n-1} x_0^i (x-x_0) x^{n-1-i},

and for matrices one has

\displaystyle  \| A^n - B^n \|_{op} \leq \| A-B\|_{op} ( \|A\|_{op}^{n-1} + \| A\|_{op}^{n-2} \| B\|_{op} + \dots + \|B\|_{op}^{n-1} ).

Exercise 1 If {U} and {V} are unitary {n \times n} matrices, show that the commutator {[U,V] := U V U^{-1} V^{-1}} obeys the inequality

\displaystyle  \| [U,V] - I \|_{op} \leq 2 \| U - I \|_{op} \| V - I \|_{op}.

(Hint: first control {\| UV - VU \|_{op}}.)

Now suppose (for simplicity) that {F: {\bf R}^d \rightarrow {\bf R}^{d'}} is a map between Euclidean spaces. If {F} is continuously differentiable, then one can use the fundamental theorem of calculus to write

\displaystyle  F(x) - F(x_0) = \int_0^1 \frac{d}{dt} F( \gamma(t) )\ dt

where {\gamma: [0,1] \rightarrow Y} is any continuously differentiable path from {x_0} to {x}. For instance, if one uses the straight line path {\gamma(t) := (1-t) x_0 + tx}, one has

\displaystyle  F(x) - F(x_0) = \int_0^1 ((x-x_0) \cdot \nabla F)( (1-t) x_0 + t x )\ dt.

In the one-dimensional case {d=1}, this simplifies to

\displaystyle  F(x) - F(x_0) = (x-x_0) \int_0^1 F'( (1-t) x_0 + t x )\ dt. \ \ \ \ \ (4)

Among other things, this immediately implies the factor theorem for {C^k} functions: if {F} is a {C^k({\bf R})} function for some {k \geq 1} that vanishes at some point {x_0}, then {F(x)} factors as the product of {x-x_0} and some {C^{k-1}} function {G}. Another basic consequence is that if {\nabla F} is uniformly bounded in magnitude by some constant {C}, then {F} is Lipschitz continuous with the same constant {C}.

Applying (4) to the power function {x \mapsto x^n}, we obtain the identity

\displaystyle  x^n - x_0^n = n (x-x_0) \int_0^1 ((1-t) x_0 + t x)^{n-1}\ dt \ \ \ \ \ (5)

which can be compared with (3). Indeed, for {x_0} and {x} close to {1}, one can use logarithms and Taylor expansion to arrive at the approximation {((1-t) x_0 + t x)^{n-1} \approx x_0^{(1-t) (n-1)} x^{t(n-1)}}, so (3) behaves a little like a Riemann sum approximation to (5).

Exercise 2 For each {i=1,\dots,n}, let {X^{(1)}_i} and {X^{(0)}_i} be random variables taking values in a measurable space {R_i}, and let {F: R_1 \times \dots \times R_n \rightarrow {\bf R}^m} be a bounded measurable function.

  • (i) (Lindeberg exchange identity) Show that

    \displaystyle  \mathop{\bf E} F(X^{(1)}_1,\dots,X^{(1)}_n) - \mathop{\bf E} F(X^{(0)}_1,\dots,X^{(0)}_n)

    \displaystyle = \sum_{i=1}^n \mathop{\bf E} F( X^{(1)}_1,\dots, X^{(1)}_{i-1}, X^{(1)}_i, X^{(0)}_{i+1}, \dots, X^{(0)}_n)

    \displaystyle - \mathop{\bf E} F( X^{(1)}_1,\dots, X^{(1)}_{i-1}, X^{(0)}_i, X^{(0)}_{i+1}, \dots, X^{(0)}_n).

  • (ii) (Knowles-Yin exchange identity) Show that

    \displaystyle  \mathop{\bf E} F(X^{(1)}_1,\dots,X^{(1)}_n) - \mathop{\bf E} F(X^{(0)}_1,\dots,X^{(0)}_n)

    \displaystyle = \int_0^1 \sum_{i=1}^n \mathop{\bf E} F( X^{(t)}_1,\dots, X^{(t)}_{i-1}, X^{(1)}_i, X^{(t)}_{i+1}, \dots, X^{(t)}_n)

    \displaystyle - \mathop{\bf E} F( X^{(t)}_1,\dots, X^{(t)}_{i-1}, X^{(0)}_i, X^{(t)}_{i+1}, \dots, X^{(t)}_n)\ dt,

    where {X^{(t)}_i = 1_{I_i \leq t} X^{(0)}_i + 1_{I_i > t} X^{(1)}_i} is a mixture of {X^{(0)}_i} and {X^{(1)}_i}, with {I_1,\dots,I_n} uniformly drawn from {[0,1]} independently of each other and of the {X^{(0)}_1,\dots,X^{(0)}_n, X^{(1)}_0,\dots,X^{(1)}_n}.

  • (iii) Discuss the relationship between the identities in parts (i), (ii) with the identities (3), (5).

(The identity in (i) is the starting point for the Lindeberg exchange method in probability theory, discussed for instance in this previous post. The identity in (ii) can also be used in the Lindeberg exchange method; the terms in the right-hand side are slightly more symmetric in the indices {1,\dots,n}, which can be a technical advantage in some applications; see this paper of Knowles and Yin for an instance of this.)

Exercise 3 If {F: {\bf R}^d \rightarrow {\bf R}^{d'}} is continuously {k} times differentiable, establish Taylor’s theorem with remainder

\displaystyle  F(x) = \sum_{j=0}^{k-1} \frac{1}{j!} (((x-x_0) \cdot \nabla)^j F)( x_0 )

\displaystyle + \int_0^1 \frac{(1-t)^{k-1}}{(k-1)!} (((x-x_0) \cdot \nabla)^k F)((1-t) x_0 + t x)\ dt.

If {\nabla^k F} is bounded, conclude that

\displaystyle  |F(x) - \sum_{j=0}^{k-1} \frac{1}{j!} (((x-x_0) \cdot \nabla)^j F)( x_0 )|

\displaystyle \leq \frac{|x-x_0|^k}{k!} \sup_{y \in {\bf R}^d} |\nabla^k F(y)|.

For real scalar functions {F: {\bf R}^d \rightarrow {\bf R}}, the average value of the continuous real-valued function {(x - x_0) \cdot \nabla F((1-t) x_0 + t x)} must be attained at some point {t} in the interval {[0,1]}. We thus conclude the mean-value theorem

\displaystyle  F(x) - F(x_0) = ((x - x_0) \cdot \nabla F)((1-t) x_0 + t x)

for some {t \in [0,1]} (that can depend on {x}, {x_0}, and {F}). This can for instance give a second proof of fact that continuously differentiable functions {F} with bounded derivative are Lipschitz continuous. However it is worth stressing that the mean-value theorem is only available for real scalar functions; it is false for instance for complex scalar functions. A basic counterexample is given by the function {e(x) := e^{2\pi i x}}; there is no {t \in [0,1]} for which {e(1) - e(0) = e'(t)}. On the other hand, as {e'} has magnitude {2\pi}, we still know from (4) that {e} is Lipschitz of constant {2\pi}, and when combined with (1) we obtain the basic bounds

\displaystyle  |e(x) - e(y)| \leq \min( 2, 2\pi |x-y| )

which are already very useful for many applications.

Exercise 4 Let {H_0, V} be {n \times n} matrices, and let {t} be a non-negative real.

  • (i) Establish the Duhamel formula

    \displaystyle  e^{t(H_0+V)} = e^{tH_0} + \int_0^t e^{(t-s) H_0} V e^{s (H_0+V)}\ ds

    \displaystyle  = e^{tH_0} + \int_0^t e^{(t-s) (H_0+V)} V e^{s H_0}\ ds

    where {e^A} denotes the matrix exponential of {A}. (Hint: Differentiate {e^{(t-s) H_0} e^{s (H_0+V)}} or {e^{(t-s) (H_0+V)} e^{s H_0}} in {s}.)

  • (ii) Establish the iterated Duhamel formula

    \displaystyle  e^{t(H_0+V)} = e^{tH_0} + \sum_{j=1}^k \int_{0 \leq t_1 \leq \dots \leq t_j \leq t}

    \displaystyle e^{(t-t_j) H_0} V e^{(t_j-t_{j-1}) H_0} V \dots e^{(t_2-t_1) H_0} V e^{t_1 H_0}\ dt_1 \dots dt_j

    \displaystyle  + \int_{0 \leq t_1 \leq \dots \leq t_{k+1} \leq t}

    \displaystyle  e^{(t-t_{k+1}) (H_0+V)} V e^{(t_{k+1}-t_k) H_0} V \dots e^{(t_2-t_1) H_0} V e^{t_1 H_0}\ dt_1 \dots dt_{k+1}

    for any {k \geq 0}.

  • (iii) Establish the infinitely iterated Duhamel formula

    \displaystyle  e^{t(H_0+V)} = e^{tH_0} + \sum_{j=1}^\infty \int_{0 \leq t_1 \leq \dots \leq t_j \leq t}

    \displaystyle e^{(t-t_j) H_0} V e^{(t_j-t_{j-1}) H_0} V \dots e^{(t_2-t_1) H_0} V e^{t_1 H_0}\ dt_1 \dots dt_j.

  • (iv) If {H(t)} is an {n \times n} matrix depending in a continuously differentiable fashion on {t}, establish the variation formula

    \displaystyle  \frac{d}{dt} e^{H(t)} = (F(\mathrm{ad}(H(t))) H'(t)) e^{H(t)}

    where {\mathrm{ad}(H)} is the adjoint representation {\mathrm{ad}(H)(V) = HV - VH} applied to {H}, and {F} is the function

    \displaystyle  F(z) := \int_0^1 e^{sz}\ ds

    (thus {F(z) = \frac{e^z-1}{z}} for non-zero {z}), with {F(\mathrm{ad}(H(t)))} defined using functional calculus.

We remark that further manipulation of (iv) of the above exercise using the fundamental theorem of calculus eventually leads to the Baker-Campbell-Hausdorff-Dynkin formula, as discussed in this previous blog post.

Exercise 5 Let {A, B} be positive definite {n \times n} matrices, and let {Y} be an {n \times n} matrix. Show that there is a unique solution {X} to the Sylvester equation

\displaystyle  AX + X B = Y

which is given by the formula

\displaystyle  X = \int_0^\infty e^{-tA} Y e^{-tB}\ dt.

In the above examples we had applied the fundamental theorem of calculus along linear curves {\gamma(t) = (1-t) x_0 + t x}. However, it is sometimes better to use other curves. For instance, the circular arc {\gamma(t) = \cos(\pi t/2) x_0 + \sin(\pi t/2) x} can be useful, particularly if {x_0} and {x} are “orthogonal” or “independent” in some sense; a good example of this is the proof by Maurey and Pisier of the gaussian concentration inequality, given in Theorem 8 of this previous blog post. In a similar vein, if one wishes to compare a scalar random variable {X} of mean zero and variance one with a Gaussian random variable {G} of mean zero and variance one, it can be useful to introduce the intermediate random variables {\gamma(t) := (1-t)^{1/2} X + t^{1/2} G} (where {X} and {G} are independent); note that these variables have mean zero and variance one, and after coupling them together appropriately they evolve by the Ornstein-Uhlenbeck process, which has many useful properties. For instance, one can use these ideas to establish monotonicity formulae for entropy; see e.g. this paper of Courtade for an example of this and further references. More generally, one can exploit curves {\gamma} that flow according to some geometrically natural ODE or PDE; several examples of this occur famously in Perelman’s proof of the Poincaré conjecture via Ricci flow, discussed for instance in this previous set of lecture notes.

In some cases, it is difficult to compute {F(x)-F(x_0)} or the derivative {\nabla F} directly, but one can instead proceed by implicit differentiation, or some variant thereof. Consider for instance the matrix inversion map {F(A) := A^{-1}} (defined on the open dense subset of {n \times n} matrices consisting of invertible matrices). If one wants to compute {F(B)-F(A)} for {B} close to {A}, one can write temporarily write {F(B) - F(A) = E}, thus

\displaystyle  B^{-1} - A^{-1} = E.

Multiplying both sides on the left by {B} to eliminate the {B^{-1}} term, and on the right by {A} to eliminate the {A^{-1}} term, one obtains

\displaystyle  A - B = B E A

and thus on reversing these steps we arrive at the basic identity

\displaystyle  B^{-1} - A^{-1} = B^{-1} (A - B) A^{-1}. \ \ \ \ \ (6)

For instance, if {H_0, V} are {n \times n} matrices, and we consider the resolvents

\displaystyle  R_0(z) := (H_0 - z I)^{-1}; \quad R_V(z) := (H_0 + V - zI)^{-1}

then we have the resolvent identity

\displaystyle  R_V(z) - R_0(z) = - R_V(z) V R_0(z) \ \ \ \ \ (7)

as long as {z} does not lie in the spectrum of {H_0} or {H_0+V} (for instance, if {H_0}, {V} are self-adjoint then one can take {z} to be any strictly complex number). One can iterate this identity to obtain

\displaystyle  R_V(z) = \sum_{j=0}^k (-R_0(z) V)^j R_0(z) + (-R_V(z) V) (-R_0(z) V)^k R_0(z)

for any natural number {k}; in particular, if {R_0(z) V} has operator norm less than one, one has the Neumann series

\displaystyle  R_V(z) = \sum_{j=0}^\infty (-R_0(z) V)^j R_0(z).

Similarly, if {A(t)} is a family of invertible matrices that depends in a continuously differentiable fashion on a time variable {t}, then by implicitly differentiating the identity

\displaystyle  A(t) A(t)^{-1} = I

in {t} using the product rule, we obtain

\displaystyle  (\frac{d}{dt} A(t)) A(t)^{-1} + A(t) \frac{d}{dt} A(t)^{-1} = 0

and hence

\displaystyle  \frac{d}{dt} A(t)^{-1} = - A(t)^{-1} (\frac{d}{dt} A(t)) A(t)^{-1}

(this identity may also be easily derived from (6)). One can then use the fundamental theorem of calculus to obtain variants of (6), for instance by using the curve {\gamma(t) = (1-t) A + tB} we arrive at

\displaystyle  B^{-1} - A^{-1} = \int_0^1 ((1-t)A + tB)^{-1} (A-B) ((1-t)A + tB)^{-1}\ dt

assuming that the curve stays entirely within the set of invertible matrices. While this identity may seem more complicated than (6), it is more symmetric, which conveys some advantages. For instance, using this identity it is easy to see that if {A, B} are positive definite with {A>B} in the sense of positive definite matrices (that is, {A-B} is positive definite), then {B^{-1} > A^{-1}}. (Try to prove this using (6) instead!)

Exercise 6 If {A} is an invertible {n \times n} matrix and {u, v} are {n \times 1} vectors, establish the Sherman-Morrison formula

\displaystyle  (A + t uv^T)^{-1} = A^{-1} - \frac{t}{1 + t v^T A^{-1} u} A^{-1} uv^T A^{-1}

whenever {t} is a scalar such that {1 + t v^T A^{-1} u} is non-zero. (See also this previous blog post for more discussion of these sorts of identities.)

One can use the Cauchy integral formula to extend these identities to other functions of matrices. For instance, if {F: {\bf C} \rightarrow {\bf C}} is an entire function, and {\gamma} is a counterclockwise contour that goes around the spectrum of both {H_0} and {H_0+V}, then we have

\displaystyle  F(H_0+V) = \frac{-1}{2\pi i} \int_\gamma F(z) R_V(z)\ dz

and similarly

\displaystyle  F(H_0) = \frac{-1}{2\pi i} \int_\gamma F(z) R_0(z)\ dz

and hence by (7) one has

\displaystyle  F(H_0+V) - F(H_0) = \frac{1}{2\pi i} \int_\gamma F(z) R_V(z) V F_0(z)\ dz;

similarly, if {H(t)} depends on {t} in a continuously differentiable fashion, then

\displaystyle  \frac{d}{dt} F(H(t)) = \frac{1}{2\pi i} \int_\gamma F(z) (H(t) - zI)^{-1} H'(t) (z) (H(t)-zI)^{-1}\ dz

as long as {\gamma} goes around the spectrum of {H(t)}.

Exercise 7 If {H(t)} is an {n \times n} matrix depending continuously differentiably on {t}, and {F: {\bf C} \rightarrow {\bf C}} is an entire function, establish the tracial chain rule

\displaystyle  \frac{d}{dt} \hbox{tr} F(H(t)) = \hbox{tr}(F'(H(t)) H'(t)).

In a similar vein, given that the logarithm function is the antiderivative of the reciprocal, one can express the matrix logarithm {\log A} of a positive definite matrix by the fundamental theorem of calculus identity

\displaystyle  \log A = \int_0^\infty (I + sI)^{-1} - (A + sI)^{-1}\ ds

(with the constant term {(I+tI)^{-1}} needed to prevent a logarithmic divergence in the integral). Differentiating, we see that if {A(t)} is a family of positive definite matrices depending continuously on {t}, that

\displaystyle  \frac{d}{dt} \log A(t) = \int_0^\infty (A(t) + sI)^{-1} A'(t) (A(t)+sI)^{-1}\ dt.

This can be used for instance to show that {\log} is a monotone increasing function, in the sense that {\log A> \log B} whenever {A > B > 0} in the sense of positive definite matrices. One can of course integrate this formula to obtain some formulae for the difference {\log A - \log B} of the logarithm of two positive definite matrices {A,B}.

To compare the square root {A^{1/2} - B^{1/2}} of two positive definite matrices {A,B} is trickier; there are multiple ways to proceed. One approach is to use contour integration as before (but one has to take some care to avoid branch cuts of the square root). Another to express the square root in terms of exponentials via the formula

\displaystyle  A^{1/2} = \frac{1}{\Gamma(-1/2)} \int_0^\infty (e^{-tA} - I) t^{-1/2} \frac{dt}{t}

where {\Gamma} is the gamma function; this formula can be verified by first diagonalising {A} to reduce to the scalar case and using the definition of the Gamma function. Then one has

\displaystyle  A^{1/2} - B^{1/2} = \frac{1}{\Gamma(-1/2)} \int_0^\infty (e^{-tA} - e^{-tB}) t^{-1/2} \frac{dt}{t}

and one can use some of the previous identities to control {e^{-tA} - e^{-tB}}. This is pretty messy though. A third way to proceed is via implicit differentiation. If for instance {A(t)} is a family of positive definite matrices depending continuously differentiably on {t}, we can differentiate the identity

\displaystyle  A(t)^{1/2} A(t)^{1/2} = A(t)

to obtain

\displaystyle  A(t)^{1/2} \frac{d}{dt} A(t)^{1/2} + (\frac{d}{dt} A(t)^{1/2}) A(t)^{1/2} = \frac{d}{dt} A(t).

This can for instance be solved using Exercise 5 to obtain

\displaystyle  \frac{d}{dt} A(t)^{1/2} = \int_0^\infty e^{-sA(t)^{1/2}} A'(t) e^{-sA(t)^{1/2}}\ ds

and this can in turn be integrated to obtain a formula for {A^{1/2} - B^{1/2}}. This is again a rather messy formula, but it does at least demonstrate that the square root is a monotone increasing function on positive definite matrices: {A > B > 0} implies {A^{1/2} > B^{1/2} > 0}.

Several of the above identities for matrices can be (carefully) extended to operators on Hilbert spaces provided that they are sufficiently well behaved (in particular, if they have a good functional calculus, and if various spectral hypotheses are obeyed). We will not attempt to do so here, however.

I just learned (from Emmanuel Kowalski’s blog) that the AMS has just started a repository of open-access mathematics lecture notes.  There are only a few such sets of notes there at present, but hopefully it will grow in the future; I just submitted some old lecture notes of mine from an undergraduate linear algebra course I taught in 2002 (with some updating of format and fixing of various typos).


[Update, Dec 22: my own notes are now on the repository.]

By an odd coincidence, I stumbled upon a second question in as many weeks about power series, and once again the only way I know how to prove the result is by complex methods; once again, I am leaving it here as a challenge to any interested readers, and I would be particularly interested in knowing of a proof that was not based on complex analysis (or thinly disguised versions thereof), or for a reference to previous literature where something like this identity has occured. (I suspect for instance that something like this may have shown up before in free probability, based on the answer to part (ii) of the problem.)

Here is a purely algebraic form of the problem:

Problem 1 Let {F = F(z)} be a formal function of one variable {z}. Suppose that {G = G(z)} is the formal function defined by

\displaystyle G := \sum_{n=1}^\infty \left( \frac{F^n}{n!} \right)^{(n-1)}

\displaystyle = F + \left(\frac{F^2}{2}\right)' + \left(\frac{F^3}{6}\right)'' + \dots

\displaystyle = F + FF' + (F (F')^2 + \frac{1}{2} F^2 F'') + \dots,

where we use {f^{(k)}} to denote the {k}-fold derivative of {f} with respect to the variable {z}.

  • (i) Show that {F} can be formally recovered from {G} by the formula

    \displaystyle F = \sum_{n=1}^\infty (-1)^{n-1} \left( \frac{G^n}{n!} \right)^{(n-1)}

    \displaystyle = G - \left(\frac{G^2}{2}\right)' + \left(\frac{G^3}{6}\right)'' - \dots

    \displaystyle = G - GG' + (G (G')^2 + \frac{1}{2} G^2 G'') - \dots.

  • (ii) There is a remarkable further formal identity relating {F(z)} with {G(z)} that does not explicitly involve any infinite summation. What is this identity?

To rigorously formulate part (i) of this problem, one could work in the commutative differential ring of formal infinite series generated by polynomial combinations of {F} and its derivatives (with no constant term). Part (ii) is a bit trickier to formulate in this abstract ring; the identity in question is easier to state if {F, G} are formal power series, or (even better) convergent power series, as it involves operations such as composition or inversion that can be more easily defined in those latter settings.

To illustrate Problem 1(i), let us compute up to third order in {F}, using {{\mathcal O}(F^4)} to denote any quantity involving four or more factors of {F} and its derivatives, and similarly for other exponents than {4}. Then we have

\displaystyle G = F + FF' + (F (F')^2 + \frac{1}{2} F^2 F'') + {\mathcal O}(F^4)

and hence

\displaystyle G' = F' + (F')^2 + FF'' + {\mathcal O}(F^3)

\displaystyle G'' = F'' + {\mathcal O}(F^2);

multiplying, we have

\displaystyle GG' = FF' + F (F')^2 + F^2 F'' + F (F')^2 + {\mathcal O}(F^4)


\displaystyle G (G')^2 + \frac{1}{2} G^2 G'' = F (F')^2 + \frac{1}{2} F^2 F'' + {\mathcal O}(F^4)

and hence after a lot of canceling

\displaystyle G - GG' + (G (G')^2 + \frac{1}{2} G^2 G'') = F + {\mathcal O}(F^4).

Thus Problem 1(i) holds up to errors of {{\mathcal O}(F^4)} at least. In principle one can continue verifying Problem 1(i) to increasingly high order in {F}, but the computations rapidly become quite lengthy, and I do not know of a direct way to ensure that one always obtains the required cancellation at the end of the computation.

Problem 1(i) can also be posed in formal power series: if

\displaystyle F(z) = a_1 z + a_2 z^2 + a_3 z^3 + \dots

is a formal power series with no constant term with complex coefficients {a_1, a_2, \dots} with {|a_1|<1}, then one can verify that the series

\displaystyle G := \sum_{n=1}^\infty \left( \frac{F^n}{n!} \right)^{(n-1)}

makes sense as a formal power series with no constant term, thus

\displaystyle G(z) = b_1 z + b_2 z^2 + b_3 z^3 + \dots.

For instance it is not difficult to show that {b_1 = \frac{a_1}{1-a_1}}. If one further has {|b_1| < 1}, then it turns out that

\displaystyle F = \sum_{n=1}^\infty (-1)^{n-1} \left( \frac{G^n}{n!} \right)^{(n-1)}

as formal power series. Currently the only way I know how to show this is by first proving the claim for power series with a positive radius of convergence using the Cauchy integral formula, but even this is a bit tricky unless one has managed to guess the identity in (ii) first. (In fact, the way I discovered this problem was by first trying to solve (a variant of) the identity in (ii) by Taylor expansion in the course of attacking another problem, and obtaining the transform in Problem 1 as a consequence.)

The transform that takes {F} to {G} resembles both the exponential function

\displaystyle \exp(F) = \sum_{n=0}^\infty \frac{F^n}{n!}

and Taylor’s formula

\displaystyle F(z) = \sum_{n=0}^\infty \frac{F^{(n)}(0)}{n!} z^n

but does not seem to be directly connected to either (this is more apparent once one knows the identity in (ii)).

Kronecker is famously reported to have said, “God created the natural numbers; all else is the work of man”. The truth of this statement (literal or otherwise) is debatable; but one can certainly view the other standard number systems {{\bf Z}, {\bf Q}, {\bf R}, {\bf C}} as (iterated) completions of the natural numbers {{\bf N}} in various senses. For instance:

  • The integers {{\bf Z}} are the additive completion of the natural numbers {{\bf N}} (the minimal additive group that contains a copy of {{\bf N}}).
  • The rationals {{\bf Q}} are the multiplicative completion of the integers {{\bf Z}} (the minimal field that contains a copy of {{\bf Z}}).
  • The reals {{\bf R}} are the metric completion of the rationals {{\bf Q}} (the minimal complete metric space that contains a copy of {{\bf Q}}).
  • The complex numbers {{\bf C}} are the algebraic completion of the reals {{\bf R}} (the minimal algebraically closed field that contains a copy of {{\bf R}}).

These descriptions of the standard number systems are elegant and conceptual, but not entirely suitable for constructing the number systems in a non-circular manner from more primitive foundations. For instance, one cannot quite define the reals {{\bf R}} from scratch as the metric completion of the rationals {{\bf Q}}, because the definition of a metric space itself requires the notion of the reals! (One can of course construct {{\bf R}} by other means, for instance by using Dedekind cuts or by using uniform spaces in place of metric spaces.) The definition of the complex numbers as the algebraic completion of the reals does not suffer from such a non-circularity issue, but a certain amount of field theory is required to work with this definition initially. For the purposes of quickly constructing the complex numbers, it is thus more traditional to first define {{\bf C}} as a quadratic extension of the reals {{\bf R}}, and more precisely as the extension {{\bf C} = {\bf R}(i)} formed by adjoining a square root {i} of {-1} to the reals, that is to say a solution to the equation {i^2+1=0}. It is not immediately obvious that this extension is in fact algebraically closed; this is the content of the famous fundamental theorem of algebra, which we will prove later in this course.

The two equivalent definitions of {{\bf C}} – as the algebraic closure, and as a quadratic extension, of the reals respectively – each reveal important features of the complex numbers in applications. Because {{\bf C}} is algebraically closed, all polynomials over the complex numbers split completely, which leads to a good spectral theory for both finite-dimensional matrices and infinite-dimensional operators; in particular, one expects to be able to diagonalise most matrices and operators. Applying this theory to constant coefficient ordinary differential equations leads to a unified theory of such solutions, in which real-variable ODE behaviour such as exponential growth or decay, polynomial growth, and sinusoidal oscillation all become aspects of a single object, the complex exponential {z \mapsto e^z} (or more generally, the matrix exponential {A \mapsto \exp(A)}). Applying this theory more generally to diagonalise arbitrary translation-invariant operators over some locally compact abelian group, one arrives at Fourier analysis, which is thus most naturally phrased in terms of complex-valued functions rather than real-valued ones. If one drops the assumption that the underlying group is abelian, one instead discovers the representation theory of unitary representations, which is simpler to study than the real-valued counterpart of orthogonal representations. For closely related reasons, the theory of complex Lie groups is simpler than that of real Lie groups.

Meanwhile, the fact that the complex numbers are a quadratic extension of the reals lets one view the complex numbers geometrically as a two-dimensional plane over the reals (the Argand plane). Whereas a point singularity in the real line disconnects that line, a point singularity in the Argand plane leaves the rest of the plane connected (although, importantly, the punctured plane is no longer simply connected). As we shall see, this fact causes singularities in complex analytic functions to be better behaved than singularities of real analytic functions, ultimately leading to the powerful residue calculus for computing complex integrals. Remarkably, this calculus, when combined with the quintessentially complex-variable technique of contour shifting, can also be used to compute some (though certainly not all) definite integrals of real-valued functions that would be much more difficult to compute by purely real-variable methods; this is a prime example of Hadamard’s famous dictum that “the shortest path between two truths in the real domain passes through the complex domain”.

Another important geometric feature of the Argand plane is the angle between two tangent vectors to a point in the plane. As it turns out, the operation of multiplication by a complex scalar preserves the magnitude and orientation of such angles; the same fact is true for any non-degenerate complex analytic mapping, as can be seen by performing a Taylor expansion to first order. This fact ties the study of complex mappings closely to that of the conformal geometry of the plane (and more generally, of two-dimensional surfaces and domains). In particular, one can use complex analytic maps to conformally transform one two-dimensional domain to another, leading among other things to the famous Riemann mapping theorem, and to the classification of Riemann surfaces.

If one Taylor expands complex analytic maps to second order rather than first order, one discovers a further important property of these maps, namely that they are harmonic. This fact makes the class of complex analytic maps extremely rigid and well behaved analytically; indeed, the entire theory of elliptic PDE now comes into play, giving useful properties such as elliptic regularity and the maximum principle. In fact, due to the magic of residue calculus and contour shifting, we already obtain these properties for maps that are merely complex differentiable rather than complex analytic, which leads to the striking fact that complex differentiable functions are automatically analytic (in contrast to the real-variable case, in which real differentiable functions can be very far from being analytic).

The geometric structure of the complex numbers (and more generally of complex manifolds and complex varieties), when combined with the algebraic closure of the complex numbers, leads to the beautiful subject of complex algebraic geometry, which motivates the much more general theory developed in modern algebraic geometry. However, we will not develop the algebraic geometry aspects of complex analysis here.

Last, but not least, because of the good behaviour of Taylor series in the complex plane, complex analysis is an excellent setting in which to manipulate various generating functions, particularly Fourier series {\sum_n a_n e^{2\pi i n \theta}} (which can be viewed as boundary values of power (or Laurent) series {\sum_n a_n z^n}), as well as Dirichlet series {\sum_n \frac{a_n}{n^s}}. The theory of contour integration provides a very useful dictionary between the asymptotic behaviour of the sequence {a_n}, and the complex analytic behaviour of the Dirichlet or Fourier series, particularly with regard to its poles and other singularities. This turns out to be a particularly handy dictionary in analytic number theory, for instance relating the distribution of the primes to the Riemann zeta function. Nowadays, many of the analytic number theory results first obtained through complex analysis (such as the prime number theorem) can also be obtained by more “real-variable” methods; however the complex-analytic viewpoint is still extremely valuable and illuminating.

We will frequently touch upon many of these connections to other fields of mathematics in these lecture notes. However, these are mostly side remarks intended to provide context, and it is certainly possible to skip most of these tangents and focus purely on the complex analysis material in these notes if desired.

Note: complex analysis is a very visual subject, and one should draw plenty of pictures while learning it. I am however not planning to put too many pictures in these notes, partly as it is somewhat inconvenient to do so on this blog from a technical perspective, but also because pictures that one draws on one’s own are likely to be far more useful to you than pictures that were supplied by someone else.

Read the rest of this entry »

[This blog post was written jointly by Terry Tao and Will Sawin.]

In the previous blog post, one of us (Terry) implicitly introduced a notion of rank for tensors which is a little different from the usual notion of tensor rank, and which (following BCCGNSU) we will call “slice rank”. This notion of rank could then be used to encode the Croot-Lev-Pach-Ellenberg-Gijswijt argument that uses the polynomial method to control capsets.

Afterwards, several papers have applied the slice rank method to further problems – to control tri-colored sum-free sets in abelian groups (BCCGNSU, KSS) and from there to the triangle removal lemma in vector spaces over finite fields (FL), to control sunflowers (NS), and to bound progression-free sets in {p}-groups (P).

In this post we investigate the notion of slice rank more systematically. In particular, we show how to give lower bounds for the slice rank. In many cases, we can show that the upper bounds on slice rank given in the aforementioned papers are sharp to within a subexponential factor. This still leaves open the possibility of getting a better bound for the original combinatorial problem using the slice rank of some other tensor, but for very long arithmetic progressions (at least eight terms), we show that the slice rank method cannot improve over the trivial bound using any tensor.

It will be convenient to work in a “basis independent” formalism, namely working in the category of abstract finite-dimensional vector spaces over a fixed field {{\bf F}}. (In the applications to the capset problem one takes {{\bf F}={\bf F}_3} to be the finite field of three elements, but most of the discussion here applies to arbitrary fields.) Given {k} such vector spaces {V_1,\dots,V_k}, we can form the tensor product {\bigotimes_{i=1}^k V_i}, generated by the tensor products {v_1 \otimes \dots \otimes v_k} with {v_i \in V_i} for {i=1,\dots,k}, subject to the constraint that the tensor product operation {(v_1,\dots,v_k) \mapsto v_1 \otimes \dots \otimes v_k} is multilinear. For each {1 \leq j \leq k}, we have the smaller tensor products {\bigotimes_{1 \leq i \leq k: i \neq j} V_i}, as well as the {j^{th}} tensor product

\displaystyle \otimes_j: V_j \times \bigotimes_{1 \leq i \leq k: i \neq j} V_i \rightarrow \bigotimes_{i=1}^k V_i

defined in the obvious fashion. Elements of {\bigotimes_{i=1}^k V_i} of the form {v_j \otimes_j v_{\hat j}} for some {v_j \in V_j} and {v_{\hat j} \in \bigotimes_{1 \leq i \leq k: i \neq j} V_i} will be called rank one functions, and the slice rank (or rank for short) {\hbox{rank}(v)} of an element {v} of {\bigotimes_{i=1}^k V_i} is defined to be the least nonnegative integer {r} such that {v} is a linear combination of {r} rank one functions. If {V_1,\dots,V_k} are finite-dimensional, then the rank is always well defined as a non-negative integer (in fact it cannot exceed {\min( \hbox{dim}(V_1), \dots, \hbox{dim}(V_k))}. It is also clearly subadditive:

\displaystyle \hbox{rank}(v+w) \leq \hbox{rank}(v) + \hbox{rank}(w). \ \ \ \ \ (1)


For {k=1}, {\hbox{rank}(v)} is {0} when {v} is zero, and {1} otherwise. For {k=2}, {\hbox{rank}(v)} is the usual rank of the {2}-tensor {v \in V_1 \otimes V_2} (which can for instance be identified with a linear map from {V_1} to the dual space {V_2^*}). The usual notion of tensor rank for higher order tensors uses complete tensor products {v_1 \otimes \dots \otimes v_k}, {v_i \in V_i} as the rank one objects, rather than {v_j \otimes_j v_{\hat j}}, giving a rank that is greater than or equal to the slice rank studied here.

From basic linear algebra we have the following equivalences:

Lemma 1 Let {V_1,\dots,V_k} be finite-dimensional vector spaces over a field {{\bf F}}, let {v} be an element of {V_1 \otimes \dots \otimes V_k}, and let {r} be a non-negative integer. Then the following are equivalent:

  • (i) One has {\hbox{rank}(v) \leq r}.
  • (ii) One has a representation of the form

    \displaystyle v = \sum_{j=1}^k \sum_{s \in S_j} v_{j,s} \otimes_j v_{\hat j,s}

    where {S_1,\dots,S_k} are finite sets of total cardinality {|S_1|+\dots+|S_k|} at most {r}, and for each {1 \leq j \leq k} and {s \in S_j}, {v_{j,s} \in V_j} and {v_{\hat j,s} \in \bigotimes_{1 \leq i \leq k: i \neq j} V_i}.

  • (iii) One has

    \displaystyle v \in \sum_{j=1}^k U_j \otimes_j \bigotimes_{1 \leq i \leq k: i \neq j} V_i

    where for each {j=1,\dots,k}, {U_j} is a subspace of {V_j} of total dimension {\hbox{dim}(U_1)+\dots+\hbox{dim}(U_k)} at most {r}, and we view {U_j \otimes_j \bigotimes_{1 \leq i \leq k: i \neq j} V_i} as a subspace of {\bigotimes_{i=1}^k V_i} in the obvious fashion.

  • (iv) (Dual formulation) There exist subspaces {W_j} of the dual space {V_j^*} for {j=1,\dots,k}, of total dimension at least {\hbox{dim}(V_1)+\dots+\hbox{dim}(V_k) - r}, such that {v} is orthogonal to {\bigotimes_{j=1}^k W_j}, in the sense that one has the vanishing

    \displaystyle \langle \bigotimes_{j=1}^k w_j, v \rangle = 0

    for all {w_j \in W_j}, where {\langle, \rangle: \bigotimes_{j=1}^k V_j^* \times \bigotimes_{j=1}^k V_j \rightarrow {\bf F}} is the obvious pairing.

Proof: The equivalence of (i) and (ii) is clear from definition. To get from (ii) to (iii) one simply takes {U_j} to be the span of the {v_{j,s}}, and conversely to get from (iii) to (ii) one takes the {v_{j,s}} to be a basis of the {U_j} and computes {v_{\hat j,s}} by using a basis for the tensor product {\bigotimes_{j=1}^k U_j \otimes_j \bigotimes_{1 \leq i \leq k: i \neq j} V_i} consisting entirely of functions of the form {v_{j,s} \otimes_j e} for various {e}. To pass from (iii) to (iv) one takes {W_j} to be the annihilator {\{ w_j \in V_j: \langle w_j, v_j \rangle = 0 \forall v_j \in U_j \}} of {U_j}, and conversely to pass from (iv) to (iii). \Box

One corollary of the formulation (iv), is that the set of tensors of slice rank at most {r} is Zariski closed (if the field {{\mathbf F}} is algebraically closed), and so the slice rank itself is a lower semi-continuous function. This is in contrast to the usual tensor rank, which is not necessarily semicontinuous.

Corollary 2 Let {V_1,\dots, V_k} be finite-dimensional vector spaces over an algebraically closed field {{\bf F}}. Let {r} be a nonnegative integer. The set of elements of {V_1 \otimes \dots \otimes V_k} of slice rank at most {r} is closed in the Zariski topology.

Proof: In view of Lemma 1(i and iv), this set is the union over tuples of integers {d_1,\dots,d_k} with {d_1 + \dots + d_k \geq \hbox{dim}(V_1)+\dots+\hbox{dim}(V_k) - r} of the projection from {\hbox{Gr}(d_1, V_1) \times \dots \times \hbox{Gr}(d_k, V_k) \times ( V_1 \otimes \dots \otimes V_k)} of the set of tuples {(W_1,\dots,W_k, v)} with { v} orthogonal to {W_1 \times \dots \times W_k}, where {\hbox{Gr}(d,V)} is the Grassmanian parameterizing {d}-dimensional subspaces of {V}.

One can check directly that the set of tuples {(W_1,\dots,W_k, v)} with { v} orthogonal to {W_1 \times \dots \times W_k} is Zariski closed in {\hbox{Gr}(d_1, V_1) \times \dots \times \hbox{Gr}(d_k, V_k) \times V_1 \otimes \dots \otimes V_k} using a set of equations of the form {\langle \bigotimes_{j=1}^k w_j, v \rangle = 0} locally on {\hbox{Gr}(d_1, V_1) \times \dots \times \hbox{Gr}(d_k, V_k) }. Hence because the Grassmanian is a complete variety, the projection of this set to {V_1 \otimes \dots \otimes V_k} is also Zariski closed. So the finite union over tuples {d_1,\dots,d_k} of these projections is also Zariski closed.


We also have good behaviour with respect to linear transformations:

Lemma 3 Let {V_1,\dots,V_k, W_1,\dots,W_k} be finite-dimensional vector spaces over a field {{\bf F}}, let {v} be an element of {V_1 \otimes \dots \otimes V_k}, and for each {1 \leq j \leq k}, let {\phi_j: V_j \rightarrow W_j} be a linear transformation, with {\bigotimes_{j=1}^k \phi_j: \bigotimes_{j=1}^k V_k \rightarrow \bigotimes_{j=1}^k W_k} the tensor product of these maps. Then

\displaystyle \hbox{rank}( (\bigotimes_{j=1}^k \phi_j)(v) ) \leq \hbox{rank}(v). \ \ \ \ \ (2)


Furthermore, if the {\phi_j} are all injective, then one has equality in (2).

Thus, for instance, the rank of a tensor {v \in \bigotimes_{j=1}^k V_k} is intrinsic in the sense that it is unaffected by any enlargements of the spaces {V_1,\dots,V_k}.

Proof: The bound (2) is clear from the formulation (ii) of rank in Lemma 1. For equality, apply (2) to the injective {\phi_j}, as well as to some arbitrarily chosen left inverses {\phi_j^{-1}: W_j \rightarrow V_j} of the {\phi_j}. \Box

Computing the rank of a tensor is difficult in general; however, the problem becomes a combinatorial one if one has a suitably sparse representation of that tensor in some basis, where we will measure sparsity by the property of being an antichain.

Proposition 4 Let {V_1,\dots,V_k} be finite-dimensional vector spaces over a field {{\bf F}}. For each {1 \leq j \leq k}, let {(v_{j,s})_{s \in S_j}} be a linearly independent set in {V_j} indexed by some finite set {S_j}. Let {\Gamma} be a subset of {S_1 \times \dots \times S_k}.

Let {v \in \bigotimes_{j=1}^k V_j} be a tensor of the form

\displaystyle v = \sum_{(s_1,\dots,s_k) \in \Gamma} c_{s_1,\dots,s_k} v_{1,s_1} \otimes \dots \otimes v_{k,s_k} \ \ \ \ \ (3)


where for each {(s_1,\dots,s_k)}, {c_{s_1,\dots,s_k}} is a coefficient in {{\bf F}}. Then one has

\displaystyle \hbox{rank}(v) \leq \min_{\Gamma = \Gamma_1 \cup \dots \cup \Gamma_k} |\pi_1(\Gamma_1)| + \dots + |\pi_k(\Gamma_k)| \ \ \ \ \ (4)


where the minimum ranges over all coverings of {\Gamma} by sets {\Gamma_1,\dots,\Gamma_k}, and {\pi_j: S_1 \times \dots \times S_k \rightarrow S_j} for {j=1,\dots,k} are the projection maps.

Now suppose that the coefficients {c_{s_1,\dots,s_k}} are all non-zero, that each of the {S_j} are equipped with a total ordering {\leq_j}, and {\Gamma'} is the set of maximal elements of {\Gamma}, thus there do not exist distinct {(s_1,\dots,s_k) \in \Gamma'}, {(t_1,\dots,t_k) \in \Gamma} such that {s_j \leq t_j} for all {j=1,\dots,k}. Then one has

\displaystyle \hbox{rank}(v) \geq \min_{\Gamma' = \Gamma_1 \cup \dots \cup \Gamma_k} |\pi_1(\Gamma_1)| + \dots + |\pi_k(\Gamma_k)|. \ \ \ \ \ (5)


In particular, if {\Gamma} is an antichain (i.e. every element is maximal), then equality holds in (4).

Proof: By Lemma 3 (or by enlarging the bases {v_{j,s_j}}), we may assume without loss of generality that each of the {V_j} is spanned by the {v_{j,s_j}}. By relabeling, we can also assume that each {S_j} is of the form

\displaystyle S_j = \{1,\dots,|S_j|\}

with the usual ordering, and by Lemma 3 we may take each {V_j} to be {{\bf F}^{|S_j|}}, with {v_{j,s_j} = e_{s_j}} the standard basis.

Let {r} denote the rank of {v}. To show (4), it suffices to show the inequality

\displaystyle r \leq |\pi_1(\Gamma_1)| + \dots + |\pi_k(\Gamma_k)| \ \ \ \ \ (6)


for any covering of {\Gamma} by {\Gamma_1,\dots,\Gamma_k}. By removing repeated elements we may assume that the {\Gamma_i} are disjoint. For each {1 \leq j \leq k}, the tensor

\displaystyle \sum_{(s_1,\dots,s_k) \in \Gamma_j} c_{s_1,\dots,s_k} e_{s_1} \otimes \dots \otimes e_{s_k}

can (after collecting terms) be written as

\displaystyle \sum_{s_j \in \pi_j(\Gamma_j)} e_{s_j} \otimes_j v_{\hat j,s_j}

for some {v_{\hat j, s_j} \in \bigotimes_{1 \leq i \leq k: i \neq j} {\bf F}^{|S_i|}}. Summing and using (1), we conclude the inequality (6).

Now assume that the {c_{s_1,\dots,s_k}} are all non-zero and that {\Gamma'} is the set of maximal elements of {\Gamma}. To conclude the proposition, it suffices to show that the reverse inequality

\displaystyle r \geq |\pi_1(\Gamma_1)| + \dots + |\pi_k(\Gamma_k)| \ \ \ \ \ (7)


 holds for some {\Gamma_1,\dots,\Gamma_k} covering {\Gamma'}. By Lemma 1(iv), there exist subspaces {W_j} of {({\bf F}^{|S_j|})^*} whose dimension {d_j := \hbox{dim}(W_j)} sums to

\displaystyle \sum_{j=1}^k d_j = \sum_{j=1}^k |S_j| - r \ \ \ \ \ (8)


such that {v} is orthogonal to {\bigotimes_{j=1}^k W_j}.

Let {1 \leq j \leq k}. Using Gaussian elimination, one can find a basis {w_{j,1},\dots,w_{j,d_j}} of {W_j} whose representation in the standard dual basis {e^*_{1},\dots,e^*_{|S_j|}} of {({\bf F}^{|S_j|})^*} is in row-echelon form. That is to say, there exist natural numbers

\displaystyle 1 \leq s_{j,1} < \dots < s_{j,d_j} \leq |S_j|

such that for all {1 \leq t \leq d_j}, {w_{j,t}} is a linear combination of the dual vectors {e^*_{s_{j,t}},\dots,e^*_{|S_j|}}, with the {e^*_{s_{j,t}}} coefficient equal to one.

We now claim that {\prod_{j=1}^k \{ s_{j,t}: 1 \leq t \leq d_j \}} is disjoint from {\Gamma'}. Suppose for contradiction that this were not the case, thus there exists {1 \leq t_j \leq d_j} for each {1 \leq j \leq k} such that

\displaystyle (s_{1,t_1}, \dots, s_{k,t_k}) \in \Gamma'.

As {\Gamma'} is the set of maximal elements of {\Gamma}, this implies that

\displaystyle (s'_1,\dots,s'_k) \not \in \Gamma

for any tuple {(s'_1,\dots,s'_k) \in \prod_{j=1}^k \{ s_{j,t_j}, \dots, |S_j|\}} other than {(s_{1,t_1}, \dots, s_{k,t_k})}. On the other hand, we know that {w_{j,t_j}} is a linear combination of {e^*_{s_{j,t_j}},\dots,e^*_{|S_j|}}, with the {e^*_{s_{j,t_j}}} coefficient one. We conclude that the tensor product {\bigotimes_{j=1}^k w_{j,t_j}} is equal to

\displaystyle \bigotimes_{j=1}^k e^*_{s_{j,t_j}}

plus a linear combination of other tensor products {\bigotimes_{j=1}^k e^*_{s'_j}} with {(s'_1,\dots,s'_k)} not in {\Gamma}. Taking inner products with (3), we conclude that {\langle v, \bigotimes_{j=1}^k w_{j,t_j}\rangle = c_{s_{1,t_1},\dots,s_{k,t_k}} \neq 0}, contradicting the fact that {v} is orthogonal to {\prod_{j=1}^k W_j}. Thus we have {\prod_{j=1}^k \{ s_{j,t}: 1 \leq t \leq d_j \}} disjoint from {\Gamma'}.

For each {1 \leq j \leq k}, let {\Gamma_j} denote the set of tuples {(s_1,\dots,s_k)} in {\Gamma'} with {s_j} not of the form {\{ s_{j,t}: 1 \leq t \leq d_j \}}. From the previous discussion we see that the {\Gamma_j} cover {\Gamma'}, and we clearly have {\pi_j(\Gamma_j) \leq |S_j| - d_j}, and hence from (8) we have (7) as claimed. \Box

As an instance of this proposition, we recover the computation of diagonal rank from the previous blog post:

Example 5 Let {V_1,\dots,V_k} be finite-dimensional vector spaces over a field {{\bf F}} for some {k \geq 2}. Let {d} be a natural number, and for {1 \leq j \leq k}, let {e_{j,1},\dots,e_{j,d}} be a linearly independent set in {V_j}. Let {c_1,\dots,c_d} be non-zero coefficients in {{\bf F}}. Then

\displaystyle \sum_{t=1}^d c_t e_{1,t} \otimes \dots \otimes e_{k,t}

has rank {d}. Indeed, one applies the proposition with {S_1,\dots,S_k} all equal to {\{1,\dots,d\}}, with {\Gamma} the diagonal in {S_1 \times \dots \times S_k}; this is an antichain if we give one of the {S_i} the standard ordering, and another of the {S_i} the opposite ordering (and ordering the remaining {S_i} arbitrarily). In this case, the {\pi_j} are all bijective, and so it is clear that the minimum in (4) is simply {d}.

The combinatorial minimisation problem in the above proposition can be solved asymptotically when working with tensor powers, using the notion of the Shannon entropy {h(X)} of a discrete random variable {X}.

Proposition 6 Let {V_1,\dots,V_k} be finite-dimensional vector spaces over a field {{\bf F}}. For each {1 \leq j \leq k}, let {(v_{j,s})_{s \in S_j}} be a linearly independent set in {V_j} indexed by some finite set {S_j}. Let {\Gamma} be a non-empty subset of {S_1 \times \dots \times S_k}.

Let {v \in \bigotimes_{j=1}^k V_j} be a tensor of the form (3) for some coefficients {c_{s_1,\dots,s_k}}. For each natural number {n}, let {v^{\otimes n}} be the tensor power of {n} copies of {v}, viewed as an element of {\bigotimes_{j=1}^k V_j^{\otimes n}}. Then

\displaystyle \hbox{rank}(v^{\otimes n}) \leq \exp( (H + o(1)) n ) \ \ \ \ \ (9)


as {n \rightarrow \infty}, where {H} is the quantity

\displaystyle H = \hbox{sup}_{(X_1,\dots,X_k)} \hbox{min}( h(X_1), \dots, h(X_k) ) \ \ \ \ \ (10)


and {(X_1,\dots,X_k)} range over the random variables taking values in {\Gamma}.

Now suppose that the coefficients {c_{s_1,\dots,s_k}} are all non-zero and that each of the {S_j} are equipped with a total ordering {\leq_j}. Let {\Gamma'} be the set of maximal elements of {\Gamma} in the product ordering, and let {H' = \hbox{sup}_{(X_1,\dots,X_k)} \hbox{min}( h(X_1), \dots, h(X_k) ) } where {(X_1,\dots,X_k)} range over random variables taking values in {\Gamma'}. Then

\displaystyle \hbox{rank}(v^{\otimes n}) \geq \exp( (H' + o(1)) n ) \ \ \ \ \ (11)


as {n \rightarrow \infty}. In particular, if the maximizer in (10) is supported on the maximal elements of {\Gamma} (which always holds if {\Gamma} is an antichain in the product ordering), then equality holds in (9).


It will suffice to show that

\displaystyle \min_{\Gamma^n = \Gamma_{n,1} \cup \dots \cup \Gamma_{n,k}} |\pi_{n,1}(\Gamma_{n,1})| + \dots + |\pi_{n,k}(\Gamma_{n,k})| = \exp( (H + o(1)) n ) \ \ \ \ \ (12)


as {n \rightarrow \infty}, where {\pi_{n,j}: \prod_{i=1}^k S_i^n \rightarrow S_j^n} is the projection map. Then the same thing will apply to {\Gamma'} and {H'}. Then applying Proposition 4, using the lexicographical ordering on {S_j^n} and noting that, if {\Gamma'} are the maximal elements of {\Gamma}, then {\Gamma'^n} are the maximal elements of {\Gamma^n}, we obtain both (9) and (11).

We first prove the lower bound. By compactness (and the continuity properties of entropy), we can find a random variable {(X_1,\dots,X_k)} taking values in {\Gamma} such that

\displaystyle H = \hbox{min}( h(X_1), \dots, h(X_k) ). \ \ \ \ \ (13)


Let {\varepsilon = o(1)} be a small positive quantity that goes to zero sufficiently slowly with {n}. Let {\Sigma = \Sigma_{X_1,\dots,X_k} \subset \Gamma^n} denote the set of all tuples {(a_1, \dots, \vec a_n)} in {\Gamma^n} that are within {\varepsilon} of being distributed according to the law of {(X_1,\dots,X_k)}, in the sense that for all {a \in \Gamma}, one has

\displaystyle |\frac{|\{ 1 \leq l \leq n: a_l = a \}|}{n} - {\bf P}( (X_1,\dots,X_k) = a )| \leq \varepsilon.

By the asymptotic equipartition property, the cardinality of {\Sigma} can be computed to be

\displaystyle |\Sigma| = \exp( (h( X_1,\dots,X_k)+o(1)) n ) \ \ \ \ \ (14)


if {\varepsilon} goes to zero slowly enough. Similarly one has

\displaystyle |\pi_{n,j}(\Sigma)| = \exp( (h( X_j)+o(1)) n ),

and for each {s_{n,j} \in \pi_{n,j}(\Sigma)}, one has

\displaystyle |\{ \sigma \in \Sigma: \pi_{n,j}(\sigma) = s_{n,j} \}| \leq \exp( (h( X_1,\dots,X_k)-h(X_j)+o(1)) n ). \ \ \ \ \ (15)


Now let {\Gamma^n = \Gamma_{n,1} \cup \dots \cup \Gamma_{n,k}} be an arbitrary covering of {\Gamma^n}. By the pigeonhole principle, there exists {1 \leq j \leq k} such that

\displaystyle |\Gamma_{n,j} \cap \Sigma| \geq \frac{1}{k} |\Sigma|

and hence by (14), (15)

\displaystyle |\pi_{n,j}( \Gamma_{n,j} \cap \Sigma)| \geq \frac{1}{k} \exp( (h( X_j)+o(1)) n )

which by (13) implies that

\displaystyle |\pi_{n,1}(\Gamma_{n,1})| + \dots + |\pi_{n,k}(\Gamma_{n,k})| \geq \exp( (H + o(1)) n )

noting that the {\frac{1}{k}} factor can be absorbed into the {o(1)} error). This gives the lower bound in (12).

Now we prove the upper bound. We can cover {\Gamma^n} by {O(\exp(o(n))} sets of the form {\Sigma_{X_1,\dots,X_k}} for various choices of random variables {(X_1,\dots,X_k)} taking values in {\Gamma}. For each such random variable {(X_1,\dots,X_k)}, we can find {1 \leq j \leq k} such that {h(X_j) \leq H}; we then place all of {\Sigma_{X_1,\dots,X_k}} in {\Gamma_j}. It is then clear that the {\Gamma_j} cover {\Gamma} and that

\displaystyle |\Gamma_j| \leq \exp( (H+o(1)) n )

for all {j=1,\dots,n}, giving the required upper bound. \Box

It is of interest to compute the quantity {H} in (10). We have the following criterion for when a maximiser occurs:

Proposition 7 Let {S_1,\dots,S_k} be finite sets, and {\Gamma \subset S_1 \times \dots \times S_k} be non-empty. Let {H} be the quantity in (10). Let {(X_1,\dots,X_k)} be a random variable taking values in {\Gamma}, and let {\Gamma^* \subset \Gamma} denote the essential range of {(X_1,\dots,X_k)}, that is to say the set of tuples {(t_1,\dots,t_k)\in \Gamma} such that {{\bf P}( X_1=t_1, \dots, X_k = t_k)} is non-zero. Then the following are equivalent:

  • (i) {(X_1,\dots,X_k)} attains the maximum in (10).
  • (ii) There exist weights {w_1,\dots,w_k \geq 0} and a finite quantity {D \geq 0}, such that {w_j=0} whenever {h(X_j) > \min(h(X_1),\dots,h(X_k))}, and such that

    \displaystyle \sum_{j=1}^k w_j \log \frac{1}{{\bf P}(X_j = t_j)} \leq D \ \ \ \ \ (16)

    for all {(t_1,\dots,t_k) \in \Gamma}, with equality if {(t_1,\dots,t_k) \in \Gamma^*}. (In particular, {w_j} must vanish if there exists a {t_j \in \pi_i(\Gamma)} with {{\bf P}(X_j=t_j)=0}.)

Furthermore, when (i) and (ii) holds, one has

\displaystyle D = H \sum_{j=1}^k w_j. \ \ \ \ \ (17)


Proof: We first show that (i) implies (ii). The function {p \mapsto p \log \frac{1}{p}} is concave on {[0,1]}. As a consequence, if we define {C} to be the set of tuples {(h_1,\dots,h_k) \in [0,+\infty)^k} such that there exists a random variable {(Y_1,\dots,Y_k)} taking values in {\Gamma} with {h(Y_j) \geq h_j}, then {C} is convex. On the other hand, by (10), {C} is disjoint from the orthant {(H,+\infty)^k}. Thus, by the hyperplane separation theorem, we conclude that there exists a half-space

\displaystyle \{ (h_1,\dots,h_k) \in {\bf R}^k: w_1 h_1 + \dots + w_k h_k \geq c \},

where {w_1,\dots,w_k} are reals that are not all zero, and {c} is another real, which contains {(h(X_1),\dots,h(X_k))} on its boundary and {(H,+\infty)^k} in its interior, such that {C} avoids the interior of the half-space. Since {(h(X_1),\dots,h(X_k))} is also on the boundary of {(H,+\infty)^k}, we see that the {w_j} are non-negative, and that {w_j = 0} whenever {h(X_j) \neq H}.

By construction, the quantity

\displaystyle w_1 h(Y_1) + \dots + w_k h(Y_k)

is maximised when {(Y_1,\dots,Y_k) = (X_1,\dots,X_k)}. At this point we could use the method of Lagrange multipliers to obtain the required constraints, but because we have some boundary conditions on the {(Y_1,\dots,Y_k)} (namely, that the probability that they attain a given element of {\Gamma} has to be non-negative) we will work things out by hand. Let {t = (t_1,\dots,t_k)} be an element of {\Gamma}, and {s = (s_1,\dots,s_k)} an element of {\Gamma^*}. For {\varepsilon>0} small enough, we can form a random variable {(Y_1,\dots,Y_k)} taking values in {\Gamma}, whose probability distribution is the same as that for {(X_1,\dots,X_k)} except that the probability of attaining {(t_1,\dots,t_k)} is increased by {\varepsilon}, and the probability of attaining {(s_1,\dots,s_k)} is decreased by {\varepsilon}. If there is any {j} for which {{\bf P}(X_j = t_j)=0} and {w_j \neq 0}, then one can check that

\displaystyle w_1 h(Y_1) + \dots + w_k h(Y_k) - (w_1 h(X_1) + \dots + w_k h(X_k)) \gg \varepsilon \log \frac{1}{\varepsilon}

for sufficiently small {\varepsilon}, contradicting the maximality of {(X_1,\dots,X_k)}; thus we have {{\bf P}(X_j = t_j) > 0} whenever {w_j \neq 0}. Taylor expansion then gives

\displaystyle w_1 h(Y_1) + \dots + w_k h(Y_k) - (w_1 h(X_1) + \dots + w_k h(X_k)) = (A_t - A_s) \varepsilon + O(\varepsilon^2)

for small {\varepsilon}, where

\displaystyle A_t := \sum_{j=1}^k w_j \log \frac{1}{{\bf P}(X_j = t_j)}

and similarly for {A_s}. We conclude that {A_t \leq A_s} for all {s \in \Gamma^*} and {t \in \Gamma}, thus there exists a quantity {D} such that {A_s = D} for all {s \in \Gamma^*}, and {A_t \leq D} for all {t \in \Gamma}. By construction {D} must be nonnegative. Sampling {(t_1,\dots,t_k)} using the distribution of {(X_1,\dots,X_k)}, one has

\displaystyle \sum_{j=1}^k w_j \log \frac{1}{{\bf P}(X_j = t_j)} = D

almost surely; taking expectations we conclude that

\displaystyle \sum_{j=1}^k w_j \sum_{t_j \in S_j} {\bf P}( X_j = t_j) \log \frac{1}{{\bf P}(X_j = t_j)} = D.

The inner sum is {h(X_j)}, which equals {H} when {w_j} is non-zero, giving (17).

Now we show conversely that (ii) implies (i). As noted previously, the function {p \mapsto p \log \frac{1}{p}} is concave on {[0,1]}, with derivative {\log \frac{1}{p} - 1}. This gives the inequality

\displaystyle q \log \frac{1}{q} \leq p \log \frac{1}{p} + (q-p) ( \log \frac{1}{p} - 1 ) \ \ \ \ \ (18)


for any {0 \leq p,q \leq 1} (note the right-hand side may be infinite when {p=0} and {q>0}). Let {(Y_1,\dots,Y_k)} be any random variable taking values in {\Gamma}, then on applying the above inequality with {p = {\bf P}(X_j = t_j)} and {q = {\bf P}( Y_j = t_j )}, multiplying by {w_j}, and summing over {j=1,\dots,k} and {t_j \in S_j} gives

\displaystyle \sum_{j=1}^k w_j h(Y_j) \leq \sum_{j=1}^k w_j h(X_j)

\displaystyle + \sum_{j=1}^k \sum_{t_j \in S_j} w_j ({\bf P}(Y_j = t_j) - {\bf P}(X_j = t_j)) ( \log \frac{1}{{\bf P}(X_j=t_j)} - 1 ).

By construction, one has

\displaystyle \sum_{j=1}^k w_j h(X_j) = \min(h(X_1),\dots,h(X_k)) \sum_{j=1}^k w_j


\displaystyle \sum_{j=1}^k w_j h(Y_j) \geq \min(h(Y_1),\dots,h(Y_k)) \sum_{j=1}^k w_j

so to prove that {\min(h(Y_1),\dots,h(Y_k)) \leq \min(h(X_1),\dots,h(X_k))} (which would give (i)), it suffices to show that

\displaystyle \sum_{j=1}^k \sum_{t_j \in S_j} w_j ({\bf P}(Y_j = t_j) - {\bf P}(X_j = t_j)) ( \log \frac{1}{{\bf P}(X_j=t_j)} - 1 ) \leq 0,

or equivalently that the quantity

\displaystyle \sum_{j=1}^k \sum_{t_j \in S_j} w_j {\bf P}(Y_j = t_j) ( \log \frac{1}{{\bf P}(X_j=t_j)} - 1 )

is maximised when {(Y_1,\dots,Y_k) = (X_1,\dots,X_k)}. Since

\displaystyle \sum_{j=1}^k \sum_{t_j \in S_j} w_j {\bf P}(Y_j = t_j) = \sum_{j=1}^k w_j

it suffices to show this claim for the quantity

\displaystyle \sum_{j=1}^k \sum_{t_j \in S_j} w_j {\bf P}(Y_j = t_j) \log \frac{1}{{\bf P}(X_j=t_j)}.

One can view this quantity as

\displaystyle {\bf E}_{(Y_1,\dots,Y_k)} \sum_{j=1}^k w_j \log \frac{1}{{\bf P}_{X_j}(X_j=Y_j)}.

By (ii), this quantity is bounded by {D}, with equality if {(Y_1,\dots,Y_k)} is equal to {(X_1,\dots,X_k)} (and is in particular ranging in {\Gamma^*}), giving the claim. \Box

The second half of the proof of Proposition 7 only uses the marginal distributions {{{\bf P}(X_j=t_j)}} and the equation(16), not the actual distribution of {(X_1,\dots,X_k)}, so it can also be used to prove an upper bound on {H} when the exact maximizing distribution is not known, given suitable probability distributions in each variable. The logarithm of the probability distribution here plays the role that the weight functions do in BCCGNSU.

Remark 8 Suppose one is in the situation of (i) and (ii) above; assume the nondegeneracy condition that {H} is positive (or equivalently that {D} is positive). We can assign a “degree” {d_j(t_j)} to each element {t_j \in S_j} by the formula

\displaystyle d_j(t_j) := w_j \log \frac{1}{{\bf P}(X_j = t_j)}, \ \ \ \ \ (19)


then every tuple {(t_1,\dots,t_k)} in {\Gamma} has total degree at most {D}, and those tuples in {\Gamma^*} have degree exactly {D}. In particular, every tuple in {\Gamma^n} has degree at most {nD}, and hence by (17), each such tuple has a {j}-component of degree less than or equal to {nHw_j} for some {j} with {w_j>0}. On the other hand, we can compute from (19) and the fact that {h(X_j) = H} for {w_j > 0} that {Hw_j = {\bf E} d_j(X_j)}. Thus, by asymptotic equipartition, and assuming {w_j \neq 0}, the number of “monomials” in {S_j^n} of total degree at most {nHw_j} is at most {\exp( (h(X_j)+o(1)) n )}; one can in fact use (19) and (18) to show that this is in fact an equality. This gives a direct way to cover {\Gamma^n} by sets {\Gamma_{n,1},\dots,\Gamma_{n,k}} with {|\pi_j(\Gamma_{n,j})| \leq \exp( (H+o(1)) n)}, which is in the spirit of the Croot-Lev-Pach-Ellenberg-Gijswijt arguments from the previous post.

We can now show that the rank computation for the capset problem is sharp:

Proposition 9 Let {V_1^{\otimes n} = V_2^{\otimes n} = V_3^{\otimes n}} denote the space of functions from {{\bf F}_3^n} to {{\bf F}_3}. Then the function {(x,y,z) \mapsto \delta_{0^n}(x,y,z)} from {{\bf F}_3^n \times {\bf F}_3^n \times {\bf F}_3^n} to {{\bf F}}, viewed as an element of {V_1^{\otimes n} \otimes V_2^{\otimes n} \otimes V_3^{\otimes n}}, has rank {\exp( (H^*+o(1)) n )} as {n \rightarrow \infty}, where {H^* \approx 1.013445} is given by the formula

\displaystyle H^* = \alpha \log \frac{1}{\alpha} + \beta \log \frac{1}{\beta} + \gamma \log \frac{1}{\gamma} \ \ \ \ \ (20)



\displaystyle \alpha = \frac{32}{3(15 + \sqrt{33})} \approx 0.51419

\displaystyle \beta = \frac{4(\sqrt{33}-1)}{3(15+\sqrt{33})} \approx 0.30495

\displaystyle \gamma = \frac{(\sqrt{33}-1)^2}{6(15+\sqrt{33})} \approx 0.18086.

Proof: In {{\bf F}_3 \times {\bf F}_3 \times {\bf F}_3}, we have

\displaystyle \delta_0(x+y+z) = 1 - (x+y+z)^2

\displaystyle = (1-x^2) - y^2 - z^2 + xy + yz + zx.

Thus, if we let {V_1=V_2=V_3} be the space of functions from {{\bf F}_3} to {{\bf F}_3} (with domain variable denoted {x,y,z} respectively), and define the basis functions

\displaystyle v_{1,0} := 1; v_{1,1} := x; v_{1,2} := x^2

\displaystyle v_{2,0} := 1; v_{2,1} := y; v_{2,2} := y^2

\displaystyle v_{3,0} := 1; v_{3,1} := z; v_{3,2} := z^2

of {V_1,V_2,V_3} indexed by {S_1=S_2=S_3 := \{ 0,1,2\}} (with the usual ordering), respectively, and set {\Gamma \subset S_1 \times S_2 \times S_3} to be the set

\displaystyle \{ (2,0,0), (0,2,0), (0,0,2), (1,1,0), (0,1,1), (1,0,1),(0,0,0) \}

then {\delta_0(x,y,z)} is a linear combination of the {v_{1,t_1} \otimes v_{1,t_2} \otimes v_{1,t_3}} with {(t_1,t_2,t_3) \in \Gamma}, and all coefficients non-zero. Then we have {\Gamma'= \{ (2,0,0), (0,2,0), (0,0,2), (1,1,0), (0,1,1), (1,0,1) \}}. We will show that the quantity {H} of (10) agrees with the quantity {H^*} of (20), and that the optimizing distribution is supported on {\Gamma'}, so that by Proposition 6 the rank of {\delta_{0^n}(x,y,z)} is {\exp( (H+o(1)) n)}.

To compute the quantity at (10), we use the criterion in Proposition 7. We take {(X_1,X_2,X_3)} to be the random variable taking values in {\Gamma} that attains each of the values {(2,0,0), (0,2,0), (0,0,2)} with a probability of {\gamma \approx 0.18086}, and each of {(1,1,0), (0,1,1), (1,0,1)} with a probability of {\alpha - 2\gamma = \beta/2 \approx 0.15247}; then each of the {X_j} attains the values of {0,1,2} with probabilities {\alpha,\beta,\gamma} respectively, so in particular {h(X_1)=h(X_2)=h(X_3)} is equal to the quantity {H'} in (20). If we now set {w_1 = w_2 = w_3 := 1} and

\displaystyle D := 2\log \frac{1}{\alpha} + \log \frac{1}{\gamma} = \log \frac{1}{\alpha} + 2 \log \frac{1}{\beta} = 3H^* \approx 3.04036

we can verify the condition (16) with equality for all {(t_1,t_2,t_3) \in \Gamma'}, which from (17) gives {H=H'=H^*} as desired. \Box

This statement already follows from the result of Kleinberg-Sawin-Speyer, which gives a “tri-colored sum-free set” in {\mathbb F_3^n} of size {\exp((H'+o(1))n)}, as the slice rank of this tensor is an upper bound for the size of a tri-colored sum-free set. If one were to go over the proofs more carefully to evaluate the subexponential factors, this argument would give a stronger lower bound than KSS, as it does not deal with the substantial loss that comes from Behrend’s construction. However, because it actually constructs a set, the KSS result rules out more possible approaches to give an exponential improvement of the upper bound for capsets. The lower bound on slice rank shows that the bound cannot be improved using only the slice rank of this particular tensor, whereas KSS shows that the bound cannot be improved using any method that does not take advantage of the “single-colored” nature of the problem.

We can also show that the slice rank upper bound in a result of Naslund-Sawin is similarly sharp:

Proposition 10 Let {V_1^{\otimes n} = V_2^{\otimes n} = V_3^{\otimes n}} denote the space of functions from {\{0,1\}^n} to {\mathbb C}. Then the function {(x,y,z) \mapsto \prod_{i=1}^n (x_i+y_i+z_i)-1} from {\{0,1\}^n \times \{0,1\}^n \times \{0,1\}^n \rightarrow \mathbb C}, viewed as an element of {V_1^{\otimes n} \otimes V_2^{\otimes n} \otimes V_3^{\otimes n}}, has slice rank {(3/2^{2/3})^n e^{o(n)}}

Proof: Let {v_{1,0}=1} and {v_{1,1}=x} be a basis for the space {V_1} of functions on {\{0,1\}}, itself indexed by {S_1=\{0,1\}}. Choose similar bases for {V_2} and {V_3}, with {v_{2,0}=1, v_{2,1}=y} and {v_{3,0}=1,v_{3,1}=z-1}.

Set {\Gamma = \{(1,0,0),(0,1,0),(0,0,1)\}}. Then {x+y+z-1} is a linear combination of the {v_{1,t_1} \otimes v_{1,t_2} \otimes v_{1,t_3}} with {(t_1,t_2,t_3) \in \Gamma}, and all coefficients non-zero. Order {S_1,S_2,S_3} the usual way so that {\Gamma} is an antichain. We will show that the quantity {H} of (10) is {\log(3/2^{2/3})}, so that applying the last statement of Proposition 6, we conclude that the rank of {\delta_{0^n}(x,y,z)} is {\exp( (\log(3/2^{2/3})+o(1)) n)= (3/2^{2/3})^n e^{o(n)}} ,

Let {(X_1,X_2,X_3)} be the random variable taking values in {\Gamma} that attains each of the values {(1,0,0),(0,1,0),(0,0,1)} with a probability of {1/3}. Then each of the {X_i} attains the value {1} with probability {1/3} and {0} with probability {2/3}, so

\displaystyle h(X_1)=h(X_2)=h(X_3) = (1/3) \log (3) + (2/3) \log(3/2) = \log 3 - (2/3) \log 2= \log (3/2^{2/3})

Setting {w_1=w_2=w_3=1} and {D=3 \log(3/2^{2/3})=3 \log 3 - 2 \log 2}, we can verify the condition (16) with equality for all {(t_1,t_2,t_3) \in \Gamma'}, which from (17) gives {H=\log (3/2^{2/3})} as desired. \Box

We used a slightly different method in each of the last two results. In the first one, we use the most natural bases for all three vector spaces, and distinguish {\Gamma} from its set of maximal elements {\Gamma'}. In the second one we modify one basis element slightly, with {v_{3,1}=z-1} instead of the more obvious choice {z}, which allows us to work with {\Gamma = \{(1,0,0),(0,1,0),(0,0,1)\}} instead of {\Gamma=\{(1,0,0),(0,1,0),(0,0,1),(0,0,0)\}}. Because {\Gamma} is an antichain, we do not need to distinguish {\Gamma} and {\Gamma'}. Both methods in fact work with either problem, and they are both about equally difficult, but we include both as either might turn out to be substantially more convenient in future work.

Proposition 11 Let {k \geq 8} be a natural number and let {G} be a finite abelian group. Let {{\bf F}} be any field. Let {V_1 = \dots = V_k} denote the space of functions from {G} to {{\bf F}}.

Let {F} be any {{\bf F}}-valued function on {G^k} that is nonzero only when the {k} elements of {G^n} form a {k}-term arithmetic progression, and is nonzero on every {k}-term constant progression.

Then the slice rank of {F} is {|G|}.

Proof: We apply Proposition 4, using the standard bases of {V_1,\dots,V_k}. Let {\Gamma} be the support of {F}. Suppose that we have {k} orderings on {H} such that the constant progressions are maximal elements of {\Gamma} and thus all constant progressions lie in {\Gamma'}. Then for any partition {\Gamma_1,\dots, \Gamma_k} of {\Gamma'}, {\Gamma_j} can contain at most {|\pi_j(\Gamma_j)|} constant progressions, and as all {|G|} constant progressions must lie in one of the {\Gamma_j}, we must have {\sum_{j=1}^k |\pi_j(\Gamma_j)| \geq |G|}. By Proposition 4, this implies that the slice rank of {F} is at least {|G|}. Since {F} is a {|G| \times \dots \times |G|} tensor, the slice rank is at most {|G|}, hence exactly {|G|}.

So it is sufficient to find {k} orderings on {G} such that the constant progressions are maximal element of {\Gamma}. We make several simplifying reductions: We may as well assume that {\Gamma} consists of all the {k}-term arithmetic progressions, because if the constant progressions are maximal among the set of all progressions then they are maximal among its subset {\Gamma}. So we are looking for an ordering in which the constant progressions are maximal among all {k}-term arithmetic progressions. We may as well assume that {G} is cyclic, because if for each cyclic group we have an ordering where constant progressions are maximal, on an arbitrary finite abelian group the lexicographic product of these orderings is an ordering for which the constant progressions are maximal. We may assume {k=8}, as if we have an {8}-tuple of orderings where constant progressions are maximal, we may add arbitrary orderings and the constant progressions will remain maximal.

So it is sufficient to find {8} orderings on the cyclic group {\mathbb Z/n} such that the constant progressions are maximal elements of the set of {8}-term progressions in {\mathbb Z/n} in the {8}-fold product ordering. To do that, let the first, second, third, and fifth orderings be the usual order on {\{0,\dots,n-1\}} and let the fourth, sixth, seventh, and eighth orderings be the reverse of the usual order on {\{0,\dots,n-1\}}.

Then let {(c,c,c,c,c,c,c,c)} be a constant progression and for contradiction assume that {(a,a+b,a+2b,a+3b,a+4b,a+5b,a+6b,a+7b)} is a progression greater than {(c,c,c,c,c,c,c,c)} in this ordering. We may assume that {c \in [0, (n-1)/2]}, because otherwise we may reverse the order of the progression, which has the effect of reversing all eight orderings, and then apply the transformation {x \rightarrow n-1-x}, which again reverses the eight orderings, bringing us back to the original problem but with {c \in [0,(n-1)/2]}.

Take a representative of the residue class {b} in the interval {[-n/2,n/2]}. We will abuse notation and call this {b}. Observe that {a+b, a+2b,} {a+3b}, and {a+5b} are all contained in the interval {[0,c]} modulo {n}. Take a representative of the residue class {a} in the interval {[0,c]}. Then {a+b} is in the interval {[mn,mn+c]} for some {m}. The distance between any distinct pair of intervals of this type is greater than {n/2}, but the distance between {a} and {a+b} is at most {n/2}, so {a+b} is in the interval {[0,c]}. By the same reasoning, {a+2b} is in the interval {[0,c]}. Therefore {|b| \leq c/2< n/4}. But then the distance between {a+2b} and {a+4b} is at most {n/2}, so by the same reasoning {a+4b} is in the interval {[0,c]}. Because {a+3b} is between {a+2b} and {a+4b}, it also lies in the interval {[0,c]}. Because {a+3b} is in the interval {[0,c]}, and by assumption it is congruent mod {n} to a number in the set {\{0,\dots,n-1\}} greater than or equal to {c}, it must be exactly {c}. Then, remembering that {a+2b} and {a+4b} lie in {[0,c]}, we have {c-b \leq b} and {c+b \leq b}, so {b=0}, hence {a=c}, thus {(a,\dots,a+7b)=(c,\dots,c)}, which contradicts the assumption that {(a,\dots,a+7b)>(c,\dots,c)}. \Box

In fact, given a {k}-term progressions mod {n} and a constant, we can form a {k}-term binary sequence with a {1} for each step of the progression that is greater than the constant and a {0} for each step that is less. Because a rotation map, viewed as a dynamical system, has zero topological entropy, the number of {k}-term binary sequences that appear grows subexponentially in {k}. Hence there must be, for large enough {k}, at least one sequence that does not appear. In this proof we exploit a sequence that does not appear for {k=8}.

A capset in the vector space {{\bf F}_3^n} over the finite field {{\bf F}_3} of three elements is a subset {A} of {{\bf F}_3^n} that does not contain any lines {\{ x,x+r,x+2r\}}, where {x,r \in {\bf F}_3^n} and {r \neq 0}. A basic problem in additive combinatorics (discussed in one of the very first posts on this blog) is to obtain good upper and lower bounds for the maximal size of a capset in {{\bf F}_3^n}.

Trivially, one has {|A| \leq 3^n}. Using Fourier methods (and the density increment argument of Roth), the bound of {|A| \leq O( 3^n / n )} was obtained by Meshulam, and improved only as late as 2012 to {O( 3^n /n^{1+c})} for some absolute constant {c>0} by Bateman and Katz. But in a very recent breakthrough, Ellenberg (and independently Gijswijt) obtained the exponentially superior bound {|A| \leq O( 2.756^n )}, using a version of the polynomial method recently introduced by Croot, Lev, and Pach. (In the converse direction, a construction of Edel gives capsets as large as {(2.2174)^n}.) Given the success of the polynomial method in superficially similar problems such as the finite field Kakeya problem (discussed in this previous post), it was natural to wonder that this method could be applicable to the cap set problem (see for instance this MathOverflow comment of mine on this from 2010), but it took a surprisingly long time before Croot, Lev, and Pach were able to identify the precise variant of the polynomial method that would actually work here.

The proof of the capset bound is very short (Ellenberg’s and Gijswijt’s preprints are both 3 pages long, and Croot-Lev-Pach is 6 pages), but I thought I would present a slight reformulation of the argument which treats the three points on a line in {{\bf F}_3} symmetrically (as opposed to treating the third point differently from the first two, as is done in the Ellenberg and Gijswijt papers; Croot-Lev-Pach also treat the middle point of a three-term arithmetic progression differently from the two endpoints, although this is a very natural thing to do in their context of {({\bf Z}/4{\bf Z})^n}). The basic starting point is this: if {A} is a capset, then one has the identity

\displaystyle \delta_{0^n}( x+y+z ) = \sum_{a \in A} \delta_a(x) \delta_a(y) \delta_a(z) \ \ \ \ \ (1)


for all {(x,y,z) \in A^3}, where {\delta_a(x) := 1_{a=x}} is the Kronecker delta function, which we view as taking values in {{\bf F}_3}. Indeed, (1) reflects the fact that the equation {x+y+z=0} has solutions precisely when {x,y,z} are either all equal, or form a line, and the latter is ruled out precisely when {A} is a capset.

To exploit (1), we will show that the left-hand side of (1) is “low rank” in some sense, while the right-hand side is “high rank”. Recall that a function {F: A \times A \rightarrow {\bf F}} taking values in a field {{\bf F}} is of rank one if it is non-zero and of the form {(x,y) \mapsto f(x) g(y)} for some {f,g: A \rightarrow {\bf F}}, and that the rank of a general function {F: A \times A \rightarrow {\bf F}} is the least number of rank one functions needed to express {F} as a linear combination. More generally, if {k \geq 2}, we define the rank of a function {F: A^k \rightarrow {\bf F}} to be the least number of “rank one” functions of the form

\displaystyle (x_1,\dots,x_k) \mapsto f(x_i) g(x_1,\dots,x_{i-1},x_{i+1},\dots,x_k)

for some {i=1,\dots,k} and some functions {f: A \rightarrow {\bf F}}, {g: A^{k-1} \rightarrow {\bf F}}, that are needed to generate {F} as a linear combination. For instance, when {k=3}, the rank one functions take the form {(x,y,z) \mapsto f(x) g(y,z)}, {(x,y,z) \mapsto f(y) g(x,z)}, {(x,y,z) \mapsto f(z) g(x,y)}, and linear combinations of {r} such rank one functions will give a function of rank at most {r}.

It is a standard fact in linear algebra that the rank of a diagonal matrix is equal to the number of non-zero entries. This phenomenon extends to higher dimensions:

Lemma 1 (Rank of diagonal hypermatrices) Let {k \geq 2}, let {A} be a finite set, let {{\bf F}} be a field, and for each {a \in A}, let {c_a \in {\bf F}} be a coefficient. Then the rank of the function

\displaystyle (x_1,\dots,x_k) \mapsto \sum_{a \in A} c_a \delta_a(x_1) \dots \delta_a(x_k) \ \ \ \ \ (2)


is equal to the number of non-zero coefficients {c_a}.

Proof: We induct on {k}. As mentioned above, the case {k=2} follows from standard linear algebra, so suppose now that {k>2} and the claim has already been proven for {k-1}.

It is clear that the function (2) has rank at most equal to the number of non-zero {c_a} (since the summands on the right-hand side are rank one functions), so it suffices to establish the lower bound. By deleting from {A} those elements {a \in A} with {c_a=0} (which cannot increase the rank), we may assume without loss of generality that all the {c_a} are non-zero. Now suppose for contradiction that (2) has rank at most {|A|-1}, then we obtain a representation

\displaystyle \sum_{a \in A} c_a \delta_a(x_1) \dots \delta_a(x_k)

\displaystyle = \sum_{i=1}^k \sum_{\alpha \in I_i} f_{i,\alpha}(x_i) g_{i,\alpha}( x_1,\dots,x_{i-1},x_{i+1},\dots,x_k) \ \ \ \ \ (3)


for some sets {I_1,\dots,I_k} of cardinalities adding up to at most {|A|-1}, and some functions {f_{i,\alpha}: A \rightarrow {\bf F}} and {g_{i,\alpha}: A^{k-1} \rightarrow {\bf R}}.

Consider the space of functions {h: A \rightarrow {\bf F}} that are orthogonal to all the {f_{k,\alpha}}, {\alpha \in I_k} in the sense that

\displaystyle \sum_{x \in A} f_{k,\alpha}(x) h(x) = 0

for all {\alpha \in I_k}. This space is a vector space whose dimension {d} is at least {|A| - |I_k|}. A basis of this space generates a {d \times |A|} coordinate matrix of full rank, which implies that there is at least one non-singular {d \times d} minor. This implies that there exists a function {h: A \rightarrow {\bf F}} in this space which is nowhere vanishing on some subset {A'} of {A} of cardinality at least {|A|-|I_k|}.

If we multiply (3) by {h(x_k)} and sum in {x_k}, we conclude that

\displaystyle \sum_{a \in A} c_a h(a) \delta_a(x_1) \dots \delta_a(x_{k-1})

\displaystyle = \sum_{i=1}^{k-1} \sum_{\alpha \in I_i} f_{i,\alpha}(x_i)\tilde g_{i,\alpha}( x_1,\dots,x_{i-1},x_{i+1},\dots,x_{k-1})


\displaystyle \tilde g_{i,\alpha}(x_1,\dots,x_{i-1},x_{i+1},\dots,x_{k-1})

\displaystyle := \sum_{x_k \in A} g_{i,\alpha}(x_1,\dots,x_{i-1},x_{i+1},\dots,x_k) h(x_k).

The right-hand side has rank at most {|A|-1-|I_k|}, since the summands are rank one functions. On the other hand, from induction hypothesis the left-hand side has rank at least {|A|-|I_k|}, giving the required contradiction. \Box

On the other hand, we have the following (symmetrised version of a) beautifully simple observation of Croot, Lev, and Pach:

Lemma 2 On {({\bf F}_3^n)^3}, the rank of the function {(x,y,z) \mapsto \delta_{0^n}(x+y+z)} is at most {3N}, where

\displaystyle N := \sum_{a,b,c \geq 0: a+b+c=n, b+2c \leq 2n/3} \frac{n!}{a!b!c!}.

Proof: Using the identity {\delta_0(x) = 1 - x^2} for {x \in {\bf F}_3}, we have

\displaystyle \delta_{0^n}(x+y+z) = \prod_{i=1}^n (1 - (x_i+y_i+z_i)^2).

The right-hand side is clearly a polynomial of degree {2n} in {x,y,z}, which is then a linear combination of monomials

\displaystyle x_1^{i_1} \dots x_n^{i_n} y_1^{j_1} \dots y_n^{j_n} z_1^{k_1} \dots z_n^{k_n}

with {i_1,\dots,i_n,j_1,\dots,j_n,k_1,\dots,k_n \in \{0,1,2\}} with

\displaystyle i_1 + \dots + i_n + j_1 + \dots + j_n + k_1 + \dots + k_n \leq 2n.

In particular, from the pigeonhole principle, at least one of {i_1 + \dots + i_n, j_1 + \dots + j_n, k_1 + \dots + k_n} is at most {2n/3}.

Consider the contribution of the monomials for which {i_1 + \dots + i_n \leq 2n/3}. We can regroup this contribution as

\displaystyle \sum_\alpha f_\alpha(x) g_\alpha(y,z)

where {\alpha} ranges over those {(i_1,\dots,i_n) \in \{0,1,2\}^n} with {i_1 + \dots + i_n \leq 2n/3}, {f_\alpha} is the monomial

\displaystyle f_\alpha(x_1,\dots,x_n) := x_1^{i_1} \dots x_n^{i_n}

and {g_\alpha: {\bf F}_3^n \times {\bf F}_3^n \rightarrow {\bf F}_3} is some explicitly computable function whose exact form will not be of relevance to our argument. The number of such {\alpha} is equal to {N}, so this contribution has rank at most {N}. The remaining contributions arising from the cases {j_1 + \dots + j_n \leq 2n/3} and {k_1 + \dots + k_n \leq 2n/3} similarly have rank at most {N} (grouping the monomials so that each monomial is only counted once), so the claim follows.

Upon restricting from {({\bf F}_3^n)^3} to {A^3}, the rank of {(x,y,z) \mapsto \delta_{0^n}(x+y+z)} is still at most {3N}. The two lemmas then combine to give the Ellenberg-Gijswijt bound

\displaystyle |A| \leq 3N.

All that remains is to compute the asymptotic behaviour of {N}. This can be done using the general tool of Cramer’s theorem, but can also be derived from Stirling’s formula (discussed in this previous post). Indeed, if {a = (\alpha+o(1)) n}, {b = (\beta+o(1)) n}, {c = (\gamma+o(1)) n} for some {\alpha,\beta,\gamma \geq 0} summing to {1}, Stirling’s formula gives

\displaystyle \frac{n!}{a!b!c!} = \exp( n (h(\alpha,\beta,\gamma) + o(1)) )

where {h} is the entropy function

\displaystyle h(\alpha,\beta,\gamma) = \alpha \log \frac{1}{\alpha} + \beta \log \frac{1}{\beta} + \gamma \log \frac{1}{\gamma}.

We then have

\displaystyle N = \exp( n (X + o(1))

where {X} is the maximum entropy {h(\alpha,\beta,\gamma)} subject to the constraints

\displaystyle \alpha,\beta,\gamma \geq 0; \alpha+\beta+\gamma=1; \beta+2\gamma \leq 2/3.

A routine Lagrange multiplier computation shows that the maximum occurs when

\displaystyle \alpha = \frac{32}{3(15 + \sqrt{33})}

\displaystyle \beta = \frac{4(\sqrt{33}-1)}{3(15+\sqrt{33})}

\displaystyle \gamma = \frac{(\sqrt{33}-1)^2}{6(15+\sqrt{33})}

and {h(\alpha,\beta,\gamma)} is approximately {1.013455}, giving rise to the claimed bound of {O( 2.756^n )}.

Remark 3 As noted in the Ellenberg and Gijswijt papers, the above argument extends readily to other fields than {{\bf F}_3} to control the maximal size of subset of {{\bf F}^n} that has no non-trivial solutions to the equation {ax+by+cz=0}, where {a,b,c \in {\bf F}} are non-zero constants that sum to zero. Of course one replaces the function {(x,y,z) \mapsto \delta_{0^n}(x+y+z)} in Lemma 2 by {(x,y,z) \mapsto \delta_{0^n}(ax+by+cz)} in this case.

Remark 4 This symmetrised formulation suggests that one possible way to improve slightly on the numerical quantity {2.756} by finding a more efficient way to decompose {\delta_{0^n}(x+y+z)} into rank one functions, however I was not able to do so (though such improvements are reminiscent of the Strassen type algorithms for fast matrix multiplication).

Remark 5 It is tempting to see if this method can get non-trivial upper bounds for sets {A} with no length {4} progressions, in (say) {{\bf F}_5^n}. One can run the above arguments, replacing the function

\displaystyle (x,y,z) \mapsto \delta_{0^n}(x+y+z)


\displaystyle (x,y,z,w) \mapsto \delta_{0^n}(x-2y+z) \delta_{0^n}(y-2z+w);

this leads to the bound {|A| \leq 4N} where

\displaystyle N := \sum_{a,b,c,d,e \geq 0: a+b+c+d+e=n, b+2c+3d+4e \leq 2n} \frac{n!}{a!b!c!d!e!}.

Unfortunately, {N} is asymptotic to {\frac{1}{2} 5^n} and so this bound is in fact slightly worse than the trivial bound {|A| \leq 5^n}! However, there is a slim chance that there is a more efficient way to decompose {\delta_{0^n}(x-2y+z) \delta_{0^n}(y-2z+w)} into rank one functions that would give a non-trivial bound on {A}. I experimented with a few possible such decompositions but unfortunately without success.

Remark 6 Return now to the capset problem. Since Lemma 1 is valid for any field {{\bf F}}, one could perhaps hope to get better bounds by viewing the Kronecker delta function {\delta} as taking values in another field than {{\bf F}_3}, such as the complex numbers {{\bf C}}. However, as soon as one works in a field of characteristic other than {3}, one can adjoin a cube root {\omega} of unity, and one now has the Fourier decomposition

\displaystyle \delta_{0^n}(x+y+z) = \frac{1}{3^n} \sum_{\xi \in {\bf F}_3^n} \omega^{\xi \cdot x} \omega^{\xi \cdot y} \omega^{\xi \cdot z}.

Moving to the Fourier basis, we conclude from Lemma 1 that the function {(x,y,z) \mapsto \delta_{0^n}(x+y+z)} on {{\bf F}_3^n} now has rank exactly {3^n}, and so one cannot improve upon the trivial bound of {|A| \leq 3^n} by this method using fields of characteristic other than three as the range field. So it seems one has to stick with {{\bf F}_3} (or the algebraic completion thereof).

Thanks to Jordan Ellenberg and Ben Green for helpful discussions.