You are currently browsing the tag archive for the ‘entropy’ tag.

Let ${G}$ be a finite set of order ${N}$; in applications ${G}$ will be typically something like a finite abelian group, such as the cyclic group ${{\bf Z}/N{\bf Z}}$. Let us define a ${1}$-bounded function to be a function ${f: G \rightarrow {\bf C}}$ such that ${|f(n)| \leq 1}$ for all ${n \in G}$. There are many seminorms ${\| \|}$ of interest that one places on functions ${f: G \rightarrow {\bf C}}$ that are bounded by ${1}$ on ${1}$-bounded functions, such as the Gowers uniformity seminorms ${\| \|_k}$ for ${k \geq 1}$ (which are genuine norms for ${k \geq 2}$). All seminorms in this post will be implicitly assumed to obey this property.

In additive combinatorics, a significant role is played by inverse theorems, which abstractly take the following form for certain choices of seminorm ${\| \|}$, some parameters ${\eta, \varepsilon>0}$, and some class ${{\mathcal F}}$ of ${1}$-bounded functions:

Theorem 1 (Inverse theorem template) If ${f}$ is a ${1}$-bounded function with ${\|f\| \geq \eta}$, then there exists ${F \in {\mathcal F}}$ such that ${|\langle f, F \rangle| \geq \varepsilon}$, where ${\langle,\rangle}$ denotes the usual inner product

$\displaystyle \langle f, F \rangle := {\bf E}_{n \in G} f(n) \overline{F(n)}.$

Informally, one should think of ${\eta}$ as being somewhat small but fixed independently of ${N}$, ${\varepsilon}$ as being somewhat smaller but depending only on ${\eta}$ (and on the seminorm), and ${{\mathcal F}}$ as representing the “structured functions” for these choices of parameters. There is some flexibility in exactly how to choose the class ${{\mathcal F}}$ of structured functions, but intuitively an inverse theorem should become more powerful when this class is small. Accordingly, let us define the ${(\eta,\varepsilon)}$-entropy of the seminorm ${\| \|}$ to be the least cardinality of ${{\mathcal F}}$ for which such an inverse theorem holds. Seminorms with low entropy are ones for which inverse theorems can be expected to be a useful tool. This concept arose in some discussions I had with Ben Green many years ago, but never appeared in print, so I decided to record some observations we had on this concept here on this blog.

Lebesgue norms ${\| f\|_{L^p} := ({\bf E}_{n \in G} |f(n)|^p)^{1/p}}$ for ${1 < p < \infty}$ have exponentially large entropy (and so inverse theorems are not expected to be useful in this case):

Proposition 2 (${L^p}$ norm has exponentially large inverse entropy) Let ${1 < p < \infty}$ and ${0 < \eta < 1}$. Then the ${(\eta,\eta^p/4)}$-entropy of ${\| \|_{L^p}}$ is at most ${(1+8/\eta^p)^N}$. Conversely, for any ${\varepsilon>0}$, the ${(\eta,\varepsilon)}$-entropy of ${\| \|_{L^p}}$ is at least ${\exp( c \varepsilon^2 N)}$ for some absolute constant ${c>0}$.

Proof: If ${f}$ is ${1}$-bounded with ${\|f\|_{L^p} \geq \eta}$, then we have

$\displaystyle |\langle f, |f|^{p-2} f \rangle| \geq \eta^p$

and hence by the triangle inequality we have

$\displaystyle |\langle f, F \rangle| \geq \eta^p/2$

where ${F}$ is either the real or imaginary part of ${|f|^{p-2} f}$, which takes values in ${[-1,1]}$. If we let ${\tilde F}$ be ${F}$ rounded to the nearest multiple of ${\eta^p/4}$, then by the triangle inequality again we have

$\displaystyle |\langle f, \tilde F \rangle| \geq \eta^p/4.$

There are only at most ${1+8/\eta^p}$ possible values for each value ${\tilde F(n)}$ of ${\tilde F}$, and hence at most ${(1+8/\eta^p)^N}$ possible choices for ${\tilde F}$. This gives the first claim.

Now suppose that there is an ${(\eta,\varepsilon)}$-inverse theorem for some ${{\mathcal F}}$ of cardinality ${M}$. If we let ${f}$ be a random sign function (so the ${f(n)}$ are independent random variables taking values in ${-1,+1}$ with equal probability), then there is a random ${F \in {\mathcal F}}$ such that

$\displaystyle |\langle f, F \rangle| \geq \varepsilon$

and hence by the pigeonhole principle there is a deterministic ${F \in {\mathcal F}}$ such that

$\displaystyle {\bf P}( |\langle f, F \rangle| \geq \varepsilon ) \geq 1/M.$

On the other hand, from the Hoeffding inequality one has

$\displaystyle {\bf P}( |\langle f, F \rangle| \geq \varepsilon ) \ll \exp( - c \varepsilon^2 N )$

for some absolute constant ${c}$, hence

$\displaystyle M \geq \exp( c \varepsilon^2 N )$

as claimed. $\Box$

Most seminorms of interest in additive combinatorics, such as the Gowers uniformity norms, are bounded by some finite ${L^p}$ norm thanks to Hölder’s inequality, so from the above proposition and the obvious monotonicity properties of entropy, we conclude that all Gowers norms on finite abelian groups ${G}$ have at most exponential inverse theorem entropy. But we can do significantly better than this:

• For the ${U^1}$ seminorm ${\|f\|_{U^1(G)} := |{\bf E}_{n \in G} f(n)|}$, one can simply take ${{\mathcal F} = \{1\}}$ to consist of the constant function ${1}$, and the ${(\eta,\eta)}$-entropy is clearly equal to ${1}$ for any ${0 < \eta < 1}$.
• For the ${U^2}$ norm, the standard Fourier-analytic inverse theorem asserts that if ${\|f\|_{U^2(G)} \geq \eta}$ then ${|\langle f, e(\xi \cdot) \rangle| \geq \eta^2}$ for some Fourier character ${\xi \in \hat G}$. Thus the ${(\eta,\eta^2)}$-entropy is at most ${N}$.
• For the ${U^k({\bf Z}/N{\bf Z})}$ norm on cyclic groups for ${k > 2}$, the inverse theorem proved by Green, Ziegler, and myself gives an ${(\eta,\varepsilon)}$-inverse theorem for some ${\varepsilon \gg_{k,\eta} 1}$ and ${{\mathcal F}}$ consisting of nilsequences ${n \mapsto F(g(n) \Gamma)}$ for some filtered nilmanifold ${G/\Gamma}$ of degree ${k-1}$ in a finite collection of cardinality ${O_{\eta,k}(1)}$, some polynomial sequence ${g: {\bf Z} \rightarrow G}$ (which was subsequently observed by Candela-Sisask (see also Manners) that one can choose to be ${N}$-periodic), and some Lipschitz function ${F: G/\Gamma \rightarrow {\bf C}}$ of Lipschitz norm ${O_{\eta,k}(1)}$. By the Arzela-Ascoli theorem, the number of possible ${F}$ (up to uniform errors of size at most ${\varepsilon/2}$, say) is ${O_{\eta,k}(1)}$. By standard arguments one can also ensure that the coefficients of the polynomial ${g}$ are ${O_{\eta,k}(1)}$, and then by periodicity there are only ${O(N^{O_{\eta,k}(1)}}$ such polynomials. As a consequence, the ${(\eta,\varepsilon)}$-entropy is of polynomial size ${O_{\eta,k}( N^{O_{\eta,k}(1)} )}$ (a fact that seems to have first been implicitly observed in Lemma 6.2 of this paper of Frantzikinakis; thanks to Ben Green for this reference). One can obtain more precise dependence on ${\eta,k}$ using the quantitative version of this inverse theorem due to Manners; back of the envelope calculations using Section 5 of that paper suggest to me that one can take ${\varepsilon = \eta^{O_k(1)}}$ to be polynomial in ${\eta}$ and the entropy to be of the order ${O_k( N^{\exp(\exp(\eta^{-O_k(1)}))} )}$, or alternatively one can reduce the entropy to ${O_k( \exp(\exp(\eta^{-O_k(1)})) N^{\eta^{-O_k(1)}})}$ at the cost of degrading ${\varepsilon}$ to ${1/\exp\exp( O(\eta^{-O(1)}))}$.
• If one replaces the cyclic group ${{\bf Z}/N{\bf Z}}$ by a vector space ${{\bf F}_p^n}$ over some fixed finite field ${{\bf F}_p}$ of prime order (so that ${N=p^n}$), then the inverse theorem of Ziegler and myself (available in both high and low characteristic) allows one to obtain an ${(\eta,\varepsilon)}$-inverse theorem for some ${\varepsilon \gg_{k,\eta} 1}$ and ${{\mathcal F}}$ the collection of non-classical degree ${k-1}$ polynomial phases from ${{\bf F}_p^n}$ to ${S^1}$, which one can normalize to equal ${1}$ at the origin, and then by the classification of such polynomials one can calculate that the ${(\eta,\varepsilon)}$ entropy is of quasipolynomial size ${\exp( O_{p,k}(n^{k-1}) ) = \exp( O_{p,k}( \log^{k-1} N ) )}$ in ${N}$. By using the recent work of Gowers and Milicevic, one can make the dependence on ${p,k}$ here more precise, but we will not perform these calcualtions here.
• For the ${U^3(G)}$ norm on an arbitrary finite abelian group, the recent inverse theorem of Jamneshan and myself gives (after some calculations) a bound of the polynomial form ${O( q^{O(n^2)} N^{\exp(\eta^{-O(1)})})}$ on the ${(\eta,\varepsilon)}$-entropy for some ${\varepsilon \gg \eta^{O(1)}}$, which one can improve slightly to ${O( q^{O(n^2)} N^{\eta^{-O(1)}})}$ if one degrades ${\varepsilon}$ to ${1/\exp(\eta^{-O(1)})}$, where ${q}$ is the maximal order of an element of ${G}$, and ${n}$ is the rank (the number of elements needed to generate ${G}$). This bound is polynomial in ${N}$ in the cyclic group case and quasipolynomial in general.

For general finite abelian groups ${G}$, we do not yet have an inverse theorem of comparable power to the ones mentioned above that give polynomial or quasipolynomial upper bounds on the entropy. However, there is a cheap argument that at least gives some subexponential bounds:

Proposition 3 (Cheap subexponential bound) Let ${k \geq 2}$ and ${0 < \eta < 1/2}$, and suppose that ${G}$ is a finite abelian group of order ${N \geq \eta^{-C_k}}$ for some sufficiently large ${C_k}$. Then the ${(\eta,c_k \eta^{O_k(1)})}$-complexity of ${\| \|_{U^k(G)}}$ is at most ${O( \exp( \eta^{-O_k(1)} N^{1 - \frac{k+1}{2^k-1}} ))}$.

Proof: (Sketch) We use a standard random sampling argument, of the type used for instance by Croot-Sisask or Briet-Gopi (thanks to Ben Green for this latter reference). We can assume that ${N \geq \eta^{-C_k}}$ for some sufficiently large ${C_k>0}$, since otherwise the claim follows from Proposition 2.

Let ${A}$ be a random subset of ${{\bf Z}/N{\bf Z}}$ with the events ${n \in A}$ being iid with probability ${0 < p < 1}$ to be chosen later, conditioned to the event ${|A| \leq 2pN}$. Let ${f}$ be a ${1}$-bounded function. By a standard second moment calculation, we see that with probability at least ${1/2}$, we have

$\displaystyle \|f\|_{U^k(G)}^{2^k} = {\bf E}_{n, h_1,\dots,h_k \in G} f(n) \prod_{\omega \in \{0,1\}^k \backslash \{0\}} {\mathcal C}^{|\omega|} \frac{1}{p} 1_A f(n + \omega \cdot h)$

$\displaystyle + O((\frac{1}{N^{k+1} p^{2^k-1}})^{1/2}).$

Thus, by the triangle inequality, if we choose ${p := C \eta^{-2^{k+1}/(2^k-1)} / N^{\frac{k+1}{2^k-1}}}$ for some sufficiently large ${C = C_k > 0}$, then for any ${1}$-bounded ${f}$ with ${\|f\|_{U^k(G)} \geq \eta/2}$, one has with probability at least ${1/2}$ that

$\displaystyle |{\bf E}_{n, h_1,\dots,h_k \i2^n G} f(n) \prod_{\omega \in \{0,1\}^k \backslash \{0\}} {\mathcal C}^{|\omega|} \frac{1}{p} 1_A f(n + \omega \cdot h)|$

$\displaystyle \geq \eta^{2^k}/2^{2^k+1}.$

We can write the left-hand side as ${|\langle f, F \rangle|}$ where ${F}$ is the randomly sampled dual function

$\displaystyle F(n) := {\bf E}_{n, h_1,\dots,h_k \in G} f(n) \prod_{\omega \in \{0,1\}^k \backslash \{0\}} {\mathcal C}^{|\omega|+1} \frac{1}{p} 1_A f(n + \omega \cdot h).$

Unfortunately, ${F}$ is not ${1}$-bounded in general, but we have

$\displaystyle \|F\|_{L^2(G)}^2 \leq {\bf E}_{n, h_1,\dots,h_k ,h'_1,\dots,h'_k \in G}$

$\displaystyle \prod_{\omega \in \{0,1\}^k \backslash \{0\}} \frac{1}{p} 1_A(n + \omega \cdot h) \frac{1}{p} 1_A(n + \omega \cdot h')$

and the right-hand side can be shown to be ${1+o(1)}$ on the average, so we can condition on the event that the right-hand side is ${O(1)}$ without significant loss in falure probability.

If we then let ${\tilde f_A}$ be ${1_A f}$ rounded to the nearest Gaussian integer multiple of ${\eta^{2^k}/2^{2^{10k}}}$ in the unit disk, one has from the triangle inequality that

$\displaystyle |\langle f, \tilde F \rangle| \geq \eta^{2^k}/2^{2^k+2}$

where ${\tilde F}$ is the discretised randomly sampled dual function

$\displaystyle \tilde F(n) := {\bf E}_{n, h_1,\dots,h_k \in G} f(n) \prod_{\omega \in \{0,1\}^k \backslash \{0\}} {\mathcal C}^{|\omega|+1} \frac{1}{p} \tilde f_A(n + \omega \cdot h).$

For any given ${A}$, there are at most ${2np}$ places ${n}$ where ${\tilde f_A(n)}$ can be non-zero, and in those places there are ${O_k( \eta^{-2^{k}})}$ possible values for ${\tilde f_A(n)}$. Thus, if we let ${{\mathcal F}_A}$ be the collection of all possible ${\tilde f_A}$ associated to a given ${A}$, the cardinality of this set is ${O( \exp( \eta^{-O_k(1)} N^{1 - \frac{k+1}{2^k-1}} ) )}$, and for any ${f}$ with ${\|f\|_{U^k(G)} \geq \eta/2}$, we have

$\displaystyle \sup_{\tilde F \in {\mathcal F}_A} |\langle f, \tilde F \rangle| \geq \eta^{2^k}/2^{k+2}$

with probability at least ${1/2}$.

Now we remove the failure probability by independent resampling. By rounding to the nearest Gaussian integer multiple of ${c_k \eta^{2^k}}$ in the unit disk for a sufficiently small ${c_k>0}$, one can find a family ${{\mathcal G}}$ of cardinality ${O( \eta^{-O_k(N)})}$ consisting of ${1}$-bounded functions ${\tilde f}$ of ${U^k(G)}$ norm at least ${\eta/2}$ such that for every ${1}$-bounded ${f}$ with ${\|f\|_{U^k(G)} \geq \eta}$ there exists ${\tilde f \in {\mathcal G}}$ such that

$\displaystyle \|f-\tilde f\|_{L^\infty(G)} \leq \eta^{2^k}/2^{k+3}.$

Now, let ${A_1,\dots,A_M}$ be independent samples of ${A}$ for some ${M}$ to be chosen later. By the preceding discussion, we see that with probability at least ${1 - 2^{-M}}$, we have

$\displaystyle \sup_{\tilde F \in \bigcup_{j=1}^M {\mathcal F}_{A_j}} |\langle \tilde f, \tilde F \rangle| \geq \eta^{2^k}/2^{k+2}$

for any given ${\tilde f \in {\mathcal G}}$, so by the union bound, if we choose ${M = \lfloor C N \log \frac{1}{\eta} \rfloor}$ for a large enough ${C = C_k}$, we can find ${A_1,\dots,A_M}$ such that

$\displaystyle \sup_{\tilde F \in \bigcup_{j=1}^M {\mathcal F}_{A_j}} |\langle \tilde f, \tilde F \rangle| \geq \eta^{2^k}/2^{k+2}$

for all ${\tilde f \in {\mathcal G}}$, and hence y the triangle inequality

$\displaystyle \sup_{\tilde F \in \bigcup_{j=1}^M {\mathcal F}_{A_j}} |\langle f, \tilde F \rangle| \geq \eta^{2^k}/2^{k+3}.$

Taking ${{\mathcal F}}$ to be the union of the ${{\mathcal F}_{A_j}}$ (applying some truncation and rescaling to these ${L^2}$-bounded functions to make them ${L^\infty}$-bounded, and then ${1}$-bounded), we obtain the claim. $\Box$

One way to obtain lower bounds on the inverse theorem entropy is to produce a collection of almost orthogonal functions with large norm. More precisely:

Proposition 4 Let ${\| \|}$ be a seminorm, let ${0 < \varepsilon \leq \eta < 1}$, and suppose that one has a collection ${f_1,\dots,f_M}$ of ${1}$-bounded functions such that for all ${i=1,\dots,M}$, ${\|f_i\| \geq \eta}$ one has ${|\langle f_i, f_j \rangle| \leq \varepsilon^2/2}$ for all but at most ${L}$ choices of ${j \in \{1,\dots,M\}}$ for all distinct ${i,j \in \{1,\dots,M\}}$. Then the ${(\eta, \varepsilon)}$-entropy of ${\| \|}$ is at least ${\varepsilon^2 M / 2L}$.

Proof: Suppose we have an ${(\eta,\varepsilon)}$-inverse theorem with some family ${{\mathcal F}}$. Then for each ${i=1,\dots,M}$ there is ${F_i \in {\mathcal F}}$ such that ${|\langle f_i, F_i \rangle| \geq \varepsilon}$. By the pigeonhole principle, there is thus ${F \in {\mathcal F}}$ such that ${|\langle f_i, F \rangle| \geq \varepsilon}$ for all ${i}$ in a subset ${I}$ of ${\{1,\dots,M\}}$ of cardinality at least ${M/|{\mathcal F}|}$:

$\displaystyle |I| \geq M / |{\mathcal F}|.$

We can sum this to obtain

$\displaystyle |\sum_{i \in I} c_i \langle f_i, F \rangle| \geq |I| \varepsilon$

for some complex numbers ${c_i}$ of unit magnitude. By Cauchy-Schwarz, this implies

$\displaystyle \| \sum_{i \in I} c_i f_i \|_{L^2(G)}^2 \geq |I|^2 \varepsilon^2$

and hence by the triangle inequality

$\displaystyle \sum_{i,j \in I} |\langle f_i, f_j \rangle| \geq |I|^2 \varepsilon^2.$

On the other hand, by hypothesis we can bound the left-hand side by ${|I| (L + \varepsilon^2 |I|/2)}$. Rearranging, we conclude that

$\displaystyle |I| \leq 2 L / \varepsilon^2$

and hence

$\displaystyle |{\mathcal F}| \geq \varepsilon^2 M / 2L$

giving the claim. $\Box$

Thus for instance:

• For the ${U^2(G)}$ norm, one can take ${f_1,\dots,f_M}$ to be the family of linear exponential phases ${n \mapsto e(\xi \cdot n)}$ with ${M = N}$ and ${L=1}$, and obtain a linear lower bound of ${\varepsilon^2 N/2}$ for the ${(\eta,\varepsilon)}$-entropy, thus matching the upper bound of ${N}$ up to constants when ${\varepsilon}$ is fixed.
• For the ${U^k({\bf Z}/N{\bf Z})}$ norm, a similar calculation using polynomial phases of degree ${k-1}$, combined with the Weyl sum estimates, gives a lower bound of ${\gg_{k,\varepsilon} N^{k-1}}$ for the ${(\eta,\varepsilon)}$-entropy for any fixed ${\eta,\varepsilon}$; by considering nilsequences as well, together with nilsequence equidistribution theory, one can replace the exponent ${k-1}$ here by some quantity that goes to infinity as ${\eta \rightarrow 0}$, though I have not attempted to calculate the exact rate.
• For the ${U^k({\bf F}_p^n)}$ norm, another similar calculation using polynomial phases of degree ${k-1}$ should give a lower bound of ${\gg_{p,k,\eta,\varepsilon} \exp( c_{p,k,\eta,\varepsilon} n^{k-1} )}$ for the ${(\eta,\varepsilon)}$-entropy, though I have not fully performed the calculation.

We close with one final example. Suppose ${G}$ is a product ${G = A \times B}$ of two sets ${A,B}$ of cardinality ${\asymp \sqrt{N}}$, and we consider the Gowers box norm

$\displaystyle \|f\|_{\Box^2(G)}^4 := {\bf E}_{a,a' \in A; b,b' \in B} f(a,b) \overline{f}(a,b') \overline{f}(a',b) f(a,b).$

One possible choice of class ${{\mathcal F}}$ here are the indicators ${1_{U \times V}}$ of “rectangles” ${U \times V}$ with ${U \subset A}$, ${V \subset B}$ (cf. this previous blog post on cut norms). By standard calculations, one can use this class to show that the ${(\eta, \eta^4/10)}$-entropy of ${\| \|_{\Box^2(G)}}$ is ${O( \exp( O(\sqrt{N}) )}$, and a variant of the proof of the second part of Proposition 2 shows that this is the correct order of growth in ${N}$. In contrast, a modification of Proposition 3 only gives an upper bound of the form ${O( \exp( O( N^{2/3} ) ) )}$ (the bottleneck is ensuring that the randomly sampled dual functions stay bounded in ${L^2}$), which shows that while this cheap bound is not optimal, it can still broadly give the correct “type” of bound (specifically, intermediate growth between polynomial and exponential).

Let ${P(z) = z^n + a_{n-1} z^{n-1} + \dots + a_0}$ be a monic polynomial of degree ${n}$ with complex coefficients. Then by the fundamental theorem of algebra, we can factor ${P}$ as

$\displaystyle P(z) = (z-z_1) \dots (z-z_n) \ \ \ \ \ (1)$

for some complex zeroes ${z_1,\dots,z_n}$ (possibly with repetition).

Now suppose we evolve ${P}$ with respect to time by heat flow, creating a function ${P(t,z)}$ of two variables with given initial data ${P(0,z) = P(z)}$ for which

$\displaystyle \partial_t P(t,z) = \partial_{zz} P(t,z). \ \ \ \ \ (2)$

On the space of polynomials of degree at most ${n}$, the operator ${\partial_{zz}}$ is nilpotent, and one can solve this equation explicitly both forwards and backwards in time by the Taylor series

$\displaystyle P(t,z) = \sum_{j=0}^\infty \frac{t^j}{j!} \partial_z^{2j} P(0,z).$

For instance, if one starts with a quadratic ${P(0,z) = z^2 + bz + c}$, then the polynomial evolves by the formula

$\displaystyle P(t,z) = z^2 + bz + (c+2t).$

As the polynomial ${P(t)}$ evolves in time, the zeroes ${z_1(t),\dots,z_n(t)}$ evolve also. Assuming for sake of discussion that the zeroes are simple, the inverse function theorem tells us that the zeroes will (locally, at least) evolve smoothly in time. What are the dynamics of this evolution?

For instance, in the quadratic case, the quadratic formula tells us that the zeroes are

$\displaystyle z_1(t) = \frac{-b + \sqrt{b^2 - 4(c+2t)}}{2}$

and

$\displaystyle z_2(t) = \frac{-b - \sqrt{b^2 - 4(c+2t)}}{2}$

after arbitrarily choosing a branch of the square root. If ${b,c}$ are real and the discriminant ${b^2 - 4c}$ is initially positive, we see that we start with two real zeroes centred around ${-b/2}$, which then approach each other until time ${t = \frac{b^2-4c}{8}}$, at which point the roots collide and then move off from each other in an imaginary direction.

In the general case, we can obtain the equations of motion by implicitly differentiating the defining equation

$\displaystyle P( t, z_i(t) ) = 0$

in time using (2) to obtain

$\displaystyle \partial_{zz} P( t, z_i(t) ) + \partial_t z_i(t) \partial_z P(t,z_i(t)) = 0.$

To simplify notation we drop the explicit dependence on time, thus

$\displaystyle \partial_{zz} P(z_i) + (\partial_t z_i) \partial_z P(z_i)= 0.$

From (1) and the product rule, we see that

$\displaystyle \partial_z P( z_i ) = \prod_{j:j \neq i} (z_i - z_j)$

and

$\displaystyle \partial_{zz} P( z_i ) = 2 \sum_{k:k \neq i} \prod_{j:j \neq i,k} (z_i - z_j)$

(where all indices are understood to range over ${1,\dots,n}$) leading to the equations of motion

$\displaystyle \partial_t z_i = \sum_{k:k \neq i} \frac{2}{z_k - z_i}, \ \ \ \ \ (3)$

at least when one avoids those times in which there is a repeated zero. In the case when the zeroes ${z_i}$ are real, each term ${\frac{2}{z_k-z_i}}$ represents a (first-order) attraction in the dynamics between ${z_i}$ and ${z_k}$, but the dynamics are more complicated for complex zeroes (e.g. purely imaginary zeroes will experience repulsion rather than attraction, as one already sees in the quadratic example). Curiously, this system resembles that of Dyson brownian motion (except with the brownian motion part removed, and time reversed). I learned of the connection between the ODE (3) and the heat equation from this paper of Csordas, Smith, and Varga, but perhaps it has been mentioned in earlier literature as well.

One interesting consequence of these equations is that if the zeroes are real at some time, then they will stay real as long as the zeroes do not collide. Let us now restrict attention to the case of real simple zeroes, in which case we will rename the zeroes as ${x_i}$ instead of ${z_i}$, and order them as ${x_1 < \dots < x_n}$. The evolution

$\displaystyle \partial_t x_i = \sum_{k:k \neq i} \frac{2}{x_k - x_i}$

can now be thought of as reverse gradient flow for the “entropy”

$\displaystyle H := -\sum_{i,j: i \neq j} \log |x_i - x_j|,$

(which is also essentially the logarithm of the discriminant of the polynomial) since we have

$\displaystyle \partial_t x_i = \frac{\partial H}{\partial x_i}.$

In particular, we have the monotonicity formula

$\displaystyle \partial_t H = 4E$

where ${E}$ is the “energy”

$\displaystyle E := \frac{1}{4} \sum_i (\frac{\partial H}{\partial x_i})^2$

$\displaystyle = \sum_i (\sum_{k:k \neq i} \frac{1}{x_k-x_i})^2$

$\displaystyle = \sum_{i,k: i \neq k} \frac{1}{(x_k-x_i)^2} + 2 \sum_{i,j,k: i,j,k \hbox{ distinct}} \frac{1}{(x_k-x_i)(x_j-x_i)}$

$\displaystyle = \sum_{i,k: i \neq k} \frac{1}{(x_k-x_i)^2}$

where in the last line we use the antisymmetrisation identity

$\displaystyle \frac{1}{(x_k-x_i)(x_j-x_i)} + \frac{1}{(x_i-x_j)(x_k-x_j)} + \frac{1}{(x_j-x_k)(x_i-x_k)} = 0.$

Among other things, this shows that as one goes backwards in time, the entropy decreases, and so no collisions can occur to the past, only in the future, which is of course consistent with the attractive nature of the dynamics. As ${H}$ is a convex function of the positions ${x_1,\dots,x_n}$, one expects ${H}$ to also evolve in a convex manner in time, that is to say the energy ${E}$ should be increasing. This is indeed the case:

Exercise 1 Show that

$\displaystyle \partial_t E = 2 \sum_{i,j: i \neq j} (\frac{2}{(x_i-x_j)^2} - \sum_{k: i,j,k \hbox{ distinct}} \frac{1}{(x_k-x_i)(x_k-x_j)})^2.$

Symmetric polynomials of the zeroes are polynomial functions of the coefficients and should thus evolve in a polynomial fashion. One can compute this explicitly in simple cases. For instance, the center of mass is an invariant:

$\displaystyle \partial_t \frac{1}{n} \sum_i x_i = 0.$

The variance decreases linearly:

Exercise 2 Establish the virial identity

$\displaystyle \partial_t \sum_{i,j} (x_i-x_j)^2 = - 4n^2(n-1).$

As the variance (which is proportional to ${\sum_{i,j} (x_i-x_j)^2}$) cannot become negative, this identity shows that “finite time blowup” must occur – that the zeroes must collide at or before the time ${\frac{1}{4n^2(n-1)} \sum_{i,j} (x_i-x_j)^2}$.

Exercise 3 Show that the Stieltjes transform

$\displaystyle s(t,z) = \sum_i \frac{1}{x_i - z}$

solves the viscous Burgers equation

$\displaystyle \partial_t s = \partial_{zz} s - 2 s \partial_z s,$

either by using the original heat equation (2) and the identity ${s = - \partial_z P / P}$, or else by using the equations of motion (3). This relation between the Burgers equation and the heat equation is known as the Cole-Hopf transformation.

The paper of Csordas, Smith, and Varga mentioned previously gives some other bounds on the lifespan of the dynamics; roughly speaking, they show that if there is one pair of zeroes that are much closer to each other than to the other zeroes then they must collide in a short amount of time (unless there is a collision occuring even earlier at some other location). Their argument extends also to situations where there are an infinite number of zeroes, which they apply to get new results on Newman’s conjecture in analytic number theory. I would be curious to know of further places in the literature where this dynamics has been studied.

Let ${X}$ and ${Y}$ be two random variables taking values in the same (discrete) range ${R}$, and let ${E}$ be some subset of ${R}$, which we think of as the set of “bad” outcomes for either ${X}$ or ${Y}$. If ${X}$ and ${Y}$ have the same probability distribution, then clearly

$\displaystyle {\bf P}( X \in E ) = {\bf P}( Y \in E ).$

In particular, if it is rare for ${Y}$ to lie in ${E}$, then it is also rare for ${X}$ to lie in ${E}$.

If ${X}$ and ${Y}$ do not have exactly the same probability distribution, but their probability distributions are close to each other in some sense, then we can expect to have an approximate version of the above statement. For instance, from the definition of the total variation distance ${\delta(X,Y)}$ between two random variables (or more precisely, the total variation distance between the probability distributions of two random variables), we see that

$\displaystyle {\bf P}(Y \in E) - \delta(X,Y) \leq {\bf P}(X \in E) \leq {\bf P}(Y \in E) + \delta(X,Y) \ \ \ \ \ (1)$

for any ${E \subset R}$. In particular, if it is rare for ${Y}$ to lie in ${E}$, and ${X,Y}$ are close in total variation, then it is also rare for ${X}$ to lie in ${E}$.

A basic inequality in information theory is Pinsker’s inequality

$\displaystyle \delta(X,Y) \leq \sqrt{\frac{1}{2} D_{KL}(X||Y)}$

where the Kullback-Leibler divergence ${D_{KL}(X||Y)}$ is defined by the formula

$\displaystyle D_{KL}(X||Y) = \sum_{x \in R} {\bf P}( X=x ) \log \frac{{\bf P}(X=x)}{{\bf P}(Y=x)}.$

(See this previous blog post for a proof of this inequality.) A standard application of Jensen’s inequality reveals that ${D_{KL}(X||Y)}$ is non-negative (Gibbs’ inequality), and vanishes if and only if ${X}$, ${Y}$ have the same distribution; thus one can think of ${D_{KL}(X||Y)}$ as a measure of how close the distributions of ${X}$ and ${Y}$ are to each other, although one should caution that this is not a symmetric notion of distance, as ${D_{KL}(X||Y) \neq D_{KL}(Y||X)}$ in general. Inserting Pinsker’s inequality into (1), we see for instance that

$\displaystyle {\bf P}(X \in E) \leq {\bf P}(Y \in E) + \sqrt{\frac{1}{2} D_{KL}(X||Y)}.$

Thus, if ${X}$ is close to ${Y}$ in the Kullback-Leibler sense, and it is rare for ${Y}$ to lie in ${E}$, then it is rare for ${X}$ to lie in ${E}$ as well.

We can specialise this inequality to the case when ${Y}$ a uniform random variable ${U}$ on a finite range ${R}$ of some cardinality ${N}$, in which case the Kullback-Leibler divergence ${D_{KL}(X||U)}$ simplifies to

$\displaystyle D_{KL}(X||U) = \log N - {\bf H}(X)$

where

$\displaystyle {\bf H}(X) := \sum_{x \in R} {\bf P}(X=x) \log \frac{1}{{\bf P}(X=x)}$

is the Shannon entropy of ${X}$. Again, a routine application of Jensen’s inequality shows that ${{\bf H}(X) \leq \log N}$, with equality if and only if ${X}$ is uniformly distributed on ${R}$. The above inequality then becomes

$\displaystyle {\bf P}(X \in E) \leq {\bf P}(U \in E) + \sqrt{\frac{1}{2}(\log N - {\bf H}(X))}. \ \ \ \ \ (2)$

Thus, if ${E}$ is a small fraction of ${R}$ (so that it is rare for ${U}$ to lie in ${E}$), and the entropy of ${X}$ is very close to the maximum possible value of ${\log N}$, then it is rare for ${X}$ to lie in ${E}$ also.

The inequality (2) is only useful when the entropy ${{\bf H}(X)}$ is close to ${\log N}$ in the sense that ${{\bf H}(X) = \log N - O(1)}$, otherwise the bound is worse than the trivial bound of ${{\bf P}(X \in E) \leq 1}$. In my recent paper on the Chowla and Elliott conjectures, I ended up using a variant of (2) which was still non-trivial when the entropy ${{\bf H}(X)}$ was allowed to be smaller than ${\log N - O(1)}$. More precisely, I used the following simple inequality, which is implicit in the arguments of that paper but which I would like to make more explicit in this post:

Lemma 1 (Pinsker-type inequality) Let ${X}$ be a random variable taking values in a finite range ${R}$ of cardinality ${N}$, let ${U}$ be a uniformly distributed random variable in ${R}$, and let ${E}$ be a subset of ${R}$. Then

$\displaystyle {\bf P}(X \in E) \leq \frac{(\log N - {\bf H}(X)) + \log 2}{\log 1/{\bf P}(U \in E)}.$

Proof: Consider the conditional entropy ${{\bf H}(X | 1_{X \in E} )}$. On the one hand, we have

$\displaystyle {\bf H}(X | 1_{X \in E} ) = {\bf H}(X, 1_{X \in E}) - {\bf H}(1_{X \in E} )$

$\displaystyle = {\bf H}(X) - {\bf H}(1_{X \in E})$

$\displaystyle \geq {\bf H}(X) - \log 2$

by Jensen’s inequality. On the other hand, one has

$\displaystyle {\bf H}(X | 1_{X \in E} ) = {\bf P}(X \in E) {\bf H}(X | X \in E )$

$\displaystyle + (1-{\bf P}(X \in E)) {\bf H}(X | X \not \in E)$

$\displaystyle \leq {\bf P}(X \in E) \log |E| + (1-{\bf P}(X \in E)) \log N$

$\displaystyle = \log N - {\bf P}(X \in E) \log \frac{N}{|E|}$

$\displaystyle = \log N - {\bf P}(X \in E) \log \frac{1}{{\bf P}(U \in E)},$

where we have again used Jensen’s inequality. Putting the two inequalities together, we obtain the claim. $\Box$

Remark 2 As noted in comments, this inequality can be viewed as a special case of the more general inequality

$\displaystyle {\bf P}(X \in E) \leq \frac{D(X||Y) + \log 2}{\log 1/{\bf P}(Y \in E)}$

for arbitrary random variables ${X,Y}$ taking values in the same discrete range ${R}$, which follows from the data processing inequality

$\displaystyle D( f(X)||f(Y)) \leq D(X|| Y)$

for arbitrary functions ${f}$, applied to the indicator function ${f = 1_E}$. Indeed one has

$\displaystyle D( 1_E(X) || 1_E(Y) ) = {\bf P}(X \in E) \log \frac{{\bf P}(X \in E)}{{\bf P}(Y \in E)}$

$\displaystyle + {\bf P}(X \not \in E) \log \frac{{\bf P}(X \not \in E)}{{\bf P}(Y \not \in E)}$

$\displaystyle \geq {\bf P}(X \in E) \log \frac{1}{{\bf P}(Y \in E)} - h( {\bf P}(X \in E) )$

$\displaystyle \geq {\bf P}(X \in E) \log \frac{1}{{\bf P}(Y \in E)} - \log 2$

where ${h(u) := u \log \frac{1}{u} + (1-u) \log \frac{1}{1-u}}$ is the entropy function.

Thus, for instance, if one has

$\displaystyle {\bf H}(X) \geq \log N - o(K)$

and

$\displaystyle {\bf P}(U \in E) \leq \exp( - K )$

for some ${K}$ much larger than ${1}$ (so that ${1/K = o(1)}$), then

$\displaystyle {\bf P}(X \in E) = o(1).$

More informally: if the entropy of ${X}$ is somewhat close to the maximum possible value of ${\log N}$, and it is exponentially rare for a uniform variable to lie in ${E}$, then it is still somewhat rare for ${X}$ to lie in ${E}$. The estimate given is close to sharp in this regime, as can be seen by calculating the entropy of a random variable ${X}$ which is uniformly distributed inside a small set ${E}$ with some probability ${p}$ and uniformly distributed outside of ${E}$ with probability ${1-p}$, for some parameter ${0 \leq p \leq 1}$.

It turns out that the above lemma combines well with concentration of measure estimates; in my paper, I used one of the simplest such estimates, namely Hoeffding’s inequality, but there are of course many other estimates of this type (see e.g. this previous blog post for some others). Roughly speaking, concentration of measure inequalities allow one to make approximations such as

$\displaystyle F(U) \approx {\bf E} F(U)$

with exponentially high probability, where ${U}$ is a uniform distribution and ${F}$ is some reasonable function of ${U}$. Combining this with the above lemma, we can then obtain approximations of the form

$\displaystyle F(X) \approx {\bf E} F(U) \ \ \ \ \ (3)$

with somewhat high probability, if the entropy of ${X}$ is somewhat close to maximum. This observation, combined with an “entropy decrement argument” that allowed one to arrive at a situation in which the relevant random variable ${X}$ did have a near-maximum entropy, is the key new idea in my recent paper; for instance, one can use the approximation (3) to obtain an approximation of the form

$\displaystyle \sum_{j=1}^H \sum_{p \in {\mathcal P}} \lambda(n+j) \lambda(n+j+p) 1_{p|n+j}$

$\displaystyle \approx \sum_{j=1}^H \sum_{p \in {\mathcal P}} \frac{\lambda(n+j) \lambda(n+j+p)}{p}$

for “most” choices of ${n}$ and a suitable choice of ${H}$ (with the latter being provided by the entropy decrement argument). The left-hand side is tied to Chowla-type sums such as ${\sum_{n \leq x} \frac{\lambda(n)\lambda(n+1)}{n}}$ through the multiplicativity of ${\lambda}$, while the right-hand side, being a linear correlation involving two parameters ${j,p}$ rather than just one, has “finite complexity” and can be treated by existing techniques such as the Hardy-Littlewood circle method. One could hope that one could similarly use approximations such as (3) in other problems in analytic number theory or combinatorics.

There are many situations in combinatorics in which one is running some sort of iteration algorithm to continually “improve” some object ${A}$; each loop of the algorithm replaces ${A}$ with some better version ${A'}$ of itself, until some desired property of ${A}$ is attained and the algorithm halts. In order for such arguments to yield a useful conclusion, it is often necessary that the algorithm halts in a finite amount of time, or (even better), in a bounded amount of time. (In general, one cannot use infinitary iteration tools, such as transfinite induction or Zorn’s lemma, in combinatorial settings, because the iteration processes used to improve some target object ${A}$ often degrade some other finitary quantity ${B}$ in the process, and an infinite iteration would then have the undesirable effect of making ${B}$ infinite.)

A basic strategy to ensure termination of an algorithm is to exploit a monotonicity property, or more precisely to show that some key quantity keeps increasing (or keeps decreasing) with each loop of the algorithm, while simultaneously staying bounded. (Or, as the economist Herbert Stein was fond of saying, “If something cannot go on forever, it must stop.”)

Here are four common flavours of this monotonicity strategy:

• The mass increment argument. This is perhaps the most familiar way to ensure termination: make each improved object ${A'}$ “heavier” than the previous one ${A}$ by some non-trivial amount (e.g. by ensuring that the cardinality of ${A'}$ is strictly greater than that of ${A}$, thus ${|A'| \geq |A|+1}$). Dually, one can try to force the amount of “mass” remaining “outside” of ${A}$ in some sense to decrease at every stage of the iteration. If there is a good upper bound on the “mass” of ${A}$ that stays essentially fixed throughout the iteration process, and a lower bound on the mass increment at each stage, then the argument terminates. Many “greedy algorithm” arguments are of this type. The proof of the Hahn decomposition theorem in measure theory also falls into this category. The general strategy here is to keep looking for useful pieces of mass outside of ${A}$, and add them to ${A}$ to form ${A'}$, thus exploiting the additivity properties of mass. Eventually no further usable mass remains to be added (i.e. ${A}$ is maximal in some ${L^1}$ sense), and this should force some desirable property on ${A}$.
• The density increment argument. This is a variant of the mass increment argument, in which one increments the “density” of ${A}$ rather than the “mass”. For instance, ${A}$ might be contained in some ambient space ${P}$, and one seeks to improve ${A}$ to ${A'}$ (and ${P}$ to ${P'}$) in such a way that the density of the new object in the new ambient space is better than that of the previous object (e.g. ${|A'|/|P'| \geq |A|/|P| + c}$ for some ${c>0}$). On the other hand, the density of ${A}$ is clearly bounded above by ${1}$. As long as one has a sufficiently good lower bound on the density increment at each stage, one can conclude an upper bound on the number of iterations in the algorithm. The prototypical example of this is Roth’s proof of his theorem that every set of integers of positive upper density contains an arithmetic progression of length three. The general strategy here is to keep looking for useful density fluctuations inside ${A}$, and then “zoom in” to a region of increased density by reducing ${A}$ and ${P}$ appropriately. Eventually no further usable density fluctuation remains (i.e. ${A}$ is uniformly distributed), and this should force some desirable property on ${A}$.
• The energy increment argument. This is an “${L^2}$” analogue of the “${L^1}$“-based mass increment argument (or the “${L^\infty}$“-based density increment argument), in which one seeks to increments the amount of “energy” that ${A}$ captures from some reference object ${X}$, or (equivalently) to decrement the amount of energy of ${X}$ which is still “orthogonal” to ${A}$. Here ${A}$ and ${X}$ are related somehow to a Hilbert space, and the energy involves the norm on that space. A classic example of this type of argument is the existence of orthogonal projections onto closed subspaces of a Hilbert space; this leads among other things to the construction of conditional expectation in measure theory, which then underlies a number of arguments in ergodic theory, as discussed for instance in this earlier blog post. Another basic example is the standard proof of the Szemerédi regularity lemma (where the “energy” is often referred to as the “index”). These examples are related; see this blog post for further discussion. The general strategy here is to keep looking for useful pieces of energy orthogonal to ${A}$, and add them to ${A}$ to form ${A'}$, thus exploiting square-additivity properties of energy, such as Pythagoras’ theorem. Eventually, no further usable energy outside of ${A}$ remains to be added (i.e. ${A}$ is maximal in some ${L^2}$ sense), and this should force some desirable property on ${A}$.
• The rank reduction argument. Here, one seeks to make each new object ${A'}$ to have a lower “rank”, “dimension”, or “order” than the previous one. A classic example here is the proof of the linear algebra fact that given any finite set of vectors, there exists a linearly independent subset which spans the same subspace; the proof of the more general Steinitz exchange lemma is in the same spirit. The general strategy here is to keep looking for “collisions” or “dependencies” within ${A}$, and use them to collapse ${A}$ to an object ${A'}$ of lower rank. Eventually, no further usable collisions within ${A}$ remain, and this should force some desirable property on ${A}$.

Much of my own work in additive combinatorics relies heavily on at least one of these types of arguments (and, in some cases, on a nested combination of two or more of them). Many arguments in nonlinear partial differential equations also have a similar flavour, relying on various monotonicity formulae for solutions to such equations, though the objective in PDE is usually slightly different, in that one wants to keep control of a solution as one approaches a singularity (or as some time or space coordinate goes off to infinity), rather than to ensure termination of an algorithm. (On the other hand, many arguments in the theory of concentration compactness, which is used heavily in PDE, does have the same algorithm-terminating flavour as the combinatorial arguments; see this earlier blog post for more discussion.)

Recently, a new species of monotonicity argument was introduced by Moser, as the primary tool in his elegant new proof of the Lovász local lemma. This argument could be dubbed an entropy compression argument, and only applies to probabilistic algorithms which require a certain collection ${R}$ of random “bits” or other random choices as part of the input, thus each loop of the algorithm takes an object ${A}$ (which may also have been generated randomly) and some portion of the random string ${R}$ to (deterministically) create a better object ${A'}$ (and a shorter random string ${R'}$, formed by throwing away those bits of ${R}$ that were used in the loop). The key point is to design the algorithm to be partially reversible, in the sense that given ${A'}$ and ${R'}$ and some additional data ${H'}$ that logs the cumulative history of the algorithm up to this point, one can reconstruct ${A}$ together with the remaining portion ${R}$ not already contained in ${R'}$. Thus, each stage of the argument compresses the information-theoretic content of the string ${A+R}$ into the string ${A'+R'+H'}$ in a lossless fashion. However, a random variable such as ${A+R}$ cannot be compressed losslessly into a string of expected size smaller than the Shannon entropy of that variable. Thus, if one has a good lower bound on the entropy of ${A+R}$, and if the length of ${A'+R'+H'}$ is significantly less than that of ${A+R}$ (i.e. we need the marginal growth in the length of the history file ${H'}$ per iteration to be less than the marginal amount of randomness used per iteration), then there is a limit as to how many times the algorithm can be run, much as there is a limit as to how many times a random data file can be compressed before no further length reduction occurs.

It is interesting to compare this method with the ones discussed earlier. In the previous methods, the failure of the algorithm to halt led to a new iteration of the object ${A}$ which was “heavier”, “denser”, captured more “energy”, or “lower rank” than the previous instance of ${A}$. Here, the failure of the algorithm to halt leads to new information that can be used to “compress” ${A}$ (or more precisely, the full state ${A+R}$) into a smaller amount of space. I don’t know yet of any application of this new type of termination strategy to the fields I work in, but one could imagine that it could eventually be of use (perhaps to show that solutions to PDE with sufficiently “random” initial data can avoid singularity formation?), so I thought I would discuss it here.

Below the fold I give a special case of Moser’s argument, based on a blog post of Lance Fortnow on this topic.

As many readers may already know, my good friend and fellow mathematical blogger Tim Gowers, having wrapped up work on the Princeton Companion to Mathematics (which I believe is now in press), has begun another mathematical initiative, namely a “Tricks Wiki” to act as a repository for mathematical tricks and techniques.    Tim has already started the ball rolling with several seed articles on his own blog, and asked me to also contribute some articles.  (As I understand it, these articles will be migrated to the Wiki in a few months, once it is fully set up, and then they will evolve with edits and contributions by anyone who wishes to pitch in, in the spirit of Wikipedia; in particular, articles are not intended to be permanently authored or signed by any single contributor.)

So today I’d like to start by extracting some material from an old post of mine on “Amplification, arbitrage, and the tensor power trick” (as well as from some of the comments), and converting it to the Tricks Wiki format, while also taking the opportunity to add a few more examples.

Title: The tensor power trick

Quick description: If one wants to prove an inequality $X \leq Y$ for some non-negative quantities X, Y, but can only see how to prove a quasi-inequality $X \leq CY$ that loses a multiplicative constant C, then try to replace all objects involved in the problem by “tensor powers” of themselves and apply the quasi-inequality to those powers.  If all goes well, one can show that $X^M \leq C Y^M$ for all $M \geq 1$, with a constant C which is independent of M, which implies that $X \leq Y$ as desired by taking $M^{th}$ roots and then taking limits as $M \to \infty$.

Having established the monotonicity of the Perelman reduced volume in the previous lecture (after first heuristically justifying this monotonicity in Lecture 9), we now show how this can be used to establish $\kappa$-noncollapsing of Ricci flows, thus giving a second proof of Theorem 2 from Lecture 7. Of course, we already proved (a stronger version) of this theorem already in Lecture 8, using the Perelman entropy, but this second proof is also important, because the reduced volume is a more localised quantity (due to the weight $e^{-l_{(0,x_0)}}$ in its definition and so one can in fact establish local versions of the non-collapsing theorem which turn out to be important when we study ancient $\kappa$-noncollapsing solutions later in Perelman’s proof, because such solutions need not be compact and so cannot be controlled by global quantities (such as the Perelman entropy).

The route to $\kappa$-noncollapsing via reduced volume proceeds by the following scheme:

Non-collapsing at time t=0 (1)

$\Downarrow$

Large reduced volume at time t=0 (2)

$\Downarrow$

Large reduced volume at later times t (3)

$\Downarrow$

Non-collapsing at later times t (4)

The implication $(2) \implies (3)$ is the monotonicity of Perelman reduced volume. In this lecture we discuss the other two implications $(1) \implies (2)$, and $(3) \implies (4)$).

Our arguments here are based on Perelman’s first paper, Kleiner-Lott’s notes, and Morgan-Tian’s book, though the material in the Morgan-Tian book differs in some key respects from the other two texts. A closely related presentation of these topics also appears in the paper of Cao-Zhu.

It occurred to me recently that the mathematical blog medium may be a good venue not just for expository “short stories” on mathematical concepts or results, but also for more technical discussions of individual mathematical “tricks”, which would otherwise not be significant enough to warrant a publication-length (and publication-quality) article. So I thought today that I would discuss the amplification trick in harmonic analysis and combinatorics (and in particular, in the study of estimates); this trick takes an established estimate involving an arbitrary object (such as a function f), and obtains a stronger (or amplified) estimate by transforming the object in a well-chosen manner (often involving some new parameters) into a new object, applying the estimate to that new object, and seeing what that estimate says about the original object (after optimising the parameters or taking a limit). The amplification trick works particularly well for estimates which enjoy some sort of symmetry on one side of the estimate that is not represented on the other side; indeed, it can be viewed as a way to “arbitrage” differing amounts of symmetry between the left- and right-hand sides of an estimate. It can also be used in the contrapositive, amplifying a weak counterexample to an estimate into a strong counterexample. This trick also sheds some light as to why dimensional analysis works; an estimate which is not dimensionally consistent can often be amplified into a stronger estimate which is dimensionally consistent; in many cases, this new estimate is so strong that it cannot in fact be true, and thus dimensionally inconsistent inequalities tend to be either false or inefficient, which is why we rarely see them. (More generally, any inequality on which a group acts on either the left or right-hand side can often be “decomposed” into the “isotypic components” of the group action, either by the amplification trick or by other related tools, such as Fourier analysis.)

The amplification trick is a deceptively simple one, but it can become particularly powerful when one is arbitraging an unintuitive symmetry, such as symmetry under tensor powers. Indeed, the “tensor power trick”, which can eliminate constants and even logarithms in an almost magical manner, can lead to some interesting proofs of sharp inequalities, which are difficult to establish by more direct means.

The most familiar example of the amplification trick in action is probably the textbook proof of the Cauchy-Schwarz inequality

$|\langle v, w \rangle| \leq \|v\| \|w\|$ (1)

for vectors v, w in a complex Hilbert space. To prove this inequality, one might start by exploiting the obvious inequality

$\|v-w\|^2 \geq 0$ (2)

but after expanding everything out, one only gets the weaker inequality

$\hbox{Re} \langle v, w \rangle \leq \frac{1}{2} \|v\|^2 + \frac{1}{2} \|w\|^2$. (3)

Now (3) is weaker than (1) for two reasons; the left-hand side is smaller, and the right-hand side is larger (thanks to the arithmetic mean-geometric mean inequality). However, we can amplify (3) by arbitraging some symmetry imbalances. Firstly, observe that the phase rotation symmetry $v \mapsto e^{i\theta} v$ preserves the RHS of (3) but not the LHS. We exploit this by replacing v by $e^{i\theta} v$ in (3) for some phase $\theta$ to be chosen later, to obtain

$\hbox{Re} e^{i\theta} \langle v, w \rangle \leq \frac{1}{2} \|v\|^2 + \frac{1}{2} \|w\|^2$.

Now we are free to choose $\theta$ at will (as long as it is real, of course), so it is natural to choose $\theta$ to optimise the inequality, which in this case means to make the left-hand side as large as possible. This is achieved by choosing $e^{i\theta}$ to cancel the phase of $\langle v, w \rangle$, and we obtain

$|\langle v, w \rangle| \leq \frac{1}{2} \|v\|^2 + \frac{1}{2} \|w\|^2$ (4)

This is closer to (1); we have fixed the left-hand side, but the right-hand side is still too weak. But we can amplify further, by exploiting an imbalance in a different symmetry, namely the homogenisation symmetry $(v,w) \mapsto (\lambda v, \frac{1}{\lambda} w)$ for a scalar $\lambda > 0$, which preserves the left-hand side but not the right. Inserting this transform into (4) we conclude that

$|\langle v, w \rangle| \leq \frac{\lambda^2}{2} \|v\|^2 + \frac{1}{2\lambda^2} \|w\|^2$

where $\lambda > 0$ is at our disposal to choose. We can optimise in $\lambda$ by minimising the right-hand side, and indeed one easily sees that the minimum (or infimum, if one of v and w vanishes) is $\|v\| \|w\|$ (which is achieved when $\lambda = \sqrt{\|w\|/\|v\|}$ when $v,w$ are non-zero, or in an asymptotic limit $\lambda \to 0$ or $\lambda \to \infty$ in the degenerate cases), and so we have amplified our way to the Cauchy-Schwarz inequality (1). [See also this discussion by Tim Gowers on the Cauchy-Schwarz inequality.]

[This post is authored by Gil Kalai, who has kindly “guest blogged” this week’s “open problem of the week”. – T.]

The entropy-influence conjecture seeks to relate two somewhat different measures as to how a boolean function has concentrated Fourier coefficients, namely the total influence and the entropy.

We begin by defining the total influence. Let $\{-1,+1\}^n$ be the discrete cube, i.e. the set of $\pm 1$ vectors $(x_1,\ldots,x_n)$ of length n. A boolean function is any function $f: \{-1,+1\}^n \to \{-1,+1\}$ from the discrete cube to {-1,+1}. One can think of such functions as “voting methods”, which take the preferences of n voters (+1 for yes, -1 for no) as input and return a yes/no verdict as output. For instance, if n is odd, the “majority vote” function $\hbox{sgn}(x_1+\ldots+x_n)$ returns +1 if there are more +1 variables than -1, or -1 otherwise, whereas if $1 \leq k \leq n$, the “$k^{th}$ dictator” function returns the value $x_k$ of the $k^{th}$ variable.

We give the cube $\{-1,+1\}^n$ the uniform probability measure $\mu$ (thus we assume that the n voters vote randomly and independently). Given any boolean function f and any variable $1 \leq k \leq n$, define the influence $I_k(f)$ of the $k^{th}$ variable to be the quantity

$I_k(f) := \mu \{ x \in \{-1,+1\}^n: f(\sigma_k(x)) \neq f(x) \}$

where $\sigma_k(x)$ is the element of the cube formed by flipping the sign of the $k^{th}$ variable. Informally, $I_k(f)$ measures the probability that the $k^{th}$ voter could actually determine the outcome of an election; it is sometimes referred to as the Banzhaf power index. The total influence I(f) of f (also known as the average sensitivity and the edge-boundary density) is then defined as

$I(f) := \sum_{k=1}^n I_k(f).$

Thus for instance a dictator function has total influence 1, whereas majority vote has total influence comparable to $\sqrt{n}$. The influence can range between 0 (for constant functions +1, -1) and n (for the parity function $x_1 \ldots x_k$ or its negation). If f has mean zero (i.e. it is equal to +1 half of the time), then the edge-isoperimetric inequality asserts that $I(f) \geq 1$ (with equality if and only if there is a dictatorship), whilst the Kahn-Kalai-Linial (KKL) theorem asserts that $I_k(f) \gg \frac{\log n}{n}$ for some k. There is a result of Friedgut that if $I(f)$ is bounded by A (say) and $\varepsilon > 0$, then f is within a distance $\varepsilon$ (in $L^1$ norm) of another boolean function g which only depends on $O_{A,\varepsilon}(1)$ of the variables (such functions are known as juntas).

[This post is authored by Gil Kalai, who has kindly “guest blogged” this week’s “open problem of the week”. – T.]

This is a problem in discrete and convex geometry. It seeks to quantify the intuitively obvious fact that large convex bodies are so “fat” that they cannot avoid “detection” by a small number of observation points. More precisely, we fix a dimension d and make the following definition (introduced by Haussler and Welzl):

• Definition: Let $X \subset {\Bbb R}^d$ be a finite set of points, and let $0 < \epsilon < 1$. We say that a finite set $Y \subset {\Bbb R}^d$ is a weak $\epsilon$-net for X (with respect to convex bodies) if, whenever B is a convex body which is large in the sense that $|B \cap X| > \epsilon |X|$, then B contains at least one point of Y. (If Y is contained in X, we say that Y is a strong $\epsilon$-net for X with respect to convex bodies.)

For example, in one dimension, if $X = \{1,\ldots,N\}$, and $Y = \{ \epsilon N, 2 \epsilon N, \ldots, k \epsilon N \}$ where k is the integer part of $1/\epsilon$, then Y is a weak $\epsilon$-net for X with respect to convex bodies. Thus we see that even when the original set X is very large, one can create a $\epsilon$-net of size as small as $O(1/\epsilon)$. Strong $\epsilon$-nets are of importance in computational learning theory, and are fairly well understood via Vapnik-Chervonenkis (or VC) theory; however, the theory of weak $\epsilon$-nets is still not completely satisfactory.

One can ask what happens in higher dimensions, for instance when X is a discrete cube $X = \{1,\ldots,N\}^d$. It is not too hard to cook up $\epsilon$-nets of size $O_d(1/\epsilon^d)$ (by using tools such as Minkowski’s theorem), but in fact one can create $\epsilon$-nets of size as small as $O( \frac{1}{\epsilon} \log \frac{1}{\epsilon} )$ simply by taking a random subset of X of this cardinality and observing that “up to errors of $\epsilon$“, the total number of essentially different ways a convex body can meet X grows at most polynomially in $1/\epsilon$. (This is a very typical application of the probabilistic method.) On the other hand, since X can contain roughly $1/\epsilon$ disjoint convex bodies, each of which contains at least $\epsilon$ of the points in X, we see that no $\epsilon$-net can have size much smaller than $1/\epsilon$.

Now consider the situation in which X is now an arbitrary finite set, rather than a discrete cube. More precisely, let $f(\epsilon,d)$ be the least number such that every finite set X possesses at least one weak $\epsilon$-net for X with respect to convex bodies of cardinality at most $f(\epsilon,d)$. (One can also replace the finite set X with an arbitrary probability measure; the two formulations are equivalent.) Informally, f is the least number of “guards” one needs to place to prevent a convex body from covering more than $\epsilon$ of any given territory.

• Problem 1: For fixed d, what is the correct rate of growth of f as $\epsilon \to 0$?