You are currently browsing the tag archive for the ‘entropy’ tag.

There are many situations in combinatorics in which one is running some sort of iteration algorithm to continually “improve” some object ${A}$; each loop of the algorithm replaces ${A}$ with some better version ${A'}$ of itself, until some desired property of ${A}$ is attained and the algorithm halts. In order for such arguments to yield a useful conclusion, it is often necessary that the algorithm halts in a finite amount of time, or (even better), in a bounded amount of time. (In general, one cannot use infinitary iteration tools, such as transfinite induction or Zorn’s lemma, in combinatorial settings, because the iteration processes used to improve some target object ${A}$ often degrade some other finitary quantity ${B}$ in the process, and an infinite iteration would then have the undesirable effect of making ${B}$ infinite.)

A basic strategy to ensure termination of an algorithm is to exploit a monotonicity property, or more precisely to show that some key quantity keeps increasing (or keeps decreasing) with each loop of the algorithm, while simultaneously staying bounded. (Or, as the economist Herbert Stein was fond of saying, “If something cannot go on forever, it must stop.”)

Here are four common flavours of this monotonicity strategy:

• The mass increment argument. This is perhaps the most familiar way to ensure termination: make each improved object ${A'}$ “heavier” than the previous one ${A}$ by some non-trivial amount (e.g. by ensuring that the cardinality of ${A'}$ is strictly greater than that of ${A}$, thus ${|A'| \geq |A|+1}$). Dually, one can try to force the amount of “mass” remaining “outside” of ${A}$ in some sense to decrease at every stage of the iteration. If there is a good upper bound on the “mass” of ${A}$ that stays essentially fixed throughout the iteration process, and a lower bound on the mass increment at each stage, then the argument terminates. Many “greedy algorithm” arguments are of this type. The proof of the Hahn decomposition theorem in measure theory also falls into this category. The general strategy here is to keep looking for useful pieces of mass outside of ${A}$, and add them to ${A}$ to form ${A'}$, thus exploiting the additivity properties of mass. Eventually no further usable mass remains to be added (i.e. ${A}$ is maximal in some ${L^1}$ sense), and this should force some desirable property on ${A}$.
• The density increment argument. This is a variant of the mass increment argument, in which one increments the “density” of ${A}$ rather than the “mass”. For instance, ${A}$ might be contained in some ambient space ${P}$, and one seeks to improve ${A}$ to ${A'}$ (and ${P}$ to ${P'}$) in such a way that the density of the new object in the new ambient space is better than that of the previous object (e.g. ${|A'|/|P'| \geq |A|/|P| + c}$ for some ${c>0}$). On the other hand, the density of ${A}$ is clearly bounded above by ${1}$. As long as one has a sufficiently good lower bound on the density increment at each stage, one can conclude an upper bound on the number of iterations in the algorithm. The prototypical example of this is Roth’s proof of his theorem that every set of integers of positive upper density contains an arithmetic progression of length three. The general strategy here is to keep looking for useful density fluctuations inside ${A}$, and then “zoom in” to a region of increased density by reducing ${A}$ and ${P}$ appropriately. Eventually no further usable density fluctuation remains (i.e. ${A}$ is uniformly distributed), and this should force some desirable property on ${A}$.
• The energy increment argument. This is an “${L^2}$” analogue of the “${L^1}$“-based mass increment argument (or the “${L^\infty}$“-based density increment argument), in which one seeks to increments the amount of “energy” that ${A}$ captures from some reference object ${X}$, or (equivalently) to decrement the amount of energy of ${X}$ which is still “orthogonal” to ${A}$. Here ${A}$ and ${X}$ are related somehow to a Hilbert space, and the energy involves the norm on that space. A classic example of this type of argument is the existence of orthogonal projections onto closed subspaces of a Hilbert space; this leads among other things to the construction of conditional expectation in measure theory, which then underlies a number of arguments in ergodic theory, as discussed for instance in this earlier blog post. Another basic example is the standard proof of the Szemerédi regularity lemma (where the “energy” is often referred to as the “index”). These examples are related; see this blog post for further discussion. The general strategy here is to keep looking for useful pieces of energy orthogonal to ${A}$, and add them to ${A}$ to form ${A'}$, thus exploiting square-additivity properties of energy, such as Pythagoras’ theorem. Eventually, no further usable energy outside of ${A}$ remains to be added (i.e. ${A}$ is maximal in some ${L^2}$ sense), and this should force some desirable property on ${A}$.
• The rank reduction argument. Here, one seeks to make each new object ${A'}$ to have a lower “rank”, “dimension”, or “order” than the previous one. A classic example here is the proof of the linear algebra fact that given any finite set of vectors, there exists a linearly independent subset which spans the same subspace; the proof of the more general Steinitz exchange lemma is in the same spirit. The general strategy here is to keep looking for “collisions” or “dependencies” within ${A}$, and use them to collapse ${A}$ to an object ${A'}$ of lower rank. Eventually, no further usable collisions within ${A}$ remain, and this should force some desirable property on ${A}$.

Much of my own work in additive combinatorics relies heavily on at least one of these types of arguments (and, in some cases, on a nested combination of two or more of them). Many arguments in nonlinear partial differential equations also have a similar flavour, relying on various monotonicity formulae for solutions to such equations, though the objective in PDE is usually slightly different, in that one wants to keep control of a solution as one approaches a singularity (or as some time or space coordinate goes off to infinity), rather than to ensure termination of an algorithm. (On the other hand, many arguments in the theory of concentration compactness, which is used heavily in PDE, does have the same algorithm-terminating flavour as the combinatorial arguments; see this earlier blog post for more discussion.)

Recently, a new species of monotonicity argument was introduced by Moser, as the primary tool in his elegant new proof of the Lovász local lemma. This argument could be dubbed an entropy compression argument, and only applies to probabilistic algorithms which require a certain collection ${R}$ of random “bits” or other random choices as part of the input, thus each loop of the algorithm takes an object ${A}$ (which may also have been generated randomly) and some portion of the random string ${R}$ to (deterministically) create a better object ${A'}$ (and a shorter random string ${R'}$, formed by throwing away those bits of ${R}$ that were used in the loop). The key point is to design the algorithm to be partially reversible, in the sense that given ${A'}$ and ${R'}$ and some additional data ${H'}$ that logs the cumulative history of the algorithm up to this point, one can reconstruct ${A}$ together with the remaining portion ${R}$ not already contained in ${R'}$. Thus, each stage of the argument compresses the information-theoretic content of the string ${A+R}$ into the string ${A'+R'+H'}$ in a lossless fashion. However, a random variable such as ${A+R}$ cannot be compressed losslessly into a string of expected size smaller than the Shannon entropy of that variable. Thus, if one has a good lower bound on the entropy of ${A+R}$, and if the length of ${A'+R'+H'}$ is significantly less than that of ${A+R}$ (i.e. we need the marginal growth in the length of the history file ${H'}$ per iteration to be less than the marginal amount of randomness used per iteration), then there is a limit as to how many times the algorithm can be run, much as there is a limit as to how many times a random data file can be compressed before no further length reduction occurs.

It is interesting to compare this method with the ones discussed earlier. In the previous methods, the failure of the algorithm to halt led to a new iteration of the object ${A}$ which was “heavier”, “denser”, captured more “energy”, or “lower rank” than the previous instance of ${A}$. Here, the failure of the algorithm to halt leads to new information that can be used to “compress” ${A}$ (or more precisely, the full state ${A+R}$) into a smaller amount of space. I don’t know yet of any application of this new type of termination strategy to the fields I work in, but one could imagine that it could eventually be of use (perhaps to show that solutions to PDE with sufficiently “random” initial data can avoid singularity formation?), so I thought I would discuss it here.

Below the fold I give a special case of Moser’s argument, based on a blog post of Lance Fortnow on this topic.

As many readers may already know, my good friend and fellow mathematical blogger Tim Gowers, having wrapped up work on the Princeton Companion to Mathematics (which I believe is now in press), has begun another mathematical initiative, namely a “Tricks Wiki” to act as a repository for mathematical tricks and techniques.    Tim has already started the ball rolling with several seed articles on his own blog, and asked me to also contribute some articles.  (As I understand it, these articles will be migrated to the Wiki in a few months, once it is fully set up, and then they will evolve with edits and contributions by anyone who wishes to pitch in, in the spirit of Wikipedia; in particular, articles are not intended to be permanently authored or signed by any single contributor.)

So today I’d like to start by extracting some material from an old post of mine on “Amplification, arbitrage, and the tensor power trick” (as well as from some of the comments), and converting it to the Tricks Wiki format, while also taking the opportunity to add a few more examples.

Title: The tensor power trick

Quick description: If one wants to prove an inequality $X \leq Y$ for some non-negative quantities X, Y, but can only see how to prove a quasi-inequality $X \leq CY$ that loses a multiplicative constant C, then try to replace all objects involved in the problem by “tensor powers” of themselves and apply the quasi-inequality to those powers.  If all goes well, one can show that $X^M \leq C Y^M$ for all $M \geq 1$, with a constant C which is independent of M, which implies that $X \leq Y$ as desired by taking $M^{th}$ roots and then taking limits as $M \to \infty$.

Having established the monotonicity of the Perelman reduced volume in the previous lecture (after first heuristically justifying this monotonicity in Lecture 9), we now show how this can be used to establish $\kappa$-noncollapsing of Ricci flows, thus giving a second proof of Theorem 2 from Lecture 7. Of course, we already proved (a stronger version) of this theorem already in Lecture 8, using the Perelman entropy, but this second proof is also important, because the reduced volume is a more localised quantity (due to the weight $e^{-l_{(0,x_0)}}$ in its definition and so one can in fact establish local versions of the non-collapsing theorem which turn out to be important when we study ancient $\kappa$-noncollapsing solutions later in Perelman’s proof, because such solutions need not be compact and so cannot be controlled by global quantities (such as the Perelman entropy).

The route to $\kappa$-noncollapsing via reduced volume proceeds by the following scheme:

Non-collapsing at time t=0 (1)

$\Downarrow$

Large reduced volume at time t=0 (2)

$\Downarrow$

Large reduced volume at later times t (3)

$\Downarrow$

Non-collapsing at later times t (4)

The implication $(2) \implies (3)$ is the monotonicity of Perelman reduced volume. In this lecture we discuss the other two implications $(1) \implies (2)$, and $(3) \implies (4)$).

Our arguments here are based on Perelman’s first paper, Kleiner-Lott’s notes, and Morgan-Tian’s book, though the material in the Morgan-Tian book differs in some key respects from the other two texts. A closely related presentation of these topics also appears in the paper of Cao-Zhu.

It occurred to me recently that the mathematical blog medium may be a good venue not just for expository “short stories” on mathematical concepts or results, but also for more technical discussions of individual mathematical “tricks”, which would otherwise not be significant enough to warrant a publication-length (and publication-quality) article. So I thought today that I would discuss the amplification trick in harmonic analysis and combinatorics (and in particular, in the study of estimates); this trick takes an established estimate involving an arbitrary object (such as a function f), and obtains a stronger (or amplified) estimate by transforming the object in a well-chosen manner (often involving some new parameters) into a new object, applying the estimate to that new object, and seeing what that estimate says about the original object (after optimising the parameters or taking a limit). The amplification trick works particularly well for estimates which enjoy some sort of symmetry on one side of the estimate that is not represented on the other side; indeed, it can be viewed as a way to “arbitrage” differing amounts of symmetry between the left- and right-hand sides of an estimate. It can also be used in the contrapositive, amplifying a weak counterexample to an estimate into a strong counterexample. This trick also sheds some light as to why dimensional analysis works; an estimate which is not dimensionally consistent can often be amplified into a stronger estimate which is dimensionally consistent; in many cases, this new estimate is so strong that it cannot in fact be true, and thus dimensionally inconsistent inequalities tend to be either false or inefficient, which is why we rarely see them. (More generally, any inequality on which a group acts on either the left or right-hand side can often be “decomposed” into the “isotypic components” of the group action, either by the amplification trick or by other related tools, such as Fourier analysis.)

The amplification trick is a deceptively simple one, but it can become particularly powerful when one is arbitraging an unintuitive symmetry, such as symmetry under tensor powers. Indeed, the “tensor power trick”, which can eliminate constants and even logarithms in an almost magical manner, can lead to some interesting proofs of sharp inequalities, which are difficult to establish by more direct means.

The most familiar example of the amplification trick in action is probably the textbook proof of the Cauchy-Schwarz inequality

$|\langle v, w \rangle| \leq \|v\| \|w\|$ (1)

for vectors v, w in a complex Hilbert space. To prove this inequality, one might start by exploiting the obvious inequality

$\|v-w\|^2 \geq 0$ (2)

but after expanding everything out, one only gets the weaker inequality

$\hbox{Re} \langle v, w \rangle \leq \frac{1}{2} \|v\|^2 + \frac{1}{2} \|w\|^2$. (3)

Now (3) is weaker than (1) for two reasons; the left-hand side is smaller, and the right-hand side is larger (thanks to the arithmetic mean-geometric mean inequality). However, we can amplify (3) by arbitraging some symmetry imbalances. Firstly, observe that the phase rotation symmetry $v \mapsto e^{i\theta} v$ preserves the RHS of (3) but not the LHS. We exploit this by replacing v by $e^{i\theta} v$ in (3) for some phase $\theta$ to be chosen later, to obtain

$\hbox{Re} e^{i\theta} \langle v, w \rangle \leq \frac{1}{2} \|v\|^2 + \frac{1}{2} \|w\|^2$.

Now we are free to choose $\theta$ at will (as long as it is real, of course), so it is natural to choose $\theta$ to optimise the inequality, which in this case means to make the left-hand side as large as possible. This is achieved by choosing $e^{i\theta}$ to cancel the phase of $\langle v, w \rangle$, and we obtain

$|\langle v, w \rangle| \leq \frac{1}{2} \|v\|^2 + \frac{1}{2} \|w\|^2$ (4)

This is closer to (1); we have fixed the left-hand side, but the right-hand side is still too weak. But we can amplify further, by exploiting an imbalance in a different symmetry, namely the homogenisation symmetry $(v,w) \mapsto (\lambda v, \frac{1}{\lambda} w)$ for a scalar $\lambda > 0$, which preserves the left-hand side but not the right. Inserting this transform into (4) we conclude that

$|\langle v, w \rangle| \leq \frac{\lambda^2}{2} \|v\|^2 + \frac{1}{2\lambda^2} \|w\|^2$

where $\lambda > 0$ is at our disposal to choose. We can optimise in $\lambda$ by minimising the right-hand side, and indeed one easily sees that the minimum (or infimum, if one of v and w vanishes) is $\|v\| \|w\|$ (which is achieved when $\lambda = \sqrt{\|w\|/\|v\|}$ when $v,w$ are non-zero, or in an asymptotic limit $\lambda \to 0$ or $\lambda \to \infty$ in the degenerate cases), and so we have amplified our way to the Cauchy-Schwarz inequality (1). [See also this discussion by Tim Gowers on the Cauchy-Schwarz inequality.]

[This post is authored by Gil Kalai, who has kindly “guest blogged” this week’s “open problem of the week”. - T.]

The entropy-influence conjecture seeks to relate two somewhat different measures as to how a boolean function has concentrated Fourier coefficients, namely the total influence and the entropy.

We begin by defining the total influence. Let $\{-1,+1\}^n$ be the discrete cube, i.e. the set of $\pm 1$ vectors $(x_1,\ldots,x_n)$ of length n. A boolean function is any function $f: \{-1,+1\}^n \to \{-1,+1\}$ from the discrete cube to {-1,+1}. One can think of such functions as “voting methods”, which take the preferences of n voters (+1 for yes, -1 for no) as input and return a yes/no verdict as output. For instance, if n is odd, the “majority vote” function $\hbox{sgn}(x_1+\ldots+x_n)$ returns +1 if there are more +1 variables than -1, or -1 otherwise, whereas if $1 \leq k \leq n$, the “$k^{th}$ dictator” function returns the value $x_k$ of the $k^{th}$ variable.

We give the cube $\{-1,+1\}^n$ the uniform probability measure $\mu$ (thus we assume that the n voters vote randomly and independently). Given any boolean function f and any variable $1 \leq k \leq n$, define the influence $I_k(f)$ of the $k^{th}$ variable to be the quantity

$I_k(f) := \mu \{ x \in \{-1,+1\}^n: f(\sigma_k(x)) \neq f(x) \}$

where $\sigma_k(x)$ is the element of the cube formed by flipping the sign of the $k^{th}$ variable. Informally, $I_k(f)$ measures the probability that the $k^{th}$ voter could actually determine the outcome of an election; it is sometimes referred to as the Banzhaf power index. The total influence I(f) of f (also known as the average sensitivity and the edge-boundary density) is then defined as

$I(f) := \sum_{k=1}^n I_k(f).$

Thus for instance a dictator function has total influence 1, whereas majority vote has total influence comparable to $\sqrt{n}$. The influence can range between 0 (for constant functions +1, -1) and n (for the parity function $x_1 \ldots x_k$ or its negation). If f has mean zero (i.e. it is equal to +1 half of the time), then the edge-isoperimetric inequality asserts that $I(f) \geq 1$ (with equality if and only if there is a dictatorship), whilst the Kahn-Kalai-Linial (KKL) theorem asserts that $I_k(f) \gg \frac{\log n}{n}$ for some k. There is a result of Friedgut that if $I(f)$ is bounded by A (say) and $\varepsilon > 0$, then f is within a distance $\varepsilon$ (in $L^1$ norm) of another boolean function g which only depends on $O_{A,\varepsilon}(1)$ of the variables (such functions are known as juntas).

[This post is authored by Gil Kalai, who has kindly “guest blogged” this week’s “open problem of the week”. - T.]

This is a problem in discrete and convex geometry. It seeks to quantify the intuitively obvious fact that large convex bodies are so “fat” that they cannot avoid “detection” by a small number of observation points. More precisely, we fix a dimension d and make the following definition (introduced by Haussler and Welzl):

• Definition: Let $X \subset {\Bbb R}^d$ be a finite set of points, and let $0 < \epsilon < 1$. We say that a finite set $Y \subset {\Bbb R}^d$ is a weak $\epsilon$-net for X (with respect to convex bodies) if, whenever B is a convex body which is large in the sense that $|B \cap X| > \epsilon |X|$, then B contains at least one point of Y. (If Y is contained in X, we say that Y is a strong $\epsilon$-net for X with respect to convex bodies.)

For example, in one dimension, if $X = \{1,\ldots,N\}$, and $Y = \{ \epsilon N, 2 \epsilon N, \ldots, k \epsilon N \}$ where k is the integer part of $1/\epsilon$, then Y is a weak $\epsilon$-net for X with respect to convex bodies. Thus we see that even when the original set X is very large, one can create a $\epsilon$-net of size as small as $O(1/\epsilon)$. Strong $\epsilon$-nets are of importance in computational learning theory, and are fairly well understood via Vapnik-Chervonenkis (or VC) theory; however, the theory of weak $\epsilon$-nets is still not completely satisfactory.

One can ask what happens in higher dimensions, for instance when X is a discrete cube $X = \{1,\ldots,N\}^d$. It is not too hard to cook up $\epsilon$-nets of size $O_d(1/\epsilon^d)$ (by using tools such as Minkowski’s theorem), but in fact one can create $\epsilon$-nets of size as small as $O( \frac{1}{\epsilon} \log \frac{1}{\epsilon} )$ simply by taking a random subset of X of this cardinality and observing that “up to errors of $\epsilon$“, the total number of essentially different ways a convex body can meet X grows at most polynomially in $1/\epsilon$. (This is a very typical application of the probabilistic method.) On the other hand, since X can contain roughly $1/\epsilon$ disjoint convex bodies, each of which contains at least $\epsilon$ of the points in X, we see that no $\epsilon$-net can have size much smaller than $1/\epsilon$.

Now consider the situation in which X is now an arbitrary finite set, rather than a discrete cube. More precisely, let $f(\epsilon,d)$ be the least number such that every finite set X possesses at least one weak $\epsilon$-net for X with respect to convex bodies of cardinality at most $f(\epsilon,d)$. (One can also replace the finite set X with an arbitrary probability measure; the two formulations are equivalent.) Informally, f is the least number of “guards” one needs to place to prevent a convex body from covering more than $\epsilon$ of any given territory.

• Problem 1: For fixed d, what is the correct rate of growth of f as $\epsilon \to 0$?