You are currently browsing the tag archive for the ‘entropy’ tag.

Let and be two random variables taking values in the same (discrete) range , and let be some subset of , which we think of as the set of “bad” outcomes for either or . If and have the same probability distribution, then clearly

In particular, if it is rare for to lie in , then it is also rare for to lie in .

If and do not have exactly the same probability distribution, but their probability distributions are *close* to each other in some sense, then we can expect to have an approximate version of the above statement. For instance, from the definition of the total variation distance between two random variables (or more precisely, the total variation distance between the probability distributions of two random variables), we see that

for any . In particular, if it is rare for to lie in , and are close in total variation, then it is also rare for to lie in .

A basic inequality in information theory is Pinsker’s inequality

where the Kullback-Leibler divergence is defined by the formula

(See this previous blog post for a proof of this inequality.) A standard application of Jensen’s inequality reveals that is non-negative (Gibbs’ inequality), and vanishes if and only if , have the same distribution; thus one can think of as a measure of how close the distributions of and are to each other, although one should caution that this is not a symmetric notion of distance, as in general. Inserting Pinsker’s inequality into (1), we see for instance that

Thus, if is close to in the Kullback-Leibler sense, and it is rare for to lie in , then it is rare for to lie in as well.

We can specialise this inequality to the case when a uniform random variable on a finite range of some cardinality , in which case the Kullback-Leibler divergence simplifies to

where

is the Shannon entropy of . Again, a routine application of Jensen’s inequality shows that , with equality if and only if is uniformly distributed on . The above inequality then becomes

Thus, if is a small fraction of (so that it is rare for to lie in ), and the entropy of is very close to the maximum possible value of , then it is rare for to lie in also.

The inequality (2) is only useful when the entropy is close to in the sense that , otherwise the bound is worse than the trivial bound of . In my recent paper on the Chowla and Elliott conjectures, I ended up using a variant of (2) which was still non-trivial when the entropy was allowed to be smaller than . More precisely, I used the following simple inequality, which is implicit in the arguments of that paper but which I would like to make more explicit in this post:

Lemma 1 (Pinsker-type inequality)Let be a random variable taking values in a finite range of cardinality , let be a uniformly distributed random variable in , and let be a subset of . Then

*Proof:* Consider the conditional entropy . On the one hand, we have

by Jensen’s inequality. On the other hand, one has

where we have again used Jensen’s inequality. Putting the two inequalities together, we obtain the claim.

Remark 2As noted in comments, this inequality can be viewed as a special case of the more general inequalityfor arbitrary random variables taking values in the same discrete range , which follows from the data processing inequality

for arbitrary functions , applied to the indicator function . Indeed one has

where is the entropy function.

Thus, for instance, if one has

and

for some much larger than (so that ), then

More informally: if the entropy of is *somewhat* close to the maximum possible value of , and it is *exponentially* rare for a uniform variable to lie in , then it is still *somewhat* rare for to lie in . The estimate given is close to sharp in this regime, as can be seen by calculating the entropy of a random variable which is uniformly distributed inside a small set with some probability and uniformly distributed outside of with probability , for some parameter .

It turns out that the above lemma combines well with concentration of measure estimates; in my paper, I used one of the simplest such estimates, namely Hoeffding’s inequality, but there are of course many other estimates of this type (see e.g. this previous blog post for some others). Roughly speaking, concentration of measure inequalities allow one to make approximations such as

with exponentially high probability, where is a uniform distribution and is some reasonable function of . Combining this with the above lemma, we can then obtain approximations of the form

with somewhat high probability, if the entropy of is somewhat close to maximum. This observation, combined with an “entropy decrement argument” that allowed one to arrive at a situation in which the relevant random variable did have a near-maximum entropy, is the key new idea in my recent paper; for instance, one can use the approximation (3) to obtain an approximation of the form

for “most” choices of and a suitable choice of (with the latter being provided by the entropy decrement argument). The left-hand side is tied to Chowla-type sums such as through the multiplicativity of , while the right-hand side, being a linear correlation involving two parameters rather than just one, has “finite complexity” and can be treated by existing techniques such as the Hardy-Littlewood circle method. One could hope that one could similarly use approximations such as (3) in other problems in analytic number theory or combinatorics.

There are many situations in combinatorics in which one is running some sort of iteration algorithm to continually “improve” some object ; each loop of the algorithm replaces with some better version of itself, until some desired property of is attained and the algorithm halts. In order for such arguments to yield a useful conclusion, it is often necessary that the algorithm halts in a finite amount of time, or (even better), in a bounded amount of time. (In general, one cannot use infinitary iteration tools, such as transfinite induction or Zorn’s lemma, in combinatorial settings, because the iteration processes used to improve some target object often degrade some other finitary quantity in the process, and an infinite iteration would then have the undesirable effect of making infinite.)

A basic strategy to ensure termination of an algorithm is to exploit a *monotonicity property*, or more precisely to show that some key quantity keeps increasing (or keeps decreasing) with each loop of the algorithm, while simultaneously staying bounded. (Or, as the economist Herbert Stein was fond of saying, “If something cannot go on forever, it must stop.”)

Here are four common flavours of this monotonicity strategy:

- The
*mass increment argument*. This is perhaps the most familiar way to ensure termination: make each improved object “heavier” than the previous one by some non-trivial amount (e.g. by ensuring that the cardinality of is strictly greater than that of , thus ). Dually, one can try to force the amount of “mass” remaining “outside” of in some sense to decrease at every stage of the iteration. If there is a good upper bound on the “mass” of that stays essentially fixed throughout the iteration process, and a lower bound on the mass increment at each stage, then the argument terminates. Many “greedy algorithm” arguments are of this type. The proof of the Hahn decomposition theorem in measure theory also falls into this category. The general strategy here is to keep looking for useful pieces of mass outside of , and add them to to form , thus exploiting the additivity properties of mass. Eventually no further usable mass remains to be added (i.e. is*maximal*in some sense), and this should force some desirable property on . - The
*density increment argument*. This is a variant of the mass increment argument, in which one increments the “density” of rather than the “mass”. For instance, might be contained in some ambient space , and one seeks to improve to (and to ) in such a way that the density of the new object in the new ambient space is better than that of the previous object (e.g. for some ). On the other hand, the density of is clearly bounded above by . As long as one has a sufficiently good lower bound on the density increment at each stage, one can conclude an upper bound on the number of iterations in the algorithm. The prototypical example of this is Roth’s proof of his theorem that every set of integers of positive upper density contains an arithmetic progression of length three. The general strategy here is to keep looking for useful density fluctuations inside , and then “zoom in” to a region of increased density by reducing and appropriately. Eventually no further usable density fluctuation remains (i.e. is*uniformly distributed*), and this should force some desirable property on . - The
*energy increment argument*. This is an “” analogue of the ““-based mass increment argument (or the ““-based density increment argument), in which one seeks to increments the amount of “energy” that captures from some reference object , or (equivalently) to decrement the amount of energy of which is still “orthogonal” to . Here and are related somehow to a Hilbert space, and the energy involves the norm on that space. A classic example of this type of argument is the existence of orthogonal projections onto closed subspaces of a Hilbert space; this leads among other things to the construction of conditional expectation in measure theory, which then underlies a number of arguments in ergodic theory, as discussed for instance in this earlier blog post. Another basic example is the standard proof of the Szemerédi regularity lemma (where the “energy” is often referred to as the “index”). These examples are related; see this blog post for further discussion. The general strategy here is to keep looking for useful pieces of energy orthogonal to , and add them to to form , thus exploiting square-additivity properties of energy, such as Pythagoras’ theorem. Eventually, no further usable energy outside of remains to be added (i.e. is*maximal*in some sense), and this should force some desirable property on . - The
*rank reduction argument*. Here, one seeks to make each new object to have a lower “rank”, “dimension”, or “order” than the previous one. A classic example here is the proof of the linear algebra fact that given any finite set of vectors, there exists a linearly independent subset which spans the same subspace; the proof of the more general Steinitz exchange lemma is in the same spirit. The general strategy here is to keep looking for “collisions” or “dependencies” within , and use them to collapse to an object of lower rank. Eventually, no further usable collisions within remain, and this should force some desirable property on .

Much of my own work in additive combinatorics relies heavily on at least one of these types of arguments (and, in some cases, on a nested combination of two or more of them). Many arguments in nonlinear partial differential equations also have a similar flavour, relying on various *monotonicity formulae* for solutions to such equations, though the objective in PDE is usually slightly different, in that one wants to keep control of a solution as one approaches a singularity (or as some time or space coordinate goes off to infinity), rather than to ensure termination of an algorithm. (On the other hand, many arguments in the theory of *concentration compactness*, which is used heavily in PDE, does have the same algorithm-terminating flavour as the combinatorial arguments; see this earlier blog post for more discussion.)

Recently, a new species of monotonicity argument was introduced by Moser, as the primary tool in his elegant new proof of the Lovász local lemma. This argument could be dubbed an *entropy compression argument*, and only applies to probabilistic algorithms which require a certain collection of random “bits” or other random choices as part of the input, thus each loop of the algorithm takes an object (which may also have been generated randomly) and some portion of the random string to (deterministically) create a better object (and a shorter random string , formed by throwing away those bits of that were used in the loop). The key point is to design the algorithm to be partially *reversible*, in the sense that given and and some additional data that logs the cumulative *history* of the algorithm up to this point, one can reconstruct together with the remaining portion not already contained in . Thus, each stage of the argument *compresses* the information-theoretic content of the string into the string in a lossless fashion. However, a random variable such as cannot be compressed losslessly into a string of expected size smaller than the Shannon entropy of that variable. Thus, if one has a good lower bound on the entropy of , and if the length of is significantly less than that of (i.e. we need the marginal growth in the length of the history file per iteration to be less than the marginal amount of randomness used per iteration), then there is a limit as to how many times the algorithm can be run, much as there is a limit as to how many times a random data file can be compressed before no further length reduction occurs.

It is interesting to compare this method with the ones discussed earlier. In the previous methods, the failure of the algorithm to halt led to a new iteration of the object which was “heavier”, “denser”, captured more “energy”, or “lower rank” than the previous instance of . Here, the failure of the algorithm to halt leads to new information that can be used to “compress” (or more precisely, the full state ) into a smaller amount of space. I don’t know yet of any application of this new type of termination strategy to the fields I work in, but one could imagine that it could eventually be of use (perhaps to show that solutions to PDE with sufficiently “random” initial data can avoid singularity formation?), so I thought I would discuss it here.

Below the fold I give a special case of Moser’s argument, based on a blog post of Lance Fortnow on this topic.

As many readers may already know, my good friend and fellow mathematical blogger Tim Gowers, having wrapped up work on the Princeton Companion to Mathematics (which I believe is now in press), has begun another mathematical initiative, namely a “Tricks Wiki” to act as a repository for mathematical tricks and techniques. Tim has already started the ball rolling with several seed articles on his own blog, and asked me to also contribute some articles. (As I understand it, these articles will be migrated to the Wiki in a few months, once it is fully set up, and then they will evolve with edits and contributions by anyone who wishes to pitch in, in the spirit of Wikipedia; in particular, articles are not intended to be permanently authored or signed by any single contributor.)

So today I’d like to start by extracting some material from an old post of mine on “Amplification, arbitrage, and the tensor power trick” (as well as from some of the comments), and converting it to the Tricks Wiki format, while also taking the opportunity to add a few more examples.

**Title: The tensor power trick
**

**Quick description: **If one wants to prove an inequality for some non-negative quantities X, Y, but can only see how to prove a quasi-inequality that loses a multiplicative constant C, then try to replace all objects involved in the problem by “tensor powers” of themselves and apply the quasi-inequality to those powers. If all goes well, one can show that for all , with a constant C which is *independent* of M, which implies that as desired by taking roots and then taking limits as .

Having established the monotonicity of the Perelman reduced volume in the previous lecture (after first heuristically justifying this monotonicity in Lecture 9), we now show how this can be used to establish -noncollapsing of Ricci flows, thus giving a second proof of Theorem 2 from Lecture 7. Of course, we already proved (a stronger version) of this theorem already in Lecture 8, using the Perelman entropy, but this second proof is also important, because the reduced volume is a more localised quantity (due to the weight in its definition and so one can in fact establish *local *versions of the non-collapsing theorem which turn out to be important when we study ancient -noncollapsing solutions later in Perelman’s proof, because such solutions need not be compact and so cannot be controlled by global quantities (such as the Perelman entropy).

The route to -noncollapsing via reduced volume proceeds by the following scheme:

Non-collapsing at time t=0 (1)

Large reduced volume at time t=0 (2)

Large reduced volume at later times t (3)

Non-collapsing at later times t (4)

The implication is the monotonicity of Perelman reduced volume. In this lecture we discuss the other two implications , and ).

Our arguments here are based on Perelman’s first paper, Kleiner-Lott’s notes, and Morgan-Tian’s book, though the material in the Morgan-Tian book differs in some key respects from the other two texts. A closely related presentation of these topics also appears in the paper of Cao-Zhu.

It occurred to me recently that the mathematical blog medium may be a good venue not just for expository “short stories” on mathematical concepts or results, but also for more technical discussions of individual mathematical “tricks”, which would otherwise not be significant enough to warrant a publication-length (and publication-quality) article. So I thought today that I would discuss the *amplification trick* in harmonic analysis and combinatorics (and in particular, in the study of estimates); this trick takes an established estimate involving an arbitrary object (such as a function f), and obtains a stronger (or *amplified*) estimate by transforming the object in a well-chosen manner (often involving some new parameters) into a new object, applying the estimate to that new object, and seeing what that estimate says about the original object (after optimising the parameters or taking a limit). The amplification trick works particularly well for estimates which enjoy some sort of symmetry on one side of the estimate that is not represented on the other side; indeed, it can be viewed as a way to “arbitrage” differing amounts of symmetry between the left- and right-hand sides of an estimate. It can also be used in the contrapositive, amplifying a weak counterexample to an estimate into a strong counterexample. This trick also sheds some light as to why dimensional analysis works; an estimate which is not dimensionally consistent can often be amplified into a stronger estimate which is dimensionally consistent; in many cases, this new estimate is so strong that it cannot in fact be true, and thus dimensionally inconsistent inequalities tend to be either false or inefficient, which is why we rarely see them. (More generally, any inequality on which a group acts on either the left or right-hand side can often be “decomposed” into the “isotypic components” of the group action, either by the amplification trick or by other related tools, such as Fourier analysis.)

The amplification trick is a deceptively simple one, but it can become particularly powerful when one is arbitraging an unintuitive symmetry, such as symmetry under tensor powers. Indeed, the “tensor power trick”, which can eliminate constants and even logarithms in an almost magical manner, can lead to some interesting proofs of sharp inequalities, which are difficult to establish by more direct means.

The most familiar example of the amplification trick in action is probably the textbook proof of the Cauchy-Schwarz inequality

(1)

for vectors v, w in a complex Hilbert space. To prove this inequality, one might start by exploiting the obvious inequality

(2)

but after expanding everything out, one only gets the weaker inequality

. (3)

Now (3) is weaker than (1) for two reasons; the left-hand side is smaller, and the right-hand side is larger (thanks to the arithmetic mean-geometric mean inequality). However, we can amplify (3) by arbitraging some symmetry imbalances. Firstly, observe that the phase rotation symmetry preserves the RHS of (3) but not the LHS. We exploit this by replacing v by in (3) for some phase to be chosen later, to obtain

.

Now we are free to choose at will (as long as it is real, of course), so it is natural to choose to optimise the inequality, which in this case means to make the left-hand side as large as possible. This is achieved by choosing to cancel the phase of , and we obtain

(4)

This is closer to (1); we have fixed the left-hand side, but the right-hand side is still too weak. But we can amplify further, by exploiting an imbalance in a different symmetry, namely the homogenisation symmetry for a scalar , which preserves the left-hand side but not the right. Inserting this transform into (4) we conclude that

where is at our disposal to choose. We can optimise in by minimising the right-hand side, and indeed one easily sees that the minimum (or infimum, if one of v and w vanishes) is (which is achieved when when are non-zero, or in an asymptotic limit or in the degenerate cases), and so we have amplified our way to the Cauchy-Schwarz inequality (1). [See also this discussion by Tim Gowers on the Cauchy-Schwarz inequality.]

[*This post is authored by Gil Kalai, who has kindly “guest blogged” this week’s “open problem of the week”.* – T.]

The *entropy-influence conjecture* seeks to relate two somewhat different measures as to how a boolean function has concentrated Fourier coefficients, namely the total influence and the entropy.

We begin by defining the total influence. Let be the discrete cube, i.e. the set of vectors of length n. A *boolean function* is any function from the discrete cube to {-1,+1}. One can think of such functions as “voting methods”, which take the preferences of n voters (+1 for yes, -1 for no) as input and return a yes/no verdict as output. For instance, if n is odd, the “majority vote” function returns +1 if there are more +1 variables than -1, or -1 otherwise, whereas if , the “ dictator” function returns the value of the variable.

We give the cube the uniform probability measure (thus we assume that the n voters vote randomly and independently). Given any boolean function f and any variable , define the *influence* of the variable to be the quantity

where is the element of the cube formed by flipping the sign of the variable. Informally, measures the probability that the voter could actually determine the outcome of an election; it is sometimes referred to as the Banzhaf power index. The *total influence* I(f) of f (also known as the *average sensitivity* and the *edge-boundary density*) is then defined as

Thus for instance a dictator function has total influence 1, whereas majority vote has total influence comparable to . The influence can range between 0 (for constant functions +1, -1) and n (for the parity function or its negation). If f has mean zero (i.e. it is equal to +1 half of the time), then the *edge-isoperimetric inequality* asserts that (with equality if and only if there is a dictatorship), whilst the Kahn-Kalai-Linial (KKL) theorem asserts that for some k. There is a result of Friedgut that if is bounded by A (say) and , then f is within a distance (in norm) of another boolean function g which only depends on of the variables (such functions are known as juntas).

[*This post is authored by Gil Kalai, who has kindly “guest blogged” this week’s “open problem of the week”.* – T.]

This is a problem in discrete and convex geometry. It seeks to quantify the intuitively obvious fact that large convex bodies are so “fat” that they cannot avoid “detection” by a small number of observation points. More precisely, we fix a dimension d and make the following definition (introduced by Haussler and Welzl):

**Definition**: Let be a finite set of points, and let . We say that a finite set is a*weak -net*for X (with respect to convex bodies) if, whenever B is a convex body which is large in the sense that , then B contains at least one point of Y. (If Y is contained in X, we say that Y is a*strong -net*for X with respect to convex bodies.)

For example, in one dimension, if , and where k is the integer part of , then Y is a weak -net for X with respect to convex bodies. Thus we see that even when the original set X is very large, one can create a -net of size as small as . Strong -nets are of importance in computational learning theory, and are fairly well understood via Vapnik-Chervonenkis (or VC) theory; however, the theory of weak -nets is still not completely satisfactory.

One can ask what happens in higher dimensions, for instance when X is a discrete cube . It is not too hard to cook up -nets of size (by using tools such as Minkowski’s theorem), but in fact one can create -nets of size as small as simply by taking a *random* subset of X of this cardinality and observing that “up to errors of “, the total number of essentially different ways a convex body can meet X grows at most polynomially in . (This is a very typical application of the probabilistic method.) On the other hand, since X can contain roughly disjoint convex bodies, each of which contains at least of the points in X, we see that no -net can have size much smaller than .

Now consider the situation in which X is now an arbitrary finite set, rather than a discrete cube. More precisely, let be the least number such that *every* finite set X possesses at least one weak -net for X with respect to convex bodies of cardinality at most . (One can also replace the finite set X with an arbitrary probability measure; the two formulations are equivalent.) Informally, f is the least number of “guards” one needs to place to prevent a convex body from covering more than of any given territory.

**Problem 1**: For fixed d, what is the correct rate of growth of f as ?

## Recent Comments