You are currently browsing the tag archive for the ‘entropy’ tag.
Let be a finite set of order
; in applications
will be typically something like a finite abelian group, such as the cyclic group
. Let us define a
-bounded function to be a function
such that
for all
. There are many seminorms
of interest that one places on functions
that are bounded by
on
-bounded functions, such as the Gowers uniformity seminorms
for
(which are genuine norms for
). All seminorms in this post will be implicitly assumed to obey this property.
In additive combinatorics, a significant role is played by inverse theorems, which abstractly take the following form for certain choices of seminorm , some parameters
, and some class
of
-bounded functions:
Theorem 1 (Inverse theorem template) Ifis a
-bounded function with
, then there exists
such that
, where
denotes the usual inner product
Informally, one should think of as being somewhat small but fixed independently of
,
as being somewhat smaller but depending only on
(and on the seminorm), and
as representing the “structured functions” for these choices of parameters. There is some flexibility in exactly how to choose the class
of structured functions, but intuitively an inverse theorem should become more powerful when this class is small. Accordingly, let us define the
-entropy of the seminorm
to be the least cardinality of
for which such an inverse theorem holds. Seminorms with low entropy are ones for which inverse theorems can be expected to be a useful tool. This concept arose in some discussions I had with Ben Green many years ago, but never appeared in print, so I decided to record some observations we had on this concept here on this blog.
Lebesgue norms for
have exponentially large entropy (and so inverse theorems are not expected to be useful in this case):
Proposition 2 (norm has exponentially large inverse entropy) Let
and
. Then the
-entropy of
is at most
. Conversely, for any
, the
-entropy of
is at least
for some absolute constant
.
Proof: If is
-bounded with
, then we have
Now suppose that there is an -inverse theorem for some
of cardinality
. If we let
be a random sign function (so the
are independent random variables taking values in
with equal probability), then there is a random
such that
Most seminorms of interest in additive combinatorics, such as the Gowers uniformity norms, are bounded by some finite norm thanks to Hölder’s inequality, so from the above proposition and the obvious monotonicity properties of entropy, we conclude that all Gowers norms on finite abelian groups
have at most exponential inverse theorem entropy. But we can do significantly better than this:
- For the
seminorm
, one can simply take
to consist of the constant function
, and the
-entropy is clearly equal to
for any
.
- For the
norm, the standard Fourier-analytic inverse theorem asserts that if
then
for some Fourier character
. Thus the
-entropy is at most
.
- For the
norm on cyclic groups for
, the inverse theorem proved by Green, Ziegler, and myself gives an
-inverse theorem for some
and
consisting of nilsequences
for some filtered nilmanifold
of degree
in a finite collection of cardinality
, some polynomial sequence
(which was subsequently observed by Candela-Sisask (see also Manners) that one can choose to be
-periodic), and some Lipschitz function
of Lipschitz norm
. By the Arzela-Ascoli theorem, the number of possible
(up to uniform errors of size at most
, say) is
. By standard arguments one can also ensure that the coefficients of the polynomial
are
, and then by periodicity there are only
such polynomials. As a consequence, the
-entropy is of polynomial size
(a fact that seems to have first been implicitly observed in Lemma 6.2 of this paper of Frantzikinakis; thanks to Ben Green for this reference). One can obtain more precise dependence on
using the quantitative version of this inverse theorem due to Manners; back of the envelope calculations using Section 5 of that paper suggest to me that one can take
to be polynomial in
and the entropy to be of the order
, or alternatively one can reduce the entropy to
at the cost of degrading
to
.
- If one replaces the cyclic group
by a vector space
over some fixed finite field
of prime order (so that
), then the inverse theorem of Ziegler and myself (available in both high and low characteristic) allows one to obtain an
-inverse theorem for some
and
the collection of non-classical degree
polynomial phases from
to
, which one can normalize to equal
at the origin, and then by the classification of such polynomials one can calculate that the
entropy is of quasipolynomial size
in
. By using the recent work of Gowers and Milicevic, one can make the dependence on
here more precise, but we will not perform these calcualtions here.
- For the
norm on an arbitrary finite abelian group, the recent inverse theorem of Jamneshan and myself gives (after some calculations) a bound of the polynomial form
on the
-entropy for some
, which one can improve slightly to
if one degrades
to
, where
is the maximal order of an element of
, and
is the rank (the number of elements needed to generate
). This bound is polynomial in
in the cyclic group case and quasipolynomial in general.
For general finite abelian groups , we do not yet have an inverse theorem of comparable power to the ones mentioned above that give polynomial or quasipolynomial upper bounds on the entropy. However, there is a cheap argument that at least gives some subexponential bounds:
Proposition 3 (Cheap subexponential bound) Letand
, and suppose that
is a finite abelian group of order
for some sufficiently large
. Then the
-complexity of
is at most
.
Proof: (Sketch) We use a standard random sampling argument, of the type used for instance by Croot-Sisask or Briet-Gopi (thanks to Ben Green for this latter reference). We can assume that for some sufficiently large
, since otherwise the claim follows from Proposition 2.
Let be a random subset of
with the events
being iid with probability
to be chosen later, conditioned to the event
. Let
be a
-bounded function. By a standard second moment calculation, we see that with probability at least
, we have
If we then let be
rounded to the nearest Gaussian integer multiple of
in the unit disk, one has from the triangle inequality that
Now we remove the failure probability by independent resampling. By rounding to the nearest Gaussian integer multiple of in the unit disk for a sufficiently small
, one can find a family
of cardinality
consisting of
-bounded functions
of
norm at least
such that for every
-bounded
with
there exists
such that
One way to obtain lower bounds on the inverse theorem entropy is to produce a collection of almost orthogonal functions with large norm. More precisely:
Proposition 4 Letbe a seminorm, let
, and suppose that one has a collection
of
-bounded functions such that for all
,
one has
for all but at most
choices of
for all distinct
. Then the
-entropy of
is at least
.
Proof: Suppose we have an -inverse theorem with some family
. Then for each
there is
such that
. By the pigeonhole principle, there is thus
such that
for all
in a subset
of
of cardinality at least
:
Thus for instance:
- For the
norm, one can take
to be the family of linear exponential phases
with
and
, and obtain a linear lower bound of
for the
-entropy, thus matching the upper bound of
up to constants when
is fixed.
- For the
norm, a similar calculation using polynomial phases of degree
, combined with the Weyl sum estimates, gives a lower bound of
for the
-entropy for any fixed
; by considering nilsequences as well, together with nilsequence equidistribution theory, one can replace the exponent
here by some quantity that goes to infinity as
, though I have not attempted to calculate the exact rate.
- For the
norm, another similar calculation using polynomial phases of degree
should give a lower bound of
for the
-entropy, though I have not fully performed the calculation.
We close with one final example. Suppose is a product
of two sets
of cardinality
, and we consider the Gowers box norm
Let be a monic polynomial of degree
with complex coefficients. Then by the fundamental theorem of algebra, we can factor
as
for some complex zeroes (possibly with repetition).
Now suppose we evolve with respect to time by heat flow, creating a function
of two variables with given initial data
for which
On the space of polynomials of degree at most , the operator
is nilpotent, and one can solve this equation explicitly both forwards and backwards in time by the Taylor series
For instance, if one starts with a quadratic , then the polynomial evolves by the formula
As the polynomial evolves in time, the zeroes
evolve also. Assuming for sake of discussion that the zeroes are simple, the inverse function theorem tells us that the zeroes will (locally, at least) evolve smoothly in time. What are the dynamics of this evolution?
For instance, in the quadratic case, the quadratic formula tells us that the zeroes are
and
after arbitrarily choosing a branch of the square root. If are real and the discriminant
is initially positive, we see that we start with two real zeroes centred around
, which then approach each other until time
, at which point the roots collide and then move off from each other in an imaginary direction.
In the general case, we can obtain the equations of motion by implicitly differentiating the defining equation
in time using (2) to obtain
To simplify notation we drop the explicit dependence on time, thus
From (1) and the product rule, we see that
and
(where all indices are understood to range over ) leading to the equations of motion
at least when one avoids those times in which there is a repeated zero. In the case when the zeroes are real, each term
represents a (first-order) attraction in the dynamics between
and
, but the dynamics are more complicated for complex zeroes (e.g. purely imaginary zeroes will experience repulsion rather than attraction, as one already sees in the quadratic example). Curiously, this system resembles that of Dyson brownian motion (except with the brownian motion part removed, and time reversed). I learned of the connection between the ODE (3) and the heat equation from this paper of Csordas, Smith, and Varga, but perhaps it has been mentioned in earlier literature as well.
One interesting consequence of these equations is that if the zeroes are real at some time, then they will stay real as long as the zeroes do not collide. Let us now restrict attention to the case of real simple zeroes, in which case we will rename the zeroes as instead of
, and order them as
. The evolution
can now be thought of as reverse gradient flow for the “entropy”
(which is also essentially the logarithm of the discriminant of the polynomial) since we have
In particular, we have the monotonicity formula
where is the “energy”
where in the last line we use the antisymmetrisation identity
Among other things, this shows that as one goes backwards in time, the entropy decreases, and so no collisions can occur to the past, only in the future, which is of course consistent with the attractive nature of the dynamics. As is a convex function of the positions
, one expects
to also evolve in a convex manner in time, that is to say the energy
should be increasing. This is indeed the case:
Exercise 1 Show that
Symmetric polynomials of the zeroes are polynomial functions of the coefficients and should thus evolve in a polynomial fashion. One can compute this explicitly in simple cases. For instance, the center of mass is an invariant:
The variance decreases linearly:
Exercise 2 Establish the virial identity
As the variance (which is proportional to ) cannot become negative, this identity shows that “finite time blowup” must occur – that the zeroes must collide at or before the time
.
Exercise 3 Show that the Stieltjes transform
solves the viscous Burgers equation
either by using the original heat equation (2) and the identity
, or else by using the equations of motion (3). This relation between the Burgers equation and the heat equation is known as the Cole-Hopf transformation.
The paper of Csordas, Smith, and Varga mentioned previously gives some other bounds on the lifespan of the dynamics; roughly speaking, they show that if there is one pair of zeroes that are much closer to each other than to the other zeroes then they must collide in a short amount of time (unless there is a collision occuring even earlier at some other location). Their argument extends also to situations where there are an infinite number of zeroes, which they apply to get new results on Newman’s conjecture in analytic number theory. I would be curious to know of further places in the literature where this dynamics has been studied.
Let and
be two random variables taking values in the same (discrete) range
, and let
be some subset of
, which we think of as the set of “bad” outcomes for either
or
. If
and
have the same probability distribution, then clearly
In particular, if it is rare for to lie in
, then it is also rare for
to lie in
.
If and
do not have exactly the same probability distribution, but their probability distributions are close to each other in some sense, then we can expect to have an approximate version of the above statement. For instance, from the definition of the total variation distance
between two random variables (or more precisely, the total variation distance between the probability distributions of two random variables), we see that
for any . In particular, if it is rare for
to lie in
, and
are close in total variation, then it is also rare for
to lie in
.
A basic inequality in information theory is Pinsker’s inequality
where the Kullback-Leibler divergence is defined by the formula
(See this previous blog post for a proof of this inequality.) A standard application of Jensen’s inequality reveals that is non-negative (Gibbs’ inequality), and vanishes if and only if
,
have the same distribution; thus one can think of
as a measure of how close the distributions of
and
are to each other, although one should caution that this is not a symmetric notion of distance, as
in general. Inserting Pinsker’s inequality into (1), we see for instance that
Thus, if is close to
in the Kullback-Leibler sense, and it is rare for
to lie in
, then it is rare for
to lie in
as well.
We can specialise this inequality to the case when a uniform random variable
on a finite range
of some cardinality
, in which case the Kullback-Leibler divergence
simplifies to
where
is the Shannon entropy of . Again, a routine application of Jensen’s inequality shows that
, with equality if and only if
is uniformly distributed on
. The above inequality then becomes
Thus, if is a small fraction of
(so that it is rare for
to lie in
), and the entropy of
is very close to the maximum possible value of
, then it is rare for
to lie in
also.
The inequality (2) is only useful when the entropy is close to
in the sense that
, otherwise the bound is worse than the trivial bound of
. In my recent paper on the Chowla and Elliott conjectures, I ended up using a variant of (2) which was still non-trivial when the entropy
was allowed to be smaller than
. More precisely, I used the following simple inequality, which is implicit in the arguments of that paper but which I would like to make more explicit in this post:
Lemma 1 (Pinsker-type inequality) Let
be a random variable taking values in a finite range
of cardinality
, let
be a uniformly distributed random variable in
, and let
be a subset of
. Then
Proof: Consider the conditional entropy . On the one hand, we have
by Jensen’s inequality. On the other hand, one has
where we have again used Jensen’s inequality. Putting the two inequalities together, we obtain the claim.
Remark 2 As noted in comments, this inequality can be viewed as a special case of the more general inequality
for arbitrary random variables
taking values in the same discrete range
, which follows from the data processing inequality
for arbitrary functions
, applied to the indicator function
. Indeed one has
where
is the entropy function.
Thus, for instance, if one has
and
for some much larger than
(so that
), then
More informally: if the entropy of is somewhat close to the maximum possible value of
, and it is exponentially rare for a uniform variable to lie in
, then it is still somewhat rare for
to lie in
. The estimate given is close to sharp in this regime, as can be seen by calculating the entropy of a random variable
which is uniformly distributed inside a small set
with some probability
and uniformly distributed outside of
with probability
, for some parameter
.
It turns out that the above lemma combines well with concentration of measure estimates; in my paper, I used one of the simplest such estimates, namely Hoeffding’s inequality, but there are of course many other estimates of this type (see e.g. this previous blog post for some others). Roughly speaking, concentration of measure inequalities allow one to make approximations such as
with exponentially high probability, where is a uniform distribution and
is some reasonable function of
. Combining this with the above lemma, we can then obtain approximations of the form
with somewhat high probability, if the entropy of is somewhat close to maximum. This observation, combined with an “entropy decrement argument” that allowed one to arrive at a situation in which the relevant random variable
did have a near-maximum entropy, is the key new idea in my recent paper; for instance, one can use the approximation (3) to obtain an approximation of the form
for “most” choices of and a suitable choice of
(with the latter being provided by the entropy decrement argument). The left-hand side is tied to Chowla-type sums such as
through the multiplicativity of
, while the right-hand side, being a linear correlation involving two parameters
rather than just one, has “finite complexity” and can be treated by existing techniques such as the Hardy-Littlewood circle method. One could hope that one could similarly use approximations such as (3) in other problems in analytic number theory or combinatorics.
There are many situations in combinatorics in which one is running some sort of iteration algorithm to continually “improve” some object ; each loop of the algorithm replaces
with some better version
of itself, until some desired property of
is attained and the algorithm halts. In order for such arguments to yield a useful conclusion, it is often necessary that the algorithm halts in a finite amount of time, or (even better), in a bounded amount of time. (In general, one cannot use infinitary iteration tools, such as transfinite induction or Zorn’s lemma, in combinatorial settings, because the iteration processes used to improve some target object
often degrade some other finitary quantity
in the process, and an infinite iteration would then have the undesirable effect of making
infinite.)
A basic strategy to ensure termination of an algorithm is to exploit a monotonicity property, or more precisely to show that some key quantity keeps increasing (or keeps decreasing) with each loop of the algorithm, while simultaneously staying bounded. (Or, as the economist Herbert Stein was fond of saying, “If something cannot go on forever, it must stop.”)
Here are four common flavours of this monotonicity strategy:
- The mass increment argument. This is perhaps the most familiar way to ensure termination: make each improved object
“heavier” than the previous one
by some non-trivial amount (e.g. by ensuring that the cardinality of
is strictly greater than that of
, thus
). Dually, one can try to force the amount of “mass” remaining “outside” of
in some sense to decrease at every stage of the iteration. If there is a good upper bound on the “mass” of
that stays essentially fixed throughout the iteration process, and a lower bound on the mass increment at each stage, then the argument terminates. Many “greedy algorithm” arguments are of this type. The proof of the Hahn decomposition theorem in measure theory also falls into this category. The general strategy here is to keep looking for useful pieces of mass outside of
, and add them to
to form
, thus exploiting the additivity properties of mass. Eventually no further usable mass remains to be added (i.e.
is maximal in some
sense), and this should force some desirable property on
.
- The density increment argument. This is a variant of the mass increment argument, in which one increments the “density” of
rather than the “mass”. For instance,
might be contained in some ambient space
, and one seeks to improve
to
(and
to
) in such a way that the density of the new object in the new ambient space is better than that of the previous object (e.g.
for some
). On the other hand, the density of
is clearly bounded above by
. As long as one has a sufficiently good lower bound on the density increment at each stage, one can conclude an upper bound on the number of iterations in the algorithm. The prototypical example of this is Roth’s proof of his theorem that every set of integers of positive upper density contains an arithmetic progression of length three. The general strategy here is to keep looking for useful density fluctuations inside
, and then “zoom in” to a region of increased density by reducing
and
appropriately. Eventually no further usable density fluctuation remains (i.e.
is uniformly distributed), and this should force some desirable property on
.
- The energy increment argument. This is an “
” analogue of the “
“-based mass increment argument (or the “
“-based density increment argument), in which one seeks to increments the amount of “energy” that
captures from some reference object
, or (equivalently) to decrement the amount of energy of
which is still “orthogonal” to
. Here
and
are related somehow to a Hilbert space, and the energy involves the norm on that space. A classic example of this type of argument is the existence of orthogonal projections onto closed subspaces of a Hilbert space; this leads among other things to the construction of conditional expectation in measure theory, which then underlies a number of arguments in ergodic theory, as discussed for instance in this earlier blog post. Another basic example is the standard proof of the Szemerédi regularity lemma (where the “energy” is often referred to as the “index”). These examples are related; see this blog post for further discussion. The general strategy here is to keep looking for useful pieces of energy orthogonal to
, and add them to
to form
, thus exploiting square-additivity properties of energy, such as Pythagoras’ theorem. Eventually, no further usable energy outside of
remains to be added (i.e.
is maximal in some
sense), and this should force some desirable property on
.
- The rank reduction argument. Here, one seeks to make each new object
to have a lower “rank”, “dimension”, or “order” than the previous one. A classic example here is the proof of the linear algebra fact that given any finite set of vectors, there exists a linearly independent subset which spans the same subspace; the proof of the more general Steinitz exchange lemma is in the same spirit. The general strategy here is to keep looking for “collisions” or “dependencies” within
, and use them to collapse
to an object
of lower rank. Eventually, no further usable collisions within
remain, and this should force some desirable property on
.
Much of my own work in additive combinatorics relies heavily on at least one of these types of arguments (and, in some cases, on a nested combination of two or more of them). Many arguments in nonlinear partial differential equations also have a similar flavour, relying on various monotonicity formulae for solutions to such equations, though the objective in PDE is usually slightly different, in that one wants to keep control of a solution as one approaches a singularity (or as some time or space coordinate goes off to infinity), rather than to ensure termination of an algorithm. (On the other hand, many arguments in the theory of concentration compactness, which is used heavily in PDE, does have the same algorithm-terminating flavour as the combinatorial arguments; see this earlier blog post for more discussion.)
Recently, a new species of monotonicity argument was introduced by Moser, as the primary tool in his elegant new proof of the Lovász local lemma. This argument could be dubbed an entropy compression argument, and only applies to probabilistic algorithms which require a certain collection of random “bits” or other random choices as part of the input, thus each loop of the algorithm takes an object
(which may also have been generated randomly) and some portion of the random string
to (deterministically) create a better object
(and a shorter random string
, formed by throwing away those bits of
that were used in the loop). The key point is to design the algorithm to be partially reversible, in the sense that given
and
and some additional data
that logs the cumulative history of the algorithm up to this point, one can reconstruct
together with the remaining portion
not already contained in
. Thus, each stage of the argument compresses the information-theoretic content of the string
into the string
in a lossless fashion. However, a random variable such as
cannot be compressed losslessly into a string of expected size smaller than the Shannon entropy of that variable. Thus, if one has a good lower bound on the entropy of
, and if the length of
is significantly less than that of
(i.e. we need the marginal growth in the length of the history file
per iteration to be less than the marginal amount of randomness used per iteration), then there is a limit as to how many times the algorithm can be run, much as there is a limit as to how many times a random data file can be compressed before no further length reduction occurs.
It is interesting to compare this method with the ones discussed earlier. In the previous methods, the failure of the algorithm to halt led to a new iteration of the object which was “heavier”, “denser”, captured more “energy”, or “lower rank” than the previous instance of
. Here, the failure of the algorithm to halt leads to new information that can be used to “compress”
(or more precisely, the full state
) into a smaller amount of space. I don’t know yet of any application of this new type of termination strategy to the fields I work in, but one could imagine that it could eventually be of use (perhaps to show that solutions to PDE with sufficiently “random” initial data can avoid singularity formation?), so I thought I would discuss it here.
Below the fold I give a special case of Moser’s argument, based on a blog post of Lance Fortnow on this topic.
As many readers may already know, my good friend and fellow mathematical blogger Tim Gowers, having wrapped up work on the Princeton Companion to Mathematics (which I believe is now in press), has begun another mathematical initiative, namely a “Tricks Wiki” to act as a repository for mathematical tricks and techniques. Tim has already started the ball rolling with several seed articles on his own blog, and asked me to also contribute some articles. (As I understand it, these articles will be migrated to the Wiki in a few months, once it is fully set up, and then they will evolve with edits and contributions by anyone who wishes to pitch in, in the spirit of Wikipedia; in particular, articles are not intended to be permanently authored or signed by any single contributor.)
So today I’d like to start by extracting some material from an old post of mine on “Amplification, arbitrage, and the tensor power trick” (as well as from some of the comments), and converting it to the Tricks Wiki format, while also taking the opportunity to add a few more examples.
Title: The tensor power trick
Quick description: If one wants to prove an inequality for some non-negative quantities X, Y, but can only see how to prove a quasi-inequality
that loses a multiplicative constant C, then try to replace all objects involved in the problem by “tensor powers” of themselves and apply the quasi-inequality to those powers. If all goes well, one can show that
for all
, with a constant C which is independent of M, which implies that
as desired by taking
roots and then taking limits as
.
Having established the monotonicity of the Perelman reduced volume in the previous lecture (after first heuristically justifying this monotonicity in Lecture 9), we now show how this can be used to establish -noncollapsing of Ricci flows, thus giving a second proof of Theorem 2 from Lecture 7. Of course, we already proved (a stronger version) of this theorem already in Lecture 8, using the Perelman entropy, but this second proof is also important, because the reduced volume is a more localised quantity (due to the weight
in its definition and so one can in fact establish local versions of the non-collapsing theorem which turn out to be important when we study ancient
-noncollapsing solutions later in Perelman’s proof, because such solutions need not be compact and so cannot be controlled by global quantities (such as the Perelman entropy).
The route to -noncollapsing via reduced volume proceeds by the following scheme:
Non-collapsing at time t=0 (1)
Large reduced volume at time t=0 (2)
Large reduced volume at later times t (3)
Non-collapsing at later times t (4)
The implication is the monotonicity of Perelman reduced volume. In this lecture we discuss the other two implications
, and
).
Our arguments here are based on Perelman’s first paper, Kleiner-Lott’s notes, and Morgan-Tian’s book, though the material in the Morgan-Tian book differs in some key respects from the other two texts. A closely related presentation of these topics also appears in the paper of Cao-Zhu.
It occurred to me recently that the mathematical blog medium may be a good venue not just for expository “short stories” on mathematical concepts or results, but also for more technical discussions of individual mathematical “tricks”, which would otherwise not be significant enough to warrant a publication-length (and publication-quality) article. So I thought today that I would discuss the amplification trick in harmonic analysis and combinatorics (and in particular, in the study of estimates); this trick takes an established estimate involving an arbitrary object (such as a function f), and obtains a stronger (or amplified) estimate by transforming the object in a well-chosen manner (often involving some new parameters) into a new object, applying the estimate to that new object, and seeing what that estimate says about the original object (after optimising the parameters or taking a limit). The amplification trick works particularly well for estimates which enjoy some sort of symmetry on one side of the estimate that is not represented on the other side; indeed, it can be viewed as a way to “arbitrage” differing amounts of symmetry between the left- and right-hand sides of an estimate. It can also be used in the contrapositive, amplifying a weak counterexample to an estimate into a strong counterexample. This trick also sheds some light as to why dimensional analysis works; an estimate which is not dimensionally consistent can often be amplified into a stronger estimate which is dimensionally consistent; in many cases, this new estimate is so strong that it cannot in fact be true, and thus dimensionally inconsistent inequalities tend to be either false or inefficient, which is why we rarely see them. (More generally, any inequality on which a group acts on either the left or right-hand side can often be “decomposed” into the “isotypic components” of the group action, either by the amplification trick or by other related tools, such as Fourier analysis.)
The amplification trick is a deceptively simple one, but it can become particularly powerful when one is arbitraging an unintuitive symmetry, such as symmetry under tensor powers. Indeed, the “tensor power trick”, which can eliminate constants and even logarithms in an almost magical manner, can lead to some interesting proofs of sharp inequalities, which are difficult to establish by more direct means.
The most familiar example of the amplification trick in action is probably the textbook proof of the Cauchy-Schwarz inequality
(1)
for vectors v, w in a complex Hilbert space. To prove this inequality, one might start by exploiting the obvious inequality
(2)
but after expanding everything out, one only gets the weaker inequality
. (3)
Now (3) is weaker than (1) for two reasons; the left-hand side is smaller, and the right-hand side is larger (thanks to the arithmetic mean-geometric mean inequality). However, we can amplify (3) by arbitraging some symmetry imbalances. Firstly, observe that the phase rotation symmetry preserves the RHS of (3) but not the LHS. We exploit this by replacing v by
in (3) for some phase
to be chosen later, to obtain
.
Now we are free to choose at will (as long as it is real, of course), so it is natural to choose
to optimise the inequality, which in this case means to make the left-hand side as large as possible. This is achieved by choosing
to cancel the phase of
, and we obtain
(4)
This is closer to (1); we have fixed the left-hand side, but the right-hand side is still too weak. But we can amplify further, by exploiting an imbalance in a different symmetry, namely the homogenisation symmetry for a scalar
, which preserves the left-hand side but not the right. Inserting this transform into (4) we conclude that
where is at our disposal to choose. We can optimise in
by minimising the right-hand side, and indeed one easily sees that the minimum (or infimum, if one of v and w vanishes) is
(which is achieved when
when
are non-zero, or in an asymptotic limit
or
in the degenerate cases), and so we have amplified our way to the Cauchy-Schwarz inequality (1). [See also this discussion by Tim Gowers on the Cauchy-Schwarz inequality.]
[This post is authored by Gil Kalai, who has kindly “guest blogged” this week’s “open problem of the week”. – T.]
The entropy-influence conjecture seeks to relate two somewhat different measures as to how a boolean function has concentrated Fourier coefficients, namely the total influence and the entropy.
We begin by defining the total influence. Let be the discrete cube, i.e. the set of
vectors
of length n. A boolean function is any function
from the discrete cube to {-1,+1}. One can think of such functions as “voting methods”, which take the preferences of n voters (+1 for yes, -1 for no) as input and return a yes/no verdict as output. For instance, if n is odd, the “majority vote” function
returns +1 if there are more +1 variables than -1, or -1 otherwise, whereas if
, the “
dictator” function returns the value
of the
variable.
We give the cube the uniform probability measure
(thus we assume that the n voters vote randomly and independently). Given any boolean function f and any variable
, define the influence
of the
variable to be the quantity
where is the element of the cube formed by flipping the sign of the
variable. Informally,
measures the probability that the
voter could actually determine the outcome of an election; it is sometimes referred to as the Banzhaf power index. The total influence I(f) of f (also known as the average sensitivity and the edge-boundary density) is then defined as
Thus for instance a dictator function has total influence 1, whereas majority vote has total influence comparable to . The influence can range between 0 (for constant functions +1, -1) and n (for the parity function
or its negation). If f has mean zero (i.e. it is equal to +1 half of the time), then the edge-isoperimetric inequality asserts that
(with equality if and only if there is a dictatorship), whilst the Kahn-Kalai-Linial (KKL) theorem asserts that
for some k. There is a result of Friedgut that if
is bounded by A (say) and
, then f is within a distance
(in
norm) of another boolean function g which only depends on
of the variables (such functions are known as juntas).
[This post is authored by Gil Kalai, who has kindly “guest blogged” this week’s “open problem of the week”. – T.]
This is a problem in discrete and convex geometry. It seeks to quantify the intuitively obvious fact that large convex bodies are so “fat” that they cannot avoid “detection” by a small number of observation points. More precisely, we fix a dimension d and make the following definition (introduced by Haussler and Welzl):
- Definition: Let
be a finite set of points, and let
. We say that a finite set
is a weak
-net for X (with respect to convex bodies) if, whenever B is a convex body which is large in the sense that
, then B contains at least one point of Y. (If Y is contained in X, we say that Y is a strong
-net for X with respect to convex bodies.)
For example, in one dimension, if , and
where k is the integer part of
, then Y is a weak
-net for X with respect to convex bodies. Thus we see that even when the original set X is very large, one can create a
-net of size as small as
. Strong
-nets are of importance in computational learning theory, and are fairly well understood via Vapnik-Chervonenkis (or VC) theory; however, the theory of weak
-nets is still not completely satisfactory.
One can ask what happens in higher dimensions, for instance when X is a discrete cube . It is not too hard to cook up
-nets of size
(by using tools such as Minkowski’s theorem), but in fact one can create
-nets of size as small as
simply by taking a random subset of X of this cardinality and observing that “up to errors of
“, the total number of essentially different ways a convex body can meet X grows at most polynomially in
. (This is a very typical application of the probabilistic method.) On the other hand, since X can contain roughly
disjoint convex bodies, each of which contains at least
of the points in X, we see that no
-net can have size much smaller than
.
Now consider the situation in which X is now an arbitrary finite set, rather than a discrete cube. More precisely, let be the least number such that every finite set X possesses at least one weak
-net for X with respect to convex bodies of cardinality at most
. (One can also replace the finite set X with an arbitrary probability measure; the two formulations are equivalent.) Informally, f is the least number of “guards” one needs to place to prevent a convex body from covering more than
of any given territory.
- Problem 1: For fixed d, what is the correct rate of growth of f as
?
Recent Comments