You are currently browsing the category archive for the ‘math.IT’ category.

Let and be two random variables taking values in the same (discrete) range , and let be some subset of , which we think of as the set of “bad” outcomes for either or . If and have the same probability distribution, then clearly

In particular, if it is rare for to lie in , then it is also rare for to lie in .

If and do not have exactly the same probability distribution, but their probability distributions are *close* to each other in some sense, then we can expect to have an approximate version of the above statement. For instance, from the definition of the total variation distance between two random variables (or more precisely, the total variation distance between the probability distributions of two random variables), we see that

for any . In particular, if it is rare for to lie in , and are close in total variation, then it is also rare for to lie in .

A basic inequality in information theory is Pinsker’s inequality

where the Kullback-Leibler divergence is defined by the formula

(See this previous blog post for a proof of this inequality.) A standard application of Jensen’s inequality reveals that is non-negative (Gibbs’ inequality), and vanishes if and only if , have the same distribution; thus one can think of as a measure of how close the distributions of and are to each other, although one should caution that this is not a symmetric notion of distance, as in general. Inserting Pinsker’s inequality into (1), we see for instance that

Thus, if is close to in the Kullback-Leibler sense, and it is rare for to lie in , then it is rare for to lie in as well.

We can specialise this inequality to the case when a uniform random variable on a finite range of some cardinality , in which case the Kullback-Leibler divergence simplifies to

where

is the Shannon entropy of . Again, a routine application of Jensen’s inequality shows that , with equality if and only if is uniformly distributed on . The above inequality then becomes

Thus, if is a small fraction of (so that it is rare for to lie in ), and the entropy of is very close to the maximum possible value of , then it is rare for to lie in also.

The inequality (2) is only useful when the entropy is close to in the sense that , otherwise the bound is worse than the trivial bound of . In my recent paper on the Chowla and Elliott conjectures, I ended up using a variant of (2) which was still non-trivial when the entropy was allowed to be smaller than . More precisely, I used the following simple inequality, which is implicit in the arguments of that paper but which I would like to make more explicit in this post:

Lemma 1 (Pinsker-type inequality)Let be a random variable taking values in a finite range of cardinality , let be a uniformly distributed random variable in , and let be a subset of . Then

*Proof:* Consider the conditional entropy . On the one hand, we have

by Jensen’s inequality. On the other hand, one has

where we have again used Jensen’s inequality. Putting the two inequalities together, we obtain the claim.

Remark 2As noted in comments, this inequality can be viewed as a special case of the more general inequalityfor arbitrary random variables taking values in the same discrete range , which follows from the data processing inequality

for arbitrary functions , applied to the indicator function . Indeed one has

where is the entropy function.

Thus, for instance, if one has

and

for some much larger than (so that ), then

More informally: if the entropy of is *somewhat* close to the maximum possible value of , and it is *exponentially* rare for a uniform variable to lie in , then it is still *somewhat* rare for to lie in . The estimate given is close to sharp in this regime, as can be seen by calculating the entropy of a random variable which is uniformly distributed inside a small set with some probability and uniformly distributed outside of with probability , for some parameter .

It turns out that the above lemma combines well with concentration of measure estimates; in my paper, I used one of the simplest such estimates, namely Hoeffding’s inequality, but there are of course many other estimates of this type (see e.g. this previous blog post for some others). Roughly speaking, concentration of measure inequalities allow one to make approximations such as

with exponentially high probability, where is a uniform distribution and is some reasonable function of . Combining this with the above lemma, we can then obtain approximations of the form

with somewhat high probability, if the entropy of is somewhat close to maximum. This observation, combined with an “entropy decrement argument” that allowed one to arrive at a situation in which the relevant random variable did have a near-maximum entropy, is the key new idea in my recent paper; for instance, one can use the approximation (3) to obtain an approximation of the form

for “most” choices of and a suitable choice of (with the latter being provided by the entropy decrement argument). The left-hand side is tied to Chowla-type sums such as through the multiplicativity of , while the right-hand side, being a linear correlation involving two parameters rather than just one, has “finite complexity” and can be treated by existing techniques such as the Hardy-Littlewood circle method. One could hope that one could similarly use approximations such as (3) in other problems in analytic number theory or combinatorics.

A handy inequality in additive combinatorics is the Plünnecke-Ruzsa inequality:

Theorem 1 (Plünnecke-Ruzsa inequality)Let be finite non-empty subsets of an additive group , such that for all and some scalars . Then there exists a subset of such that .

The proof uses graph-theoretic techniques. Setting , we obtain a useful corollary: if has small doubling in the sense that , then we have for all , where is the sum of copies of .

In a recent paper, I adapted a number of sum set estimates to the entropy setting, in which finite sets such as in are replaced with discrete random variables taking values in , and (the logarithm of) cardinality of a set is replaced by Shannon entropy of a random variable . (Throughout this note I assume all entropies to be finite.) However, at the time, I was unable to find an entropy analogue of the Plünnecke-Ruzsa inequality, because I did not know how to adapt the graph theory argument to the entropy setting.

I recently discovered, however, that buried in a classic paper of Kaimonovich and Vershik (implicitly in Proposition 1.3, to be precise) there was the following analogue of Theorem 1:

Theorem 2 (Entropy Plünnecke-Ruzsa inequality)Let be independent random variables of finite entropy taking values in an additive group , such that for all and some scalars . Then .

In fact Theorem 2 is a bit “better” than Theorem 1 in the sense that Theorem 1 needed to refine the original set to a subset , but no such refinement is needed in Theorem 2. One corollary of Theorem 2 is that if , then for all , where are independent copies of ; this improves slightly over the analogous combinatorial inequality. Indeed, the function is concave (this can be seen by using the version of Theorem 2 (or (2) below) to show that the quantity is decreasing in ).

Theorem 2 is actually a quick consequence of the *submodularity inequality*

in information theory, which is valid whenever are discrete random variables such that and each determine (i.e. is a function of , and also a function of ), and and jointly determine (i.e is a function of and ). To apply this, let be independent discrete random variables taking values in . Observe that the pairs and each determine , and jointly determine . Applying (1) we conclude that

which after using the independence of simplifies to the *sumset submodularity inequality*

(this inequality was also recently observed by Madiman; it is the case of Theorem 2). As a corollary of this inequality, we see that if , then

and Theorem 2 follows by telescoping series.

The proof of Theorem 2 seems to be genuinely different from the graph-theoretic proof of Theorem 1. It would be interesting to see if the above argument can be somehow adapted to give a stronger version of Theorem 1. Note also that both Theorem 1 and Theorem 2 have extensions to more general combinations of than ; see this paper and this paper respectively.

I am posting here four more of my Mahler lectures, each of which is based on earlier talks of mine:

- Compressed sensing. This is an updated and reformatted version of my ANZIAM talk on this topic.
- Discrete random matrices. This talk is a survey on recent developments on the universality phenomenon in random matrices, including work of myself and Van Vu. It covers similar material to my Netanyahu lecture, which has not previously appeared in electronic form.
- Recent progress in additive prime number theory. This is an updated and reformatted version of my AMS lecture on this topic.
- Recent progress on the Kakeya conjecture. This is an updated and reformatted version of my Fefferman conference lecture.

As always, comments, corrections, and other feedback are welcome.

There are many situations in combinatorics in which one is running some sort of iteration algorithm to continually “improve” some object ; each loop of the algorithm replaces with some better version of itself, until some desired property of is attained and the algorithm halts. In order for such arguments to yield a useful conclusion, it is often necessary that the algorithm halts in a finite amount of time, or (even better), in a bounded amount of time. (In general, one cannot use infinitary iteration tools, such as transfinite induction or Zorn’s lemma, in combinatorial settings, because the iteration processes used to improve some target object often degrade some other finitary quantity in the process, and an infinite iteration would then have the undesirable effect of making infinite.)

A basic strategy to ensure termination of an algorithm is to exploit a *monotonicity property*, or more precisely to show that some key quantity keeps increasing (or keeps decreasing) with each loop of the algorithm, while simultaneously staying bounded. (Or, as the economist Herbert Stein was fond of saying, “If something cannot go on forever, it must stop.”)

Here are four common flavours of this monotonicity strategy:

- The
*mass increment argument*. This is perhaps the most familiar way to ensure termination: make each improved object “heavier” than the previous one by some non-trivial amount (e.g. by ensuring that the cardinality of is strictly greater than that of , thus ). Dually, one can try to force the amount of “mass” remaining “outside” of in some sense to decrease at every stage of the iteration. If there is a good upper bound on the “mass” of that stays essentially fixed throughout the iteration process, and a lower bound on the mass increment at each stage, then the argument terminates. Many “greedy algorithm” arguments are of this type. The proof of the Hahn decomposition theorem in measure theory also falls into this category. The general strategy here is to keep looking for useful pieces of mass outside of , and add them to to form , thus exploiting the additivity properties of mass. Eventually no further usable mass remains to be added (i.e. is*maximal*in some sense), and this should force some desirable property on . - The
*density increment argument*. This is a variant of the mass increment argument, in which one increments the “density” of rather than the “mass”. For instance, might be contained in some ambient space , and one seeks to improve to (and to ) in such a way that the density of the new object in the new ambient space is better than that of the previous object (e.g. for some ). On the other hand, the density of is clearly bounded above by . As long as one has a sufficiently good lower bound on the density increment at each stage, one can conclude an upper bound on the number of iterations in the algorithm. The prototypical example of this is Roth’s proof of his theorem that every set of integers of positive upper density contains an arithmetic progression of length three. The general strategy here is to keep looking for useful density fluctuations inside , and then “zoom in” to a region of increased density by reducing and appropriately. Eventually no further usable density fluctuation remains (i.e. is*uniformly distributed*), and this should force some desirable property on . - The
*energy increment argument*. This is an “” analogue of the ““-based mass increment argument (or the ““-based density increment argument), in which one seeks to increments the amount of “energy” that captures from some reference object , or (equivalently) to decrement the amount of energy of which is still “orthogonal” to . Here and are related somehow to a Hilbert space, and the energy involves the norm on that space. A classic example of this type of argument is the existence of orthogonal projections onto closed subspaces of a Hilbert space; this leads among other things to the construction of conditional expectation in measure theory, which then underlies a number of arguments in ergodic theory, as discussed for instance in this earlier blog post. Another basic example is the standard proof of the Szemerédi regularity lemma (where the “energy” is often referred to as the “index”). These examples are related; see this blog post for further discussion. The general strategy here is to keep looking for useful pieces of energy orthogonal to , and add them to to form , thus exploiting square-additivity properties of energy, such as Pythagoras’ theorem. Eventually, no further usable energy outside of remains to be added (i.e. is*maximal*in some sense), and this should force some desirable property on . - The
*rank reduction argument*. Here, one seeks to make each new object to have a lower “rank”, “dimension”, or “order” than the previous one. A classic example here is the proof of the linear algebra fact that given any finite set of vectors, there exists a linearly independent subset which spans the same subspace; the proof of the more general Steinitz exchange lemma is in the same spirit. The general strategy here is to keep looking for “collisions” or “dependencies” within , and use them to collapse to an object of lower rank. Eventually, no further usable collisions within remain, and this should force some desirable property on .

Much of my own work in additive combinatorics relies heavily on at least one of these types of arguments (and, in some cases, on a nested combination of two or more of them). Many arguments in nonlinear partial differential equations also have a similar flavour, relying on various *monotonicity formulae* for solutions to such equations, though the objective in PDE is usually slightly different, in that one wants to keep control of a solution as one approaches a singularity (or as some time or space coordinate goes off to infinity), rather than to ensure termination of an algorithm. (On the other hand, many arguments in the theory of *concentration compactness*, which is used heavily in PDE, does have the same algorithm-terminating flavour as the combinatorial arguments; see this earlier blog post for more discussion.)

Recently, a new species of monotonicity argument was introduced by Moser, as the primary tool in his elegant new proof of the Lovász local lemma. This argument could be dubbed an *entropy compression argument*, and only applies to probabilistic algorithms which require a certain collection of random “bits” or other random choices as part of the input, thus each loop of the algorithm takes an object (which may also have been generated randomly) and some portion of the random string to (deterministically) create a better object (and a shorter random string , formed by throwing away those bits of that were used in the loop). The key point is to design the algorithm to be partially *reversible*, in the sense that given and and some additional data that logs the cumulative *history* of the algorithm up to this point, one can reconstruct together with the remaining portion not already contained in . Thus, each stage of the argument *compresses* the information-theoretic content of the string into the string in a lossless fashion. However, a random variable such as cannot be compressed losslessly into a string of expected size smaller than the Shannon entropy of that variable. Thus, if one has a good lower bound on the entropy of , and if the length of is significantly less than that of (i.e. we need the marginal growth in the length of the history file per iteration to be less than the marginal amount of randomness used per iteration), then there is a limit as to how many times the algorithm can be run, much as there is a limit as to how many times a random data file can be compressed before no further length reduction occurs.

It is interesting to compare this method with the ones discussed earlier. In the previous methods, the failure of the algorithm to halt led to a new iteration of the object which was “heavier”, “denser”, captured more “energy”, or “lower rank” than the previous instance of . Here, the failure of the algorithm to halt leads to new information that can be used to “compress” (or more precisely, the full state ) into a smaller amount of space. I don’t know yet of any application of this new type of termination strategy to the fields I work in, but one could imagine that it could eventually be of use (perhaps to show that solutions to PDE with sufficiently “random” initial data can avoid singularity formation?), so I thought I would discuss it here.

Below the fold I give a special case of Moser’s argument, based on a blog post of Lance Fortnow on this topic.

The most fundamental unsolved problem in complexity theory is undoubtedly the P=NP problem, which asks (roughly speaking) whether a problem which can be solved by a non-deterministic polynomial-time (NP) algorithm, can also be solved by a deterministic polynomial-time (P) algorithm. The general belief is that , i.e. there exist problems which can be solved by non-deterministic polynomial-time algorithms but not by deterministic polynomial-time algorithms.

One reason why the question is so difficult to resolve is that a certain generalisation of this question has an affirmative answer in some cases, and a negative answer in other cases. More precisely, if we give all the algorithms access to an oracle, then for one choice of this oracle, all the problems that are solvable by non-deterministic polynomial-time algorithms that calls (), can also be solved by a deterministic polynomial-time algorithm algorithm that calls (), thus ; but for another choice of this oracle, there exist problems solvable by non-deterministic polynomial-time algorithms that call , which *cannot* be solved by a deterministic polynomial-time algorithm that calls , thus . One particular consequence of this result (which is due to Baker, Gill, and Solovay) is that there cannot be any *relativisable* proof of either or , where “relativisable” means that the proof would also work without any changes in the presence of an oracle.

The Baker-Gill-Solovay result was quite surprising, but the idea of the proof turns out to be rather simple. To get an oracle such that , one basically sets to be a powerful simulator that can simulate non-deterministic machines (and, furthermore, can also simulate *itself*); it turns out that any PSPACE-complete oracle would suffice for this task. To get an oracle for which , one has to be a bit sneakier, setting to be a query device for a sparse set of random (or high-complexity) strings, which are too complex to be guessed at by any deterministic polynomial-time algorithm.

Unfortunately, the simple idea of the proof can be obscured by various technical details (e.g. using Turing machines to define and precisely), which require a certain amount of time to properly absorb. To help myself try to understand this result better, I have decided to give a sort of “allegory” of the proof, based around a (rather contrived) story about various students trying to pass a multiple choice test, which avoids all the technical details but still conveys the basic ideas of the argument. This allegory was primarily for my own benefit, but I thought it might also be of interest to some readers here (and also has some tangential relation to the proto-polymath project of determinstically finding primes), so I reproduce it below.

*[This post should have appeared several months ago, but I didn’t have a link to the newsletter at the time, and I subsequently forgot about it until now. -T.]*

Last year, Emmanuel Candés and I were two of the recipients of the 2008 IEEE Information Theory Society Paper Award, for our paper “Near-optimal signal recovery from random projections: universal encoding strategies?” published in IEEE Inf. Thy.. (The other recipient is David Donoho, for the closely related paper “Compressed sensing” in the same journal.) These papers helped initiate the modern subject of *compressed sensing*, which I have talked about earlier on this blog, although of course they also built upon a number of important precursor results in signal recovery, high-dimensional geometry, Fourier analysis, linear programming, and probability. As part of our response to this award, Emmanuel and I wrote a short piece commenting on these developments, entitled “Reflections on compressed sensing“, which appears in the Dec 2008 issue of the IEEE Information Theory newsletter. In it we place our results in the context of these precursor results, and also mention some of the many active directions (theoretical, numerical, and applied) that compressed sensing is now developing in.

Emmanuel Candés and I have just uploaded to the arXiv our paper “The power of convex relaxation: near-optimal matrix completion“, submitted to IEEE Inf. Theory. In this paper we study the *matrix completion problem*, which one can view as a sort of “non-commutative” analogue of the *sparse recovery problem* studied in the field of compressed sensing, although there are also some other significant differences between the two problems. The sparse recovery problem seeks to recover a sparse vector from some linear measurements , where A is a known matrix. For general x, classical linear algebra tells us that if m < n, then the problem here is underdetermined and has multiple solutions; but under the additional assumption that x is *sparse* (most of the entries are zero), it turns out (under various hypotheses on the measurement matrix A, and in particular if A contains a sufficient amount of “randomness” or “incoherence”) that exact recovery becomes possible in the underdetermined case. Furthermore, recovery is not only theoretically possible, but is also computationally practical in many cases; in particular, under some assumptions on A, one can recover x by minimising the convex norm over all solutions to Ax=b.

Now we turn to the matrix completion problem. Instead of an unknown vector , we now have an unknown matrix (we use the shorthand here). We will take a specific type of underdetermined linear measurement of M, namely we pick a random subset of the matrix array of some cardinality , and form the random sample of M.

Of course, with no further information on M, it is impossible to complete the matrix M from the partial information – we only have pieces of information and need . But suppose we also know that M is *low-rank*, e.g. has rank less than r; this is an analogue of sparsity, but for matrices rather than vectors. Then, in principle, we have reduced the number of degrees of freedom for M from to something more like , and so (in analogy with compressed sensing) one may now hope to perform matrix completion with a much smaller fraction of samples, and in particular with m close to .

This type of problem comes up in several real-world applications, most famously in the Netflix prize. The Netflix prize problem is to be able to predict a very large ratings matrix M, whose rows are the customers, whose columns are the movies, and the entries are the rating that each customer would hypothetically assign to each movie. Of course, not every customer has rented every movie from Netflix, and so only a small fraction of this matrix is actually known. However, if one makes the assumption that most customers’ rating preference is determined by only a small number of characteristics of the movie (e.g. genre, lead actor/actresses, director, year, etc.), then the matrix should be (approximately) low rank, and so the above type of analysis should be useful (though of course it is not going to be the *only* tool of use in this messy, real-world problem).

Actually, one expects to need to oversample the matrix by a logarithm or two in order to have a good chance of exact recovery, if one is sampling randomly. This can be seen even in the rank one case r=1, in which is the product of a column vector and a row vector; let’s consider square matrices for simplicity. Observe that if the sampled coordinates completely miss one of the rows of the matrix, then the corresponding element of u has gone completely unmeasured, and one cannot hope to complete this row of the matrix. Thus one needs to sample every row (and also every column) of the matrix. The solution to the coupon collector’s problem then tells us that one needs about samples to achieve this goal. In fact, the theory of Erdős-Rényi random graphs tells us that the bipartite graph induced by becomes almost surely connected beyond this threshold, which turns out to be exactly what is needed to perform matrix completion for rank 1 matrices.

On the other hand, one cannot hope to complete the matrix if some of the singular vectors of the matrix are extremely sparse. For instance, in the Netflix problem, a singularly idiosyncratic customer (or dually, a singularly unclassifiable movie) may give rise to a row or column of M that has no relation to the rest of the matrix, occupying its own separate component of the singular value decomposition of M; such a row or column is then impossible to complete exactly without sampling the entirety of that row or column. Thus, to get exact matrix completion from a small fraction of entries, one needs some sort of *incoherence* assumption on the singular vectors, which spreads them out across all coordinates in a roughly even manner, as opposed to being concentrated on just a few values.

In a recent paper, Candés and Recht proposed solving the matrix completion problem by minimising the *nuclear norm* (or trace norm)

amongst all matrices consistent with the observed data . This nuclear norm is the non-commutative counterpart to the norm for vectors, and so this algorithm is analogous to the minimisation (or *basis pursuit*) algorithm which is effective for compressed sensing (though not the only such algorithm for this task). They showed, roughly speaking, that exact matrix completion (for, say, square matrices for simplicity) is ensured with high probability so long as the singular vectors obey a certain incoherence property (basically, their norm should be close to the minimal possible value, namely ), so long as one had the condition

.

This differs from the presumably optimal threshold of by a factor of about .

The main result of our paper is to mostly eliminate this gap, at the cost of a stronger hypothesis on the matrix being measured:

Main theorem.(Informal statement) Suppose the matrix M has rank r and obeys a certain “strong incoherence property”. Then with high probability, nuclear norm minimisation will recover M from a random sample provided that , where .

A result of a broadly similar nature, but with a rather different recovery algorithm and with a somewhat different range of applicability, was recently established by Keshavan, Oh, and Montanari. The strong incoherence property is somewhat technical, but is related to the Candés-Recht incoherence property and is satisfied by a number of reasonable random matrix models. The exponent O(1) here is reasonably civilised (ranging between 2 and 9, depending on the specific model and parameters being used).

This week I am in Seville, Spain, for a conference in harmonic analysis and related topics. My talk is titled “the uniform uncertainty principle and compressed sensing“. The content of this talk overlaps substantially with my Ostrowski lecture on the same topic; the slides I prepared for the Seville lecture can be found here.

[*Update*, Dec 6: Some people have asked about my other lecture given in Seville, on structure and randomness in the prime numbers. This lecture is largely equivalent to the one posted here.]

Given that there has recently been a lot of discussion on this blog about this logic puzzle, I thought I would make a dedicated post for it (and move all the previous comments to this post). The text here is adapted from an earlier web page of mine from a few years back.

The puzzle has a number of formulations, but I will use this one:

There is an island upon which a tribe resides. The tribe consists of 1000 people, with various eye colours. Yet, their religion forbids them to know their own eye color, or even to discuss the topic; thus, each resident can (and does) see the eye colors of all other residents, but has no way of discovering his or her own (there are no reflective surfaces). If a tribesperson does discover his or her own eye color, then their religion compels them to commit ritual suicide at noon the following day in the village square for all to witness. All the tribespeople are highly logical and devout, and they all know that each other is also highly logical and devout (and they all know that they all know that each other is highly logical and devout, and so forth).

[

Added, Feb 15: for the purposes of this logic puzzle, “highly logical” means that any conclusion that can logically deduced from the information and observations available to an islander, will automatically be known to that islander.]Of the 1000 islanders, it turns out that 100 of them have blue eyes and 900 of them have brown eyes, although the islanders are not initially aware of these statistics (each of them can of course only see 999 of the 1000 tribespeople).

One day, a blue-eyed foreigner visits to the island and wins the complete trust of the tribe.

One evening, he addresses the entire tribe to thank them for their hospitality.

However, not knowing the customs, the foreigner makes the mistake of mentioning eye color in his address, remarking “how unusual it is to see another blue-eyed person like myself in this region of the world”.

What effect, if anything, does this

faux pashave on the tribe?

The interesting thing about this puzzle is that there are two quite plausible arguments here, which give opposing conclusions:

[Note: if you have not seen the puzzle before, I recommend thinking about it first before clicking ahead.]

## Recent Comments