You are currently browsing the category archive for the ‘math.IT’ category.

Let ${X}$ and ${Y}$ be two random variables taking values in the same (discrete) range ${R}$, and let ${E}$ be some subset of ${R}$, which we think of as the set of “bad” outcomes for either ${X}$ or ${Y}$. If ${X}$ and ${Y}$ have the same probability distribution, then clearly

$\displaystyle {\bf P}( X \in E ) = {\bf P}( Y \in E ).$

In particular, if it is rare for ${Y}$ to lie in ${E}$, then it is also rare for ${X}$ to lie in ${E}$.

If ${X}$ and ${Y}$ do not have exactly the same probability distribution, but their probability distributions are close to each other in some sense, then we can expect to have an approximate version of the above statement. For instance, from the definition of the total variation distance ${\delta(X,Y)}$ between two random variables (or more precisely, the total variation distance between the probability distributions of two random variables), we see that

$\displaystyle {\bf P}(Y \in E) - \delta(X,Y) \leq {\bf P}(X \in E) \leq {\bf P}(Y \in E) + \delta(X,Y) \ \ \ \ \ (1)$

for any ${E \subset R}$. In particular, if it is rare for ${Y}$ to lie in ${E}$, and ${X,Y}$ are close in total variation, then it is also rare for ${X}$ to lie in ${E}$.

A basic inequality in information theory is Pinsker’s inequality

$\displaystyle \delta(X,Y) \leq \sqrt{\frac{1}{2} D_{KL}(X||Y)}$

where the Kullback-Leibler divergence ${D_{KL}(X||Y)}$ is defined by the formula

$\displaystyle D_{KL}(X||Y) = \sum_{x \in R} {\bf P}( X=x ) \log \frac{{\bf P}(X=x)}{{\bf P}(Y=x)}.$

(See this previous blog post for a proof of this inequality.) A standard application of Jensen’s inequality reveals that ${D_{KL}(X||Y)}$ is non-negative (Gibbs’ inequality), and vanishes if and only if ${X}$, ${Y}$ have the same distribution; thus one can think of ${D_{KL}(X||Y)}$ as a measure of how close the distributions of ${X}$ and ${Y}$ are to each other, although one should caution that this is not a symmetric notion of distance, as ${D_{KL}(X||Y) \neq D_{KL}(Y||X)}$ in general. Inserting Pinsker’s inequality into (1), we see for instance that

$\displaystyle {\bf P}(X \in E) \leq {\bf P}(Y \in E) + \sqrt{\frac{1}{2} D_{KL}(X||Y)}.$

Thus, if ${X}$ is close to ${Y}$ in the Kullback-Leibler sense, and it is rare for ${Y}$ to lie in ${E}$, then it is rare for ${X}$ to lie in ${E}$ as well.

We can specialise this inequality to the case when ${Y}$ a uniform random variable ${U}$ on a finite range ${R}$ of some cardinality ${N}$, in which case the Kullback-Leibler divergence ${D_{KL}(X||U)}$ simplifies to

$\displaystyle D_{KL}(X||U) = \log N - {\bf H}(X)$

where

$\displaystyle {\bf H}(X) := \sum_{x \in R} {\bf P}(X=x) \log \frac{1}{{\bf P}(X=x)}$

is the Shannon entropy of ${X}$. Again, a routine application of Jensen’s inequality shows that ${{\bf H}(X) \leq \log N}$, with equality if and only if ${X}$ is uniformly distributed on ${R}$. The above inequality then becomes

$\displaystyle {\bf P}(X \in E) \leq {\bf P}(U \in E) + \sqrt{\frac{1}{2}(\log N - {\bf H}(X))}. \ \ \ \ \ (2)$

Thus, if ${E}$ is a small fraction of ${R}$ (so that it is rare for ${U}$ to lie in ${E}$), and the entropy of ${X}$ is very close to the maximum possible value of ${\log N}$, then it is rare for ${X}$ to lie in ${E}$ also.

The inequality (2) is only useful when the entropy ${{\bf H}(X)}$ is close to ${\log N}$ in the sense that ${{\bf H}(X) = \log N - O(1)}$, otherwise the bound is worse than the trivial bound of ${{\bf P}(X \in E) \leq 1}$. In my recent paper on the Chowla and Elliott conjectures, I ended up using a variant of (2) which was still non-trivial when the entropy ${{\bf H}(X)}$ was allowed to be smaller than ${\log N - O(1)}$. More precisely, I used the following simple inequality, which is implicit in the arguments of that paper but which I would like to make more explicit in this post:

Lemma 1 (Pinsker-type inequality) Let ${X}$ be a random variable taking values in a finite range ${R}$ of cardinality ${N}$, let ${U}$ be a uniformly distributed random variable in ${R}$, and let ${E}$ be a subset of ${R}$. Then

$\displaystyle {\bf P}(X \in E) \leq \frac{(\log N - {\bf H}(X)) + \log 2}{\log 1/{\bf P}(U \in E)}.$

Proof: Consider the conditional entropy ${{\bf H}(X | 1_{X \in E} )}$. On the one hand, we have

$\displaystyle {\bf H}(X | 1_{X \in E} ) = {\bf H}(X, 1_{X \in E}) - {\bf H}(1_{X \in E} )$

$\displaystyle = {\bf H}(X) - {\bf H}(1_{X \in E})$

$\displaystyle \geq {\bf H}(X) - \log 2$

by Jensen’s inequality. On the other hand, one has

$\displaystyle {\bf H}(X | 1_{X \in E} ) = {\bf P}(X \in E) {\bf H}(X | X \in E )$

$\displaystyle + (1-{\bf P}(X \in E)) {\bf H}(X | X \not \in E)$

$\displaystyle \leq {\bf P}(X \in E) \log |E| + (1-{\bf P}(X \in E)) \log N$

$\displaystyle = \log N - {\bf P}(X \in E) \log \frac{N}{|E|}$

$\displaystyle = \log N - {\bf P}(X \in E) \log \frac{1}{{\bf P}(U \in E)},$

where we have again used Jensen’s inequality. Putting the two inequalities together, we obtain the claim. $\Box$

Remark 2 As noted in comments, this inequality can be viewed as a special case of the more general inequality

$\displaystyle {\bf P}(X \in E) \leq \frac{D(X||Y) + \log 2}{\log 1/{\bf P}(Y \in E)}$

for arbitrary random variables ${X,Y}$ taking values in the same discrete range ${R}$, which follows from the data processing inequality

$\displaystyle D( f(X)||f(Y)) \leq D(X|| Y)$

for arbitrary functions ${f}$, applied to the indicator function ${f = 1_E}$. Indeed one has

$\displaystyle D( 1_E(X) || 1_E(Y) ) = {\bf P}(X \in E) \log \frac{{\bf P}(X \in E)}{{\bf P}(Y \in E)}$

$\displaystyle + {\bf P}(X \not \in E) \log \frac{{\bf P}(X \not \in E)}{{\bf P}(Y \not \in E)}$

$\displaystyle \geq {\bf P}(X \in E) \log \frac{1}{{\bf P}(Y \in E)} - h( {\bf P}(X \in E) )$

$\displaystyle \geq {\bf P}(X \in E) \log \frac{1}{{\bf P}(Y \in E)} - \log 2$

where ${h(u) := u \log \frac{1}{u} + (1-u) \log \frac{1}{1-u}}$ is the entropy function.

Thus, for instance, if one has

$\displaystyle {\bf H}(X) \geq \log N - o(K)$

and

$\displaystyle {\bf P}(U \in E) \leq \exp( - K )$

for some ${K}$ much larger than ${1}$ (so that ${1/K = o(1)}$), then

$\displaystyle {\bf P}(X \in E) = o(1).$

More informally: if the entropy of ${X}$ is somewhat close to the maximum possible value of ${\log N}$, and it is exponentially rare for a uniform variable to lie in ${E}$, then it is still somewhat rare for ${X}$ to lie in ${E}$. The estimate given is close to sharp in this regime, as can be seen by calculating the entropy of a random variable ${X}$ which is uniformly distributed inside a small set ${E}$ with some probability ${p}$ and uniformly distributed outside of ${E}$ with probability ${1-p}$, for some parameter ${0 \leq p \leq 1}$.

It turns out that the above lemma combines well with concentration of measure estimates; in my paper, I used one of the simplest such estimates, namely Hoeffding’s inequality, but there are of course many other estimates of this type (see e.g. this previous blog post for some others). Roughly speaking, concentration of measure inequalities allow one to make approximations such as

$\displaystyle F(U) \approx {\bf E} F(U)$

with exponentially high probability, where ${U}$ is a uniform distribution and ${F}$ is some reasonable function of ${U}$. Combining this with the above lemma, we can then obtain approximations of the form

$\displaystyle F(X) \approx {\bf E} F(U) \ \ \ \ \ (3)$

with somewhat high probability, if the entropy of ${X}$ is somewhat close to maximum. This observation, combined with an “entropy decrement argument” that allowed one to arrive at a situation in which the relevant random variable ${X}$ did have a near-maximum entropy, is the key new idea in my recent paper; for instance, one can use the approximation (3) to obtain an approximation of the form

$\displaystyle \sum_{j=1}^H \sum_{p \in {\mathcal P}} \lambda(n+j) \lambda(n+j+p) 1_{p|n+j}$

$\displaystyle \approx \sum_{j=1}^H \sum_{p \in {\mathcal P}} \frac{\lambda(n+j) \lambda(n+j+p)}{p}$

for “most” choices of ${n}$ and a suitable choice of ${H}$ (with the latter being provided by the entropy decrement argument). The left-hand side is tied to Chowla-type sums such as ${\sum_{n \leq x} \frac{\lambda(n)\lambda(n+1)}{n}}$ through the multiplicativity of ${\lambda}$, while the right-hand side, being a linear correlation involving two parameters ${j,p}$ rather than just one, has “finite complexity” and can be treated by existing techniques such as the Hardy-Littlewood circle method. One could hope that one could similarly use approximations such as (3) in other problems in analytic number theory or combinatorics.

A handy inequality in additive combinatorics is the Plünnecke-Ruzsa inequality:

Theorem 1 (Plünnecke-Ruzsa inequality) Let ${A, B_1, \ldots, B_m}$ be finite non-empty subsets of an additive group ${G}$, such that ${|A+B_i| \leq K_i |A|}$ for all ${1 \leq i \leq m}$ and some scalars ${K_1,\ldots,K_m \geq 1}$. Then there exists a subset ${A'}$ of ${A}$ such that ${|A' + B_1 + \ldots + B_m| \leq K_1 \ldots K_m |A'|}$.

The proof uses graph-theoretic techniques. Setting ${A=B_1=\ldots=B_m}$, we obtain a useful corollary: if ${A}$ has small doubling in the sense that ${|A+A| \leq K|A|}$, then we have ${|mA| \leq K^m |A|}$ for all ${m \geq 1}$, where ${mA = A + \ldots + A}$ is the sum of ${m}$ copies of ${A}$.

In a recent paper, I adapted a number of sum set estimates to the entropy setting, in which finite sets such as ${A}$ in ${G}$ are replaced with discrete random variables ${X}$ taking values in ${G}$, and (the logarithm of) cardinality ${|A|}$ of a set ${A}$ is replaced by Shannon entropy ${{\Bbb H}(X)}$ of a random variable ${X}$. (Throughout this note I assume all entropies to be finite.) However, at the time, I was unable to find an entropy analogue of the Plünnecke-Ruzsa inequality, because I did not know how to adapt the graph theory argument to the entropy setting.

I recently discovered, however, that buried in a classic paper of Kaimonovich and Vershik (implicitly in Proposition 1.3, to be precise) there was the following analogue of Theorem 1:

Theorem 2 (Entropy Plünnecke-Ruzsa inequality) Let ${X, Y_1, \ldots, Y_m}$ be independent random variables of finite entropy taking values in an additive group ${G}$, such that ${{\Bbb H}(X+Y_i) \leq {\Bbb H}(X) + \log K_i}$ for all ${1 \leq i \leq m}$ and some scalars ${K_1,\ldots,K_m \geq 1}$. Then ${{\Bbb H}(X+Y_1+\ldots+Y_m) \leq {\Bbb H}(X) + \log K_1 \ldots K_m}$.

In fact Theorem 2 is a bit “better” than Theorem 1 in the sense that Theorem 1 needed to refine the original set ${A}$ to a subset ${A'}$, but no such refinement is needed in Theorem 2. One corollary of Theorem 2 is that if ${{\Bbb H}(X_1+X_2) \leq {\Bbb H}(X) + \log K}$, then ${{\Bbb H}(X_1+\ldots+X_m) \leq {\Bbb H}(X) + (m-1) \log K}$ for all ${m \geq 1}$, where ${X_1,\ldots,X_m}$ are independent copies of ${X}$; this improves slightly over the analogous combinatorial inequality. Indeed, the function ${m \mapsto {\Bbb H}(X_1+\ldots+X_m)}$ is concave (this can be seen by using the ${m=2}$ version of Theorem 2 (or (2) below) to show that the quantity ${{\Bbb H}(X_1+\ldots+X_{m+1})-{\Bbb H}(X_1+\ldots+X_m)}$ is decreasing in ${m}$).

Theorem 2 is actually a quick consequence of the submodularity inequality

$\displaystyle {\Bbb H}(W) + {\Bbb H}(X) \leq {\Bbb H}(Y) + {\Bbb H}(Z) \ \ \ \ \ (1)$

in information theory, which is valid whenever ${X,Y,Z,W}$ are discrete random variables such that ${Y}$ and ${Z}$ each determine ${X}$ (i.e. ${X}$ is a function of ${Y}$, and also a function of ${Z}$), and ${Y}$ and ${Z}$ jointly determine ${W}$ (i.e ${W}$ is a function of ${Y}$ and ${Z}$). To apply this, let ${X, Y, Z}$ be independent discrete random variables taking values in ${G}$. Observe that the pairs ${(X,Y+Z)}$ and ${(X+Y,Z)}$ each determine ${X+Y+Z}$, and jointly determine ${(X,Y,Z)}$. Applying (1) we conclude that

$\displaystyle {\Bbb H}(X,Y,Z) + {\Bbb H}(X+Y+Z) \leq {\Bbb H}(X,Y+Z) + {\Bbb H}(X+Y,Z)$

which after using the independence of ${X,Y,Z}$ simplifies to the sumset submodularity inequality

$\displaystyle {\Bbb H}(X+Y+Z) + {\Bbb H}(Y) \leq {\Bbb H}(X+Y) + {\Bbb H}(Y+Z) \ \ \ \ \ (2)$

(this inequality was also recently observed by Madiman; it is the ${m=2}$ case of Theorem 2). As a corollary of this inequality, we see that if ${{\Bbb H}(X+Y_i) \leq {\Bbb H}(X) + \log K_i}$, then

$\displaystyle {\Bbb H}(X+Y_1+\ldots+Y_i) \leq {\Bbb H}(X+Y_1+\ldots+Y_{i-1}) + \log K_i,$

and Theorem 2 follows by telescoping series.

The proof of Theorem 2 seems to be genuinely different from the graph-theoretic proof of Theorem 1. It would be interesting to see if the above argument can be somehow adapted to give a stronger version of Theorem 1. Note also that both Theorem 1 and Theorem 2 have extensions to more general combinations of ${X,Y_1,\ldots,Y_m}$ than ${X+Y_i}$; see this paper and this paper respectively.

I am posting here four more of my Mahler lectures, each of which is based on earlier talks of mine:

As always, comments, corrections, and other feedback are welcome.

There are many situations in combinatorics in which one is running some sort of iteration algorithm to continually “improve” some object ${A}$; each loop of the algorithm replaces ${A}$ with some better version ${A'}$ of itself, until some desired property of ${A}$ is attained and the algorithm halts. In order for such arguments to yield a useful conclusion, it is often necessary that the algorithm halts in a finite amount of time, or (even better), in a bounded amount of time. (In general, one cannot use infinitary iteration tools, such as transfinite induction or Zorn’s lemma, in combinatorial settings, because the iteration processes used to improve some target object ${A}$ often degrade some other finitary quantity ${B}$ in the process, and an infinite iteration would then have the undesirable effect of making ${B}$ infinite.)

A basic strategy to ensure termination of an algorithm is to exploit a monotonicity property, or more precisely to show that some key quantity keeps increasing (or keeps decreasing) with each loop of the algorithm, while simultaneously staying bounded. (Or, as the economist Herbert Stein was fond of saying, “If something cannot go on forever, it must stop.”)

Here are four common flavours of this monotonicity strategy:

• The mass increment argument. This is perhaps the most familiar way to ensure termination: make each improved object ${A'}$ “heavier” than the previous one ${A}$ by some non-trivial amount (e.g. by ensuring that the cardinality of ${A'}$ is strictly greater than that of ${A}$, thus ${|A'| \geq |A|+1}$). Dually, one can try to force the amount of “mass” remaining “outside” of ${A}$ in some sense to decrease at every stage of the iteration. If there is a good upper bound on the “mass” of ${A}$ that stays essentially fixed throughout the iteration process, and a lower bound on the mass increment at each stage, then the argument terminates. Many “greedy algorithm” arguments are of this type. The proof of the Hahn decomposition theorem in measure theory also falls into this category. The general strategy here is to keep looking for useful pieces of mass outside of ${A}$, and add them to ${A}$ to form ${A'}$, thus exploiting the additivity properties of mass. Eventually no further usable mass remains to be added (i.e. ${A}$ is maximal in some ${L^1}$ sense), and this should force some desirable property on ${A}$.
• The density increment argument. This is a variant of the mass increment argument, in which one increments the “density” of ${A}$ rather than the “mass”. For instance, ${A}$ might be contained in some ambient space ${P}$, and one seeks to improve ${A}$ to ${A'}$ (and ${P}$ to ${P'}$) in such a way that the density of the new object in the new ambient space is better than that of the previous object (e.g. ${|A'|/|P'| \geq |A|/|P| + c}$ for some ${c>0}$). On the other hand, the density of ${A}$ is clearly bounded above by ${1}$. As long as one has a sufficiently good lower bound on the density increment at each stage, one can conclude an upper bound on the number of iterations in the algorithm. The prototypical example of this is Roth’s proof of his theorem that every set of integers of positive upper density contains an arithmetic progression of length three. The general strategy here is to keep looking for useful density fluctuations inside ${A}$, and then “zoom in” to a region of increased density by reducing ${A}$ and ${P}$ appropriately. Eventually no further usable density fluctuation remains (i.e. ${A}$ is uniformly distributed), and this should force some desirable property on ${A}$.
• The energy increment argument. This is an “${L^2}$” analogue of the “${L^1}$“-based mass increment argument (or the “${L^\infty}$“-based density increment argument), in which one seeks to increments the amount of “energy” that ${A}$ captures from some reference object ${X}$, or (equivalently) to decrement the amount of energy of ${X}$ which is still “orthogonal” to ${A}$. Here ${A}$ and ${X}$ are related somehow to a Hilbert space, and the energy involves the norm on that space. A classic example of this type of argument is the existence of orthogonal projections onto closed subspaces of a Hilbert space; this leads among other things to the construction of conditional expectation in measure theory, which then underlies a number of arguments in ergodic theory, as discussed for instance in this earlier blog post. Another basic example is the standard proof of the Szemerédi regularity lemma (where the “energy” is often referred to as the “index”). These examples are related; see this blog post for further discussion. The general strategy here is to keep looking for useful pieces of energy orthogonal to ${A}$, and add them to ${A}$ to form ${A'}$, thus exploiting square-additivity properties of energy, such as Pythagoras’ theorem. Eventually, no further usable energy outside of ${A}$ remains to be added (i.e. ${A}$ is maximal in some ${L^2}$ sense), and this should force some desirable property on ${A}$.
• The rank reduction argument. Here, one seeks to make each new object ${A'}$ to have a lower “rank”, “dimension”, or “order” than the previous one. A classic example here is the proof of the linear algebra fact that given any finite set of vectors, there exists a linearly independent subset which spans the same subspace; the proof of the more general Steinitz exchange lemma is in the same spirit. The general strategy here is to keep looking for “collisions” or “dependencies” within ${A}$, and use them to collapse ${A}$ to an object ${A'}$ of lower rank. Eventually, no further usable collisions within ${A}$ remain, and this should force some desirable property on ${A}$.

Much of my own work in additive combinatorics relies heavily on at least one of these types of arguments (and, in some cases, on a nested combination of two or more of them). Many arguments in nonlinear partial differential equations also have a similar flavour, relying on various monotonicity formulae for solutions to such equations, though the objective in PDE is usually slightly different, in that one wants to keep control of a solution as one approaches a singularity (or as some time or space coordinate goes off to infinity), rather than to ensure termination of an algorithm. (On the other hand, many arguments in the theory of concentration compactness, which is used heavily in PDE, does have the same algorithm-terminating flavour as the combinatorial arguments; see this earlier blog post for more discussion.)

Recently, a new species of monotonicity argument was introduced by Moser, as the primary tool in his elegant new proof of the Lovász local lemma. This argument could be dubbed an entropy compression argument, and only applies to probabilistic algorithms which require a certain collection ${R}$ of random “bits” or other random choices as part of the input, thus each loop of the algorithm takes an object ${A}$ (which may also have been generated randomly) and some portion of the random string ${R}$ to (deterministically) create a better object ${A'}$ (and a shorter random string ${R'}$, formed by throwing away those bits of ${R}$ that were used in the loop). The key point is to design the algorithm to be partially reversible, in the sense that given ${A'}$ and ${R'}$ and some additional data ${H'}$ that logs the cumulative history of the algorithm up to this point, one can reconstruct ${A}$ together with the remaining portion ${R}$ not already contained in ${R'}$. Thus, each stage of the argument compresses the information-theoretic content of the string ${A+R}$ into the string ${A'+R'+H'}$ in a lossless fashion. However, a random variable such as ${A+R}$ cannot be compressed losslessly into a string of expected size smaller than the Shannon entropy of that variable. Thus, if one has a good lower bound on the entropy of ${A+R}$, and if the length of ${A'+R'+H'}$ is significantly less than that of ${A+R}$ (i.e. we need the marginal growth in the length of the history file ${H'}$ per iteration to be less than the marginal amount of randomness used per iteration), then there is a limit as to how many times the algorithm can be run, much as there is a limit as to how many times a random data file can be compressed before no further length reduction occurs.

It is interesting to compare this method with the ones discussed earlier. In the previous methods, the failure of the algorithm to halt led to a new iteration of the object ${A}$ which was “heavier”, “denser”, captured more “energy”, or “lower rank” than the previous instance of ${A}$. Here, the failure of the algorithm to halt leads to new information that can be used to “compress” ${A}$ (or more precisely, the full state ${A+R}$) into a smaller amount of space. I don’t know yet of any application of this new type of termination strategy to the fields I work in, but one could imagine that it could eventually be of use (perhaps to show that solutions to PDE with sufficiently “random” initial data can avoid singularity formation?), so I thought I would discuss it here.

Below the fold I give a special case of Moser’s argument, based on a blog post of Lance Fortnow on this topic.

The most fundamental unsolved problem in complexity theory is undoubtedly the P=NP problem, which asks (roughly speaking) whether a problem which can be solved by a non-deterministic polynomial-time (NP) algorithm, can also be solved by a deterministic polynomial-time (P) algorithm. The general belief is that ${P \neq NP}$, i.e. there exist problems which can be solved by non-deterministic polynomial-time algorithms but not by deterministic polynomial-time algorithms.

One reason why the ${P \neq NP}$ question is so difficult to resolve is that a certain generalisation of this question has an affirmative answer in some cases, and a negative answer in other cases. More precisely, if we give all the algorithms access to an oracle, then for one choice ${A}$ of this oracle, all the problems that are solvable by non-deterministic polynomial-time algorithms that calls ${A}$ (${NP^A}$), can also be solved by a deterministic polynomial-time algorithm algorithm that calls ${A}$ (${P^A}$), thus ${P^A = NP^A}$; but for another choice ${B}$ of this oracle, there exist problems solvable by non-deterministic polynomial-time algorithms that call ${B}$, which cannot be solved by a deterministic polynomial-time algorithm that calls ${B}$, thus ${P^B \neq NP^B}$. One particular consequence of this result (which is due to Baker, Gill, and Solovay) is that there cannot be any relativisable proof of either ${P=NP}$ or ${P \neq NP}$, where “relativisable” means that the proof would also work without any changes in the presence of an oracle.

The Baker-Gill-Solovay result was quite surprising, but the idea of the proof turns out to be rather simple. To get an oracle ${A}$ such that ${P^A=NP^A}$, one basically sets ${A}$ to be a powerful simulator that can simulate non-deterministic machines (and, furthermore, can also simulate itself); it turns out that any PSPACE-complete oracle would suffice for this task. To get an oracle ${B}$ for which ${P^B \neq NP^B}$, one has to be a bit sneakier, setting ${B}$ to be a query device for a sparse set of random (or high-complexity) strings, which are too complex to be guessed at by any deterministic polynomial-time algorithm.

Unfortunately, the simple idea of the proof can be obscured by various technical details (e.g. using Turing machines to define ${P}$ and ${NP}$ precisely), which require a certain amount of time to properly absorb. To help myself try to understand this result better, I have decided to give a sort of “allegory” of the proof, based around a (rather contrived) story about various students trying to pass a multiple choice test, which avoids all the technical details but still conveys the basic ideas of the argument. This allegory was primarily for my own benefit, but I thought it might also be of interest to some readers here (and also has some tangential relation to the proto-polymath project of determinstically finding primes), so I reproduce it below.

[This post should have appeared several months ago, but I didn’t have a link to the newsletter at the time, and I subsequently forgot about it until now.  -T.]

Last year, Emmanuel Candés and I were two of the recipients of the 2008 IEEE Information Theory Society Paper Award, for our paper “Near-optimal signal recovery from random projections: universal encoding strategies?” published in IEEE Inf. Thy..  (The other recipient is David Donoho, for the closely related paper “Compressed sensing” in the same journal.)  These papers helped initiate the modern subject of compressed sensing, which I have talked about earlier on this blog, although of course they also built upon a number of important precursor results in signal recovery, high-dimensional geometry, Fourier analysis, linear programming, and probability.  As part of our response to this award, Emmanuel and I wrote a short piece commenting on these developments, entitled “Reflections on compressed sensing“, which appears in the Dec 2008 issue of the IEEE Information Theory newsletter.  In it we place our results in the context of these precursor results, and also mention some of the many active directions (theoretical, numerical, and applied) that compressed sensing is now developing in.

Emmanuel Candés and I have just uploaded to the arXiv our paper “The power of convex relaxation: near-optimal matrix completion“, submitted to IEEE Inf. Theory.  In this paper we study the matrix completion problem, which one can view as a sort of “non-commutative” analogue of the sparse recovery problem studied in the field of compressed sensing, although there are also some other significant differences between the two problems.   The sparse recovery problem seeks to recover a sparse vector $x \in {\Bbb R}^n$ from some linear measurements $Ax = b \in {\Bbb R}^m$, where A is a known $m \times n$ matrix.  For general x, classical linear algebra tells us that if m < n, then the problem here is underdetermined and has multiple solutions; but under the additional assumption that x is sparse (most of the entries are zero), it turns out (under various hypotheses on the measurement matrix A, and in particular if A contains a sufficient amount of “randomness” or “incoherence”) that exact recovery becomes possible in the underdetermined case.  Furthermore, recovery is not only theoretically possible, but is also computationally practical in many cases; in particular, under some assumptions on A, one can recover x by minimising the convex norm $\| x \|_{\ell^1}$ over all solutions to Ax=b.

Now we turn to the matrix completion problem.  Instead of an unknown vector $x \in {\Bbb R}^n$, we now have an unknown matrix $M = (m_{ij})_{i \in [n_1], j \in [n_2]} \in {\Bbb R}^{n_1 \times n_2}$ (we use the shorthand $[n] := \{1,\ldots,n\}$ here). We will take a specific type of underdetermined linear measurement of M, namely we pick a random subset $\Omega \subset [n_1] \times [n_2]$ of the matrix array $[n_1] \times [n_2]$ of some cardinality $1 \leq m \leq n_1 n_2$, and form the random sample $P_\Omega(M) := (m_{ij})_{(i,j) \in \Omega} \in {\Bbb R}^{\Omega}$ of M.

Of course, with no further information on M, it is impossible to complete the matrix M from the partial information $P_\Omega(M)$ – we only have $m$ pieces of information and need $n_1 n_2$.  But suppose we also know that M is low-rank, e.g. has rank less than r; this is an analogue of sparsity, but for matrices rather than vectors.  Then, in principle, we have reduced the number of degrees of freedom for M from $n_1 n_2$ to something more like $O( r \max(n_1,n_2) )$, and so (in analogy with compressed sensing) one may now hope to perform matrix completion with a much smaller fraction of samples, and in particular with m close to $r \max(n_1,n_2)$.

This type of problem comes up in several real-world applications, most famously in the Netflix prize.  The Netflix prize problem is to be able to predict a very large ratings matrix M, whose rows are the customers, whose columns are the movies, and the entries are the rating that each customer would hypothetically assign to each movie.  Of course, not every customer has rented every movie from Netflix, and so only a small fraction $P_\Omega(M)$ of this matrix is actually known.  However, if one makes the assumption that most customers’ rating preference is determined by only a small number of characteristics of the movie (e.g. genre, lead actor/actresses, director, year, etc.), then the matrix should be (approximately) low rank, and so the above type of analysis should be useful (though of course it is not going to be the only tool of use in this messy, real-world problem).

Actually, one expects to need to oversample the matrix by a logarithm or two in order to have a good chance of exact recovery, if one is sampling randomly.  This can be seen even in the rank one case r=1, in which $M=uv^*$ is the product of a column vector and a row vector; let’s consider square matrices $n_1=n_2=n$ for simplicity.  Observe that if the sampled coordinates $\Omega$ completely miss one of the rows of the matrix, then the corresponding element of u has gone completely unmeasured, and one cannot hope to complete this row of the matrix.   Thus one needs to sample every row (and also every column) of the $n \times n$ matrix.  The solution to the coupon collector’s problem then tells us that one needs about $O(n \log n)$ samples to achieve this goal.  In fact, the theory of Erdős-Rényi random graphs tells us that the bipartite graph induced by $\Omega$ becomes almost surely connected beyond this threshold, which turns out to be exactly what is needed to perform matrix completion for rank 1 matrices.

On the other hand, one cannot hope to complete the matrix if some of the singular vectors of the matrix are extremely sparse.  For instance, in the Netflix problem, a singularly idiosyncratic customer (or dually, a singularly unclassifiable movie) may give rise to a row or column of M that has no relation to the rest of the matrix, occupying its own separate component of the singular value decomposition of M; such a row or column is then impossible to complete exactly without sampling the entirety of that row or column.  Thus, to get exact matrix completion from a small fraction of entries, one needs some sort of incoherence assumption on the singular vectors, which spreads them out across all coordinates in a roughly even manner, as opposed to being concentrated on just a few values.

In a recent paper, Candés and Recht proposed solving the matrix completion problem by minimising the nuclear norm (or trace norm)

$\|M\|_* = \sum_{i=1}^{\min(n_1,n_2)} \sigma_i(M) = \hbox{tr}( M M^*)^{1/2}$

amongst all matrices consistent with the observed data $P_\Omega(M)$.  This nuclear norm is the non-commutative counterpart to the $\ell^1$ norm for vectors, and so this algorithm is analogous to the $\ell^1$ minimisation (or basis pursuit) algorithm which is effective for compressed sensing (though not the only such algorithm for this task).  They showed, roughly speaking, that exact matrix completion (for, say, square matrices $n_1=n_2=n$ for simplicity) is ensured with high probability so long as the singular vectors obey a certain incoherence property (basically, their $\ell^\infty$ norm should be close to the minimal possible value, namely $O(1/\sqrt{n})$), so long as one had the condition

$m \gg n^{1.2} r \log n$.

This differs from the presumably optimal threshold of $nr \log n$ by a factor of about $n^{0.2}$.

The main result of our paper is to mostly eliminate this gap, at the cost of a stronger hypothesis on the matrix being measured:

Main theorem. (Informal statement)  Suppose the $n_1 \times n_2$ matrix M has rank r and obeys a certain “strong incoherence property”.  Then with high probability, nuclear norm minimisation will recover M from a random sample $P_\Omega(M)$ provided that $m \gg n r \log^{O(1)} n$, where $n := \max(n_1,n_2)$.

A result of a broadly similar nature, but with a rather different recovery algorithm and with a somewhat different range of applicability, was recently established by Keshavan, Oh, and Montanari.  The strong incoherence property is somewhat technical, but is related to the Candés-Recht incoherence property and is satisfied by a number of reasonable random matrix models.  The exponent O(1) here is reasonably civilised (ranging between 2 and 9, depending on the specific model and parameters being used).

This week I am in Seville, Spain, for a conference in harmonic analysis and related topics.  My talk is titled “the uniform uncertainty principle and compressed sensing“.  The content of this talk overlaps substantially with my Ostrowski lecture on the same topic; the slides I prepared for the Seville lecture can be found here.

[Update, Dec 6: Some people have asked about my other lecture given in Seville, on structure and randomness in the prime numbers.  This lecture is largely equivalent to the one posted here.]

Given that there has recently been a lot of discussion on this blog about this logic puzzle, I thought I would make a dedicated post for it (and move all the previous comments to this post). The text here is adapted from an earlier web page of mine from a few years back.

The puzzle has a number of formulations, but I will use this one:

There is an island upon which a tribe resides. The tribe consists of 1000 people, with various eye colours. Yet, their religion forbids them to know their own eye color, or even to discuss the topic; thus, each resident can (and does) see the eye colors of all other residents, but has no way of discovering his or her own (there are no reflective surfaces). If a tribesperson does discover his or her own eye color, then their religion compels them to commit ritual suicide at noon the following day in the village square for all to witness. All the tribespeople are highly logical and devout, and they all know that each other is also highly logical and devout (and they all know that they all know that each other is highly logical and devout, and so forth).

[Added, Feb 15: for the purposes of this logic puzzle, “highly logical” means that any conclusion that can logically deduced from the information and observations available to an islander, will automatically be known to that islander.]

Of the 1000 islanders, it turns out that 100 of them have blue eyes and 900 of them have brown eyes, although the islanders are not initially aware of these statistics (each of them can of course only see 999 of the 1000 tribespeople).

One day, a blue-eyed foreigner visits to the island and wins the complete trust of the tribe.

One evening, he addresses the entire tribe to thank them for their hospitality.

However, not knowing the customs, the foreigner makes the mistake of mentioning eye color in his address, remarking “how unusual it is to see another blue-eyed person like myself in this region of the world”.

What effect, if anything, does this faux pas have on the tribe?

The interesting thing about this puzzle is that there are two quite plausible arguments here, which give opposing conclusions:

[Note: if you have not seen the puzzle before, I recommend thinking about it first before clicking ahead.]