You are currently browsing the category archive for the ‘math.IT’ category.

Just a short post to note that Norwegian Academy of Science and Letters has just announced that the 2017 Abel prize has been awarded to Yves Meyer, “for his pivotal role in the development of the mathematical theory of wavelets”.  The actual prize ceremony will be at Oslo in May.

I am actually in Oslo myself currently, having just presented Meyer’s work at the announcement ceremony (and also having written a brief description of some of his work).  The Abel prize has a somewhat unintuitive (and occasionally misunderstood) arrangement in which the presenter of the work of the prize is selected independently of the winner of the prize (I think in part so that the choice of presenter gives no clues as to the identity of the laureate).  In particular, like other presenters before me (which in recent years have included Timothy Gowers, Jordan Ellenberg, and Alex Bellos), I agreed to present the laureate’s work before knowing who the laureate was!  But in this case the task was very easy, because Meyer’s areas of (both pure and applied) harmonic analysis and PDE fell rather squarely within my own area of expertise.  (I had previously written about some other work of Meyer in this blog post.)  Indeed I had learned about Meyer’s wavelet constructions as a graduate student while taking a course from Ingrid Daubechies.   Daubechies also made extremely important contributions to the theory of wavelets, but due to a conflict of interest (as per the guidelines for the prize committee) arising from Daubechies’ presidency of the International Mathematical Union (which nominates the majority of the members of the Abel prize committee, who then serve for two years) from 2011 to 2014 (and her continuing service ex officio on the IMU executive committee from 2015 to 2018), she will not be eligible for the prize until 2021 at the earliest, and so I do not think this prize should be necessarily construed as a judgement on the relative contributions of Meyer and Daubechies to this field.  (In any case I fully agree with the Abel prize committee’s citation of Meyer’s pivotal role in the development of the theory of wavelets.)

[Update, Mar 28: link to prize committee guidelines and clarification of the extent of Daubechies’ conflict of interest added. -T]

Given a random variable {X} that takes on only finitely many values, we can define its Shannon entropy by the formula

\displaystyle  H(X) := \sum_x \mathbf{P}(X=x) \log \frac{1}{\mathbf{P}(X=x)}

with the convention that {0 \log \frac{1}{0} = 0}. (In some texts, one uses the logarithm to base {2} rather than the natural logarithm, but the choice of base will not be relevant for this discussion.) This is clearly a nonnegative quantity. Given two random variables {X,Y} taking on finitely many values, the joint variable {(X,Y)} is also a random variable taking on finitely many values, and also has an entropy {H(X,Y)}. It obeys the Shannon inequalities

\displaystyle  H(X), H(Y) \leq H(X,Y) \leq H(X) + H(Y)

so we can define some further nonnegative quantities, the mutual information

\displaystyle  I(X:Y) := H(X) + H(Y) - H(X,Y)

and the conditional entropies

\displaystyle  H(X|Y) := H(X,Y) - H(Y); \quad H(Y|X) := H(X,Y) - H(X).

More generally, given three random variables {X,Y,Z}, one can define the conditional mutual information

\displaystyle  I(X:Y|Z) := H(X|Z) + H(Y|Z) - H(X,Y|Z)

and the final of the Shannon entropy inequalities asserts that this quantity is also non-negative.

The mutual information {I(X:Y)} is a measure of the extent to which {X} and {Y} fail to be independent; indeed, it is not difficult to show that {I(X:Y)} vanishes if and only if {X} and {Y} are independent. Similarly, {I(X:Y|Z)} vanishes if and only if {X} and {Y} are conditionally independent relative to {Z}. At the other extreme, {H(X|Y)} is a measure of the extent to which {X} fails to depend on {Y}; indeed, it is not difficult to show that {H(X|Y)=0} if and only if {X} is determined by {Y} in the sense that there is a deterministic function {f} such that {X = f(Y)}. In a related vein, if {X} and {X'} are equivalent in the sense that there are deterministic functional relationships {X = f(X')}, {X' = g(X)} between the two variables, then {X} is interchangeable with {X'} for the purposes of computing the above quantities, thus for instance {H(X) = H(X')}, {H(X,Y) = H(X',Y)}, {I(X:Y) = I(X':Y)}, {I(X:Y|Z) = I(X':Y|Z)}, etc..

One can get some initial intuition for these information-theoretic quantities by specialising to a simple situation in which all the random variables {X} being considered come from restricting a single random (and uniformly distributed) boolean function {F: \Omega \rightarrow \{0,1\}} on a given finite domain {\Omega} to some subset {A} of {\Omega}:

\displaystyle  X = F \downharpoonright_A.

In this case, {X} has the law of a random uniformly distributed boolean function from {A} to {\{0,1\}}, and the entropy here can be easily computed to be {|A| \log 2}, where {|A|} denotes the cardinality of {A}. If {X} is the restriction of {F} to {A}, and {Y} is the restriction of {F} to {B}, then the joint variable {(X,Y)} is equivalent to the restriction of {F} to {A \cup B}. If one discards the normalisation factor {\log 2}, one then obtains the following dictionary between entropy and the combinatorics of finite sets:

Random variables {X,Y,Z} Finite sets {A,B,C}
Entropy {H(X)} Cardinality {|A|}
Joint variable {(X,Y)} Union {A \cup B}
Mutual information {I(X:Y)} Intersection cardinality {|A \cap B|}
Conditional entropy {H(X|Y)} Set difference cardinality {|A \backslash B|}
Conditional mutual information {I(X:Y|Z)} {|(A \cap B) \backslash C|}
{X, Y} independent {A, B} disjoint
{X} determined by {Y} {A} a subset of {B}
{X,Y} conditionally independent relative to {Z} {A \cap B \subset C}

Every (linear) inequality or identity about entropy (and related quantities, such as mutual information) then specialises to a combinatorial inequality or identity about finite sets that is easily verified. For instance, the Shannon inequality {H(X,Y) \leq H(X)+H(Y)} becomes the union bound {|A \cup B| \leq |A| + |B|}, and the definition of mutual information becomes the inclusion-exclusion formula

\displaystyle  |A \cap B| = |A| + |B| - |A \cup B|.

For a more advanced example, consider the data processing inequality that asserts that if {X, Z} are conditionally independent relative to {Y}, then {I(X:Z) \leq I(X:Y)}. Specialising to sets, this now says that if {A, C} are disjoint outside of {B}, then {|A \cap C| \leq |A \cap B|}; this can be made apparent by considering the corresponding Venn diagram. This dictionary also suggests how to prove the data processing inequality using the existing Shannon inequalities. Firstly, if {A} and {C} are not necessarily disjoint outside of {B}, then a consideration of Venn diagrams gives the more general inequality

\displaystyle  |A \cap C| \leq |A \cap B| + |(A \cap C) \backslash B|

and a further inspection of the diagram then reveals the more precise identity

\displaystyle  |A \cap C| + |(A \cap B) \backslash C| = |A \cap B| + |(A \cap C) \backslash B|.

Using the dictionary in the reverse direction, one is then led to conjecture the identity

\displaystyle  I( X : Z ) + I( X : Y | Z ) = I( X : Y ) + I( X : Z | Y )

which (together with non-negativity of conditional mutual information) implies the data processing inequality, and this identity is in turn easily established from the definition of mutual information.

On the other hand, not every assertion about cardinalities of sets generalises to entropies of random variables that are not arising from restricting random boolean functions to sets. For instance, a basic property of sets is that disjointness from a given set {C} is preserved by unions:

\displaystyle  A \cap C = B \cap C = \emptyset \implies (A \cup B) \cap C = \emptyset.

Indeed, one has the union bound

\displaystyle  |(A \cup B) \cap C| \leq |A \cap C| + |B \cap C|. \ \ \ \ \ (1)

Applying the dictionary in the reverse direction, one might now conjecture that if {X} was independent of {Z} and {Y} was independent of {Z}, then {(X,Y)} should also be independent of {Z}, and furthermore that

\displaystyle  I(X,Y:Z) \leq I(X:Z) + I(Y:Z)

but these statements are well known to be false (for reasons related to pairwise independence of random variables being strictly weaker than joint independence). For a concrete counterexample, one can take {X, Y \in {\bf F}_2} to be independent, uniformly distributed random elements of the finite field {{\bf F}_2} of two elements, and take {Z := X+Y} to be the sum of these two field elements. One can easily check that each of {X} and {Y} is separately independent of {Z}, but the joint variable {(X,Y)} determines {Z} and thus is not independent of {Z}.

From the inclusion-exclusion identities

\displaystyle  |A \cap C| = |A| + |C| - |A \cup C|

\displaystyle  |B \cap C| = |B| + |C| - |B \cup C|

\displaystyle  |(A \cup B) \cap C| = |A \cup B| + |C| - |A \cup B \cup C|

\displaystyle  |A \cap B \cap C| = |A| + |B| + |C| - |A \cup B| - |B \cup C| - |A \cup C|

\displaystyle + |A \cup B \cup C|

one can check that (1) is equivalent to the trivial lower bound {|A \cap B \cap C| \geq 0}. The basic issue here is that in the dictionary between entropy and combinatorics, there is no satisfactory entropy analogue of the notion of a triple intersection {A \cap B \cap C}. (Even the double intersection {A \cap B} only exists information theoretically in a “virtual” sense; the mutual information {I(X:Y)} allows one to “compute the entropy” of this “intersection”, but does not actually describe this intersection itself as a random variable.)

However, this issue only arises with three or more variables; it is not too difficult to show that the only linear equalities and inequalities that are necessarily obeyed by the information-theoretic quantities {H(X), H(Y), H(X,Y), I(X:Y), H(X|Y), H(Y|X)} associated to just two variables {X,Y} are those that are also necessarily obeyed by their combinatorial analogues {|A|, |B|, |A \cup B|, |A \cap B|, |A \backslash B|, |B \backslash A|}. (See for instance the Venn diagram at the Wikipedia page for mutual information for a pictorial summation of this statement.)

One can work with a larger class of special cases of Shannon entropy by working with random linear functions rather than random boolean functions. Namely, let {S} be some finite-dimensional vector space over a finite field {{\mathbf F}}, and let {f: S \rightarrow {\mathbf F}} be a random linear functional on {S}, selected uniformly among all such functions. Every subspace {U} of {S} then gives rise to a random variable {X = X_U: U \rightarrow {\mathbf F}} formed by restricting {f} to {U}. This random variable is also distributed uniformly amongst all linear functions on {U}, and its entropy can be easily computed to be {\mathrm{dim}(U) \log |\mathbf{F}|}. Given two random variables {X, Y} formed by restricting {f} to {U, V} respectively, the joint random variable {(X,Y)} determines the random linear function {f} on the union {U \cup V} on the two spaces, and thus by linearity on the Minkowski sum {U+V} as well; thus {(X,Y)} is equivalent to the restriction of {f} to {U+V}. In particular, {H(X,Y) = \mathrm{dim}(U+V) \log |\mathbf{F}|}. This implies that {I(X:Y) = \mathrm{dim}(U \cap V) \log |\mathbf{F}|} and also {H(X|Y) = \mathrm{dim}(\pi_V(U)) \log |\mathbf{F}|}, where {\pi_V: S \rightarrow S/V} is the quotient map. After discarding the normalising constant {\log |\mathbf{F}|}, this leads to the following dictionary between information theoretic quantities and linear algebra quantities, analogous to the previous dictionary:

Random variables {X,Y,Z} Subspaces {U,V,W}
Entropy {H(X)} Dimension {\mathrm{dim}(U)}
Joint variable {(X,Y)} Sum {U+V}
Mutual information {I(X:Y)} Dimension of intersection {\mathrm{dim}(U \cap V)}
Conditional entropy {H(X|Y)} Dimension of projection {\mathrm{dim}(\pi_V(U))}
Conditional mutual information {I(X:Y|Z)} {\mathrm{dim}(\pi_W(U) \cap \pi_W(V))}
{X, Y} independent {U, V} transverse ({U \cap V = \{0\}})
{X} determined by {Y} {U} a subspace of {V}
{X,Y} conditionally independent relative to {Z} {\pi_W(U)}, {\pi_W(V)} transverse.

The combinatorial dictionary can be regarded as a specialisation of the linear algebra dictionary, by taking {S} to be the vector space {\mathbf{F}_2^\Omega} over the finite field {\mathbf{F}_2} of two elements, and only considering those subspaces {U} that are coordinate subspaces {U = {\bf F}_2^A} associated to various subsets {A} of {\Omega}.

As before, every linear inequality or equality that is valid for the information-theoretic quantities discussed above, is automatically valid for the linear algebra counterparts for subspaces of a vector space over a finite field by applying the above specialisation (and dividing out by the normalising factor of {\log |\mathbf{F}|}). In fact, the requirement that the field be finite can be removed by applying the compactness theorem from logic (or one of its relatives, such as Los’s theorem on ultraproducts, as done in this previous blog post).

The linear algebra model captures more of the features of Shannon entropy than the combinatorial model. For instance, in contrast to the combinatorial case, it is possible in the linear algebra setting to have subspaces {U,V,W} such that {U} and {V} are separately transverse to {W}, but their sum {U+V} is not; for instance, in a two-dimensional vector space {{\bf F}^2}, one can take {U,V,W} to be the one-dimensional subspaces spanned by {(0,1)}, {(1,0)}, and {(1,1)} respectively. Note that this is essentially the same counterexample from before (which took {{\bf F}} to be the field of two elements). Indeed, one can show that any necessarily true linear inequality or equality involving the dimensions of three subspaces {U,V,W} (as well as the various other quantities on the above table) will also be necessarily true when applied to the entropies of three discrete random variables {X,Y,Z} (as well as the corresponding quantities on the above table).

However, the linear algebra model does not completely capture the subtleties of Shannon entropy once one works with four or more variables (or subspaces). This was first observed by Ingleton, who established the dimensional inequality

\displaystyle  \mathrm{dim}(U \cap V) \leq \mathrm{dim}(\pi_W(U) \cap \pi_W(V)) + \mathrm{dim}(\pi_X(U) \cap \pi_X(V)) + \mathrm{dim}(W \cap X) \ \ \ \ \ (2)

for any subspaces {U,V,W,X}. This is easiest to see when the three terms on the right-hand side vanish; then {\pi_W(U), \pi_W(V)} are transverse, which implies that {U\cap V \subset W}; similarly {U \cap V \subset X}. But {W} and {X} are transverse, and this clearly implies that {U} and {V} are themselves transverse. To prove the general case of Ingleton’s inequality, one can define {Y := U \cap V} and use {\mathrm{dim}(\pi_W(Y)) \leq \mathrm{dim}(\pi_W(U) \cap \pi_W(V))} (and similarly for {X} instead of {W}) to reduce to establishing the inequality

\displaystyle  \mathrm{dim}(Y) \leq \mathrm{dim}(\pi_W(Y)) + \mathrm{dim}(\pi_X(Y)) + \mathrm{dim}(W \cap X) \ \ \ \ \ (3)

which can be rearranged using {\mathrm{dim}(\pi_W(Y)) = \mathrm{dim}(Y) - \mathrm{dim}(W) + \mathrm{dim}(\pi_Y(W))} (and similarly for {X} instead of {W}) and {\mathrm{dim}(W \cap X) = \mathrm{dim}(W) + \mathrm{dim}(X) - \mathrm{dim}(W + X)} as

\displaystyle  \mathrm{dim}(W + X ) \leq \mathrm{dim}(\pi_Y(W)) + \mathrm{dim}(\pi_Y(X)) + \mathrm{dim}(Y)

but this is clear since {\mathrm{dim}(W + X ) \leq \mathrm{dim}(\pi_Y(W) + \pi_Y(X)) + \mathrm{dim}(Y)}.

Returning to the entropy setting, the analogue

\displaystyle  H( V ) \leq H( V | Z ) + H(V | W ) + I(Z:W)

of (3) is true (exercise!), but the analogue

\displaystyle  I(X:Y) \leq I(X:Y|Z) + I(X:Y|W) + I(Z:W) \ \ \ \ \ (4)

of Ingleton’s inequality is false in general. Again, this is easiest to see when all the terms on the right-hand side vanish; then {X,Y} are conditionally independent relative to {Z}, and relative to {W}, and {Z} and {W} are independent, and the claim (4) would then be asserting that {X} and {Y} are independent. While there is no linear counterexample to this statement, there are simple non-linear ones: for instance, one can take {Z,W} to be independent uniform variables from {\mathbf{F}_2}, and take {X} and {Y} to be (say) {ZW} and {(1-Z)(1-W)} respectively (thus {X, Y} are the indicators of the events {Z=W=1} and {Z=W=0} respectively). Once one conditions on either {Z} or {W}, one of {X,Y} has positive conditional entropy and the other has zero entropy, and so {X, Y} are conditionally independent relative to either {Z} or {W}; also, {Z} or {W} are independent of each other. But {X} and {Y} are not independent of each other (they cannot be simultaneously equal to {1}). Somehow, the feature of the linear algebra model that is not present in general is that in the linear algebra setting, every pair of subspaces {U, V} has a well-defined intersection {U \cap V} that is also a subspace, whereas for arbitrary random variables {X, Y}, there does not necessarily exist the analogue of an intersection, namely a “common information” random variable {V} that has the entropy of {I(X:Y)} and is determined either by {X} or by {Y}.

I do not know if there is any simpler model of Shannon entropy that captures all the inequalities available for four variables. One significant complication is that there exist some information inequalities in this setting that are not of Shannon type, such as the Zhang-Yeung inequality

\displaystyle  I(X:Y) \leq 2 I(X:Y|Z) + I(X:Z|Y) + I(Y:Z|X)

\displaystyle + I(X:Y|W) + I(Z:W).

One can however still use these simpler models of Shannon entropy to be able to guess arguments that would work for general random variables. An example of this comes from my paper on the logarithmically averaged Chowla conjecture, in which I showed among other things that

\displaystyle  |\sum_{n \leq x} \frac{\lambda(n) \lambda(n+1)}{n}| \leq \varepsilon x \ \ \ \ \ (5)

whenever {x} was sufficiently large depending on {\varepsilon>0}, where {\lambda} is the Liouville function. The information-theoretic part of the proof was as follows. Given some intermediate scale {H} between {1} and {x}, one can form certain random variables {X_H, Y_H}. The random variable {X_H} is a sign pattern of the form {(\lambda(n+1),\dots,\lambda(n+H))} where {n} is a random number chosen from {1} to {x} (with logarithmic weighting). The random variable {Y_H} was tuple {(n \hbox{ mod } p)_{p \sim \varepsilon^2 H}} of reductions of {n} to primes {p} comparable to {\varepsilon^2 H}. Roughly speaking, what was implicitly shown in the paper (after using the multiplicativity of {\lambda}, the circle method, and the Matomaki-Radziwill theorem on short averages of multiplicative functions) is that if the inequality (5) fails, then there was a lower bound

\displaystyle  I( X_H : Y_H ) \gg \varepsilon^7 \frac{H}{\log H}

on the mutual information between {X_H} and {Y_H}. From translation invariance, this also gives the more general lower bound

\displaystyle  I( X_{H_0,H} : Y_H ) \gg \varepsilon^7 \frac{H}{\log H} \ \ \ \ \ (6)

for any {H_0}, where {X_{H_0,H}} denotes the shifted sign pattern {(\lambda(n+H_0+1),\dots,\lambda(n+H_0+H))}. On the other hand, one had the entropy bounds

\displaystyle  H( X_{H_0,H} ), H(Y_H) \ll H

and from concatenating sign patterns one could see that {X_{H_0,H+H'}} is equivalent to the joint random variable {(X_{H_0,H}, X_{H_0+H,H'})} for any {H_0,H,H'}. Applying these facts and using an “entropy decrement” argument, I was able to obtain a contradiction once {H} was allowed to become sufficiently large compared to {\varepsilon}, but the bound was quite weak (coming ultimately from the unboundedness of {\sum_{\log H_- \leq j \leq \log H_+} \frac{1}{j \log j}} as the interval {[H_-,H_+]} of values of {H} under consideration becomes large), something of the order of {H \sim \exp\exp\exp(\varepsilon^{-7})}; the quantity {H} needs at various junctures to be less than a small power of {\log x}, so the relationship between {x} and {\varepsilon} becomes essentially quadruple exponential in nature, {x \sim \exp\exp\exp\exp(\varepsilon^{-7})}. The basic strategy was to observe that the lower bound (6) causes some slowdown in the growth rate {H(X_{kH})/kH} of the mean entropy, in that this quantity decreased by {\gg \frac{\varepsilon^7}{\log H}} as {k} increased from {1} to {\log H}, basically by dividing {X_{kH}} into {k} components {X_{jH, H}}, {j=0,\dots,k-1} and observing from (6) each of these shares a bit of common information with the same variable {Y_H}. This is relatively clear when one works in a set model, in which {Y_H} is modeled by a set {B_H} of size {O(H)}, and {X_{H_0,H}} is modeled by a set of the form

\displaystyle  X_{H_0,H} = \bigcup_{H_0 < h \leq H_0+H} A_h

for various sets {A_h} of size {O(1)} (also there is some translation symmetry that maps {A_h} to a shift {A_{h+1}} while preserving all of the {B_H}).

However, on considering the set model recently, I realised that one can be a little more efficient by exploiting the fact (basically the Chinese remainder theorem) that the random variables {Y_H} are basically jointly independent as {H} ranges over dyadic values that are much smaller than {\log x}, which in the set model corresponds to the {B_H} all being disjoint. One can then establish a variant

\displaystyle  I( X_{H_0,H} : Y_H | (Y_{H'})_{H' < H}) \gg \varepsilon^7 \frac{H}{\log H} \ \ \ \ \ (7)

of (6), which in the set model roughly speaking asserts that each {B_H} claims a portion of the {\bigcup_{H_0 < h \leq H_0+H} A_h} of cardinality {\gg \varepsilon^7 \frac{H}{\log H}} that is not claimed by previous choices of {B_H}. This leads to a more efficient contradiction (relying on the unboundedness of {\sum_{\log H_- \leq j \leq \log H_+} \frac{1}{j}} rather than {\sum_{\log H_- \leq j \leq \log H_+} \frac{1}{j \log j}}) that looks like it removes one order of exponential growth, thus the relationship between {x} and {\varepsilon} is now {x \sim \exp\exp\exp(\varepsilon^{-7})}. Returning to the entropy model, one can use (7) and Shannon inequalities to establish an inequality of the form

\displaystyle  \frac{1}{2H} H(X_{2H} | (Y_{H'})_{H' \leq 2H}) \leq \frac{1}{H} H(X_{H} | (Y_{H'})_{H' \leq H}) - \frac{c \varepsilon^7}{\log H}

for a small constant {c>0}, which on iterating and using the boundedness of {\frac{1}{H} H(X_{H} | (Y_{H'})_{H' \leq H})} gives the claim. (A modification of this analysis, at least on the level of the back of the envelope calculation, suggests that the Matomaki-Radziwill theorem is needed only for ranges {H} greater than {\exp( (\log\log x)^{\varepsilon^{7}} )} or so, although at this range the theorem is not significantly simpler than the general case).

Let {X} and {Y} be two random variables taking values in the same (discrete) range {R}, and let {E} be some subset of {R}, which we think of as the set of “bad” outcomes for either {X} or {Y}. If {X} and {Y} have the same probability distribution, then clearly

\displaystyle  {\bf P}( X \in E ) = {\bf P}( Y \in E ).

In particular, if it is rare for {Y} to lie in {E}, then it is also rare for {X} to lie in {E}.

If {X} and {Y} do not have exactly the same probability distribution, but their probability distributions are close to each other in some sense, then we can expect to have an approximate version of the above statement. For instance, from the definition of the total variation distance {\delta(X,Y)} between two random variables (or more precisely, the total variation distance between the probability distributions of two random variables), we see that

\displaystyle  {\bf P}(Y \in E) - \delta(X,Y) \leq {\bf P}(X \in E) \leq {\bf P}(Y \in E) + \delta(X,Y) \ \ \ \ \ (1)

for any {E \subset R}. In particular, if it is rare for {Y} to lie in {E}, and {X,Y} are close in total variation, then it is also rare for {X} to lie in {E}.

A basic inequality in information theory is Pinsker’s inequality

\displaystyle  \delta(X,Y) \leq \sqrt{\frac{1}{2} D_{KL}(X||Y)}

where the Kullback-Leibler divergence {D_{KL}(X||Y)} is defined by the formula

\displaystyle  D_{KL}(X||Y) = \sum_{x \in R} {\bf P}( X=x ) \log \frac{{\bf P}(X=x)}{{\bf P}(Y=x)}.

(See this previous blog post for a proof of this inequality.) A standard application of Jensen’s inequality reveals that {D_{KL}(X||Y)} is non-negative (Gibbs’ inequality), and vanishes if and only if {X}, {Y} have the same distribution; thus one can think of {D_{KL}(X||Y)} as a measure of how close the distributions of {X} and {Y} are to each other, although one should caution that this is not a symmetric notion of distance, as {D_{KL}(X||Y) \neq D_{KL}(Y||X)} in general. Inserting Pinsker’s inequality into (1), we see for instance that

\displaystyle  {\bf P}(X \in E) \leq {\bf P}(Y \in E) + \sqrt{\frac{1}{2} D_{KL}(X||Y)}.

Thus, if {X} is close to {Y} in the Kullback-Leibler sense, and it is rare for {Y} to lie in {E}, then it is rare for {X} to lie in {E} as well.

We can specialise this inequality to the case when {Y} a uniform random variable {U} on a finite range {R} of some cardinality {N}, in which case the Kullback-Leibler divergence {D_{KL}(X||U)} simplifies to

\displaystyle  D_{KL}(X||U) = \log N - {\bf H}(X)

where

\displaystyle  {\bf H}(X) := \sum_{x \in R} {\bf P}(X=x) \log \frac{1}{{\bf P}(X=x)}

is the Shannon entropy of {X}. Again, a routine application of Jensen’s inequality shows that {{\bf H}(X) \leq \log N}, with equality if and only if {X} is uniformly distributed on {R}. The above inequality then becomes

\displaystyle  {\bf P}(X \in E) \leq {\bf P}(U \in E) + \sqrt{\frac{1}{2}(\log N - {\bf H}(X))}. \ \ \ \ \ (2)

Thus, if {E} is a small fraction of {R} (so that it is rare for {U} to lie in {E}), and the entropy of {X} is very close to the maximum possible value of {\log N}, then it is rare for {X} to lie in {E} also.

The inequality (2) is only useful when the entropy {{\bf H}(X)} is close to {\log N} in the sense that {{\bf H}(X) = \log N - O(1)}, otherwise the bound is worse than the trivial bound of {{\bf P}(X \in E) \leq 1}. In my recent paper on the Chowla and Elliott conjectures, I ended up using a variant of (2) which was still non-trivial when the entropy {{\bf H}(X)} was allowed to be smaller than {\log N - O(1)}. More precisely, I used the following simple inequality, which is implicit in the arguments of that paper but which I would like to make more explicit in this post:

Lemma 1 (Pinsker-type inequality) Let {X} be a random variable taking values in a finite range {R} of cardinality {N}, let {U} be a uniformly distributed random variable in {R}, and let {E} be a subset of {R}. Then

\displaystyle  {\bf P}(X \in E) \leq \frac{(\log N - {\bf H}(X)) + \log 2}{\log 1/{\bf P}(U \in E)}.

Proof: Consider the conditional entropy {{\bf H}(X | 1_{X \in E} )}. On the one hand, we have

\displaystyle  {\bf H}(X | 1_{X \in E} ) = {\bf H}(X, 1_{X \in E}) - {\bf H}(1_{X \in E} )

\displaystyle  = {\bf H}(X) - {\bf H}(1_{X \in E})

\displaystyle  \geq {\bf H}(X) - \log 2

by Jensen’s inequality. On the other hand, one has

\displaystyle  {\bf H}(X | 1_{X \in E} ) = {\bf P}(X \in E) {\bf H}(X | X \in E )

\displaystyle  + (1-{\bf P}(X \in E)) {\bf H}(X | X \not \in E)

\displaystyle  \leq {\bf P}(X \in E) \log |E| + (1-{\bf P}(X \in E)) \log N

\displaystyle  = \log N - {\bf P}(X \in E) \log \frac{N}{|E|}

\displaystyle  = \log N - {\bf P}(X \in E) \log \frac{1}{{\bf P}(U \in E)},

where we have again used Jensen’s inequality. Putting the two inequalities together, we obtain the claim. \Box

Remark 2 As noted in comments, this inequality can be viewed as a special case of the more general inequality

\displaystyle  {\bf P}(X \in E) \leq \frac{D(X||Y) + \log 2}{\log 1/{\bf P}(Y \in E)}

for arbitrary random variables {X,Y} taking values in the same discrete range {R}, which follows from the data processing inequality

\displaystyle  D( f(X)||f(Y)) \leq D(X|| Y)

for arbitrary functions {f}, applied to the indicator function {f = 1_E}. Indeed one has

\displaystyle  D( 1_E(X) || 1_E(Y) ) = {\bf P}(X \in E) \log \frac{{\bf P}(X \in E)}{{\bf P}(Y \in E)}

\displaystyle + {\bf P}(X \not \in E) \log \frac{{\bf P}(X \not \in E)}{{\bf P}(Y \not \in E)}

\displaystyle  \geq {\bf P}(X \in E) \log \frac{1}{{\bf P}(Y \in E)} - h( {\bf P}(X \in E) )

\displaystyle  \geq {\bf P}(X \in E) \log \frac{1}{{\bf P}(Y \in E)} - \log 2

where {h(u) := u \log \frac{1}{u} + (1-u) \log \frac{1}{1-u}} is the entropy function.

Thus, for instance, if one has

\displaystyle  {\bf H}(X) \geq \log N - o(K)

and

\displaystyle  {\bf P}(U \in E) \leq \exp( - K )

for some {K} much larger than {1} (so that {1/K = o(1)}), then

\displaystyle  {\bf P}(X \in E) = o(1).

More informally: if the entropy of {X} is somewhat close to the maximum possible value of {\log N}, and it is exponentially rare for a uniform variable to lie in {E}, then it is still somewhat rare for {X} to lie in {E}. The estimate given is close to sharp in this regime, as can be seen by calculating the entropy of a random variable {X} which is uniformly distributed inside a small set {E} with some probability {p} and uniformly distributed outside of {E} with probability {1-p}, for some parameter {0 \leq p \leq 1}.

It turns out that the above lemma combines well with concentration of measure estimates; in my paper, I used one of the simplest such estimates, namely Hoeffding’s inequality, but there are of course many other estimates of this type (see e.g. this previous blog post for some others). Roughly speaking, concentration of measure inequalities allow one to make approximations such as

\displaystyle  F(U) \approx {\bf E} F(U)

with exponentially high probability, where {U} is a uniform distribution and {F} is some reasonable function of {U}. Combining this with the above lemma, we can then obtain approximations of the form

\displaystyle  F(X) \approx {\bf E} F(U) \ \ \ \ \ (3)

with somewhat high probability, if the entropy of {X} is somewhat close to maximum. This observation, combined with an “entropy decrement argument” that allowed one to arrive at a situation in which the relevant random variable {X} did have a near-maximum entropy, is the key new idea in my recent paper; for instance, one can use the approximation (3) to obtain an approximation of the form

\displaystyle  \sum_{j=1}^H \sum_{p \in {\mathcal P}} \lambda(n+j) \lambda(n+j+p) 1_{p|n+j}

\displaystyle  \approx \sum_{j=1}^H \sum_{p \in {\mathcal P}} \frac{\lambda(n+j) \lambda(n+j+p)}{p}

for “most” choices of {n} and a suitable choice of {H} (with the latter being provided by the entropy decrement argument). The left-hand side is tied to Chowla-type sums such as {\sum_{n \leq x} \frac{\lambda(n)\lambda(n+1)}{n}} through the multiplicativity of {\lambda}, while the right-hand side, being a linear correlation involving two parameters {j,p} rather than just one, has “finite complexity” and can be treated by existing techniques such as the Hardy-Littlewood circle method. One could hope that one could similarly use approximations such as (3) in other problems in analytic number theory or combinatorics.

A handy inequality in additive combinatorics is the Plünnecke-Ruzsa inequality:

Theorem 1 (Plünnecke-Ruzsa inequality) Let {A, B_1, \ldots, B_m} be finite non-empty subsets of an additive group {G}, such that {|A+B_i| \leq K_i |A|} for all {1 \leq i \leq m} and some scalars {K_1,\ldots,K_m \geq 1}. Then there exists a subset {A'} of {A} such that {|A' + B_1 + \ldots + B_m| \leq K_1 \ldots K_m |A'|}.

The proof uses graph-theoretic techniques. Setting {A=B_1=\ldots=B_m}, we obtain a useful corollary: if {A} has small doubling in the sense that {|A+A| \leq K|A|}, then we have {|mA| \leq K^m |A|} for all {m \geq 1}, where {mA = A + \ldots + A} is the sum of {m} copies of {A}.

In a recent paper, I adapted a number of sum set estimates to the entropy setting, in which finite sets such as {A} in {G} are replaced with discrete random variables {X} taking values in {G}, and (the logarithm of) cardinality {|A|} of a set {A} is replaced by Shannon entropy {{\Bbb H}(X)} of a random variable {X}. (Throughout this note I assume all entropies to be finite.) However, at the time, I was unable to find an entropy analogue of the Plünnecke-Ruzsa inequality, because I did not know how to adapt the graph theory argument to the entropy setting.

I recently discovered, however, that buried in a classic paper of Kaimonovich and Vershik (implicitly in Proposition 1.3, to be precise) there was the following analogue of Theorem 1:

Theorem 2 (Entropy Plünnecke-Ruzsa inequality) Let {X, Y_1, \ldots, Y_m} be independent random variables of finite entropy taking values in an additive group {G}, such that {{\Bbb H}(X+Y_i) \leq {\Bbb H}(X) + \log K_i} for all {1 \leq i \leq m} and some scalars {K_1,\ldots,K_m \geq 1}. Then {{\Bbb H}(X+Y_1+\ldots+Y_m) \leq {\Bbb H}(X) + \log K_1 \ldots K_m}.

In fact Theorem 2 is a bit “better” than Theorem 1 in the sense that Theorem 1 needed to refine the original set {A} to a subset {A'}, but no such refinement is needed in Theorem 2. One corollary of Theorem 2 is that if {{\Bbb H}(X_1+X_2) \leq {\Bbb H}(X) + \log K}, then {{\Bbb H}(X_1+\ldots+X_m) \leq {\Bbb H}(X) + (m-1) \log K} for all {m \geq 1}, where {X_1,\ldots,X_m} are independent copies of {X}; this improves slightly over the analogous combinatorial inequality. Indeed, the function {m \mapsto {\Bbb H}(X_1+\ldots+X_m)} is concave (this can be seen by using the {m=2} version of Theorem 2 (or (2) below) to show that the quantity {{\Bbb H}(X_1+\ldots+X_{m+1})-{\Bbb H}(X_1+\ldots+X_m)} is decreasing in {m}).

Theorem 2 is actually a quick consequence of the submodularity inequality

\displaystyle  {\Bbb H}(W) + {\Bbb H}(X) \leq {\Bbb H}(Y) + {\Bbb H}(Z) \ \ \ \ \ (1)

in information theory, which is valid whenever {X,Y,Z,W} are discrete random variables such that {Y} and {Z} each determine {X} (i.e. {X} is a function of {Y}, and also a function of {Z}), and {Y} and {Z} jointly determine {W} (i.e {W} is a function of {Y} and {Z}). To apply this, let {X, Y, Z} be independent discrete random variables taking values in {G}. Observe that the pairs {(X,Y+Z)} and {(X+Y,Z)} each determine {X+Y+Z}, and jointly determine {(X,Y,Z)}. Applying (1) we conclude that

\displaystyle  {\Bbb H}(X,Y,Z) + {\Bbb H}(X+Y+Z) \leq {\Bbb H}(X,Y+Z) + {\Bbb H}(X+Y,Z)

which after using the independence of {X,Y,Z} simplifies to the sumset submodularity inequality

\displaystyle  {\Bbb H}(X+Y+Z) + {\Bbb H}(Y) \leq {\Bbb H}(X+Y) + {\Bbb H}(Y+Z) \ \ \ \ \ (2)

(this inequality was also recently observed by Madiman; it is the {m=2} case of Theorem 2). As a corollary of this inequality, we see that if {{\Bbb H}(X+Y_i) \leq {\Bbb H}(X) + \log K_i}, then

\displaystyle  {\Bbb H}(X+Y_1+\ldots+Y_i) \leq {\Bbb H}(X+Y_1+\ldots+Y_{i-1}) + \log K_i,

and Theorem 2 follows by telescoping series.

The proof of Theorem 2 seems to be genuinely different from the graph-theoretic proof of Theorem 1. It would be interesting to see if the above argument can be somehow adapted to give a stronger version of Theorem 1. Note also that both Theorem 1 and Theorem 2 have extensions to more general combinations of {X,Y_1,\ldots,Y_m} than {X+Y_i}; see this paper and this paper respectively.

Read the rest of this entry »

I am posting here four more of my Mahler lectures, each of which is based on earlier talks of mine:

As always, comments, corrections, and other feedback are welcome.

There are many situations in combinatorics in which one is running some sort of iteration algorithm to continually “improve” some object {A}; each loop of the algorithm replaces {A} with some better version {A'} of itself, until some desired property of {A} is attained and the algorithm halts. In order for such arguments to yield a useful conclusion, it is often necessary that the algorithm halts in a finite amount of time, or (even better), in a bounded amount of time. (In general, one cannot use infinitary iteration tools, such as transfinite induction or Zorn’s lemma, in combinatorial settings, because the iteration processes used to improve some target object {A} often degrade some other finitary quantity {B} in the process, and an infinite iteration would then have the undesirable effect of making {B} infinite.)

A basic strategy to ensure termination of an algorithm is to exploit a monotonicity property, or more precisely to show that some key quantity keeps increasing (or keeps decreasing) with each loop of the algorithm, while simultaneously staying bounded. (Or, as the economist Herbert Stein was fond of saying, “If something cannot go on forever, it must stop.”)

Here are four common flavours of this monotonicity strategy:

  • The mass increment argument. This is perhaps the most familiar way to ensure termination: make each improved object {A'} “heavier” than the previous one {A} by some non-trivial amount (e.g. by ensuring that the cardinality of {A'} is strictly greater than that of {A}, thus {|A'| \geq |A|+1}). Dually, one can try to force the amount of “mass” remaining “outside” of {A} in some sense to decrease at every stage of the iteration. If there is a good upper bound on the “mass” of {A} that stays essentially fixed throughout the iteration process, and a lower bound on the mass increment at each stage, then the argument terminates. Many “greedy algorithm” arguments are of this type. The proof of the Hahn decomposition theorem in measure theory also falls into this category. The general strategy here is to keep looking for useful pieces of mass outside of {A}, and add them to {A} to form {A'}, thus exploiting the additivity properties of mass. Eventually no further usable mass remains to be added (i.e. {A} is maximal in some {L^1} sense), and this should force some desirable property on {A}.
  • The density increment argument. This is a variant of the mass increment argument, in which one increments the “density” of {A} rather than the “mass”. For instance, {A} might be contained in some ambient space {P}, and one seeks to improve {A} to {A'} (and {P} to {P'}) in such a way that the density of the new object in the new ambient space is better than that of the previous object (e.g. {|A'|/|P'| \geq |A|/|P| + c} for some {c>0}). On the other hand, the density of {A} is clearly bounded above by {1}. As long as one has a sufficiently good lower bound on the density increment at each stage, one can conclude an upper bound on the number of iterations in the algorithm. The prototypical example of this is Roth’s proof of his theorem that every set of integers of positive upper density contains an arithmetic progression of length three. The general strategy here is to keep looking for useful density fluctuations inside {A}, and then “zoom in” to a region of increased density by reducing {A} and {P} appropriately. Eventually no further usable density fluctuation remains (i.e. {A} is uniformly distributed), and this should force some desirable property on {A}.
  • The energy increment argument. This is an “{L^2}” analogue of the “{L^1}“-based mass increment argument (or the “{L^\infty}“-based density increment argument), in which one seeks to increments the amount of “energy” that {A} captures from some reference object {X}, or (equivalently) to decrement the amount of energy of {X} which is still “orthogonal” to {A}. Here {A} and {X} are related somehow to a Hilbert space, and the energy involves the norm on that space. A classic example of this type of argument is the existence of orthogonal projections onto closed subspaces of a Hilbert space; this leads among other things to the construction of conditional expectation in measure theory, which then underlies a number of arguments in ergodic theory, as discussed for instance in this earlier blog post. Another basic example is the standard proof of the Szemerédi regularity lemma (where the “energy” is often referred to as the “index”). These examples are related; see this blog post for further discussion. The general strategy here is to keep looking for useful pieces of energy orthogonal to {A}, and add them to {A} to form {A'}, thus exploiting square-additivity properties of energy, such as Pythagoras’ theorem. Eventually, no further usable energy outside of {A} remains to be added (i.e. {A} is maximal in some {L^2} sense), and this should force some desirable property on {A}.
  • The rank reduction argument. Here, one seeks to make each new object {A'} to have a lower “rank”, “dimension”, or “order” than the previous one. A classic example here is the proof of the linear algebra fact that given any finite set of vectors, there exists a linearly independent subset which spans the same subspace; the proof of the more general Steinitz exchange lemma is in the same spirit. The general strategy here is to keep looking for “collisions” or “dependencies” within {A}, and use them to collapse {A} to an object {A'} of lower rank. Eventually, no further usable collisions within {A} remain, and this should force some desirable property on {A}.

Much of my own work in additive combinatorics relies heavily on at least one of these types of arguments (and, in some cases, on a nested combination of two or more of them). Many arguments in nonlinear partial differential equations also have a similar flavour, relying on various monotonicity formulae for solutions to such equations, though the objective in PDE is usually slightly different, in that one wants to keep control of a solution as one approaches a singularity (or as some time or space coordinate goes off to infinity), rather than to ensure termination of an algorithm. (On the other hand, many arguments in the theory of concentration compactness, which is used heavily in PDE, does have the same algorithm-terminating flavour as the combinatorial arguments; see this earlier blog post for more discussion.)

Recently, a new species of monotonicity argument was introduced by Moser, as the primary tool in his elegant new proof of the Lovász local lemma. This argument could be dubbed an entropy compression argument, and only applies to probabilistic algorithms which require a certain collection {R} of random “bits” or other random choices as part of the input, thus each loop of the algorithm takes an object {A} (which may also have been generated randomly) and some portion of the random string {R} to (deterministically) create a better object {A'} (and a shorter random string {R'}, formed by throwing away those bits of {R} that were used in the loop). The key point is to design the algorithm to be partially reversible, in the sense that given {A'} and {R'} and some additional data {H'} that logs the cumulative history of the algorithm up to this point, one can reconstruct {A} together with the remaining portion {R} not already contained in {R'}. Thus, each stage of the argument compresses the information-theoretic content of the string {A+R} into the string {A'+R'+H'} in a lossless fashion. However, a random variable such as {A+R} cannot be compressed losslessly into a string of expected size smaller than the Shannon entropy of that variable. Thus, if one has a good lower bound on the entropy of {A+R}, and if the length of {A'+R'+H'} is significantly less than that of {A+R} (i.e. we need the marginal growth in the length of the history file {H'} per iteration to be less than the marginal amount of randomness used per iteration), then there is a limit as to how many times the algorithm can be run, much as there is a limit as to how many times a random data file can be compressed before no further length reduction occurs.

It is interesting to compare this method with the ones discussed earlier. In the previous methods, the failure of the algorithm to halt led to a new iteration of the object {A} which was “heavier”, “denser”, captured more “energy”, or “lower rank” than the previous instance of {A}. Here, the failure of the algorithm to halt leads to new information that can be used to “compress” {A} (or more precisely, the full state {A+R}) into a smaller amount of space. I don’t know yet of any application of this new type of termination strategy to the fields I work in, but one could imagine that it could eventually be of use (perhaps to show that solutions to PDE with sufficiently “random” initial data can avoid singularity formation?), so I thought I would discuss it here.

Below the fold I give a special case of Moser’s argument, based on a blog post of Lance Fortnow on this topic.

Read the rest of this entry »

The most fundamental unsolved problem in complexity theory is undoubtedly the P=NP problem, which asks (roughly speaking) whether a problem which can be solved by a non-deterministic polynomial-time (NP) algorithm, can also be solved by a deterministic polynomial-time (P) algorithm. The general belief is that {P \neq NP}, i.e. there exist problems which can be solved by non-deterministic polynomial-time algorithms but not by deterministic polynomial-time algorithms.

One reason why the {P \neq NP} question is so difficult to resolve is that a certain generalisation of this question has an affirmative answer in some cases, and a negative answer in other cases. More precisely, if we give all the algorithms access to an oracle, then for one choice {A} of this oracle, all the problems that are solvable by non-deterministic polynomial-time algorithms that calls {A} ({NP^A}), can also be solved by a deterministic polynomial-time algorithm algorithm that calls {A} ({P^A}), thus {P^A = NP^A}; but for another choice {B} of this oracle, there exist problems solvable by non-deterministic polynomial-time algorithms that call {B}, which cannot be solved by a deterministic polynomial-time algorithm that calls {B}, thus {P^B \neq NP^B}. One particular consequence of this result (which is due to Baker, Gill, and Solovay) is that there cannot be any relativisable proof of either {P=NP} or {P \neq NP}, where “relativisable” means that the proof would also work without any changes in the presence of an oracle.

The Baker-Gill-Solovay result was quite surprising, but the idea of the proof turns out to be rather simple. To get an oracle {A} such that {P^A=NP^A}, one basically sets {A} to be a powerful simulator that can simulate non-deterministic machines (and, furthermore, can also simulate itself); it turns out that any PSPACE-complete oracle would suffice for this task. To get an oracle {B} for which {P^B \neq NP^B}, one has to be a bit sneakier, setting {B} to be a query device for a sparse set of random (or high-complexity) strings, which are too complex to be guessed at by any deterministic polynomial-time algorithm.

Unfortunately, the simple idea of the proof can be obscured by various technical details (e.g. using Turing machines to define {P} and {NP} precisely), which require a certain amount of time to properly absorb. To help myself try to understand this result better, I have decided to give a sort of “allegory” of the proof, based around a (rather contrived) story about various students trying to pass a multiple choice test, which avoids all the technical details but still conveys the basic ideas of the argument. This allegory was primarily for my own benefit, but I thought it might also be of interest to some readers here (and also has some tangential relation to the proto-polymath project of determinstically finding primes), so I reproduce it below.

Read the rest of this entry »

[This post should have appeared several months ago, but I didn’t have a link to the newsletter at the time, and I subsequently forgot about it until now.  -T.]

Last year, Emmanuel Candés and I were two of the recipients of the 2008 IEEE Information Theory Society Paper Award, for our paper “Near-optimal signal recovery from random projections: universal encoding strategies?” published in IEEE Inf. Thy..  (The other recipient is David Donoho, for the closely related paper “Compressed sensing” in the same journal.)  These papers helped initiate the modern subject of compressed sensing, which I have talked about earlier on this blog, although of course they also built upon a number of important precursor results in signal recovery, high-dimensional geometry, Fourier analysis, linear programming, and probability.  As part of our response to this award, Emmanuel and I wrote a short piece commenting on these developments, entitled “Reflections on compressed sensing“, which appears in the Dec 2008 issue of the IEEE Information Theory newsletter.  In it we place our results in the context of these precursor results, and also mention some of the many active directions (theoretical, numerical, and applied) that compressed sensing is now developing in.

Emmanuel Candés and I have just uploaded to the arXiv our paper “The power of convex relaxation: near-optimal matrix completion“, submitted to IEEE Inf. Theory.  In this paper we study the matrix completion problem, which one can view as a sort of “non-commutative” analogue of the sparse recovery problem studied in the field of compressed sensing, although there are also some other significant differences between the two problems.   The sparse recovery problem seeks to recover a sparse vector x \in {\Bbb R}^n from some linear measurements Ax = b \in {\Bbb R}^m, where A is a known m \times n matrix.  For general x, classical linear algebra tells us that if m < n, then the problem here is underdetermined and has multiple solutions; but under the additional assumption that x is sparse (most of the entries are zero), it turns out (under various hypotheses on the measurement matrix A, and in particular if A contains a sufficient amount of “randomness” or “incoherence”) that exact recovery becomes possible in the underdetermined case.  Furthermore, recovery is not only theoretically possible, but is also computationally practical in many cases; in particular, under some assumptions on A, one can recover x by minimising the convex norm \| x \|_{\ell^1} over all solutions to Ax=b.

Now we turn to the matrix completion problem.  Instead of an unknown vector x \in {\Bbb R}^n, we now have an unknown matrix M = (m_{ij})_{i \in [n_1], j \in [n_2]} \in {\Bbb R}^{n_1 \times n_2} (we use the shorthand [n] := \{1,\ldots,n\} here). We will take a specific type of underdetermined linear measurement of M, namely we pick a random subset \Omega \subset [n_1] \times [n_2] of the matrix array [n_1] \times [n_2] of some cardinality 1 \leq m \leq n_1 n_2, and form the random sample P_\Omega(M) := (m_{ij})_{(i,j) \in \Omega} \in {\Bbb R}^{\Omega} of M.

Of course, with no further information on M, it is impossible to complete the matrix M from the partial information P_\Omega(M) – we only have m pieces of information and need n_1 n_2.  But suppose we also know that M is low-rank, e.g. has rank less than r; this is an analogue of sparsity, but for matrices rather than vectors.  Then, in principle, we have reduced the number of degrees of freedom for M from n_1 n_2 to something more like O( r \max(n_1,n_2) ), and so (in analogy with compressed sensing) one may now hope to perform matrix completion with a much smaller fraction of samples, and in particular with m close to r \max(n_1,n_2).

This type of problem comes up in several real-world applications, most famously in the Netflix prize.  The Netflix prize problem is to be able to predict a very large ratings matrix M, whose rows are the customers, whose columns are the movies, and the entries are the rating that each customer would hypothetically assign to each movie.  Of course, not every customer has rented every movie from Netflix, and so only a small fraction P_\Omega(M) of this matrix is actually known.  However, if one makes the assumption that most customers’ rating preference is determined by only a small number of characteristics of the movie (e.g. genre, lead actor/actresses, director, year, etc.), then the matrix should be (approximately) low rank, and so the above type of analysis should be useful (though of course it is not going to be the only tool of use in this messy, real-world problem).

Actually, one expects to need to oversample the matrix by a logarithm or two in order to have a good chance of exact recovery, if one is sampling randomly.  This can be seen even in the rank one case r=1, in which M=uv^* is the product of a column vector and a row vector; let’s consider square matrices n_1=n_2=n for simplicity.  Observe that if the sampled coordinates \Omega completely miss one of the rows of the matrix, then the corresponding element of u has gone completely unmeasured, and one cannot hope to complete this row of the matrix.   Thus one needs to sample every row (and also every column) of the n \times n matrix.  The solution to the coupon collector’s problem then tells us that one needs about O(n \log n) samples to achieve this goal.  In fact, the theory of Erdős-Rényi random graphs tells us that the bipartite graph induced by \Omega becomes almost surely connected beyond this threshold, which turns out to be exactly what is needed to perform matrix completion for rank 1 matrices.

On the other hand, one cannot hope to complete the matrix if some of the singular vectors of the matrix are extremely sparse.  For instance, in the Netflix problem, a singularly idiosyncratic customer (or dually, a singularly unclassifiable movie) may give rise to a row or column of M that has no relation to the rest of the matrix, occupying its own separate component of the singular value decomposition of M; such a row or column is then impossible to complete exactly without sampling the entirety of that row or column.  Thus, to get exact matrix completion from a small fraction of entries, one needs some sort of incoherence assumption on the singular vectors, which spreads them out across all coordinates in a roughly even manner, as opposed to being concentrated on just a few values.

In a recent paper, Candés and Recht proposed solving the matrix completion problem by minimising the nuclear norm (or trace norm)

\|M\|_* = \sum_{i=1}^{\min(n_1,n_2)} \sigma_i(M) = \hbox{tr}( M M^*)^{1/2}

amongst all matrices consistent with the observed data P_\Omega(M).  This nuclear norm is the non-commutative counterpart to the \ell^1 norm for vectors, and so this algorithm is analogous to the \ell^1 minimisation (or basis pursuit) algorithm which is effective for compressed sensing (though not the only such algorithm for this task).  They showed, roughly speaking, that exact matrix completion (for, say, square matrices n_1=n_2=n for simplicity) is ensured with high probability so long as the singular vectors obey a certain incoherence property (basically, their \ell^\infty norm should be close to the minimal possible value, namely O(1/\sqrt{n})), so long as one had the condition

m \gg n^{1.2} r \log n.

This differs from the presumably optimal threshold of nr \log n by a factor of about n^{0.2}.

The main result of our paper is to mostly eliminate this gap, at the cost of a stronger hypothesis on the matrix being measured:

Main theorem. (Informal statement)  Suppose the n_1 \times n_2 matrix M has rank r and obeys a certain “strong incoherence property”.  Then with high probability, nuclear norm minimisation will recover M from a random sample P_\Omega(M) provided that m \gg n r \log^{O(1)} n, where n := \max(n_1,n_2).

A result of a broadly similar nature, but with a rather different recovery algorithm and with a somewhat different range of applicability, was recently established by Keshavan, Oh, and Montanari.  The strong incoherence property is somewhat technical, but is related to the Candés-Recht incoherence property and is satisfied by a number of reasonable random matrix models.  The exponent O(1) here is reasonably civilised (ranging between 2 and 9, depending on the specific model and parameters being used).

Read the rest of this entry »

This week I am in Seville, Spain, for a conference in harmonic analysis and related topics.  My talk is titled “the uniform uncertainty principle and compressed sensing“.  The content of this talk overlaps substantially with my Ostrowski lecture on the same topic; the slides I prepared for the Seville lecture can be found here.

[Update, Dec 6: Some people have asked about my other lecture given in Seville, on structure and randomness in the prime numbers.  This lecture is largely equivalent to the one posted here.]

Archives