You are currently browsing the tag archive for the ‘compressed sensing’ tag.

[This post should have appeared several months ago, but I didn’t have a link to the newsletter at the time, and I subsequently forgot about it until now.  -T.]

Last year, Emmanuel Candés and I were two of the recipients of the 2008 IEEE Information Theory Society Paper Award, for our paper “Near-optimal signal recovery from random projections: universal encoding strategies?” published in IEEE Inf. Thy..  (The other recipient is David Donoho, for the closely related paper “Compressed sensing” in the same journal.)  These papers helped initiate the modern subject of compressed sensing, which I have talked about earlier on this blog, although of course they also built upon a number of important precursor results in signal recovery, high-dimensional geometry, Fourier analysis, linear programming, and probability.  As part of our response to this award, Emmanuel and I wrote a short piece commenting on these developments, entitled “Reflections on compressed sensing“, which appears in the Dec 2008 issue of the IEEE Information Theory newsletter.  In it we place our results in the context of these precursor results, and also mention some of the many active directions (theoretical, numerical, and applied) that compressed sensing is now developing in.

Emmanuel Candés and I have just uploaded to the arXiv our paper “The power of convex relaxation: near-optimal matrix completion“, submitted to IEEE Inf. Theory.  In this paper we study the matrix completion problem, which one can view as a sort of “non-commutative” analogue of the sparse recovery problem studied in the field of compressed sensing, although there are also some other significant differences between the two problems.   The sparse recovery problem seeks to recover a sparse vector $x \in {\Bbb R}^n$ from some linear measurements $Ax = b \in {\Bbb R}^m$, where A is a known $m \times n$ matrix.  For general x, classical linear algebra tells us that if m < n, then the problem here is underdetermined and has multiple solutions; but under the additional assumption that x is sparse (most of the entries are zero), it turns out (under various hypotheses on the measurement matrix A, and in particular if A contains a sufficient amount of “randomness” or “incoherence”) that exact recovery becomes possible in the underdetermined case.  Furthermore, recovery is not only theoretically possible, but is also computationally practical in many cases; in particular, under some assumptions on A, one can recover x by minimising the convex norm $\| x \|_{\ell^1}$ over all solutions to Ax=b.

Now we turn to the matrix completion problem.  Instead of an unknown vector $x \in {\Bbb R}^n$, we now have an unknown matrix $M = (m_{ij})_{i \in [n_1], j \in [n_2]} \in {\Bbb R}^{n_1 \times n_2}$ (we use the shorthand $[n] := \{1,\ldots,n\}$ here). We will take a specific type of underdetermined linear measurement of M, namely we pick a random subset $\Omega \subset [n_1] \times [n_2]$ of the matrix array $[n_1] \times [n_2]$ of some cardinality $1 \leq m \leq n_1 n_2$, and form the random sample $P_\Omega(M) := (m_{ij})_{(i,j) \in \Omega} \in {\Bbb R}^{\Omega}$ of M.

Of course, with no further information on M, it is impossible to complete the matrix M from the partial information $P_\Omega(M)$ – we only have $m$ pieces of information and need $n_1 n_2$.  But suppose we also know that M is low-rank, e.g. has rank less than r; this is an analogue of sparsity, but for matrices rather than vectors.  Then, in principle, we have reduced the number of degrees of freedom for M from $n_1 n_2$ to something more like $O( r \max(n_1,n_2) )$, and so (in analogy with compressed sensing) one may now hope to perform matrix completion with a much smaller fraction of samples, and in particular with m close to $r \max(n_1,n_2)$.

This type of problem comes up in several real-world applications, most famously in the Netflix prize.  The Netflix prize problem is to be able to predict a very large ratings matrix M, whose rows are the customers, whose columns are the movies, and the entries are the rating that each customer would hypothetically assign to each movie.  Of course, not every customer has rented every movie from Netflix, and so only a small fraction $P_\Omega(M)$ of this matrix is actually known.  However, if one makes the assumption that most customers’ rating preference is determined by only a small number of characteristics of the movie (e.g. genre, lead actor/actresses, director, year, etc.), then the matrix should be (approximately) low rank, and so the above type of analysis should be useful (though of course it is not going to be the only tool of use in this messy, real-world problem).

Actually, one expects to need to oversample the matrix by a logarithm or two in order to have a good chance of exact recovery, if one is sampling randomly.  This can be seen even in the rank one case r=1, in which $M=uv^*$ is the product of a column vector and a row vector; let’s consider square matrices $n_1=n_2=n$ for simplicity.  Observe that if the sampled coordinates $\Omega$ completely miss one of the rows of the matrix, then the corresponding element of u has gone completely unmeasured, and one cannot hope to complete this row of the matrix.   Thus one needs to sample every row (and also every column) of the $n \times n$ matrix.  The solution to the coupon collector’s problem then tells us that one needs about $O(n \log n)$ samples to achieve this goal.  In fact, the theory of Erdős-Rényi random graphs tells us that the bipartite graph induced by $\Omega$ becomes almost surely connected beyond this threshold, which turns out to be exactly what is needed to perform matrix completion for rank 1 matrices.

On the other hand, one cannot hope to complete the matrix if some of the singular vectors of the matrix are extremely sparse.  For instance, in the Netflix problem, a singularly idiosyncratic customer (or dually, a singularly unclassifiable movie) may give rise to a row or column of M that has no relation to the rest of the matrix, occupying its own separate component of the singular value decomposition of M; such a row or column is then impossible to complete exactly without sampling the entirety of that row or column.  Thus, to get exact matrix completion from a small fraction of entries, one needs some sort of incoherence assumption on the singular vectors, which spreads them out across all coordinates in a roughly even manner, as opposed to being concentrated on just a few values.

In a recent paper, Candés and Recht proposed solving the matrix completion problem by minimising the nuclear norm (or trace norm)

$\|M\|_* = \sum_{i=1}^{\min(n_1,n_2)} \sigma_i(M) = \hbox{tr}( M M^*)^{1/2}$

amongst all matrices consistent with the observed data $P_\Omega(M)$.  This nuclear norm is the non-commutative counterpart to the $\ell^1$ norm for vectors, and so this algorithm is analogous to the $\ell^1$ minimisation (or basis pursuit) algorithm which is effective for compressed sensing (though not the only such algorithm for this task).  They showed, roughly speaking, that exact matrix completion (for, say, square matrices $n_1=n_2=n$ for simplicity) is ensured with high probability so long as the singular vectors obey a certain incoherence property (basically, their $\ell^\infty$ norm should be close to the minimal possible value, namely $O(1/\sqrt{n})$), so long as one had the condition

$m \gg n^{1.2} r \log n$.

This differs from the presumably optimal threshold of $nr \log n$ by a factor of about $n^{0.2}$.

The main result of our paper is to mostly eliminate this gap, at the cost of a stronger hypothesis on the matrix being measured:

Main theorem. (Informal statement)  Suppose the $n_1 \times n_2$ matrix M has rank r and obeys a certain “strong incoherence property”.  Then with high probability, nuclear norm minimisation will recover M from a random sample $P_\Omega(M)$ provided that $m \gg n r \log^{O(1)} n$, where $n := \max(n_1,n_2)$.

A result of a broadly similar nature, but with a rather different recovery algorithm and with a somewhat different range of applicability, was recently established by Keshavan, Oh, and Montanari.  The strong incoherence property is somewhat technical, but is related to the Candés-Recht incoherence property and is satisfied by a number of reasonable random matrix models.  The exponent O(1) here is reasonably civilised (ranging between 2 and 9, depending on the specific model and parameters being used).

This week I am in Seville, Spain, for a conference in harmonic analysis and related topics.  My talk is titled “the uniform uncertainty principle and compressed sensing“.  The content of this talk overlaps substantially with my Ostrowski lecture on the same topic; the slides I prepared for the Seville lecture can be found here.

[Update, Dec 6: Some people have asked about my other lecture given in Seville, on structure and randomness in the prime numbers.  This lecture is largely equivalent to the one posted here.]

Over two years ago, Emmanuel Candés and I submitted the paper “The Dantzig selector: Statistical estimation when $p$ is much
larger than $n$
” to the Annals of Statistics. This paper, which appeared last year, proposed a new type of selector (which we called the Dantzig selector, due to its reliance on the linear programming methods to which George Dantzig, who had died as we were finishing our paper, had contributed so much to) for statistical estimation, in the case when the number $p$ of unknown parameters is much larger than the number $n$ of observations. More precisely, we considered the problem of obtaining a reasonable estimate $\beta^*$ for an unknown vector $\beta \in {\Bbb R}^p$ of parameters given a vector $y = X \beta + z \in {\Bbb R}^n$ of measurements, where $X$ is a known $n \times p$ predictor matrix and $z$ is a (Gaussian) noise error with some variance $\sigma^2$. We assumed that the predictor matrix X obeyed the restricted isometry property (RIP, also known as UUP), which roughly speaking asserts that $X\beta$ has norm comparable to $\beta$ whenever the vector $\beta$ is sparse. This RIP property is known to hold for various ensembles of random matrices of interest; see my earlier blog post on this topic.

Our selection algorithm, inspired by our previous work on compressed sensing, chooses the estimated parameters $\beta^*$ to have minimal $l^1$ norm amongst all vectors which are consistent with the data in the sense that the residual vector $r := y - X \beta^*$ obeys the condition

$\| X^* r \|_\infty \leq \lambda$, where $\lambda := C \sqrt{\log p} \sigma$ (1)

(one can check that such a condition is obeyed with high probability in the case that $\beta^* = \beta$, thus the true vector of parameters is feasible for this selection algorithm). This selector is similar, though not identical, to the more well-studied lasso selector in the literature, which minimises the $l^1$ norm of $\beta^*$ penalised by the $l^2$ norm of the residual.

A simple model case arises when n=p and X is the identity matrix, thus the observations are given by a simple additive noise model $y_i = \beta_i + z_i$. In this case, the Dantzig selector $\beta^*$ is given by the hard soft thresholding formula

$\beta^*_i = \max(|y_i| - \lambda, 0 ) \hbox{sgn}(y_i).$

The mean square error ${\Bbb E}( \| \beta - \beta^* \|^2 )$ for this selector can be computed to be roughly

$\lambda^2 + \sum_{i=1}^n \min( |y_i|^2, \lambda^2)$ (2)

and one can show that this is basically best possible (except for constants and logarithmic factors) amongst all selectors in this model. More generally, the main result of our paper was that under the assumption that the predictor matrix obeys the RIP, the mean square error of the Dantzig selector is essentially equal to (2) and thus close to best possible.

After accepting our paper, the Annals of Statistics took the (somewhat uncommon) step of soliciting responses to the paper from various experts in the field, and then soliciting a rejoinder to these responses from Emmanuel and I. Recently, the Annals posted these responses and rejoinder on the arXiv:

This week I am in Australia, attending the ANZIAM annual meeting in Katoomba, New South Wales (in the picturesque Blue Mountains). I gave an overview talk on some recent developments in compressed sensing, particularly with regards to the basis pursuit approach to recovering sparse (or compressible) signals from incomplete measurements. The slides for my talk can be found here, with some accompanying pictures here. (There are of course by now many other presentations of compressed sensing on-line; see for instance this page at Rice.)

There was an interesting discussion after the talk. Some members of the audience asked the very good question as to whether any a priori information about a signal (e.g. some restriction about the support) could be incorporated to improve the performance of compressed sensing; a related question was whether one could perform an adaptive sequence of measurements to similarly improve performance. I don’t have good answers to these questions. Another pointed out that the restricted isometry property was like a local “well-conditioning” property for the matrix, which only applied when one viewed a few columns at a time.

This problem in compressed sensing is an example of a derandomisation problem: take an object which, currently, can only be constructed efficiently by a probabilistic method, and figure out a deterministic construction of comparable strength and practicality. (For a general comparison of probabilistic and deterministic algorithms, I can point you to these slides by Avi Wigderson).

I will define exactly what UUP matrices (the UUP stands for “uniform uncertainty principle“) are later in this post. For now, let us just say that they are a generalisation of (rectangular) orthogonal matrices, in which the columns are locally almost orthogonal rather than globally perfectly orthogonal. Because of this, it turns out that one can pack significantly more columns into a UUP matrix than an orthogonal matrix, while still capturing many of the desirable features of orthogonal matrices, such as stable and computable invertibility (as long as one restricts attention to sparse or compressible vectors). Thus UUP matrices can “squash” sparse vectors from high-dimensional space into a low-dimensional while still being able to reconstruct those vectors; this property underlies many of the recent results on compressed sensing today.

There are several constructions of UUP matrices known today (e.g. random normalised Gaussian matrices, random normalised Bernoulli matrices, or random normalised minors of a discrete Fourier transform matrix) but (if one wants the sparsity parameter to be large) they are all probabilistic in nature; in particular, these constructions are not 100% guaranteed to actually produce a UUP matrix, although in many cases the failure rate can be proven to be exponentially small in the size of the matrix. Furthermore, there is no fast (e.g. sub-exponential time) algorithm known to test whether any given matrix is UUP or not. The failure rate is small enough that this is not a problem for most applications (especially since many compressed sensing applications are for environments which are already expected to be noisy in many other ways), but is slightly dissatisfying from a theoretical point of view. One is thus interested in finding a deterministic construction which can locate UUP matrices in a reasonably rapid manner. (One could of course simply search through all matrices in a given class and test each one for the UUP property, but this is an exponential-time algorithm, and thus totally impractical for applications.) In analogy with error-correcting codes, it may be that algebraic or number-theoretic constructions may hold the most promise for such deterministic UUP matrices (possibly assuming some unproven conjectures on exponential sums); this has already been accomplished by de Vore for UUP matrices with small sparsity parameter.

For much of last week I was in Leiden, Holland, giving one of the Ostrowski prize lectures at the annual meeting of the Netherlands mathematical congress. My talk was not on the subject of the prize (arithmetic progressions in primes), as this was covered by a talk of Ben Green there, but rather on a certain “uniform uncertainty principle” in Fourier analysis, and its relation to compressed sensing; this is work which is joint with Emmanuel Candes and also partly with Justin Romberg.

I’ve had a number of people ask me (especially in light of some recent publicity) exactly what “compressed sensing” means, and how a “single pixel camera” could possibly work (and how it might be advantageous over traditional cameras in certain circumstances). There is a large literature on the subject, but as the field is relatively recent, there does not yet appear to be a good non-technical introduction to the subject. So here’s my stab at the topic, which should hopefully be accessible to a non-mathematical audience.

For sake of concreteness I’ll primarily discuss the camera application, although compressed sensing is a more general measurement paradigm which is applicable to other contexts than imaging (e.g. astronomy, MRI, statistical selection, etc.), as I’ll briefly remark upon at the end of this post.