This problem in compressed sensing is an example of a derandomisation problem: take an object which, currently, can only be constructed efficiently by a probabilistic method, and figure out a deterministic construction of comparable strength and practicality. (For a general comparison of probabilistic and deterministic algorithms, I can point you to these slides by Avi Wigderson).
I will define exactly what UUP matrices (the UUP stands for “uniform uncertainty principle“) are later in this post. For now, let us just say that they are a generalisation of (rectangular) orthogonal matrices, in which the columns are locally almost orthogonal rather than globally perfectly orthogonal. Because of this, it turns out that one can pack significantly more columns into a UUP matrix than an orthogonal matrix, while still capturing many of the desirable features of orthogonal matrices, such as stable and computable invertibility (as long as one restricts attention to sparse or compressible vectors). Thus UUP matrices can “squash” sparse vectors from high-dimensional space into a low-dimensional while still being able to reconstruct those vectors; this property underlies many of the recent results on compressed sensing today.
There are several constructions of UUP matrices known today (e.g. random normalised Gaussian matrices, random normalised Bernoulli matrices, or random normalised minors of a discrete Fourier transform matrix) but (if one wants the sparsity parameter to be large) they are all probabilistic in nature; in particular, these constructions are not 100% guaranteed to actually produce a UUP matrix, although in many cases the failure rate can be proven to be exponentially small in the size of the matrix. Furthermore, there is no fast (e.g. sub-exponential time) algorithm known to test whether any given matrix is UUP or not. The failure rate is small enough that this is not a problem for most applications (especially since many compressed sensing applications are for environments which are already expected to be noisy in many other ways), but is slightly dissatisfying from a theoretical point of view. One is thus interested in finding a deterministic construction which can locate UUP matrices in a reasonably rapid manner. (One could of course simply search through all matrices in a given class and test each one for the UUP property, but this is an exponential-time algorithm, and thus totally impractical for applications.) In analogy with error-correcting codes, it may be that algebraic or number-theoretic constructions may hold the most promise for such deterministic UUP matrices (possibly assuming some unproven conjectures on exponential sums); this has already been accomplished by de Vore for UUP matrices with small sparsity parameter.
Before we define UUP matrices explicitly, let us first recall what an (rectangular) orthogonal matrix is. We will view an matrix (m rows and n columns) as a collection
of column vectors in the (complex) vector space
, or equivalently as a means of linearly transformed an n-dimensional vector
as an m-dimensional vector
. I will call such a matrix orthogonal if these column vectors are orthonormal, i.e. they all have unit length
and are orthogonal to each other:
whenever
.
Orthonormal vectors have several pleasant properties. One of them is Pythagoras’ theorem
(1)
valid for all complex numbers . In other words, the linear encoding
is an isometry. This implies that such an encoding can be inverted in a stable manner: given the encoded vector
one can uniquely recover the original coefficients
, and furthermore that small changes in w will not cause large fluctuations in
. Indeed, one can reconstruct the coefficients
quickly and explicitly by the formula
. (2)
One would like to make n as large as possible, and m as small as possible, so that one can transform as high-dimensional vectors as possible using only as low-dimensional space as possible to store the transformed vectors. There is however a basic obstruction to this, which is that an orthogonal matrix can only exist when ; for if n is larger than m, then there are too many vectors
to remain linearly independent in
, and one must have a non-trivial linear independence
for some , which is inconsistent with (1).
One can try to circumvent this restriction by weakening the condition (1) to (say)
(3)
for all complex numbers . (The constants 0.9 and 1.1 are not terribly important for this discussion.) Thus we only require that Pythagoras’ theorem hold approximately rather than exactly; this is equivalent to requiring that the transpose of this matrix forms a frame. (In harmonic analysis, one would say that the vectors
are almost orthogonal rather than perfectly orthogonal.) This enlarges the class of matrices that one can consider, but unfortunately does not remove the condition
, since the linear dependence argument which showed that
was incompatible with (1), also shows that
is incompatible with (3).
It turns out, though, that one can pack more than m vectors into if one localises the almost orthogonality condition (3) so that it only holds for sparse sets of coefficients
. Specifically, we fix a parameter S (less than m), and say that the matrix
obeys the UUP with sparsity S if one has the almost orthogonality condition (3) for any set of coefficients
, such that at most S of the
are non-zero. [The UUP is also known as the Restricted Isometry Property (RIP) in the literature.] In other words, we only assume that any S of the n vectors
are almost orthogonal at one time. (It is important here that we require almost orthogonality rather than perfect orthogonality, since as soon as a set of vectors are pairwise perfectly orthogonal, they are of course jointly perfectly orthogonal. In contrast, the constants 0.9 and 1.1 in the UUP condition will deteriorate as S increases, so that local almost orthogonality does not imply global almost orthogonality.) The UUP property is more powerful (and hence more useful) when S is large; in particular one would like to approach the “information-theoretic limit” when S is comparable in magnitude to m.
Roughly speaking, a set of vectors which obey the UUP are “just as good” as an orthonormal set of vectors, so long as one doesn’t look at more than S of these vectors at a time. For instance, one can easily show that the map
is still injective as long as one restricts attention to input vectors which are S/2-sparse or better (i.e. at most S/2 of the coefficients are allowed to be non-zero). This still leaves the question of how to efficiently recover the sparse coefficients
from the transformed vector
. The algorithm (2) is no longer accurate, however if the coefficients are just a little bit sparser than S/2 (e.g. S/3 will do) then one can instead use the algorithm of basis pursuit to recover the coefficients
perfectly. Namely, it turns out that among all the possible representations
of w, the one which minimises the
norm
will be the one which matches the S/3-sparse representation $\sum_j a_j v_j$ exactly. (This has an interesting geometric interpretation: if we normalise all the vectors
to have unit length, then this result says that the simplest (sparsest) way to get from 0 to w by moving in the directions
is also the shortest way to get there.) There are also some related results regarding coefficients
which are merely compressible instead of sparse, but these are a bit more technical; see my paper with Emmanuel Candes for details.
It turns out that UUP matrices can have many more columns than rows; indeed, as shown by Donoho, and by Candes and myself, n can be as large as for some absolute constant c > 0. (Subsequent proofs also appeared by Candes, Rudelson, myself, and Vershynin and by Baranuik, Davenport, de Vore, and Wakin.) The construction is in fact very easy; one simply selects the vectors
randomly, either as random unit vectors or as random normalised Gaussian vectors (so all coefficients of each
are iid Gaussians with mean zero and variance 1/m). The point is that in a high-dimensional space such as
, any two randomly selected vectors are very likely to be almost orthogonal to each other; for instance, it is an easy computation that the dot product between two random normalised Gaussian vectors has a variance of only O(1/m), even though the vectors themselves have a magnitude very close to 1. Note though that control of these dot products is really only enough to obtain the UUP for relatively small S, e.g.
. For large S, one needs slightly more advanced tools, such as large deviation bounds on the singular values of rectangular Gaussian matrices (which are closely related to the Johnson-Lindenstrauss lemma).
The results for small sparsity S are relatively easy to duplicate by deterministic means. In particular, the paper of de Vore mentioned earlier uses a polynomial construction to obtain UUP matrices with S close to , and n equal to an arbitrarily large power of m, essentially by ensuring that all the column vectors have a low inner product with each other (of magnitude roughly
or so, matching what the random construction gives, and almost certainly best possible). But to get to larger values of S (and in particular, to situations in which S is comparable to m) may require a more refined calculation (possibly involving higher moments of the Gramian matrix, as was done in a paper of Candes, Romberg, and myself in the random case). Alternatively, one may rely on conjecture rather than rigorous results; for instance, it could well be that the matrices of de Vore satisfy the UUP for far larger sparsities S than are rigorously proven in that paper.
An alternate approach, and one of interest in its own right, is to work on improving the time it takes to verify that a given matrix (possibly one of a special form) obeys the UUP. The brute-force approach of checking the singular values of every set of S column vectors requires a run time comparable to or worse, which is quite poor. (A variant approach has recently been proposed by Sharon, Wright, and Ma but has similar run time costs.) But perhaps there exist specially structured matrices for which the UUP is easier to verify, and for which it is still likely that the UUP holds. This would give a probablistic algorithm for producing rigorously certified UUP matrices with a reasonable average-case run time.
[Update, July 9: Typos corrected.]
29 comments
Comments feed for this article
5 July, 2007 at 7:13 am
Igor Carron
Dear Terry,
You write :
“..Alternatively, one may rely on conjecture rather than rigorous results; for instance, it could well be that the matrices of de Vore satisfy the UUP for far larger sparsities S than are rigorously proven in that paper…”
Has there been any hint from numerical work to support that conjecture :i.e. DeVore’s construction leading to higher than sqrt(m) bound ?
Also, DeVore states “The Restricted Isometry Property is just a sufficient condition
to guarantee that a matrix has good performance on classes. Two matrices can has exactly the same performance on classes and yet one will satisfy RIP and the other not. So there may be a more direct avenue to constructing good CS matrices by not going through RIP. ”
I wonder if you could comment on that ? I mean, are you unnecessarily restricting yourself when using RIP as a way to find good CS matrices ?
Also, a totally unrelated question, do you think that the normality (in the sense of commutation with its conjugate) or non-normality of matrix \phi has an influence on its ability to fulfill RIP ?
Igor.
5 July, 2007 at 1:13 pm
CS Boy
I think that another issue with the use of (random) CS matrix is that everybody use pseudo-random generated matrices. The point is that one needs the sensing matrix at decoding time, and usually one just keep track of the seed of the random generator. Maybe the difficulty in derandomizing CS is to have a bound on the actual length of this random seed in order to the effectiveness of the recovery.
6 July, 2007 at 12:28 pm
Terence Tao
Dear Igor,
As I mentioned in the main post, it’s hard to verify the RIP numerically for large S because of the exponentially large number of different combinations of S column vectors one has to consider. I would not be surprised though if numerics showed that the RIP held for _most_ S-tuples of column vectors for de Vore’s matrices (or indeed for any other reasonable construction). Unfortunately, this “99%-RIP” type property seems to not be of use in applications; the problem is that if even a single S-tuple of column vectors behaved badly, they could in principle provided a better l^1 minimiser than the true sparse signal in a large number of cases, even when the true signal involves a completely different S-tuple of vectors.
My guess though is that de Vore’s construction will have to be modified a little bit though – it is optimised for sqrt(m) in some sense.
There are of course a few CS results which do not require the RIP property, for instance my first paper with Emmanuel on recovery of sparse signals worked even for oversampling ratios in which the RIP is only conjectured to be true. But in order to get robust results (i.e. ones which work for compressible signals rather than sparse, or in the presence of noise), as well as universal results (ones which work for all sparse/compressible data rather than generic sparse/compressible data) it seems that something very close to the RIP is needed.
Dear CS Boy: It is conceivable that a sufficiently sophisticated pseudorandom number generator could generate a good random matrix with probability 1, rather than very close to 1, but it would be hopeless with current technology to establish this rigorously (it comes dangerously close to trying to prove
). But perhaps this may form a basis for a conjectural deterministic construction of UUP matrices.
9 July, 2007 at 5:44 am
nimbusgear
Dear Terry, i’ve been following your posts for awhile now, and while i find a good portion of what you’re describing as a bit too intricate for myself to grasp at this point in time, i do pick up on some of it.. in this case, you mentioned the following feature:
“…The point is that in a high-dimensional space such as Cm, any two randomly selected vectors are very likely to be almost orthogonal to each other…”
is this implying that there is a level of informational reduction occuring?
where if you were to, for example, need to describe an angular measurement between two high dimensional vectors, you’d not need to specify 360*granularity degrees, but rather something like 3.6*granularity degrees?
particularly, this is the type of research that i’m really working on understanding fully.. the mechanism of compression in regards to the selection of the algorithm itself, being able to multiply the resultant information of the data.
9 July, 2007 at 5:58 am
Ajay Bangla
Dear Prof. Tao,
You write :
“It turns out that UUP matrices can have many more rows than columns;…”
Shouldn’t it be UUP matrices can have many more columns than rows?
~Ajay
9 July, 2007 at 7:55 am
Terence Tao
Dear nimbusgear,
It is true that the angle between two vectors usually does not convey much information about those two vectors (except in those rare cases in which the angle is quite small). An analogy would be with an n-bit random string of 0s and 1s. Instead of measuring an “angle”, suppose one measures the total number of 1s in this string. In principle, this number could range all the way from 0 to n (just like angles can range from 0 to 180 degrees). However, in practice a “concentration of measure” occurs, and the total number of 1s is very close to n/2 (typically deviating from this mean by at most O(sqrt(n)), thanks to the law of large numbers or the central limit theorem), just like how the angle concentrates near 90 degrees with a standard deviation of O(1/sqrt(n)). Indeed, the Shannon entropy of this number-of-1s measurement is closer to 1/2 log n than to log n.
Dear Ajay: thanks for the correction!
18 July, 2007 at 1:53 am
Yet Another Machine Learning Blog » Compressed sensing [Pierre Dangauthier]
[…] in hight dimensional spaces, 2 random vectors are almost always orthogonal (with a slightly modified version of orthogonality), which guarantee non-correlation with high […]
23 August, 2007 at 6:01 am
Yet Another Machine Learning Blog » Why 2 random vectors are orthogonal in high dimention? [Pierre Dangauthier]
[…] gives the following graph. We clearly see the concentration phenomena that Terry was talking about. When s gets bigger, we obtain a narrower distribution, and the probability that a_1 is […]
22 March, 2008 at 11:21 am
The Dantzig selector: Statistical estimation when p is much larger than n « What’s new
[…] 22 March, 2008 in math.ST, update Tags: compressed sensing, Dantzig selector, estimation, lasso selector, linear programming, mean square error, restricted isometry property More than two years ago, Emmanuel Candés and I submitted the paper “The Dantzig selector: Statistical estimation when is much larger than ” to the Annals of Statistics. This paper, which appeared last year, proposed a new type of selector (which we called the Dantzig selector, due to its reliance on the linear programming methods to which George Dantzig, who had died as we were finishing our paper, had contributed so much to) for statistical estimation, in the case when the number of unknown parameters is much larger than the number of observations. More precisely, we considered the problem of obtaining a reasonable estimate for an unknown vector of parameters given a vector of measurements, where is a known predictor matrix and is a (Gaussian) noise error with some variance . We assumed that the data matrix X obeyed the restricted isometry property (RIP), which roughly speaking asserts that has norm comparable to whenever the vector is sparse. This RIP property is known to hold for various ensembles of random matrices of interest; see my earlier blog post on this topic. […]
2 June, 2008 at 2:40 am
CS_newuser
Dear Prof. Terrence,
Please see the following quote from the above
“For instance, one can easily show that the map……… is still injective as long as one restricts attention to input vectors which are S/2-sparse or better (i.e. at most S/2 of the coefficients are allowed to be non-zero). This still leaves the question of how to efficiently recover the sparse coefficients from the transformed vector . The algorithm (2) is no longer accurate, however if the coefficients are just a little bit sparser than S/2 (e.g. S/3 will do) then one can instead use the algorithm of basis pursuit to recover the coefficients perfectly.”
As the signal is already S sparse then what has S/2 and S/3 has to do with it? Please explain. Whether instead of S/2 or S/3, it should be m/2 or m/3.
I am new to this subject, so in case my question sounds absurd, please forgive me however I would still request you please clarify it for me.
2 June, 2008 at 4:43 pm
Terence Tao
Dear CS_newuser,
One can of course form the linear combination
for any choice of coefficients
that one wishes, whether they are S/2-sparse, S-sparse, or not sparse at all. If they happen to be S-sparse or better, the UUP says something about the norm of this linear combination, but it is possible for two different S-sparse combinations
,
to sum to the same value.
If however one considers S/2-sparse combinations
,
, then their difference will be an S-sparse combination or better. Applying the UUP to the difference, we see that if the coefficients
are not identical to the coefficients
, then the difference has non-zero norm, and so we have injectivity for S/2-sparse combinations.
1 September, 2008 at 9:45 am
Kanke
Dr Tao,
for a UUP matrix have any possibility to be greater than
? The range of
should be [1,n] (my opinion). If someone finds a UUP matrix with
larger than m, can we still use Basic pursuit to recover the
-sparsity signal exactly? Thanks.
I have a quick question: Does the parameter
2 September, 2008 at 9:18 am
Terence Tao
Dear Kanke,
It is not possible for S to exceed m, because if one has more than m vectors in
, then there must be a non-trivial linear independence amongst those vectors, which is not compatible with the UUP (or RIP) condition (3).
21 April, 2009 at 8:11 pm
Alex
Dr Tao:
Has there been any follow ups on this topic. I have not been able to find any literature on efficient methods for determining the RIP property of these matrices, in order to use the test in a practical scenario. Also have other necessary conditions been given for perfect recovery. Since the RIP is sufficient.
Thank you
18 August, 2009 at 9:14 am
oz
Dear Prof. Tao,
I have a question which is somewhat related to this post. Are there any known results on sparse sensing matrices (whether deterministic or randomized)?
I’ll try to explain what I mean: It is known that random Bernoulli or Gaussian matrices satisfy the RIP with high probability. These matrices are dense. In many applications I would imagine that the number of non-zero entries in the matrix is itself of considerable cost (as they represent linear combinations of the signal, and each such entry requires access to a given coordinate of the signal, which can cost some hardware/time/etc.). So how low can you go with the matrix density and still keep the RIP property? For example, if we take random matrices with prob. 1-p being zero, and p/2 being 1 or -1, how low can we take p to be? Clearly, if the matrix becomes too sparse, then finally we reach the point where each line of the matrix measures a few or just one signal, and then we’ll need the number or rows to be roughly equal to the length of the input signal (far more than its sparsity). Thus the breakdown of RIP occurs sooner (i.e. for not completely sparse matrix). Is it known when does this happen? if it a gradual decline, or is there some ‘phase-transition’ and a threshold p? (or some other threshold, depending on the matrix size, e.g. maybe you need log N non-zero entries in each line?)
Thank you,
OZ
18 August, 2009 at 9:35 am
Terence Tao
There are some combinatorial examples of sparse matrices with good compressed sensing properties, though one uses a slight variant of RIP in this case, see
Click to access report.pdf
but I would imagine that the proof of RIP for iid matrices (which ultimately comes down to control on the singular values) would also work in the sparse case (for instance, in my first paper with Van Vu on the circular law, we found that the relevant singular value bounds for dense matrices could be adapted to sparse matrices without difficulty).
18 August, 2009 at 3:35 pm
oz
Thanks a lot for your rapid response – the papers look indeed quite relevant
(though will need some time to digest them..)
Best,
OZ
29 April, 2010 at 1:53 pm
husveges
Dear Dr. Tao,
You describe UUP matrices with normalized columns and give a nice geometric interpretation that l1 minimization will retrieve not only the minimal coefficients of the columns in the l1 sense, but also the sparsest combination.
In Candes’ and Romberg’s l1 Magic guide they generate a Gaussian matrix with normalized rows, and then proceed with l1 minimization:
Click to access l1magic.pdf
Why are they normalizing the rows? Does the same geometric intuition hold?
29 April, 2010 at 10:26 pm
Kishor
Dear Sir
I read at many places (e.g. http://sites.google.com/site/igorcarron2/cs) that to check whether a family of matrices satisfy RIP, is NP-hard. If that is the case, then do we expect to find a deterministic construction of matrices satisfying RIP?
Thanks
Kishor
18 February, 2012 at 4:59 am
anon
Showing NP-hardness of a property does not mean it is neccesarily hard to produce satisfying instances. For example, checking if a Boolean formula is satisfable is NP-hard, yet it is very simple to give tons of constructions of satisfable Boolean formulas of any size.
17 February, 2012 at 8:04 pm
The road to deterministic matrices with the restricted isometry property « Short, Fat Matrices
[…] To reconstruct a sparse vector from relatively few measurements , it has become popular to find an estimate of minimal 1-norm. This estimate happens to be the sparsest vector in the preimage of provided satisfies the restricted isometry property (RIP). In words, an RIP matrix acts as a near-isometry on sufficiently sparse vectors, and since its introduction in 2004, this property has become an important subject of matrix design (see Terry Tao’s blog post on the problem). […]
2 May, 2012 at 8:55 am
Certifying the restricted isometry property is hard « Short, Fat Matrices
[…] “An alternate approach, and one of interest in its own right, is to work on improving the time it takes to verify that a given matrix (possibly one of a special form) obeys the UUP.” – Tao […]
26 May, 2013 at 10:27 am
Anonymous
The work by Devore is just a repackaging of the Nisan-Wigderson designs studied long time back. Unfortunately he never cites those references.
2 December, 2013 at 12:21 pm
Deterministic RIP matrices: Breaking the square-root bottleneck | Short, Fat Matrices
[…] years after Terry Tao posed this problem on his blog, the problem was partially solved in this breakthrough […]
4 December, 2013 at 7:52 pm
Deterministic RIP matrices: a new cooperative online project | Relax and Conquer
[…] started thinking about the problem of building deterministic RIP matrices 3 years ago, when, as a first year graduate student at Princeton, I was sharing an […]
31 July, 2014 at 12:10 pm
The restricted isometry property. | I Can't Believe It's Not Random!
[…] dimensions which satisfies the restricted isometry property was explicitly formulated by Tao in his blog (he uses the term UUP instead of RIP, standing for uniform uncertainty principle). This paper […]
28 October, 2014 at 2:57 pm
Conditional Restricted Isometries | Relax and Conquer
[…] problem of building deterministic RIP matrices that break the squareroot bottleneck was posed by Terry Tao in his blog more than 7 years ago. Since then, only one deterministic construction is known to break this bottleneck, due to Bourgain […]
27 December, 2015 at 7:56 am
Genius at Play: The Curious Mind of John Horton Conway | Short, Fat Matrices
[…] keep 4 research problems in mind: A big problem (an open problem like matrix multiplication or deterministic RIP), a workable problem (the sort of problem that puts food on the table like clustering and […]
26 August, 2016 at 3:26 am
reza
hello sir
i read the articles and don’t get why exactly picking the rows of a matrix need to be random,what is the benefit of randomness?and what about the deterministic approach that some people suggest(espacially in communication for subsampling the fourier matrix)?