I’ve had a number of people ask me (especially in light of some recent publicity) exactly what “compressed sensing” means, and how a “single pixel camera” could possibly work (and how it might be advantageous over traditional cameras in certain circumstances). There is a large literature on the subject, but as the field is relatively recent, there does not yet appear to be a good non-technical introduction to the subject. So here’s my stab at the topic, which should hopefully be accessible to a non-mathematical audience.

For sake of concreteness I’ll primarily discuss the camera application, although compressed sensing is a more general measurement paradigm which is applicable to other contexts than imaging (e.g. astronomy, MRI, statistical selection, etc.), as I’ll briefly remark upon at the end of this post.

The purpose of a camera is, of course, to record images. To simplify the discussion, let us think of an image as a rectangular array, e.g. a 1024 x 2048 array of pixels (thus there are 2 megapixels in all). To ignore the (minor) issue of colour, let us assume that we are just taking a black-and-white picture, so that each pixel is measured in grayscale as an integer (e.g. an 8-bit integer from 0 to 255, or a 16-bit integer from 0 to 65535) which signifies the intensity of each pixel.

Now, to oversimplify things quite a bit, a traditional digital camera would take one measurement of intensity for each of its pixels (so, about 2 million measurements in the above example), resulting in a relatively large image file (2MB if one uses 8-bit grayscale, or 4MB if one uses 16-bit grayscale). Mathematically, this file can be represented by a very high-dimensional vector of numbers (in this example, the dimension is about 2 million).

Before I get to the new story of “compressed sensing”, I have to first quickly review the somewhat older story of plain old “compression”. (Those who already know how image compression works can skip forward a few paragraphs.)

The images described above can take up a lot of disk space on the camera (or on some computer where the images are later uploaded), and also take a non-trivial amount of time (and energy) to transfer from one medium to another. So, it is common practice to get the camera to *compress* the image, from an initial large size (e.g. 2MB) to a much smaller size (e.g. 200KB, which is 10% of the size). The thing is that while the space of *all *images has 2MB worth of “degrees of freedom” or “entropy”, the space of all *interesting *images is much smaller, and can be stored using much less space, especially if one is willing to throw away some of the quality of the image. (Indeed, if one generates an image at random, one will almost certainly not get an interesting image; instead, one will just get random noise looking much like the static one can get on TV screens.)

How can one compress an image? There are many ways, some of which are rather technical, but let me try to give a non-technical (and slightly inaccurate) sketch of how it is done. It is quite typical for an image to have a large featureless component – for instance, in a landscape, up to half of the picture might be taken up by a monochromatic sky background. Suppose for instance that we locate a large square, say pixels, which are all exactly the same colour – e.g. all white. Without compression, this square would take 10,000 bytes to store (using 8-bit grayscale); however, instead, one can simply record the dimensions and location of the square, and note a single colour with which to paint the entire square; this will require only four or five bytes in all to record, leading to a massive space saving. Now in practice, we don’t get such an impressive gain in compression, because even apparently featureless regions have some small colour variation between them. So, given a featureless square, what one can do is record the *average* colour of that square, and then subtract that average off from the image, leaving a small residual error. One can then locate more squares where the average colour is significant, and subtract those off as well. If one does this a couple times, eventually the only stuff left will be very small in magnitude (intensity), and not noticeable to the human eye. So we can throw away the rest of the image and record only the size, location, and intensity of the “significant” squares of the image. We can then reverse this process later and reconstruct a slightly lower-quality replica of the original image, which uses much less space.

Now, the above algorithm is not all that effective in practice, as it does not cope well with sharp transitions from one colour to another. It turns out to be better to work not with average colours in squares, but rather with average colour *imbalances* in squares – the extent to which the intensity on (say) the right half of the square is higher on average than the intensity on the left. One can formalise this by using the (two-dimensional) Haar wavelet system. It then turns out that one can work with “smoother” wavelet systems which are less susceptible to artefacts, but this is a technicality which we will not discuss here. But all of these systems lead to similar schemes: one represents the original image as a linear superposition of various “wavelets” (the analogues of the coloured squares in the preceding paragraph), stores all the significant (large magnitude) wavelet coefficients, and throws away (or “thresholds”) all the rest. This type of “hard wavelet coefficient thresholding” compression algorithm is not nearly as sophisticated as the ones actually used in practice (for instance in the JPEG 2000 standard) but it is somewhat illustrative of the general principles in compression.

To summarise (and to oversimplify somewhat), the original image may have two million degrees of freedom, and in particular if one wants to express this image in terms of wavelets then one would need thus need two million different wavelets in order to reconstruct all images perfectly. However, the typical *interesting* image is very *sparse* or *compressible *in the wavelet basis: perhaps only a hundred thousand of the wavelets already capture all the notable features of the image, with the remaining 1.9 million wavelets only contributing a very small amount of “random noise” which is largely invisible to most observers. (This is not always the case: heavily *textured* images – e.g. images containing hair, fur, etc. – are not particularly compressible in the wavelet basis, and pose a challenge for image compression algorithms. But that is another story.)

Now, if we (or the camera) knew in advance which hundred thousand of the 2 million wavelet coefficients are going to be the important ones, then the camera could just measure those coefficients and not even bother trying to measure the rest. (It is possible to measure a single coefficient by applying a suitable “filter” or “mask” to the image, and making a single intensity measurement to what comes out.) However, the camera does not know which of the coefficients are going to be the key ones, so it must instead measure all 2 million pixels, convert the image to a wavelet basis, locate the hundred thousand dominant wavelet coefficients to keep, and throw away the rest. (This is of course only a caricature of how the image compression algorithm really works, but we will use it for sake of discussion.)

Now, of course, modern digital cameras work pretty well, and why should we try to improve on something which isn’t obviously broken? Indeed, the above algorithm, in which one collects an enormous amount of data but only saves a fraction of it, works just fine for consumer photography. Furthermore, with data storage becoming quite cheap, it is now often feasible to use modern cameras to take many images with no compression whatsoever. Also, the computing power required to perform the compression is manageable, even if it does contribute to the notoriously battery-draining energy consumption level of these cameras. However, there are non-consumer imaging applications in which this type of data collection paradigm is infeasible, most notably in sensor networks. If one wants to collect data using thousands of sensors, which each need to stay *in situ* for long periods of time such as months, then it becomes necessary to make the sensors as cheap and as low-power as possible – which in particular rules out the use of devices which require heavy computer processing power at the sensor end (although – and this is important – we are still allowed the luxury of all the computer power that modern technology affords us at the *receiver *end, where all the data is collected and processed). For these types of applications, one needs a data collection paradigm which is as “dumb” as possible (and which is also robust with respect to, say, the loss of 10% of the sensors, or with respect to various types of noise or data corruption).

This is where *compressed sensing* comes in. The main philosophy is this: if one only needs a 100,000 components to recover most of the image, why not just take a 100,000 measurements instead of 2 million? (In practice, we would allow a safety margin, e.g. taking 300,000 measurements, to allow for all sorts of issues, ranging from noise to aliasing to breakdown of the recovery algorithm.) In principle, this could lead to a power consumption saving of up to an order of magnitude, which may not mean much for consumer photography but can be of real importance in sensor networks.

But, as I said before, the camera does not know in advance which hundred thousand of the two million wavelet coefficients are the important ones that one needs to save. What if the camera selects a completely different set of 100,000 (or 300,000) wavelets, and thus loses all the interesting information in the image?

The solution to this problem is both simple and unintuitive. It is to make 300,000 measurements which are *totally unrelated* to the wavelet basis – despite all that I have said above regarding how this is the best basis in which to view and compress images. In fact, the best types of measurements to make are (pseudo-)*random* measurements – generating, say, 300,000 random “mask” images and measuring the extent to which the actual image resembles each of the masks. Now, these measurements (or “correlations”) between the image and the masks are likely to be all very small, and very random. But – and this is the key point – each one of the 2 million possible wavelets which comprise the image will generate their own distinctive “signature” inside these random measurements, as they will correlate positively against some of the masks, negatively against others, and be uncorrelated with yet more masks. But (with overwhelming probability) each of the 2 million signatures will be distinct; furthermore, it turns out that arbitrary linear combinations of up to a hundred thousand of these signatures will still be distinct from each other (from a linear algebra perspective, this is because two randomly chosen 100,000-dimensional subspaces of a 300,000 dimensional ambient space will be almost certainly disjoint from each other). Because of this, it is possible *in principle* to recover the image (or at least the 100,000 most important components of the image) from these 300,000 random measurements. In short, we are constructing a linear algebra analogue of a hash function.

There are however two technical problems with this approach. Firstly, there is the issue of noise: an image is not perfectly the sum of 100,000 wavelet coefficients, but also has small contributions from the other 1.9 million coefficients also. These small contributions could conceivably disguise the contribution of the 100,000 wavelet signatures as coming from a completely unrelated set of 100,000 wavelet signatures; this is a type of “aliasing” problem. The second problem is how to use the 300,000 measurements obtained to recover the image.

Let us focus on the latter problem first. If we knew which 100,000 of the 2 million wavelets involved were, then we could use standard linear algebra methods (Gaussian elimination, least squares, etc.) to recover the signal. (Indeed, this is one of the great advantages of linear encodings – they are much easier to invert than nonlinear ones. Most hash functions are practically impossible to invert – which is an advantage in cryptography, but not in signal recovery.) However, as stated before, we don’t know in advance which wavelets are involved. How can we find out? A naive least-squares approach gives horrible results which involve all 2 million coefficients and thus lead to very noisy and grainy images. One could perform a brute-force search instead, applying linear algebra once for each of the possible set of 100,000 key coefficients, but this turns out to take an insanely impractical amount of time (there are roughly combinations to consider!) and in any case this type of brute-force search turns out to be NP-complete in general (it contains problems such as subset-sum as a special case). Fortunately, however, there are two much more feasible ways to recover the data:

*Matching pursuit*: locate a wavelet whose signature seems to correlate with the data collected; remove all traces of that signature from the data; and repeat until we have totally “explained” the data collected in terms of wavelet signatures.*Basis pursuit*(or minimisation): Out of all the possible combinations of wavelets which would fit the data collected, find the one which is “sparsest” in the sense that the total sum of the magnitudes of all the coefficients is as small as possible. (It turns out that this particular minimisation tends to force most of the coefficients to vanish.) This type of minimisation can be computed in reasonable time via convex optimisation methods such as the simplex method.

Note that these image recovery algorithms do require a non-trivial (though not ridiculous) amount of computer processing power, but this is not a problem for applications such as sensor networks since this recovery is done on the receiver end (which has access to powerful computers) rather than the sensor end (which does not).

There are now rigorous results which show that these approaches can reconstruct the original signals perfectly or almost-perfectly with very high probability of success, given various compressibility or sparsity hypotheses on the original image. The matching pursuit algorithm tends to be somewhat faster, but the basis pursuit algorithm seems to be more robust with respect to noise. Exploring the exact range of applicability of these methods is still a highly active current area of research. (Sadly, there does not seem to be an application to ; the type of sparse recovery problems which are NP-complete are the total opposite (as far as the measurement matrix is concerned) with the type of sparse recovery problems which can be treated by the above methods.)

As compressed sensing is still a fairly new field (especially regarding the rigorous mathematical results), it is still a bit premature to expect developments here to appear in actual sensors. However, there are proof-of-concept prototypes already, most notably the single-pixel camera developed at Rice.

Finally, I should remark that compressed sensing, being an abstract mathematical idea rather than a specific concrete recipe, can be applied to many other types of contexts than just imaging. Some examples include:

- Magnetic resonance imaging (MRI). In medicine, MRI attempts to recover an image (in this case, the water density distribution in a human body) by taking a large but finite number of measurements (basically taking a discretised Radon transform (or x-ray transform) of the body), and then reprocessing the data. Because of the large number of measurements needed, the procedure is lengthy for the patient. Compressed sensing techniques can reduce the number of measurements required significantly, leading to faster imaging (possibly even to real-time imaging, i.e. MRI videos rather than static MRI). Furthermore, one can trade off the number of measurements against the quality of the image, so that by using the same number of measurements as one traditionally does, one may be able to get much finer scales of resolution.
- Astronomy. Many astronomical phenomena (e.g. pulsars) have various frequency oscillation behaviours which make them very sparse or compressible in the frequency domain. Compressed sensing techniques then allow one to measure these phenomena in the time domain (i.e. by recording telescope data) and being able to reconstruct the original signal accurately even from incomplete and noisy data (e.g. if weather, lack of telescope time, or simply the rotation of the earth prevents a complete time-series of data).
- Linear coding. Compressed sensing also gives a simple way for multiple transmitters to combine their output in an error-correcting way, so that even if a significant fraction of the output is lost or corrupted, the original transmission can still be recovered. For instance, one can transmit 1000 bits of information by encoding them using a random linear code into a stream of 3000 bits; and then it will turn out that even if, say, 300 of the bits (chosen adversarially) are then corrupted, the original message can be reconstructed perfectly with essentially no chance of error. The relationship with compressed sensing arises by viewing the corruption itself as the sparse signal (it is only concentrated on 300 of the 3000 bits).

Many of these applications are still only theoretical, but nevertheless the potential of these algorithms to impact so many types of measurement and signal processing is rather exciting. From a personal viewpoint, it is particularly satisfying to see work arising from pure mathematics (e.g. estimates on the determinant or singular values of Fourier minors) end up having potential application to the real world.

[*Update*, April 15: For some explicit examples of how compressed sensing works on test images, see this page.]

## 102 comments

Comments feed for this article

13 April, 2007 at 9:38 pm

randomwalkerOn anyone else’s blog this would be an excellent article, but you should be proving deep and awe-inspiring theorems rather than engaging in pedagogy for non-technical audiences.

Re. P \neq NP: were you comparing it with learning-parity-with-noise?

13 April, 2007 at 10:33 pm

Harald Hanche-OlsenRandomwalker, I am sure that Terence Tao is perfectly able to prioritize his own time. His track record so far proves it. It’s an interesting phenomenon to me that so many successful people have an unusually broad range of interests, and somehow find the time to pursue them. I believe this may be a help, rather than a hindrance, since having broad interests is helpful both in finding good research topics and in solving them. (I might worry too, if this blog got filled up with the sort of post we see here, but that doesn’t seem likely.)

13 April, 2007 at 10:43 pm

Marktough crowd :-)

13 April, 2007 at 11:28 pm

randomwalkerHarald: you’re right. I should think before speaking :)

13 April, 2007 at 11:58 pm

marcoI hope Terence Tao will keep publishing posts of this kind: they are very well written, accessible to non-technical people, and full of hints for those who want to dig a bit deeper.

That’s more than enough for me to read them.

14 April, 2007 at 1:15 am

AnonymousIf I were the author, I would be shocked or even saddened by the first comment …

14 April, 2007 at 2:08 am

Radu GrigoreThank you for this great post.

Here is something you might find mildly interesting. A month ago TopCoder organized a marathon (i.e., one week long) match in which people had to write a program that reconstructs a 100×100 grid of `densities’ based on the information gathered from 10000 (noisy) measurements. A few tried the Radon transform but none used it in the end because it was hard to implement and didn’t work well at all with so few measurements. I wonder if people would have done better if they would have known about compressed sensing and read some of the articles you link to here…

14 April, 2007 at 4:51 am

aravindthanks for your detailed description as usual, terry. you say: “… two randomly chosen 100,000-dimensional subspaces of a 300,000 dimensional ambient space will be almost certainly disjoint from each other”. i must be missing something, but from the “birthday phenomeon”, isn’t the above true only if the two subspaces have dimension up to about sqrt(300,000), i.e., much smaller than 100,000?

14 April, 2007 at 10:46 am

Brian Bensonrandomwalker, to expand on Professor Hanche-Olsen’s comments, it may seem naively true that the opportunity cost of Professor Tao writing this post is quite high; however, this argument is a rather short-sighted. It is my interpretation that the post above seeks to increase overall public awareness of and interest in both Professor Tao’s work and the relatively new field of compressed sensing. This increase in public awareness is likely to increase the demand for further works by Professor Tao and promote future work in the area of compressed sensing. This is not to mention the educational value of the post to the community as a whole.

14 April, 2007 at 2:17 pm

Igor CarronRadu, the current results seem to indicate that you need to know exactly the random functions or masks onto which the measurements are made. In other words, the amount of information needed to reconstruct the signal includes all the random functions or the masks on which the original signal was projected on. Therefore, being given only a signal corrupted with noise is not, as far as I understand, a case where one could reconstruct the original signal. Furthermore, nothing in the topcoder problem statement says anything about these densities being sparse in any basis. If these densities are random spikes, then less than 10000 random masks won’t do any good with the current results. For instance, if one uses a one-pixel camera to image some random brownian motion in a microscope application, it would take a long time to get a full picture. It might be better to flip those micro-mirrors one at a time!

14 April, 2007 at 2:54 pm

Greg Kuperbergaravind: No, subspaces are different from subsets. Technically speaking, the subspaces will be disjoint with probability 1. You should instead consider a suitable notion of approximate intersection, so that if one of the angles between the subspaces is merely moderately acute, then that counts as tantamount to an intersection. A “moderately acute” angle is typically an angle less than, say, 80 degrees.

Nonetheless, if you have two 1/3-dimensional random subspaces of a large vector spaces, the chance of even an approximate intersection is exponentially small. This phenomenon arises, for instance, in quantum computing, where subsets are commonly replaced by subspaces for various purposes.

14 April, 2007 at 3:08 pm

Greg KuperbergBrian Benson: You could be right about the merit of Terry’s posts here, but in a way you’re also reading more into the question than is really there. I have met a few Fields Medalists such as Terry, and the truth is that they put their pants on just like everyone else. They can procrastinate, or get distracted, or otherwise be ordinary human beings.

On the other hand, it is true that Terry in particular does seem to be more efficient in his work than many other mathematicians. It is also true that many award-winning scientists and mathematicians get inundated with attention, most of which is perfectly respectable, but it gets to be too much and they are forced to prioritize.

14 April, 2007 at 3:58 pm

Top Posts « WordPress.com[...] Compressed sensing and single-pixel cameras I’ve had a number of people ask me (especially in light of some recent publicity) exactly what “compressed […] [...]

14 April, 2007 at 7:33 pm

Terence TaoAravind: there is an analogy between the birthday paradox (which regards subsets of a finite set) and the intersection of random subspaces (which regards subspaces of a finite dimensional vector space). The birthday paradox tells you that if you have a set with N elements and you pick two random subsets of size much larger than , then there will very likely be a collision, but if you pick random subsets of size much smaller than , then there is very little chance of a collision.

Very informally, you can think of a d-dimensional space as having elements, and an r-dimensional subspace as having elements. And, as with birthday, if you pick two random subspaces of dimension r bigger than d/2, there is necessarily a non-trivial intersection, whereas if you pick subspaces of dimension r less than d/2, there is almost surely no intersection except at the origin.

High-dimensional geometry has a number of unintuitive properties which distinguish it from low-dimensional geometry. One of them is that in very high dimensions, any two randomly chosen directions (or one deterministic direction and one random direction) are very very likely to be very close to orthogonal. (For instance, the angle between two random vectors in is going to deviate by more than 1 degree from a right angle with a probability something like – i.e. it is very likely to deviate in low dimension but almost never deviates in very high dimensions.) This fact is also related to the law of large numbers.

Randomwalker: There may well be a connection between parity-with-noise and sparsest-solution-of-linear-systems, but the connection I had in mind was that one can encode the subset-sum problem as a very special case of the sparsest-solution-of-linear-systems problem, showing that the latter problem is NP-hard in the worst case. On the other hand, it seems that this problem can be solved in polynomial time by basis pursuit in the “average case” for suitable definitions of “average case”. As I understand it, this is a fairly typical state of affairs (a general class of problems being NP-hard worst-case but P or P# in average-case).

14 April, 2007 at 8:26 pm

techbuffetThanks for the great explanation. I have heard about compressed sensing for a while, but could not quite wrap my mind around it. I think I understand the basic idea now.

15 April, 2007 at 11:08 am

Ostrowski lecture: The uniform uncertainty principle and compressed sensing « What’s new[...] mentioned in the previous post here, compressed sensing is a relatively new measurement paradigm which seeks to capture the [...]

16 April, 2007 at 9:58 am

piotrNice post! One comment: in addition to camera, MRI, astronomy, and linear coding, random linear measurements also have applications to so-called “data stream algorithms”. For example, the signal x could represent a pattern of traffic passing through a network router, with x_i being the number of packets, say, going to destination i. Since x is very high-dimensional, one cannot really store it explicitly. However, one can store the measurement vector Ax instead, and use it to recover an approximation of x. The linearity of measurements is crucial here, since it enables maintaining Ax under linear updates to x. E.g., if you see a new packet with destination i, you can easily compute A(x+e_i)=Ax + Ae_i from Ax. So everything stays in the measurement space, so to speak.

17 April, 2007 at 11:07 am

Igor CarronAnother comment. I was looking at the some of the most advanced work on modeling the brain, and there is an uncanny parallel (if not match) between at least one model of visual processing and compressed sensing.

T. Serre, A. Oliva and T. Poggio. A feedforward architecture accounts for rapid categorization. Proceedings of the National Academy of Science, 104(15), pp. 6424-6429, April 2007

http://www.pnas.org/cgi/content/abstract/104/15/6424?etoc

The model is a layered one where at every layer, cells respond to a specific and increasingly complex stimuli. The interesting bit is that the model shows, with painstaking parallel with current biology understanding, that it is in fact one way (no need to compute all the coefficients first and then threshold them) and moreover, at every step there is a random picking of some of the stimuli to be used by the next layer.

As stated in the abstract, the model “can predict the level and the pattern of performance achieved by humans on a rapid masked animal vs. non-animal categorization task.”

18 April, 2007 at 8:22 am

Chris HillmanFascinating thread!

Re astronomy, here’s an interesting proposal: http://news.bbc.co.uk/2/hi/science/nature/6566317.stm One cannot but ask: what if these miniature gliders carried a single pixel camera and had some way to communicate with an orbiter which could relay information back to Earth?

Many years ago, I coauthored an argument which I hoped to present to ESA, via a private thinktank, to the effect that it makes good sense to move toward space exploration missions using a highly redundant fleet of small, simple, relatively cheap spacecraft each designed to do just one thing. As far as I know, this never went anywhere.

19 April, 2007 at 1:43 pm

chris forresterregarding compression.. do you think that it would be possible to create an infinitely compressible system using arbitrary data?

that is.. do you think you could utilize a stream to provide a folding mechanism for the data itself?

22 April, 2007 at 9:43 am

RDGreat writeup!!

But I do not quite understand the following:

Sadly, there does not seem to be an

application to P\neq NP; ………… be

treated by the above methods

Terry: Could you please give a few more details?

23 April, 2007 at 7:37 am

Terence TaoDear RD,

Suppose you are given a matrix A and a vector b and want to solve the problem “find the sparsest vector x for which Ax=b”.

For certain matrices A (basically, those in which there are plenty of linear dependencies between a relatively small number of columns) this problem is NP-hard.

What the work of ourselves and others shows is that for other types of matrices A (where the columns are close to being orthogonal, e.g. random matrices typically satisfy this), the problem can be solved in polynomial time by linear programming methods.

Unfortunately, these two classes of matrices seem to be totally disjoint, so there is no obvious progress on here. As I said in an earlier comment, this situation (NP-hard in worst case, P in average case) seems to be rather common.

7 September, 2011 at 12:26 pm

mttpdJust a quick question — regarding the comment: “As I understand it, this is a fairly typical state of affairs (a general class of problems being NP-hard worst-case but P or P# in average-case).”

Is it in any way related to smoothed analysis?

http://www.cs.yale.edu/homes/spielman/SmoothedAnalysis/framework.html

23 April, 2007 at 8:57 am

RDTerry: Thanks for the explanation.

I understand this is a new area and there are possibly

many open problems.

Is there any particular (doable!) open problem that you think is interesting?

23 April, 2007 at 4:12 pm

AndyDear Prof. Tao,

Thanks so much for taking time to write this post! I feel it is very helpful.

Regarding your comment on high-dimensional geometry, a couple of months ago, I was actually playing around with the angle \theta between one fixed vector and another randomly picked vector on n-dim unit ball and trying to calculate the probability Pr( cos(\theta)>1/n^(0.5-\epsilon)), I found when \epsilon>0, this probability decays to zero exponentially fast with n, and if \epsilon

23 April, 2007 at 7:23 pm

chris forresterWhat about using arbitrary data’s harmonic alignments with one another to prove P != NP? If you’re into this idea, please check out http://nimbusgear.wordpress.com/

to get an idea of what i’m talking about.

From what i can tell, because of the abstraction levels, this is a really good way of representing the problem in terms of something that is incredibly easy to check, but may take forever to determine the answer of.

It’s a.. holistic.. approach. My own application was to produce a mechanism for the infinite compression of arbitrary data, and i think the mechanism i discovered/shaped will also be of use in the P != NP problem.

24 April, 2007 at 6:00 am

AnonymousDoes this potentially provide a better way than bicubic interpolation to resize an image from N x N to 2N x 2N? If so, maybe Photoshop should include it as a filter.

24 April, 2007 at 7:21 am

Terence TaoDear Andy,

There is a glitch in the commenting parser: if you have a post which contains both > and < then everything in between gets interpreted as HTML and thus disappears. You can use > and < instead.

Dear Chris: there are significant theoretical limits as to lossless compression of arbitrary data, even if one asssumes infinite computational time. For instance, for digital data, a simple counting argument shows that any lossless compression algorithm must lengthen at least some input data. For analog data one can do something similar once one realises that measurement error (and other sources of error) are non-zero. (Incidentally, this is a common pitfall regarding attempts to prove P=NP; if one wants to run an algorithm using an analog computer, the algorithm will only be equivalent to a P algorithm if _all_ parameters are polynomial size; not just the data input and running time, but also the measurement accuracy, the energy consumption, temperature, etc.)

Dear Anonymous: In principle one could use compressed sensing techniques for image resizing, but I believe that special-purpose image processing tools customised for this specific task will perform better. Also, the N x N array of measurements do not look particularly “random” with respect to the wavelet basis that one expects the image to be sparse or compressible with respect to, and so these algorithms will not perform as well as they do in the random measurement case.

24 April, 2007 at 1:17 pm

AndyDear Prof. Tao,

Sorry for the above message and thanks a lot for the help :) I will just type my question again:

Regarding your comment on high-dimensional geometry, a couple of months ago, I was actually playing around with the angle between one fixed vector and another randomly picked vector on n-dim unit ball and trying to calculate the probability

, I found when , this probability decays to zero exponentially fast with n, and if , it goes to one. Only if , there is a nontrivial probability around 0.1576 and 0.1675 asymptotically. I couldn’t get a good interpretation of what is happening when dimension is high and why is special. It looks like the surface area of the unit ball is becoming more and more concentrated around the equator when n grows big. I’m interested to know more about its relationship to the law of large numbers and more generally about high dimensional geometry. It will be great if you could point me to any reference or explain briefly. Thanks so much!!

24 April, 2007 at 3:27 pm

Denis CharlesNice article! You are right about the gulf between the worst-case hardness and average case hardness of a problem. To my knowledge the widest such gulf (for a natural problem) occurs for the computation of Groebner bases. This problem is known to be EXPSPACE-hard in the worst case (best time bounds would be doubly exponential), whereas (conjecturally) on average the problem is polynomial time. The conjecture being a conjecture of Bayer on the Mumford regularity of ideals. I would be interested if anyone knows of a wider gulf for a natural problem. It is easy to create artificial problems where the worst case is at least as hard as the Halting problem and the average case is polytime by a technique called padding but that is uninteresting.

On the other hand there are natural problems where the average case hardness and worst case hardness are believed to coincide. A famous instance being approximating the shortest non-zero lattice vector in L_2 norm.

25 April, 2007 at 9:06 am

Terence TaoDear Andy,

If you fix one of the vectors to be, say, , and use Euler angles, one soon sees that has a probability distribution of for some constant , and so by changing variables has a probability distribution of . Note that this is a constant measure for n=3; this is a theorem of Archimedes, who liked it so much he engraved it on his tombstone. (It is also related to symplectic geometry, toric varieties, and moment mappings, but that’s another story.) When n is large, this measure concentrates in the region , and in this region we can approximate . This means that asymptotically, is normalised around the origin with a variance of . If you play around with the erf function that probably gives you a close match to your numerical results.

The connection with the central limit theorem arises when instead of considering the cosine of the angle between two unit vectors, one considers the closely related problem of computing the dot product between two independent normalised Gaussian vectors and , where the are iid standard Gaussians of mean zero and variance 1. The CLT easily shows that and are almost normally distributed with mean 1 and variance , while is almost normally distributed with mean zero and variance .

25 April, 2007 at 1:11 pm

chris forresterIndeed, i did much studying as to pidgeons and holes and the impossibility of arbitrary compression of information.

however, to expand that specific argument entirely, one would have to say that there could be no meta-algorithms that are composed of multiple ways of encoding the information structures. literally, that there would be a balance of informational structures that couldn’t be compressed by any algorithm, whatsoever.

this, i’ve found, is not a true precept, since we have something that can convert information from one form to another form, losslessly.

in such a manner, by utilizing the abstraction of the mathematical base switch, to provide an alternative look at the information without needing to encode an entirely new method of observation (re: algorithm), we find a meta-construct already there in the perspective of how to view the information at all.

my experiment runs like so:

Taking F as the arbitrary data, somewhere in the hundreds of digital base 2 bits.

Since the information is already digitally encoded, it forms a digital wave when you take a look at from the vantage point of a specific base, lets call that E. There is then A, the number of elements involved which form the digital waveshape itself.

This number of elements has associated with it its own limitations, which are namely the number of integer bases that produce that many elements from the arbitrary information given.

Using base 2 for E is in fact the worse case scenario, since the next step involves a sort of skimming over the digital waveform.

That is, to percieve the digital signature as a waveform, one finds that for a specific F, there are multiple E’s, a set Edelta, to percieve F which results in A digits.

These A digits in bases Edelta, naturally, totally compose F, with zero loss.

Now, there is an offset, O, which is the same for all A digits that are in bases Edelta. Above that offset, for all A digits, is information in base D.

At some point, for a specific Efinal and Afinal, there is a waveform D which, combined with the offset O, the information to specify E and the information to specify A, is within ‘compression’ range.

For lower density F information, this is not that useful. However, the entire algorithm can be run *per binary digit*, which is where the F from Cloud C comes into play. By keeping the entirety of the information that you would want to compress, say an mp3 file, in cloud C, and always moving one single bit at a time to F, then running the algorithm on F, one always finds a way to compress C.

This of course hinges upon the fact that we have more than we originally thought we had:

One: that the cloud C, and thusly F, is of suitable length. this is a result of the offset O needing a specific number of bits to encode in the first place. Thus, A must be 3 or higher, and since it is Edelta being checked, this needs to have a suitable number of opportunities to be able to reexpress the information in F.

Result: yes. information is not infinitely compressible, since at some point, base 2 becomes the most efficient way of perceiving the information. however, this is only a few hundred bits or less. for *anything* over this, there is a guaranteed ‘compression’ by means of digital wave re-expression methodology.

Thusly, the pidgeons and holes theory does not pan out, since we already have something in the informational system itself that allows for lossless reorientation of the information. And, we can choose between how the information is represented on a case-by-case basis, which reveals, as the possibilities increase, an increasing number of ways to choose between the representations.

The reason i come up to the P and NP thing with this is that checking per binary bit, all ranges of A, Edelta, is definitely deterministic. To say when the actual D + O + E + A is, bit-wise, less than F, however is non-deterministic. All that can be proved is to say that, eventually, it will happen. At this point, D + O + E + A can even be moved back into C with an ‘expand fold’ marker.

Thus, folding information, in terms of spacial harmonics.

To summarize:

Re-expression utilizing very high bases to recompose the information into different formats.

Sorting through those recompositions to find minimal change of relative heights, as percieved in two dimensions.

Encoding the change in heights, the offset, the bases of both the height changes and the offset plus the height changes, and the number of digits utilized.

In terms of algorithms, since it’s never attempted to fold the non-re-expressional informational spectrum, and guaranteed to always find foldable structures in the larger spectrum, it reveals an infinite compression mechanism, of higher density information.

27 April, 2007 at 9:04 am

Mark MeckesAndy:

The best-known application in high-dimensional geometry of the phenomenon you’ve observed is Dvoretzky’s theorem (in a given proof by Vitali Milman in the early 1970s). Possibly the most accessible reference for this is Lectures 7-9 of K. Ball, “An elementary introduction to modern convex geometry”, in this volume.

27 April, 2007 at 7:28 pm

KalebergThis approach reminds me of early X ray astronomy sensors, before they developed grazing lenses. They used a sheet of X ray blocking metal with a random pattern of holes punched in it and rotated the sheet while collecting a single pixel of data, the overall intensity. I think they could also move the sheet further and closer to the single sensor in a sort of “trombone” arrangement. I never worked with this, but I had a roommate who worked on the SAS-C team, and he tried explaining how the sensor, or perhaps an improved sensor, might work.

I’m not sure of the mathematics of this approach, since their “wavelet basis” was rather simple, and not completely random. Still, they did get the sensor count down to one, they traded off satellite based computing for ground computing, and they did get some neat images, included Cygnus X-1.

28 April, 2007 at 1:48 pm

Andres Corrada-EmmanuelThank you for this introduction to compressed sensing. I want to point out that the statement on the applicability of this technique to sensor networks needs to be tempered by an actual comparison between the power requirements of on-board processing versus transmission of data. For some sensor networks the greater cost is transmission. This will increasingly become so in the future since network bandwidths are physically limited or turned off by users that want ‘information’ not sensor data. In cases like that, it would pay to process data on-board and then relay extracted information rather than raw data itself. For example, an aerial mapping appliance would process all aerial images on-board and relay a map of the area with a ortho-rectified image on top. This would give a user all the information they require to do an impressive number of tasks and would have a compression ratio running in the 1000:1 range.

28 April, 2007 at 9:48 pm

Terence TaoDear Andres,

Thanks for your comments! Compressed sensing actually has comparable transmission costs to traditional processed sensing methods; I think you might have been comparing processed sensing with

non-compressednon-processed sensing, which indeed would be inefficient on many levels.Suppose for instance that a traditional sensor in a sensor network collects 1,000,000 bytes of data, processes the data on-board, and extracts 1,000 bytes to transmit, giving the 1000:1 compression ratio you mention. I agree that the non-compressed non-processed sensing alternative of simply transmitting all 1,000,000 bytes collected without any on-board processing would be expensive and inefficient. However, this is not what compressed sensing does; instead of collecting so much data, it would only collect, say, 3,000 bytes of measurements to transmit, and computers at the user end will extract the 1,000 bytes of processed data they need from the 3,000 transmitted bytes via one of the known data recovery algorithms for compressed sensing. Thus the data transmission costs in this case are comparable to a traditional sensor, but the data collection and processing costs are much lower. (It may seem in this example that the compressed sensing method requires three times as much data transmitted, but actually the gap can be a bit lower in noisy environments, because the traditional sensor will need to lengthen (and possibly retransmit) the data to enable error detection or correction, whereas the compressed sensing data will not need any such lengthening, as there is already significant redundancy contained within the transmitted data, and the recovery algorithms are robust with respect to loss or corruption of a small fraction of this data.)

30 April, 2007 at 9:03 pm

Sacha“It’s an interesting phenomenon to me that so many successful people have an unusually broad range of interests, and somehow find the time to pursue them. I believe this may be a help, rather than a hindrance, since having broad interests is helpful both in finding good research topics and in solving them. (I might worry too, if this blog got filled up with the sort of post we see here, but that doesn’t seem likely.)”

Of course it’s great to have a broad range of interests – it helps keep one’s mind active and alert!

Why would anyone be concerned if Terry started filling his blog with non-technical or non-mathematically related work? That sounds snobbish to me.

1 May, 2007 at 3:40 pm

AndyI don’t want to take too much space here, but really want to thank Terry and Mark for their comment and help!

21 May, 2007 at 8:32 pm

GovindThank you Prof. Tao for the blog. It did help me to understand the bascis easily so that now I can attack the real leterature.

6 June, 2007 at 8:36 pm

qlingDear Prof. Tao,

Thanks for your excellent story here. I am studying compressed sensing now and encountering some problem. My focus is on the required sampling size for random Fourier measurement ensemble to recover the sparse signal with l-1 minimization. [1] gave an estimation K=O(M(logN)^6), where K is the sampling size, M is the sparsity of the signal, and N is the dimension of the signal. [2] provided another estimation K=O(MlogN) by assuming a Bernoulli model. In fact, I am not sure if it is a universal conclusion. [3] introduced the result in [2] by saying:

‘The conjectured optimal estimate would be K=O(MlogN), which is known to hold for nonuniversal measurements, i.e. for one sparse signal f and for a random set \Omega (the cardinality of \Omega is K).’

But this comment makes me even more confused … In [4], the authors directly pointed out K=O(MlogN) without giving any further explanation. So I hope you can help me to clarify this question.

Thanks for your time!

[1] Candes E, Tao T: Near-optimal signal recovery from random projections: universal encoding strategies

[2] Candes E, Romberg J, Tao T: Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information

[3] Rudelson M, Vershynin R: Sparse reconstruction by convex relaxation: Fourier and Gaussian measurements

[4] Candes E, Romberg J: Practical signal recovery from random projections

7 June, 2007 at 9:44 am

Terence TaoDear Qing Ling,

There is a distinction between uniform recovery and non-uniform

recovery. A set of frequencies Omega can uniformly recover sparse

signals if it can recover

allsignals of a given sparsity. If itcan only recover

almost allsignals of a given sparsity (i.e. it canrecover randomly chosen sparse signals with high probability) then we

only have non-uniform recovery.

The results of [1] and [3] give uniform recovery, while [2] gives

non-uniform recovery, so these two types of results are not directly

comparable.

See also my page

http://www.math.ucla.edu/~tao/preprints/sparse.html

for some further discussion.

2 July, 2007 at 10:50 am

Open question: deterministic UUP matrices « What’s new[...] July 2nd, 2007 in question This problem in compressed sensing is an example of a derandomisation problem: take an object which, currently, can only be [...]

3 July, 2007 at 6:15 am

What is angular momentum (in quantum physics)? « As I know…[...] (in quantum physics)? Posted by hoahoangphuong under Quantum physics This problem in compressed sensing is an example of a derandomisation problem: take an object which, currently, can only be [...]

10 July, 2007 at 11:52 am

Simfish / InquilineKea’s Blog[...] April 13th, 2007 at 10:33 pm [...]

18 July, 2007 at 11:45 pm

bhauthPhysical application of desired transforms would be as effective as digital application. “Similar” transforms would be useful in many cases as well. Collected (complete) data must also often be transformed to be useful. (synthetic aperture radar, CAT scans, GRENOUILLE, etc) However, if the measurements are of *random* transforms of the data, then they bear no special relation to the desired compressed data. That means that every piece of the data can be recovered equally well (statistically speaking) as the “desired” pieces. Information theory tells us we cannot get something for nothing. The low correlations mean data will be degraded, as it has finite precision. Information content of a set of numbers is equal to the effective number of digits of them. How is compressed sensing more than a randomly chosen trade of number quality for number quantity?

19 July, 2007 at 12:10 am

bhauthIf the idea is that many single pixel cameras are fitted with masks that approximate wavelet transforms, then

Practically speaking, this is less effective than an equivalent cost in standard cameras.

You can’t do wavelet transforms without subtraction, which needs coherent light.

The compression part of image compression is not done by wavelets. The wavelets concentrate smoothness that can then be compressed by other means taking advantage of that smoothness. There is no way to bring the smoothness together in a single pixel camera, so if you could do the wavelet transforms with masks, and the cameras were cheap enough to be worthwhile, you would still be saving no bits unless the cameras communicate before transmission.

19 July, 2007 at 7:06 am

Terence TaoDear bhauth,

From an information-theoretic perspective, the reason why compressed sensing is a “win” is that the signals one is studying (e.g. images) have special structure which reduces their entropy be significantly less than the number of bits used to store them (before compression). For instance, in the imaging example given above, an image needs 2 million bytes to store it (in uncompressed form), but the structure of the image is such that only 100,000 bytes of information are needed to reconstruct the image to within acceptable levels of error. Because of this, it is information-theoretically possible to make, say, 300,000 random measurements and not lose any further signal quality.

You can find more about the single-pixel camera at the Rice web page:

http://www.dsp.ece.rice.edu/cs/cscamera/

The masks are not simulating wavelet transforms; instead, they are simulating an inner product with a random 0-1 vector. No subtraction is required.

19 July, 2007 at 3:33 pm

bhauthWell, you should be cautious when telling a Fields medalist something is unworkable, but video compression is my specialty and my understanding is that while linear transforms can compress images in special cases with carefully chosen (*non* random) bases, they are not useful for general purpose image compression. In practical terms that would mean that while you could reconstruct images from your single pixel camera they would statistically be no better than images from a standard camera of resolution equal to the number of measurements. As far as I can tell, that’s the case with the results on that page.

19 July, 2007 at 7:10 pm

Terence TaoDear bhauth,

I think you are confusing compressed sensing with the more traditional thresholded transform coding which appears for instance in the JPEG-2000 protocol. There, one takes many wavelet coefficients but then only retains the largest ones by applying a thresholding to obtain the compressed output. In compressed sensing, one bypasses the thresholding step and makes the compressed measurements directly, using random measurements rather than wavelet ones. I agree that it is unintuitive, but the random linear measurements do in fact capture almost as much information as a comparable number of carefully chosen (thresholded) wavelet measurements. From an information-theoretic perspective it is not so surprising; an image which can be compressed down to a 100K file with acceptable losses in quality has roughly 100K bytes of entropy and thus should, in principle, be recoverable (with comparable quality) from slightly more than 100K random measurements. The remarkable thing is that (as long as one over-samples by a factor of three or so) that this recovery can be achieved in a practical amount of computer time.

Some examples of this applied to actual images can be seen at

http://www.acm.caltech.edu/l1magic/examples.html

11 August, 2009 at 9:58 pm

AJI really wonder if CS would work for stationary sources. This could be the key to link entropy with sparsity. Images are non stationary.

19 July, 2007 at 9:13 pm

bhauthA minimum-norm algorithm? That is something quite different from what I understood you to be saying in this post. I’ll think about that for a bit. Computational feasibility isn’t so surprising, though.

19 July, 2007 at 10:19 pm

bhauthOK, the norm you’re using isn’t the optimal computationally feasible norm for images in general. I’ll think about what better ones would be some more when I’m less busy.

29 July, 2007 at 3:58 am

nicolasI am very thrilled to discover that you are interested in Information theory problems. being a mathematician by education I have way too often been put off by professionals who like theories not for their power, but for their complexity.

This is not the case for Information theory, which I view as one of the biggest problems around. It is transverse, potent, and critical in many other areas of science.

Interestingly I tried to develop sensor compressing on my own in 1994 when I started engineering school, but did not have the necessary math skills to develop it. Also, I had few ways to reach the math literature we now have on the internet, and even fewer ways to connect with like-minded individuals.

Now a Fields medalist has a blog, and I find, serendipitously, that he works on a subject that interests me….

Nicolas

4 September, 2007 at 1:47 am

rohitj’s Research » Compressed Sensing in Data Streams[...] Compressed sensing has emerged as a field of great interest in past very few years, and the Rice’s one pixel camera has surely proven to be a great application. Let me try to explain what compressed sensing [...]

5 September, 2007 at 4:02 pm

Stephen DownesGreat article on a subject I never even knew existed (and I am an avid digital photographer). Posts like this are a real service, and help people like me understand (to some degree of depth) the issues that lie behind the things that interest us.

26 September, 2007 at 4:01 am

John SidlesAll of these comments apply to quantum model order reduction (QMOR) too. To translate “The space of all interesting [quantum wave functions] is much smaller, and can be stored using much less space, especially if one is willing to throw away some of the [high-order correlations of the wave function]. Indeed, if one generates [a wave function] at random, one will almost certainly not get an interesting [quantum wave function].”

3 October, 2007 at 12:57 pm

skibumHello Professor Tao.

I was wondering if you might be willing to explain to me why in the Decoding by Linear Programming paper you must restrict attention to the c_j coeffs. being real?

I haven’t completely internalized all of the nitty-gritty detail of the proof yet, but wouldn’t this be quite restrictive in that any signals of interest (if using the DFT as the orthonormal basis with respect to which the signal is sparse) would have to be zero-phase? Or is it that the coefficients must be real in order to recast the l1-min problem as a linear program? (I believe the Candes tutorial says some such about complex being an SOCP program).

I’m just an MS student who has yet to take a course on optimization so perhaps this is where I’m missing something, but if you could provide a brief answer to this question, it’d be greatly appreciated.

Thanks!

3 October, 2007 at 2:31 pm

AdamDear Skibum: “The devil is in the details” – “Random House Dictionary

of popular proverbs and sayings (hardcover)” by Gregory Y. Titelman

(Random House Reference; 1st edition 03.05.1996).

3 October, 2007 at 3:08 pm

Terence TaoDear Skibum,

The results in my paper with Emmanuel can be largely extended to complex-valued signals with only minor changes (many constants may need to be adjusted by a factor of two, in particular). One “cheap” way to do this is simply to replace an n-dimensional complex state space with a 2n-dimensional real state space by breaking everything up into real and imaginary parts. Though, as you point out, l^1 minimisation in the resulting 2n-dimensional space is not quite the same as l^1 minimisation in the original complex n-dimensional space, because the complex numbers are usually measured in l^2 norm rather than l^1 norm. But either type of minimisation should give the same type of results. (Minimising complex l^1, with the complex numbers measured in l^2 norm, is not a particularly clean linear program, though still fairly feasible; for computational purposes it is probably better to measure complex numbers in l^1 norm, even if this is more artificial.)

1 November, 2007 at 6:17 am

CharlieThis was readable for someone with no background in math or “imaging.” Thanks for the simplification, which is very necessary “to see work arising from pure mathematics … end up having potential application to the real world.” Working at a psychiatry neuroimaging lab, I am a huge fan of any research moving toward increasing temporal and/or spatial resolution for MRI. Plus, ‘maths’ are apparently interesting too.

Thanks!

18 January, 2008 at 10:24 am

Smartest Person Blogging ? « Evil Fish[...] as an example. And how to build one pixel camera as another. The blog is at [...]

10 February, 2008 at 3:01 am

exdjDear Folks,

I think it’s wonderful that Terry managed to find time to give me a hint towards the solution of a homework problem found in his “problem solving ” course designed primarily top get UCLA students ready for the Putnam. The problem: Suppose all points in R^2 were colored , say, red white or blue.Prove that it is always possible to find to points one inch apart with the same color. Problems of this type,usually use proof by contradiction, as this one did.

17 July, 2008 at 8:38 am

Infant on compressed sensing « On the learning curve…[...] to share his thoughts to the world, in spite of being busy with his work and research. In this blog entry I came across, he describe the problem of compressed sensing with a very very simple example. He [...]

30 July, 2008 at 7:05 pm

AdamHi Prof Tao,

I have a few questions for you, if you don’t mind answering:

1) With respect to “Decoding by Linear Programming,” isn’t the decoding application somewhat, with all due respect, limited? In particular, since you are operating on the field of real numbers (whereas codes are typically binary, etc.). Has any work been done on extending your results to codes that are in a more realistic system?

I’m no coding theorist, but just curious. It seems to me that perhaps the general problem of compressing and encoding in one step (as opposed to the usual transforming scheme) is a better application for CS (as you two have mentioned), along with general ill-posed inverse problems.

2) Also, I am a little confused as to what the best bounds are for the restricted isometry constants. In particular, your “Decoding by LP” paper showed that:

delta_s + delta_2s + delta_3s < 1 is sufficient (and actually tighter than that if you use the cosine of the angle between subspaces constants … sorry forgot what these are called), but Candes has shown in other papers which he claims have improved the bounds a bit (see http://www.acm.caltech.edu/~emmanuel/papers/RIP.pdf) that delta_2s < sqrt(2) – 1 is sufficient. Obviously the restricted isometry constants are monotonically increasing, but I’m not quite sure from first glance how delta_2s < ~.4142 is necessarily better than the work in Decoding by LP.

3) Finally, I really like the approach of decoupling the problem such that the restricted isometry constants to imply the equivalence of the l0 relaxation. That said, verifying the RIP for any given matrix is also NP-hard, correct? Thus, the work thus far has focused on random constructions: gaussian ensembles, binary ensembles, randomly selecting fourier, noiselet, etc. coefficients that are incoherent with the sparsifying basis. The problem subsequently becomes probabilistic in nature; there’s no guarantee of recovery, but just a high probability thereof. That said, my question is: has any work been done on deterministic ensembles that obey the RIP?

4) Donoho’s work, it seems, has focused largely on higher dimensional geometry. I must admit most of that is fairly over my head, lol. Are the RIP coefficients at all related to what he has been doing?

Anyway, sorry for all of the questions; just curious. I’m sure you’re very busy, but if you have a moment and wouldn’t mind enlightening all of us, it’d be extremely appreciated. Thanks…

- Adam

30 July, 2008 at 7:36 pm

AdamAnd with all that typing, I just saw your other page on deterministic matrices. Sorry for question (3)!!!

I think you might be right that the approach of developing a (fast) algorithm that can verify the RIP would be the most promising. In particular, at the very least it would provide a quick method to justify using l1 for inverse problems in which there is known to be a sparse representation for the signal class of interest…

30 July, 2008 at 8:01 pm

Terence TaoDear Adam,

Regarding (1), it is true that all known CS-based methods right now are based over the reals, and so do not work so well for binary codes (though one could in principle imagine deploying a CS-based linear code over a network such as the internet, as each internet packet could store a high-precision real number; it does not seem though that this is particularly superior to existing transmission protocols, though). Interestingly, though, Feldman has some independent work in which he decodes some binary codes by relaxing the decoding problem into a linear programming problem (accepting some pseudo-codewords in the process) and solving that instead. There is also a certain amount of followup literature on CS-based linear codes, that can for instance be found at the compressedsensing.com website.

Regarding (2), the two conditions are not directly comparable, but I think when one compares the conditions with experimental or theoretical estimates on the isometry constants, Emmanuel’s condition comes out slightly ahead (and I think it has since been sharpened further by Cai, Xu, and Zhang). There seems to be a reasonably sharp threshold behaviour with the isometry constants – they are usually either very small or very large, with not much of a transition in between – and so getting the 3S down to a 2S in principle extends the range of S for which CS works by about 50%, which tends to be enough of a cushion to absorb losses elsewhere.

Donoho’s work in (4) uses the special properties of the gaussian ensemble to obtain slightly sharper results than what one can get from the RIP, which is a general-purpose axiom that covers more ensembles but has a slightly sub-optimal efficiency. (There is a paper of Candes, Rudelson, Vershynin, and myself in which we compare the two approaches; there are also several other papers in the literature obtaining more precise results in the gaussian case, including a quite sharp threshold for when l^1 minimisation breaks down.)

11 August, 2009 at 9:50 pm

AJThis is only for response to coding theory.

Even though binary codes have been well studied. They are of limited use. Codes have to be carved from lattices in real world.

One can life the binary codes (or finite field codes) to complex numbers. It may no longer be non-linear. An important problem is to lift the codes so that

1) they are linear

2) if the hamming distance is small b/w the finite field codewords, then their euclidean distance have to be larger. This strategy comes close to achieving the best possoble performance. It is extremely hard to do with finite field codes, but an elementary exercise in function field codes (convolutional codes)

30 July, 2008 at 10:28 pm

AdamWow thanks for the quick reply! I’ll check out that Candes/Rudelson/Vershynin/Yourself paper…

If I might ask a couple more questions which I’ve been contemplating….

I’m thinking about inverse problems where the dimensionality is quite high (as in an image, for instance). Merely storing the Gaussian matrix becomes near impossible for, say, a 512×512 image. Computationally it’s unwieldy too. One could much more efficiently store the Bernoulli ensemble, but still AFAIK there’s no fast algorithm for applying it.

I presume this is why the incoherent measurements of the noiselet-haar (or other db filters) system are of interest. That said, instead of as a guarantee for exact reconstruction, it’s . So two questions really,

1) I thought I read something about the latter requirement being overly pessimistic on the lower bound for m. What’s the current guess for that power on the log? I thought you had it at 6 in your initial Fourier paper, and then it lowered to 4….

2) Let’s say that 4 is as low as we can go. I saw that Baraniuk had an ICASSP paper effectively on Toeplitz matrices with iid Gaussian entries, and thus FFT could be used, but the paper didn’t seem to contain any proofs… just sort of “hey this works.” Out of curiousity, do you know if anyone has rigorously analyzed the Toeplitz gaussian ensemble? Seems like an interesting topic to see where the lower bound for m lies if we reduce the DOF of the matrix substantially as in a Toeplitz structure…

30 July, 2008 at 10:35 pm

AdamPoint of clarification: I’m looking at a problem where I have , where the T is a convolution operator. I meant to say my gaussian matrix would be 512^2 by 512^2, roughly….. hence I can’t store it.

10 October, 2008 at 12:26 pm

Small samples, and the margin of error « What’s new[...] to sample the entire population to get a good reading. (The same general philosophy underlies compressed sensing, but that’s another [...]

15 October, 2008 at 1:35 am

JatkeshaDear Prof. Tao,

Thanks for the wonderful expository article on compressed sensing. It was really helpful for someone with limited mathematical background to understand what compressed sensing is all about.

I do have a query though. I would be glad if you can answer it for me.

Doesn’t compressed sensing help to solve problems in inverse theory? Most of the problems in inverse theory involve a large amount of uncertainty because of a large amount of noise as well as linearized methods especially in Geophysics. Does compressed sensing help to solve linear and non-linear inverse problems?

Apologies if the query sounds too trivial.

Thanks and warm regards.

-Jatkesha

15 October, 2008 at 10:53 am

Terence TaoDear Jatkesha,

Compressed sensing is still largely restricted to linear problems at present, and requires the matrix relating the unknown variables to the actual measurements to be completely known. This is the situation for instance in image recovery from incomplete measurements; I’m not sure if this qualifies as an inverse problem though. There are one or two papers on in situ compressed sensing at http://www.compressedsensing.com/ which may be slightly related to your query.

30 December, 2008 at 10:53 am

JoshProf. Tao,

Thank you for this explanation, it continues to be very helpful. After perusing the literature, I have not been able to discern how the aliasing problem is addressed. Could you please provide a brief explanation or a pointer to a reasonably accessible paper?

Thank you,

Josh

10 April, 2009 at 3:20 pm

liaHi Prof. Tao!

In regard to your comment, “Many astronomical phenomena (e.g. pulsars) have various frequency oscillation behaviours which make them very sparse or compressible in the frequency domain,” I tried to find a reference on this, but couldn’t manage to do so. Do you know of a good reference?

Thanks!

Lia

7 May, 2009 at 4:29 am

Compressed Sampling « Research Jottings[...] Terrence Tao has a really good description of CS on his blog. Check it [...]

1 July, 2009 at 7:04 am

Igor CarronAdam,

You might be interested in Circulant and Toeplitz Matrices in Compressed Sensing by Holger Rauhut

In general, when it comes to finding a specific sub-subject within compressed sensing, you might want to do a search on the nuit blanche blog

The search box is on the top left corner.

Igor.

11 August, 2009 at 6:36 pm

AJSparsity implies low entropy. But Low entropy does not imply sparsity. To information theorists entropy is object of interest.

CS is a curious tool in signal processing. But it is unclear whether there is a generalization of CS to make it related to entropy of sources. If so this generalization of CS would be a powerful tool in information theory.

1 October, 2009 at 7:47 pm

AJProfessor Tao:

You have a great blog. I have a question on compressed sensing. In CS one takes r measurements and tries to find a vector of length n which has considerably fewer non-zeros than r.

My question is the following. One can imagine several scenarios in which one may not be interested in the entire vector of length n. But say a subset of size m of the vector. In this scenario, does the CS setting still hold. Should one still take r measurements? m could in fact be smaller than r.

Do you have any leads on this problem?

Thank you.

3 October, 2009 at 10:47 pm

Terence TaoReducing the number of measurements one is interested in may possibly reduce the run time of the recovery algorithm, but does not reduce the number of measurements one needs to reconstruct the data.

Imagine for instance that one could reconstruct any m of the n components of the data vector from r given measurements. By dividing the n components into about n/m groups of m each, and applying one’s reconstruction algorithm to each group, one sees that one can then reconstruct the entire data vector from the same initial set of r given measurements (but possibly with the run time of the reconstruction increased by a factor of O(n/m)). So one cannot gain any reduction in the number r of measurements this way (except in the trivial case m=0, of course).

4 October, 2009 at 1:17 pm

AJProfessor:

Conceptually your answer makes sense. However, the reason I have specifically asked this question is because of the existence of locally decodable codes and the applicability of CS technique to coding theory.

In coding theory, y = Gx +n is received vector where G is the generator, x is message, n is noise.

Typically one has to process all bits of y to get reliable reconstruction of x.

However LDC offers the following flexibility:

Suppose only bit location i of x has to be be found, a q query LDC needs to access only q locations of y.

Existence of good LDCs have implications in complexity theory. Since CS offers good decoding ability, I was wondering if bounds in CS could be transferred to bounds in LDCs.

http://research.microsoft.com/en-us/um/people/yekhanin/papers/nice_pir_limits.pdf

However known LDCs are non-random. However it might still not be a problem in transferring CS techniques because of the following result:

http://www2.computer.org/portal/web/csdl/doi/10.1109/FOCS.2007.8

I am working along this line of reasoning. I was wondering if my ideas make sense to you and whether it would be worth proceeding in this route.

Thank you,

sincerely,

10 October, 2009 at 12:17 am

Igor CarronAJ,

You might be interested in the latest entry on Nuit Blanche.

http://nuit-blanche.blogspot.com/2009/10/cs-lp-decoding-meets-lp-decoding.html

Alex Dimakis and Pascal Vontobel have a new paper entitled: LP Decoding meets LP Decoding: A Connection between Channel Coding and Compressed Sensing

There is also a mention of another paper by Dongning Guo, Dror Baron, and Shlomo Shamai entitled A Single-letter Characterization of Optimal Noisy Compressed Sensing which uses some of these sparse matrices found in LDPC.

Cheers,

Igor.

10 October, 2009 at 2:47 pm

AJHi Igor:

Thank you. It looks like CS techniques might be useful in complexity theory as well due to connection between random sparse codes and local decoding. But I do not know how one might transfer CS techniques in sparse codes to this setting.

AJ

11 October, 2009 at 8:56 pm

AJProfessor Tao:

Is there an information theoretic meaning to sparseness?

It seems to be that from the following two sequences the correspondence might be tricky.

A=1111111111…….

B=10100100001000000001……….

It seems to me that B somehow carries more information that A even though B is sparse.

Thank you.

1 July, 2010 at 5:32 am

AnonymousHere, A is more sparse than B in an appropriate basis. So it is generally true that a sparse signal carries less information than a less sparse one.

15 December, 2009 at 10:15 pm

科学松鼠会 » 陶哲轩：长大的神童[...] （顺便提一句，由于陶哲轩和很多别的数学家的介入，压缩感知这个领域已经在这一两年来成为应用数学里最热门的领域之一，吸引了人们极大的注意。陶哲轩本人在2007年写过一篇极好的关于这个领域的普及性文章，松鼠会大概会在将来推出这篇文章的中文翻译，敬请期待。） [...]

24 February, 2010 at 9:02 pm

Mini Searches with Answers - Answer My Searches[...] Compressed sensing and single-pixel cameras « What’s newHow digital cameras work [...]

7 March, 2010 at 7:42 pm

Filling In the Blanks « Rich's Random Walks[...] to construct the most plausible “original”. (A more technical description is given in this blog post by Terence Tao, one of the pioneers of the [...]

7 June, 2010 at 8:06 pm

AnonymousProfessor Tao,

Is there a way that we can correlate sparsity or compressibility of the signal to its statistical property such as mean or variance etc.. This is important since, it is not supported on the node to compute number of significant coefficients (K) to compute number of measurements M=cKlog(N/K) . If you could model K=f(V) where V is the variance of the signal then we could determine M=cf(V)log(N/f(V)).

regards

Anonymous

23 June, 2010 at 8:05 am

AnonymousHello

I need help.

Can someone refine a very bad video?

I heard of this method of treatment of photos, but I do not know who to contact.

Thank you very much

24 February, 2011 at 7:56 am

Compressed sensing, compressed MDS, compressed clustering, and my talk tomorrow « Quomodocumque[...] World’s most hurried introduction to compressed sensing (see Terry’s blog for a more thorough account.) [...]

21 July, 2011 at 9:54 am

Compressed Sensing: What is compressed sensing (compressive sampling) in layman's terms? - Quora[...] Topics or PeopleComment Seb Paquet, University prof Terry Tao has a great answer here: http://terrytao.wordpress.com/20… (more) Sign up for free to read the full text. Login if you already have an account.This answer [...]

15 December, 2011 at 9:08 pm

Thought this was cool: 陶哲轩：压缩感知与单像素照相机 « CWYAlpha[...] 科学松鼠会上关于Terence Tao的Compressed sensing and single-pixel cameras的译文《压缩感知与单像素照相机》。 [...]

16 December, 2011 at 12:10 am

Compressed sensing | Quantum Dot[...] Professor Terence Tao: Compressed sensing and single-pixel cameras [...]

20 June, 2012 at 4:56 am

Dave WlsonThank you for posting this Professor. I’ve been racking my head trying to figure out what this is about for ages. A great straightforward well explained article that was truly needed.

10 May, 2013 at 10:05 pm

Bird’s-eye views of Structure and Randomness (Series) | Abstract Art[…] Image compression basics (adapted from “Compressed sensing and single-pixel cameras“) […]

11 May, 2013 at 6:42 pm

压缩感知 | TechTrace[…] 本部分为陶哲轩在一篇blog中对于comprehensive sensing 文章翻译过来的精华部分，提取了重点，并加入了一些理解，这里用来解释CS的概念堪称经典。 […]

11 May, 2013 at 8:40 pm

Abstract Art[…] [1.2] Compressed sensing and single-pixel cameras […]

26 July, 2013 at 5:34 am

Compressive sensing and bioinformatics | Follow the Data[…] work in developing compressed sensing, has much better explanations than I can give, for instance this 2007 blog post that exemplifies the theory with the “single-pixel camera” and this PDF presentation […]

4 September, 2013 at 5:12 am

Scikit study | X Blog[…] 本部分为陶哲轩在一篇blog中对于comprehensive sensing 文章翻译过来的精华部分，提取了重点，并加入了一些理解，这里用来解释CS的概念堪称经典。 […]

8 November, 2013 at 9:23 am

12 interesting things about the 1-norm | The Regularized Singularity[…] you really want an introduction to compressed sensing read Terry Tao’s wonderful, wonderful post (http://terrytao.wordpress.com/2007/04/13/compressed-sensing-and-single-pixel-cameras/) on the single pixel camera. His math papers on related topics are pretty great […]

29 January, 2014 at 10:54 pm

preethi shreeyaHello Professor Tao,

Given: I have a beam fixed at one end and free at the other. It obviously results in vibrations due to the disturbances in the environment or when induced. This system can be represented using a differential equation. The vibrations are sensed using sensors, analyzed and reduced. Now, the sensing is done through compressed sensing technique.

My question is: How is it possible to find the optimal sensor placement (minimal number of sensors along with the details of it’s spatial position) just with the help of the Measurement matrix (that we design in compressed sensing for reconstruction purpose)?

Details: Suppose we want to retrieve x (vibrational info) from: y=cx, where c is the fat matrix (Measurement matrix * sparse matrix). I want to find a relationship that the measurement matrix can hold with the placement of sensors. It’s like an inverse problem, instead of designing a measurement matrix with respect to the sensors placed, I want to place the sensors with the knowledge I have about the measurement matrix.

31 January, 2014 at 10:41 am

(转)初识压缩感知Compressive Sensing | Dream's Secret Base[…] 本部分为陶哲轩在一篇blog中对于comprehensive sensing 文章翻译过来的精华部分，提取了重点，并加入了一些理解，这里用来解释CS的概念堪称经典。 […]