For much of last week I was in Leiden, Holland, giving one of the Ostrowski prize lectures at the annual meeting of the Netherlands mathematical congress. My talk was not on the subject of the prize (arithmetic progressions in primes), as this was covered by a talk of Ben Green there, but rather on a certain “uniform uncertainty principle” in Fourier analysis, and its relation to compressed sensing; this is work which is joint with Emmanuel Candes and also partly with Justin Romberg.

As mentioned in the previous post here, compressed sensing is a relatively new measurement paradigm which seeks to capture the “essential” aspects of a high-dimensional object using as few measurements as possible. There are many contexts in which compressed sensing is potentially useful, but for this talk (which is focussed on theory rather than applications) I will just consider a single toy model arising from Fourier analysis. Specifically, the object we seek to measure will be a complex function $f: {\Bbb Z}/{N \Bbb Z} \to {\Bbb C}$ on the cyclic group of N elements (or if you wish, a column vector of N complex numbers). In practice one should think of N as being of moderately large size, e.g. $N \sim 10^6$. Given such a function, one can form its (discrete) Fourier transform $\hat f: {\Bbb Z}/{N \Bbb Z} \to {\Bbb C}$; for this talk it will be convenient to normalise this Fourier transform as

$\hat f(\xi) = \frac{1}{\sqrt{N}} \sum_{x \in {\Bbb Z}/N{\Bbb Z}} f(x) e^{-2\pi i x\xi/N}.$

We suppose that we can measure some (but perhaps not all) of the Fourier coefficients of f, and ask whether we can reconstruct f from this information; the objective is to use as few Fourier coefficients as possible. More specifically, we fix a set $\Lambda \subset {\Bbb Z}/N{\Bbb Z}$ of “observable” frequencies, and pose the following two questions:

1. Let N be a known integer, let $f: {\Bbb Z}/N {\Bbb Z} \to {\Bbb C}$ be an unknown function, let $\Lambda \subset {\Bbb Z}/N{\Bbb Z}$ a known set of frequencies, and let $c_\xi = \hat f(\xi)$ be a sequence of known Fourier coefficients of f for all $\xi \in \Lambda$. Is it possible to reconstruct f uniquely from this information?
2. If so, what is a practical algorithm for finding f?

For instance, if $\Lambda$ is the whole set of frequencies, i.e. $\Lambda = {\Bbb Z}/N{\Bbb Z}$, then the answer to Q1 is “yes” (because the Fourier transform is injective), and an answer to Q2 is provided by the Fourier inversion formula

$f(x) = \frac{1}{\sqrt{N}} \sum_{\xi \in {\Bbb Z}/N{\Bbb Z}} c_\xi e^{2\pi i x\xi}$

which can be computed quite quickly, for instance by using the fast Fourier transform.

Now we ask what happens when $\Lambda$ is a proper subset of ${\Bbb Z}/N{\Bbb Z}$. Then the answer to Q1, as stated above, is “no” (and so Q2 is moot). One can see this abstractly by a degrees-of-freedom argument: the space of all functions f on N points has N degrees of freedom, but we are only making $|\Lambda|$ measurements, thus leaving $N - |\Lambda|$ remaining degrees of freedom in the unknown function f. If $|\Lambda|$ is strictly less than N, then there are not enough measurements to pin down f precisely. More concretely, we can easily use the Fourier inversion formula to concoct a function f which is not identically zero, but whose Fourier transform vanishes on $\Lambda$ (e.g. consider a plane wave whose frequency lies outside of $\Lambda$). Such a function is indistinguishable from the zero function as far as the known measurements are concerned.

However, we can hope to recover unique solvability for this problem by making an additional hypothesis on the function f. There are many such hypotheses one could make, but for this toy problem we shall simply assume that f is sparse. Specifically, we fix an integer S between 1 and N, and say that a function f is S-sparse if f is non-zero in at most S places, or equivalently if the support $\hbox{supp}(f) := \{ x \in {\Bbb Z}/N{\Bbb Z}: f(x) \neq 0\}$ has cardinality less than or equal to S. We now ask the following modified versions of the above two questions:

1. Let S and N be known integers, let $f: {\Bbb Z}/N{\Bbb Z} \to {\Bbb C}$ be an unknown S-sparse function, let $\Lambda \subset {\Bbb Z}/N{\Bbb Z}$ a known set of frequencies, and let $c_\xi = \hat f(\xi)$ be a sequence of known Fourier coefficients of f for all $\xi \in \Lambda$. Is it possible to reconstruct f uniquely from this information?
2. If so, what is a practical algorithm for finding f?

Note that while we know how sparse f is, we are not given to know exactly where f is sparse – there are S positions out of the N total positions where f might be non-zero, but we do not know which S positions these are. The fact that the support is not known a priori is one of the key difficulties with this problem. Nevertheless, setting that problem aside for the moment, we see that f now has only S degrees of freedom instead of N, and so by repeating the previous analysis one might now hope that the answer to Q1 becomes yes as soon as $|\Lambda| \geq S$, i.e. one takes at least as many measurements as the sparsity of f.

Actually, one needs at least $|\Lambda| \geq 2S$ (if 2S is less than or equal to N), for the following reason. Suppose that $|\Lambda|$ was strictly less than 2S. Then the set of functions supported on {1,…,2S} has more degrees of freedom than are measured by the Fourier coefficients at $\Lambda$. By elementary linear algebra, this therefore means that there exists a function f supported on {1,…,2S} whose Fourier coefficients vanish on $\Lambda$, but is not identically zero. If we split $f = f_1 - f_2$ where $f_1$ is supported on {1,…,S} and $f_2$ is supported on {S+1,…,2S}, then we see that $f_1$ and $f_2$ are two distinct S-sparse functions whose Fourier transforms agreed on Lambda, thus contradicting unique recoverability of f.

One might hope that this necessary condition is close to being sharp, so that the answer to the modified Q1 is yes as soon as $|\Lambda|$ is larger than 2S. By modifying the arguments of the previous paragraph we see that Q1 fails if and only if there exists a non-zero 2S-sparse function whose Fourier transform vanished on all of $\Lambda$, but one can hope that this is not the case, because of the following heuristic:

Uncertainty principle (informal): If a function is sparse and not identically zero, then its Fourier transform should be non-sparse.

This type of principle is motivated by the Heisenberg uncertainty principle in physics; the size of the support of f is a proxy for the spatial uncertainty of f, whereas the size of the support of $\hat f$ is a proxy for the momentum uncertainty of f. There are a number of ways to make this principle precise. One standard one is

Discrete uncertainty principle: If f is not identically zero, then $|\hbox{supp}(f)| \times |\hbox{supp}(\hat f)| \geq N$.

Proof: Combine the Plancherel identity

$\displaystyle \sum_{x \in {\Bbb Z}/N{\Bbb Z}} |f(x)|^2 = \sum_{\xi \in {\Bbb Z}/N{\Bbb Z}} |\hat f(\xi)|^2$

with the elementary inequalities

$\displaystyle \sup_{x \in {\Bbb Z}/N{\Bbb Z}} |f(x)| \leq \frac{1}{\sqrt N} \sum_{\xi \in {\Bbb Z}/N{\Bbb Z}} |\hat f(\xi)|$

$\displaystyle \sum_{x \in {\Bbb Z}/N{\Bbb Z}} |f(x)|^2 \leq |\hbox{supp}(f)| (\sup_{x \in {\Bbb Z}/N{\Bbb Z}} |f(x)|)^2$

and

$\sum_{\xi \in {\Bbb Z}/N{\Bbb Z}} |\hat f(\xi)| \leq |\hbox{supp}(\hat f)|^{1/2} (\sum_{\xi \in {\Bbb Z}/N{\Bbb Z}} |\hat f(\xi)|^2)^{1/2}.$

$\Box$

By applying this principle we see that we obtain unique recoverability as soon as $|\Lambda| > N - N/2S$, but this is a rather lousy criterion – for most values of S, it means that one has to measure almost all of the Fourier coefficients to recover f! At that point one may as well measure all of the coefficients so that one can recover both sparse and non-sparse f easily via the Fourier inversion formula.

There are cases in which the condition $|\Lambda| > N - N/2S$ cannot be improved. A good example is provided by the Dirac comb, in which N is a square number, and f is the indicator function of the multiples $\{ 0, \sqrt{N}, 2\sqrt{N}, \ldots, N-\sqrt{N}\}$ of $\sqrt{N}$. As is well known, this function is its own Fourier transform. If we then set $\Lambda$ to be the complement of the multiples of $\sqrt{N}$, then we see that even though we are measuring almost all the frequencies – $N - \sqrt{N}$ of them to be precise – we cannot distinguish the $\sqrt{N}$-sparse Dirac comb f from the zero function; we have unluckily chosen only those frequencies $\Lambda$ which totally fail to intersect the support of the Fourier transform of f.

One can concoct several further counterexamples of this type, but they require various subgroups of ${\Bbb Z}/N{\Bbb Z}$ (such as the multiples of $\sqrt{N}$, when N is square). One can ask what happens when the ambient group has no non-trivial subgroups, i.e. when N is prime. Then things become better, thanks to a classical result of Chebotarev:

Chebotarev lemma: Let N be prime. Then every square minor of the Fourier matrix $(e^{2\pi i jk/N})_{ 1\leq j,k \leq N}$ is invertible.

This lemma has been rediscovered or reproved at least seven times in the literature (I myself rediscovered it once); there is a nice survey paper on Chebotarev’s work which summarises some of this. (A nice connection to the setting of this talk, pointed out to me by Hendrik Lenstra here at Leiden: Chebotarev’s lemma was originally conjectured to be true by Ostrowski.) As a quick corollary of this lemma, we obtain the following improvement to the uncertainty principle in the prime order case.

Uncertainty principle for cyclic groups of prime order: If N is prime and f is not identically zero, then $|\hbox{supp}(f)| + |\hbox{supp}(\hat f)| \geq N+1$.

From this one can quickly show that one does indeed obtain unique recoverability for S-sparse functions in cyclic groups of prime order whenever one has $|\Lambda| \geq 2S$, and that this condition is absolutely sharp. (There is a generalisation of the above uncertainty principle to composite N due to Meshulam, but the results are not as easy to state.)

This settles the (modified) Q1 posed above, at least for groups of prime order. But it does not settle Q2 – the question of exactly how one recovers f from the given data $N, S, \Lambda, (c_\xi)_{\xi \in \Lambda}$. One can consider a number of simple-minded strategies to recover f:

1. Brute force. If we knew precisely what the support $\hbox{supp}(f)$ of f was, we can use linear algebra methods to solve for f in terms of the coefficients $c_\xi$, since we have $|\Lambda|$ equations in S unknowns (and Chebotarev guarantees that this system has maximal rank). So we can simply exhaust all the possible combinations for $\hbox{supp}(f)$ (there are roughly $\binom{N}{S}$ of these) and apply linear algebra to each combination. This works, but is horribly computationally expensive, and is completely impractical once N and S are of any reasonable size, e.g. larger than 1000.
2. $l^0$ minimisation. Out of all the possible functions f which match the given data (i.e. $\hat f(\xi) = c_\xi$ for all $\xi \in \Lambda$), find the sparsest such solution, i.e. the solution which minimises the “$l^0$ norm” $\|f\|_{l^0} := \sum_{x \in {\Bbb Z}/N{\Bbb Z}} |f(x)|^0 = |\hbox{supp}(f)|$. This works too, but is still impractical: the general problem of finding the sparsest solution to a linear system of equations contains the infamous subset sum decision problem as a special case (we’ll leave this as an exercise to the reader) and so this problem is NP-hard in general. (Note that this does not imply that the original problem Q1 is similarly NP-hard, because that problem involves a specific linear system, which turns out to be rather different from the specific linear system used to encode subset-sum.)
3. $l^2$ minimisation (i.e. the method of least squares). Out of all the possible functions f which match the given data, find the one of least energy, i.e. which minimises the $l^2$ norm $\|f\|_{l^2} := (\sum_{x \in {\Bbb Z}/N{\Bbb Z}} |f(x)|^2)^{1/2}$. This method has the advantage (unlike 1 and 2) of being extremely easy to carry out; indeed, the minimiser is given explicitly by the formula $f(x) = \frac{1}{\sqrt{N}} \sum_{\xi \in \Lambda} c_\xi e^{2\pi i x \xi/N}$. Unfortunately, this minimiser not guaranteed at all to be S-sparse, and indeed the uncertainty principle suggests in fact that the $l^2$ minimiser will be highly non-sparse.

So we have two approaches to Q2 which work but are computationally infeasible, and one approach which is computationally feasible but doesn’t work. It turns out however that one can take a “best of both worlds” approach halfway between method 2 and method 3, namely:

$l^1$ minimisation (or basis pursuit): Out of all the possible functions f which match the given data, find the one of least energy, i.e. which minimises the $l^1$ norm $\|f\|_{l^1} := \sum_{x \in {\Bbb Z}/N{\Bbb Z}} |f(x)|$.

The key difference between this minimisation problem and the $l^0$ problem is that the $l^1$ norm is convex, and so this minimisation problem is no longer NP-hard but can be solved in reasonable (though not utterly trivial) time by convex programming techniques such as the simplex method. So the method is computationally feasible; the only question is whether the method actually works to recover the original S-sparse function f.

Before we reveal the answer, we can at least give an informal geometric argument as to why $l^1$ minimisation is more likely to recover a sparse solution than $l^2$ minimisation. The set of all f whose Fourier coefficients match the observed data $c_\xi$ forms an affine subspace of the space of all functions. The $l^2$ minimiser can then be viewed geometrically by taking $l^2$ balls (i.e. Euclidean balls) centred at the origin, and gradually increasing the radius of the ball until the first point of contact with the affine subspace. In general, there is no reason to expect this point of contact to be sparse (i.e. to lie on a high-codimension coordinate subspace). If however we replace $l^2$ with $l^1$, then the Euclidean balls are replaced by octahedra, which are much “pointier” (especially in high dimensions) and whose corners lie on coordinate subspaces. So the point of first contact is now much more likely to be sparse. The idea of using $l^1$ as a “convex relaxation” of $l^0$ is a powerful one in applied mathematics; see for instance this article on the topic.

It turns out that if $\Lambda$ and f are structured in a perverse way, then basis pursuit does not work (and more generally, any algorithm to solve the problem is necessarily very unstable). We already saw the Dirac comb example, which relied on the composite nature of N. But even when N is prime, we can construct pseudo-Dirac comb examples which exhibit the problem: if f is for instance a discretised bump function adapted to an arithmetic progression such as $\{ -\lfloor \sqrt{N} \rfloor, \ldots, \lfloor\sqrt{N}\rfloor\}$, then elementary Fourier analysis reveals that the Fourier transform of f will be highly concentrated (though not completely supported) on a dual progression (which in the above example will also be basically $\{ -\lfloor \sqrt{N} \rfloor, \ldots, \lfloor\sqrt{N}\rfloor\}$, and have a rapidly decreasing tail away from this progression. (This is related to the well-known fact that the Fourier transform of a Schwartz function is again a Schwartz function.) If we pick $\Lambda$ to be far away from this progression – e.g. $\Lambda = \{ \lfloor N/3 \rfloor,\ldots,\lfloor 2N/3 \rfloor\}$, then the Fourier transform will be very small on $\Lambda$. As a consequence, while we know abstractly that exact reconstruction of f is possible if N is a large prime assuming infinite precision in the measurements, any presence of error (e.g. roundoff error) will mean that f is effectively indistinguishable from the zero function. In particular it is not hard to show that basis pursuit fails in general in this case.

The above counterexamples used very structured examples of sets of observed frequencies $\Lambda$, such as arithmetic progressions. On the other hand, it turns out, remarkably enough, that if instead one selects random sets of frequencies $\Lambda$ of some fixed size $|\Lambda|$ (thus choosing uniformly at arndom among all the $\binom{N}{|\Lambda|}$ possibilities), then things become much better. Intuitively, this is because all the counterexamples that obstruct solvability tend to have their Fourier transform supported in very structured sets, and the dichotomy between structure and randomness means that a random subset of ${\Bbb Z}/N{\Bbb Z}$ is likely to contain a proportional fraction of all structured sets. One specific manifestation of this is

Uniform Uncertainty Principle (UUP): If Lambda is a random set with $|\Lambda| \gg S \log^4 N$, then with overwhelming probability (at least $1 - O(N^{-A})$ for any fixed A), we have the approximate local Plancherel identity

$\displaystyle \sum_{\xi \in \Lambda} |\hat f(\xi)|^2 \approx \frac{|\Lambda|}{N} \sum_{x \in {\Bbb Z}/N{\Bbb Z}} |f(x)|^2$

for all 4S-sparse functions f, where by $X \approx Y$ we mean that X and Y differ by at most 10%. (N is not required to be prime.)

The above formulation is a little imprecise; see this paper (which had $\log^6 N$ instead of $\log^4 N$) or this one for more rigorous versions. This principle asserts that if the random set $\Lambda$ is just a little bit bigger than S (by a couple logs), then it is not possible for the Fourier transform of an S-sparse function to avoid $\Lambda$, and moreover the set $\Lambda$ must receive its “fair share” of the $l^2$ energy, as predicted by Plancherel’s theorem. The “uniform” nature of this principle rests of the fact that it applies for all S-sparse functions f, with no exceptions. For a single function f, this type of localisation of the Plancherel identity is quite easy to prove using Chernoff’s inequality, and uses the same phenomena that explain why Monte Carlo integration is so effective. To extend this to all sparse f, the main strategy (first used in this type of context by Bourgain) is to exploit the fact that the set of sparse f has low metric entropy and so can be described efficiently by a relatively small number of functions. (The principle cannot possibly extend to all functions f, since it is certainly possible to create non-zero functions whose Fourier transform vanishes everywhere on $\Lambda$.)

By using this principle (and variants of this principle), one can indeed show that basis pursuit works:

Theorem [CRT], [CT]. Suppose $\Lambda$ is a random set with $|\Lambda| \gg S \log N$. Then any given S-sparse function f will be recovered exactly by $l^1$ minimisation with overwhelming probability. If one makes the stronger hypothesis $|\Lambda| \gg S \log^4 N$, then with overwhelming probability all S-sparse functions will be recovered exactly by $l^1$ minimisation. (Again, N is not required to be prime.)

Roughly speaking, the idea in the latter result is to use the UUP to show that the Fourier coefficients of any sparse (or $l^1$-bounded) competitor with disjoint support to the true S-sparse solution is going to be rather orthogonal to the true solution, and thus unlikely to be present in an $l^1$ minimiser. The former result is more delicate and combinatorial, and requires computing high moments of random Fourier minors.

The method is rather robust; there is some followup work which demonstrates stability of the basis pursuit method with respect to several types of noise. See also the ICM proceedings article of Emmanuel Candes for a recent survey. It can also be abstracted from this toy Fourier problem to a more general problem of recovering sparse or compressible data from few measurements. As long as the measurement matrix obeys an appropriate generalisation of the UUP, the basis pursuit methods are quite effective (both in theory, in numerical experiments, and more recently in laboratory prototypes).