Over two years ago, Emmanuel Candés and I submitted the paper “The Dantzig selector: Statistical estimation when p is much
larger than n
” to the Annals of Statistics. This paper, which appeared last year, proposed a new type of selector (which we called the Dantzig selector, due to its reliance on the linear programming methods to which George Dantzig, who had died as we were finishing our paper, had contributed so much to) for statistical estimation, in the case when the number p of unknown parameters is much larger than the number n of observations. More precisely, we considered the problem of obtaining a reasonable estimate \beta^* for an unknown vector \beta \in {\Bbb R}^p of parameters given a vector y = X \beta + z \in {\Bbb R}^n of measurements, where X is a known n \times p predictor matrix and z is a (Gaussian) noise error with some variance \sigma^2. We assumed that the predictor matrix X obeyed the restricted isometry property (RIP, also known as UUP), which roughly speaking asserts that X\beta has norm comparable to \beta whenever the vector \beta is sparse. This RIP property is known to hold for various ensembles of random matrices of interest; see my earlier blog post on this topic.

Our selection algorithm, inspired by our previous work on compressed sensing, chooses the estimated parameters \beta^* to have minimal l^1 norm amongst all vectors which are consistent with the data in the sense that the residual vector r := y - X \beta^* obeys the condition

\| X^* r \|_\infty \leq \lambda, where \lambda := C \sqrt{\log p} \sigma (1)

(one can check that such a condition is obeyed with high probability in the case that \beta^* = \beta, thus the true vector of parameters is feasible for this selection algorithm). This selector is similar, though not identical, to the more well-studied lasso selector in the literature, which minimises the l^1 norm of \beta^* penalised by the l^2 norm of the residual.

A simple model case arises when n=p and X is the identity matrix, thus the observations are given by a simple additive noise model y_i = \beta_i + z_i. In this case, the Dantzig selector \beta^* is given by the hard soft thresholding formula

\beta^*_i = \max(|y_i| - \lambda, 0 )  \hbox{sgn}(y_i).

The mean square error {\Bbb E}( \| \beta - \beta^* \|^2 ) for this selector can be computed to be roughly

\lambda^2 + \sum_{i=1}^n  \min( |y_i|^2, \lambda^2) (2)

and one can show that this is basically best possible (except for constants and logarithmic factors) amongst all selectors in this model. More generally, the main result of our paper was that under the assumption that the predictor matrix obeys the RIP, the mean square error of the Dantzig selector is essentially equal to (2) and thus close to best possible.

After accepting our paper, the Annals of Statistics took the (somewhat uncommon) step of soliciting responses to the paper from various experts in the field, and then soliciting a rejoinder to these responses from Emmanuel and I. Recently, the Annals posted these responses and rejoinder on the arXiv:

  1. Bickel compared these results with recent results on the lasso selector by Bunea-Tsybakov-Wegcamp and Meinshausen-Yu, commented on the naturality of the constraint (1), raised the issue of collinearity (which is precluded by the RIP hypothesis, but can of course occur in practice), and proposed the use of tools such as cross-validation to estimate the quantity \lambda appearing in the constraint (1).
  2. Efron, Hastie, and Tibshirani performed numerics to compare the accuracy of the Dantzig selector and the lasso selector, the performance was broadly rather similar, but the Dantzig selector appeared to have some artefacts arising from the constraint (1).
  3. Cai and Lv asked whether the \sqrt{\log p} factor in (1) was too conservative in the case when p was extremely large compared to n, leading to an overly sparse model \beta^*; similarly, they raised the question of whether the logarithmic losses in (2) were sharp. Concerns were also raised about the verifiability of the RIP condition in this case (which is also the case in which collinearity becomes likely). There were also some issues raised concerning speed and robustness of the implementation of the Dantzig selector.
  4. Ritov raised a more philosophical point, as to whether the prediction error (which is essentially {\Bbb E} \| X \beta - X \beta^* \|^2 in this model) is a better indicator of accuracy than the loss {\Bbb E} \| \beta - \beta^* \|^2.
  5. Meinshausen, Rocha, and Yu compared our results with similar (though not perfectly analogous) existing results for the lasso selector, as well as an comparative analysis in low-dimensional situations. Their numerical experiments also suggest that the \lambda parameter in (1) needs to be tuned by cross-validation, and that the Dantzig selector has particular difficulties with collinearity.
  6. Friedlander and Saunders focused on the speed of implementation of the Dantzig selector, concluding that using general-purpose linear algebra solvers (e.g. the simplex method) was moderately computationally intensive in practice.

Finally, Candès and myself gave a rejoinder to these responses. Our main points were:

  1. Regarding collinearity (which in particular implies breakdown of the RIP), an accurate estimation of the parameters \beta is essentially hopeless no matter what selector one uses, and it is indeed more profitable to instead focus on estimating X \beta instead, but our selector is focused on applications (such as imaging, ADC conversion, or genomics) in which \beta is the variable of interest rather than X \beta. It is certainly of interest to relax the RIP hypothesis, however, and to see whether one can still obtain good estimates for X \beta in this case. (The canonical selector always gives a near-optimal estimate for X \beta, but is NP-hard (and thus infeasible in practice) to compute.)
  2. We agree with the points made that the parameters in (1) need to be tuned further to optimise performance, and in particular that cross-validation is an eminently sensible idea for this purpose. Note that in many applications (e.g. imaging), the variance \sigma^2 can be specified from the design of the application.
  3. The mean-square error for the Dantzig selector does tend to underperform that for the the lasso selector slightly in many cases, but (as noted in our original paper) this can be compensated for by using a slightly more complicated Gauss-Dantzig selector in which the Dantzig selector is used to locate the “active” parameters, to which one then applies a classical least-squares regression method.
  4. Running time of off-the-shelf implementations of the Dantzig selector are indeed somewhat of a concern for large data sets (e.g. 10^6 measurements). But the same could have been said of the lasso selector when it was first introduced, and we expect to see faster ways to implement the algorithm in the future.

It was an interesting and challenging new experience for Emmanuel and myself to engage in a formal discussion over one of my papers in a journal venue; I suppose this sort of thing is more common in the applied sciences such as statistics, but seems to be rather rare in pure mathematics.

Incidentally, after this rejoinder was submitted, more recent work has appeared showing that the lasso selector enjoys similar performance guarantees to the Dantzig selector: see this paper of Bickel, Ritov, and Tsybakov. Also, a nice way to perform cross-validation for general compressed sensing problems via the Johnson-Lindenstrauss lemma was noted very recently by Ward.

[Update, Mar 22: The journal issue (Vol. 35, No. 6, 2007) in which all these articles appear can be found here.]

[Update, Mar 29: I have learned also of the recent paper of James, Radchenko, and Lv, which introduces a new “DASSO” algorithm for computing the Dantzig selector in time comparable to that of the best known Lasso algorithms, and provides further theoretical connections between the two selectors.]