Van Vu and I have just uploaded to the arXiv our paper “Random matrices: The distribution of the smallest singular values“, submitted to Geom. Func. Anal..   This paper concerns the least singular value \sigma_n(M) of a random n \times n matrix M_n = (a_{ij})_{1 \leq i,j \leq n} with iid entries, which for simplicity we will take to be real (we also have analogues for complex random matrices), with mean zero and variance one.  A typical model to keep in mind here is the Bernoulli model, when each a_{ij} is equal to +1 or -1 with an equal probability of each, while a privileged role will be planed by the gaussian model, when each a_{ij} \equiv N(0,1) is given the standard gaussian distribution.

The distribution of the least singular value \sigma_n(M_n), which is of importance in smoothed analysis and also has intrinsic interest within the field of random matrices, has been intensively studied in recent years.  For instance, in the Bernoulli case, there have been several recent papers on the singularity probability {\Bbb P}( \sigma_n(M_n) = 0 ); it is not hard to obtain a lower bound of (\frac{1}{2}+o(1))^n, and this is conjectured to be the correct asymptotic.  The best upper bound so far is by Bourgain, Vu, and Wood, who obtain (\frac{1}{\sqrt{2}}+o(1))^n.

Upper and lower tail bounds have also been obtained, starting with the breakthrough paper of Rudelson (building upon some earlier work on rectangular matrices by Litvak, Pajor, Rudelson, and Tomczak-Jaegermann), with subsequent papers by Van and myself, by Rudelson, and also by Rudelson and Vershynin.  To oversimplify somewhat, the conclusion of this work is that the least singular value \sigma_n(M_n) has size comparable to 1/\sqrt{n} with high probability.  The techniques are based in part on inverse Littlewood-Offord theorems.

However, in the case of the gaussian ensemble, we know more than just the expected size of the least singular value; we know its asymptotic distribution.  Indeed, it was shown by Edelman in this case that one has

{\Bbb P}( \sigma_n(M_n) \leq t/\sqrt{n} ) = 1 - e^{-t^2/2 - t} + o(1) (1)

for any fixed t > 0.  This computation was highly algebraic in nature, relying on special identities that are available only for extremely symmetric random matrix ensembles, such as the gaussian random matrix model; in particular, it is not obvious at all that the Bernoulli ensemble necessarily obeys the same distribution as the gaussian one.  Nevertheless, motivated in part by this computation, Spielman and Teng conjectured that the bound

{\Bbb P}( \sigma_n(M_n) \leq t/\sqrt{n} ) \leq t + c^n

should hold for some c < 1 for, say, the Bernoulli ensemble.    This conjecture was verified up to losses of a multiplicative constant by Rudelson and Vershynin.

The main result of our paper is to show that the distribution of the least singular value is in fact universal, being asymptotically the same for all iid (real) random matrix models with the same mean and variance, and with a sufficiently high number of moment conditions.  In particular, the asymptotic (1) for the gaussian ensemble is also true for the Bernoulli ensemble. Furthermore the error term o(1) can be shown to be of the shape O(n^{-c}) for some c > 0, which in turn confirms the Spielman-Teng conjecture (without a loss of constant) in the polynomial size regime t \geq n^{-c'} for some c' > 0.  We also have some further results for other low singular values (e.g. \sigma_{n-k}(M_n) for fixed k) but they are harder to state, and we will not do so here.

To our knowledge, this is the first universality result for the “hard edge” of the spectrum (i.e. the least few singular values) for iid square matrix models.  [For rectangular matrices, where the hard edge is bounded away from zero, universality was recently established by Feldheim and Sodin.] The bulk distribution for the singular values of such matrices has been known for some time (it is governed by the famous Marchenko-Pastur law), while the distribution at the “soft edge” of the spectrum (i.e. the largest few singular values) was established to be universal by Soshnikov (here the distribution is governed by the Tracy-Widom law for the top singular value, and by the Airy kernel for the next few singular values).  Both of these results are basically obtained by the moment method (or close relatives of this method, such as the resolvent method).  However, the moment method is not effective for discerning the hard edge of the spectrum, since the singular values here are very small compared with the bulk and so have a negligible influence on the moments.  [In the rectangular case, where the hard edge is bounded away from zero, the moment method becomes available again, though the application of it is quite delicate; see the Feldheim-Sodin paper for details.] Instead, we proceed by observing a certain central limit theorem type behaviour for the geometry of the columns of M_n, which is enough to give us the desired universality; more details on the proof lie below the fold.

— Proof strategy —

As with our earlier paper on the universality of the circular law, we do not attempt to prove (1) for general ensembles such as the Bernoulli ensemble (say) directly.  Instead, we use a Lindeberg-style approach to separate the analytic difficulties from the algebraic ones by first comparing the least singular value \sigma_n(M_n) of a Bernoulli matrix M_n with the corresponding least singular value \sigma_n(G_n) of a gaussian matrix G_n.  If we can show a universality property

\sigma_n(M_n) \approx \sigma_n(G_n) (2)

where we use X \approx Y to denote the informal statement that two random variables X, Y have asymptotically the same distribution (after appropriate renormalisation, and in some appropriate metric), then the claim (1) for the Bernoulli matrix M_n will then follow from the existing work of Edelman establishing (1) for the gaussian matrix G_n.

To establish the approximation (2) for the least singular value, the first thing we do is flip it around, so that we now seek an approximation for the largest singular value for the inverse matrices:

\sigma_1(M_n^{-1}) \approx \sigma_1( G_n^{-1} ). (3)

Experience has shown that largest singular values are easier to compute than least singular values (for instance, they are the same thing as the operator norm and can be computed by, say, the power method).  But there is a huge price to pay: whereas the matrices M_n, G_n had iid entries and were thus tractable to manipulate, the inverse matrices M_n^{-1}, G_n^{-1} have highly coupled entries and are quite difficult to manipulate; in particular, it is not easy at all to control high moments of M_n^{-1}.  So the moment method does not work terribly well (although the second moment of M_n^{-1} is computable, and already useful – we used this “negative second moment identity” to good effect in our earlier paper).

So we do something else.  It turns out that the singular values of M_n tend to be spaced apart with spacing about 1/\sqrt{n}; in particular, \sigma_{n-k+1}(M) is heuristically about k / \sqrt{n}.  (This intuition is supported by the Marchenko-Pastur law, and is also backed up by the tail estimates of Rudelson and Vershynin mentioned earlier.)  Inverting this, we see that the k^{th} largest singular value of M_n^{-1} should be about \sqrt{n} / k.  Thus we see that (after dividing out the normalising factor \sqrt{n}) the matrix M_n^{-1} should behave like a compact operator; its singular values decay to zero at a noticeable rate.  In particular, M_n^{-1} should behave approximately like a finite rank matrix.  (More accurately, “finite” should be “bounded uniformly in n”.)

At this point, we use the philosophy from the property testing literature for finite rank matrices, which tells us that statistics of large finite rank matrix (and in particular, its largest eigenvalue) often can be computed accurately from a small random sample of that matrix.  Indeed, if we pick a small integer s (in practice s \sim n^\varepsilon for some small \varepsilon > 0, and let B(M_n) be the s \times n matrix formed by taking s rows of M_n^{-1} at random (actually, given the iid nature of M_n, we can just take the first s rows), then we expect (from the approximate finite rank nature of M_n^{-1})

\sigma_1(M_n^{-1}) \approx \sqrt{\frac{n}{s}} \sigma_1( B(M_n) ).

It turns out that one can justify this by an application of the moment method, coupled with some estimates on the size of the rows of M_n^{-1} which can be obtained variants of the machinery used in previous papers (in particular, we need some new estimates on distances between random vectors and random hyperplanes).

Of course, everything we say about the Bernoulli ensemble has counterparts for the gaussian ensemble, and so we are reduced to showing that

\sigma_1(B(M_n)) \approx \sigma_1(B(G_n)).

To show this, we invert again, viewing the largest singular value of B(M_n) as the least non-trivial singular value of the (pseudo-)inverse of B(M_n).  Indeed, some elementary linear algebra tells us that

\sigma_1(B(M_n)) = \sigma_s(M_{n,s})^{-1}

where the s \times s matrix M_{n,s} is formed by projecting the first s columns X_1,\ldots,X_s of M onto the s-dimensional space V defined as the orthocomplement of the span of the remaining columns X_{s+1},\ldots,X_n, and then expressing those vectors using some arbitrarily chosen orthonormal basis for V.  [This fact is perhaps most familiar in the simple case s=1, when it is saying that the size of a row of M_n^{-1} is the reciprocal of the distance between the corresponding column of M_n to the hyperplane spanned by the other columns of M_n.]

Of course, we have similar computations for the gaussian ensemble; so we are now reduced to showing that

\sigma_s(M_{n,s}) \approx \sigma_s( G_{n,s} ).

It looks like we are more or less back to where we started (cf. (2)), but with s \times s matrices instead of n \times n matrices.  Indeed, the projected gaussian matrix G_{n,s} can easily be checked to have exactly the same distribution as the s \times s gaussian matrix G_s, basically because the projection of an n-dimensional gaussian random vector to an s-dimensional subspace is an s-dimensional gaussian vector.

But now for the punchline: the projected Bernoulli matrix M_{n,s} also has distribution very close to the gaussian matrix G_s, because the projection of a n-dimensional Bernoulli random vector to a (sufficiently non-degenerate) s-dimensional subspace also behaves like an s-dimensional gaussian vector.  This is basically a consequence of the central limit theorem; for technical reasons we need a modification of the Berry-Esseén version of this theorem.  One has to show that the space V is sufficiently non-degenerate, but this can be done by a modification of techniques already in the literature (specifically, we rely on Talagrand’s inequality and a quantitative form of the Marchenko-Pastur law).  Once this is done, the claim follows from the stability properties of \sigma_s (specifically, we use the Hoffman-Weilandt inequality).