Now we turn attention to another important spectral statistic, the least singular value of an matrix (or, more generally, the least non-trivial singular value of a matrix with ). This quantity controls the invertibility of . Indeed, is invertible precisely when is non-zero, and the operator norm of is given by . This quantity is also related to the condition number of , which is of importance in numerical linear algebra. As we shall see in the next set of notes, the least singular value of (and more generally, of the shifts for complex ) will be of importance in rigorously establishing the circular law for iid random matrices , as it plays a key role in computing the Stieltjes transform of such matrices, which as we have already seen is a powerful tool in understanding the spectra of random matrices.
The least singular value
which sits at the “hard edge” of the spectrum, bears a superficial similarity to the operator norm
at the “soft edge” of the spectrum, that was discussed back in Notes 3, so one may at first think that the methods that were effective in controlling the latter, namely the epsilon-net argument and the moment method, would also work to control the former. The epsilon-net method does indeed have some effectiveness when dealing with rectangular matrices (in which the spectrum stays well away from zero), but the situation becomes more delicate for square matrices; it can control some “low entropy” portions of the infimum that arise from “structured” or “compressible” choices of , but are not able to control the “generic” or “incompressible” choices of , for which new arguments will be needed. As for the moment method, this can give the coarse order of magnitude (for instance, for rectangular matrices with for , it gives an upper bound of for the singular value with high probability, thanks to the Marchenko-Pastur law), but again this method begins to break down for square matrices, although one can make some partial headway by considering negative moments such as , though these are more difficult to compute than positive moments .
So one needs to supplement these existing methods with additional tools. It turns out that the key issue is to understand the distance between one of the rows of the matrix , and the hyperplane spanned by the other rows. The reason for this is as follows. First suppose that , so that is non-invertible, and there is a linear dependence between the rows . Thus, one of the will lie in the hyperplane spanned by the other rows, and so one of the distances mentioned above will vanish; in fact, one expects many of the distances to vanish. Conversely, whenever one of these distances vanishes, one has a linear dependence, and so .
More generally, if the least singular value is small, one generically expects many of these distances to be small also, and conversely. Thus, control of the least singular value is morally equivalent to control of the distance between a row and the hyperplane spanned by the other rows. This latter quantity is basically the dot product of with a unit normal of this hyperplane.
When working with random matrices with jointly independent coefficients, we have the crucial property that the unit normal (which depends on all the rows other than ) is independent of , so even after conditioning to be fixed, the entries of remain independent. As such, the dot product is a familiar scalar random walk, and can be controlled by a number of tools, most notably Littlewood-Offord theorems and the Berry-Esséen central limit theorem. As it turns out, this type of control works well except in some rare cases in which the normal is “compressible” or otherwise highly structured; but epsilon-net arguments can be used to dispose of these cases. (This general strategy was first developed for the technically simpler singularity problem by Komlós, and then extended to the least singular value problem by Rudelson.)
These methods rely quite strongly on the joint independence on all the entries; it remains a challenge to extend them to more general settings. Even for Wigner matrices, the methods run into difficulty because of the non-independence of some of the entries (although it turns out one can understand the least singular value in such cases by rather different methods).
To simplify the exposition, we shall focus primarily on just one specific ensemble of random matrices, the Bernoulli ensemble of random sign matrices, where are independent Bernoulli signs. However, the results can extend to more general classes of random matrices, with the main requirement being that the coefficients are jointly independent.
— 1. The epsilon-net argument —
We begin by using the epsilon net argument to establish a lower bound in the rectangular case, due to Litvak, Pajor, Rudelson, and Tomczak-Jaegermann:
for some absolute constant that holds with overwhelming probability from Notes 3 (indeed, one can take any here).
We use the epsilon-net method introduced in Notes 3, but with a smaller value of than used for the largest singular value. We write
Taking to be a maximal -net of the unit sphere in , with to be chosen later, we have that
and thus by (1), we have with overwhelming probability that
and so it suffices to show that
is exponentially small in . From the union bound, we can upper bound this by
So we need to upper bound, for each , the probability
If we let be the rows of , we can write this as
By Markov’s inequality, the only way that this event can hold is if we have
for at least values of . We do not know in advance what the set of is for which this event holds. But the number of possible values of such sets of is at most . Applying the union bound (and paying the entropy cost of ) and using symmetry, we may thus bound the above probability by
Now observe that the random variables are independent, and so we can bound this expression by
where is a random vector of iid Bernoulli signs.
We write , so that is a random walk
To understand this walk, we apply (a slight variant) of the Berry-Esséen theorem from Notes 2:
Exercise 1 Show that
for any and any non-zero . (Actually, for the purposes of this set of notes, it would suffice to establish a weaker form of the Berry-Esséen theorem with replaced by for any fixed ). (Hint: first normalise , then adapt the proof of the Berry-Esséen theorem.)
Conclude in particular that if
for some then
(Hint: condition out all the with .)
Let us temporarily call incompressible if
for somme to be chosen later, and compressible otherwise. If we only look at the incompressible elements of , we can now bound
and comparing this against the entropy cost (2) we obtain an acceptable contribution for small enough, and small enough depending on (here we are crucially using the rectangular condition ).
It remains to deal with the compressible vectors. Observe that such vectors lie within of a sparse unit vector which is only supported in at most positions. The -entropy of these sparse vectors (i.e. the number of balls of radius needed to cover this space) can easily be computed to be of polynomial size in . Meanwhile, we have the following crude bound:
Exercise 2 For any unit vector , show that
for small enough. (Hint: Use the Paley-Zygmund inequality. Bounds on higher moments on can be obtained for instance using Hoeffding’s inequality, or by direct computation.) Use this to show that
for all such and sufficiently small, with independent of and .
Thus the compressible vectors give a net contribution of , which is acceptable. This concludes the proof of Theorem 1.
— 2. Singularity probability —
Now we turn to square Bernoulli matrices . Before we investigate the size of the least singular value, we first tackle the easier problem of bounding the singularity probability
i.e. the probability that is not invertible. The problem of computing this probability exactly is still not completely settled. Since is singular whenever the first two rows (say) are identical, we obtain a lower bound
and it is conjectured that this bound is essentially tight in the sense that
but this remains open; the best bound currently is due to Bourgain, Vu, and Wood, and gives
We will not prove this bound here, but content ourselves with a weaker bound, essentially due to Komlós:
Proposition 2 We have .
To show this, we need the following combinatorial fact, due to Erdös:
Proof: By taking real and imaginary parts we may assume that is real. By eliminating zero coefficients of we may assume that ; reflecting we may then assume that all the are positive. Observe that the set of with forms an antichain in . The claim now easily follows from Sperner’s theorem and Stirling’s formula.
for any non-zero .
Now we prove the theorem. In analogy with the arugments of the previou ssection, we write
(actually we can take since is real). We divide into compressible and incompressible vectors as before, but our definition of compressibility and incompressibility is slightly different now. Also, one has to do a certain amount of technical maneuvering in order to preserve the crucial independence between rows and columns.
Namely, we pick an and call compressible if it is supported on at most coordinates, and incompressible otherwise.
Let us first consider the contribution of the event that for some nonzero compressible . Pick an with this property which is as sparse as possible, say sparse for some . Let us temporarily fix . By paying an entropy cost of , we may assume that it is the first entries that are non-zero for some . This implies that the first columns of have a linear dependence given by ; by minimality, are linearly independent. Thus, is uniquely determined (up to scalar multiples) by . Furthermore, as the matrix formed by has rank , there is some minor which already determines up to constants; by paying another entropy cost of , we may assume that it is the top left minor which does this. In particular, we can now use the first rows to determine up to constants. But the remaining rows are independent of and still need to be orthogonal to ; by Proposition 3 and (3), this happens with probability at most , giving a total cost of
which by Stirling’s formula is acceptable (in fact this gives an exponentially small contribution).
The same argument gives that the event that for some nonzero compressible also has exponentially small probability. The only remaining event to control is the event that for some incompressible , but that and for all nonzero compressible . Call this event .
Since for some incompressible , we see that for at least values of , the row lies in the vector space spanned by the remaining rows of . Let denote the event that holds, and that lies in ; then we see from double counting that
By symmetry, we thus have
To compute , we freeze consider a normal vector to ; note that we can select depending only on . We may assume that an incompressible normal vector exists, since otherwise the event would be empty. We make the crucial observation that is still independent of . By Proposition 3, we thus see that the conditional probability that , for fixed , is . We thus see that , and the claim follows.
Remark 1 Further progress has been made on this problem by a finer analysis of the concentration probability , and in particular in classifying those for which this concentration probability is large (this is known as the inverse Littlewood-Offord problem). Important breakthroughs in this direction were made by Halász (introducing Fourier-analytic tools) and by Kahn, Komlós, and Szemerédi (introducing an efficient “swapping” argument). Van Vu and I introduced tools from additive combinatorics (such as Freiman’s theorem) to obtain further improvements, leading eventually to the results of Bourgain, Vu, and Wood mentioned earlier.
— 3. Lower bound for the least singular value —
Now we return to the least singular value of an iid Bernoulli matrix, and establish a lower bound. Given that there are singular values between and , which is typically of size , one expects the least singular value to be of size about on the average. Another argument supporting this heuristic scomes from the following identity:
Exercise 3 (Negative second moment identity) Let be an invertible matrix, let be the rows of , and let be the columns of . For each , let be the hyperplane spanned by all the rows other than . Show that and .
From Talagrand’s inequality (Notes 1), we expect each to be of size on the average, which suggests that ; this is consistent with the heuristic that the eigenvalues should be roughly evenly spaced in the interval (so that should be about ).
Now we give a rigorous lower bound:
where goes to zero as uniformly in , and goes to zero as for each fixed .
This is a weaker form of a result of Rudelson and Vershynin (which obtains a bound of the form for some ), which builds upon earlier work of Rudelson and of Vu and myself, which obtained variants of the above result.
The scale that we are working at here is too fine to use epsilon net arguments (unless one has a lot of control on the entropy, which can be obtained in some cases thanks to powerful inverse Littlewood-Offord theorems, but is difficult to obtain in general.) We can prove this theorem along similar lines to the arguments in the previous section; we sketch the method as follows. We can take to be small. We write the probability to be estimated as
We can assume that for some absolute constant , as the event that this fails has exponentially small probability.
We pick an (not depending on ) to be chosen later. We call a unit vector compressible if lies within a distance of a -sparse vector. Let us first dispose of the case in which for some compressible . By paying an entropy cost of , we may assume that is within of a vector supported in the first coordinates. Using the operator norm bound on and the triangle inequality, we conclude that
Since has norm comparable to , this implies that the least singular value of the first columns of is . But by Theorem 1, this occurs with probability (if are small enough). So the total probability of the compressible event is at most , which is acceptable if is small enough.
Thus we may assume now that for all compressible unit vectors ; we may similarly assume that for all compressible unit vectors . Indeed, we may also assume that for every , where is with the column removed.
The remaining case is if for some incompressible . Let us call this event . Write , and let be the column of , thus
Letting be the subspace spanned by all the except for , we conclude upon projecting to the orthogonal complement of that
for all (compare with Exercise 3). On the other hand, since is incompressible, we see that for at least values of , and thus
for at least values of . If we let be the event that and (4) both hold, we thus have from double-counting that
and thus by symmetry
(say). However, if holds, then setting to be a unit normal vector to (which is necessarily incompressible, by the hypothesis on ), we have
Again, the crucial point is that and are independent. The incompressibility of , combined with a Berry-Esséen type theorem, then gives
Exercise 4 Show that
(say) if is sufficiently small depending on , and is sufficiently large depending on .
This gives a bound of for if is small enough depending on , and is large enough; this gives the claim.
with high probability for some , and any of polynomial size in . There are several results of this type, with overlapping ranges of generality (and various values of ) by Götze-Tikhimirov, Pan-Zhou, and Vu and myself; the exponent is known to degrade if one has too few moment assumptions on the underlying random matrix . This type of result (with an unspecified ) is important for the circular law, discussed in the next set of lectures.
— 4. Upper bound for the least singular value —
One can complement the lower tail estimate with an upper tail estimate:
We prove this using an argument of Rudelson and Vershynin. Suppose that , then
for all .
We apply this with equal to , where is the orthogonal projection to the space spanned by . On the one hand, we have
and on the other hand we have for any that
which by symmetry can be bounded by
Let be a small quantity to be chosen later. From Talagrand’s inequality (Notes 1) we know that with probability , so we obtain a bound of
Now a key point is that the vectors depend only on and not on ; indeed, they are the dual basis for in . Thus, after conditioning and thus to be fixed, is still a Bernoulli random vector. Applying a Berry-Esséen inequality, we obtain a bound of for the conditional probability that for sufficiently small depending on , unless is compressible (in the sense that, say, it is within of an -sparse vector). But this latter possibility can be controlled (with exponentially small probability) by the same type of arguments as before; we omit the details.
— 5. Asymptotic for the least singular value —
The distribution of singular values of a gaussian random matrix can be computed explicitly by techniques similar to those employed in Notes 6. In particular, if is a real gaussian matrix (with all entries iid with distribution ), it was shown by Edelman that converges in distribution to the distribution as . It turns out that this result can be extended to other ensembles with the same mean and variance. In particular, we have the following result of Van Vu and myself:
We do not establish the limit directly, but instead use Edelman’s result as a black box, focusing instead on establishing the universality of the limiting distribution of , and in particular that this limiting distribution is the same whether one has a Bernoulli ensemble or a gaussian ensemble.
The arguments are somewhat technical and we will not present them in full here, but instead give a sketch of the key ideas.
In previous sections we have already seen the close relationship between the least singular value , and the distances between a row of and the hyperplane spanned by the other rows. It is not hard to use the above machinery to show that as , converges in distribution to the absolute value of a Gaussian regardless of the underlying distribution of the coefficients of (i.e. it is asymptotically universal). The basic point is that one can write as where is a unit normal of (we will assume here that is non-singular, which by previous arguments is true asymptotically almost surely). The previous machinery lets us show that is incompressible with high probability, and then claim then follows from the Berry-Esséen theorem.
Unfortunately, despite the presence of suggestive relationships such as Exercise 3, the asymptotic universality of the distances does not directly imply asymptotic universality of the least singular value. However, it turns out that one can obtain a higher-dimensional version of the universality of the scalar quantities , as follows. For any small (say, for some small ) and any distinct , a modification of the above argument shows that the covariance matrix
of the orthogonal projections of the rows to the complement of the space spanned by the other rows of , is also universal, converging in distribution to the covariance matrix of iid gaussians (note that the convergence of to is the case of this claim). (These covariance matrix distributions are also known as Wishart distributions.) The key point is that one can show that the complement is usually “incompressible” in a certain technical sense, which implies that the projections behave like iid gaussians on that projection thanks to a multidimensional Berry-Esséen theorem.
On the other hand, the covariance matrix (9) is closely related to the inverse matrix :
Exercise 5 Show that (9) is also equal to , where is the matrix formed from the columns of .
In particular, this shows that the singular values of randomly selected columns of have a universal distribution.
Recall our goal is to show that has an asymptotically universal distribution, which is equivalent to asking that has an asymptotically universal distribution. The goal is then to extract the operator norm of from looking at a random minor of this matrix. This comes from the following application of the second moment method:
where is the Frobenius norm.
Recall from Exercise 3 that , so we expect each to have magnitude about . This, together with the Hoeffman-Wielandt inequality (Notes 3b) means that we expect to differ by from . In principle, this gives us asymptotic universality on from the already established universality of .
There is one technical obstacle remaining, however: while we know that each is distributed like a Gaussian, so that each individual is going to be of size with reasonably good probability, in order for the above exercise to be useful, one needs to bound all of the simultaneously with high probability. A naive application of the union bound leads to terrible results here. Fortunately, there is a strong correlation between the : they tend to be large together or small together, or equivalently that the distances tend to be small together or large together. Here is one indication of this:
Lemma 7 For any , one has
where is the orthogonal projection onto the space spanned by .
Proof: We may relabel so that ; then projecting everything by we may assume that . Our goal is now to show that
Recall that is a dual basis to . This implies in particular that
for any vector ; applying this to we obtain
and hence by the triangle inequality
Using the fact that , the claim follows.
In practice, once gets moderately large (e.g. for some small ), one can control the expressions appearing here by Talagrand’s inequality (Notes 1), and so this inequality tells us that once is bounded away from zero for , it is bounded away from zero for all other also. This turns out to be enough to get enough uniform control on the to make Exercise 6 useful, and ultimately to complete the proof of Theorem 6.