Now we turn attention to another important spectral statistic, the *least singular value* of an matrix (or, more generally, the least non-trivial singular value of a matrix with ). This quantity controls the invertibility of . Indeed, is invertible precisely when is non-zero, and the operator norm of is given by . This quantity is also related to the condition number of , which is of importance in numerical linear algebra. As we shall see in the next set of notes, the least singular value of (and more generally, of the shifts for complex ) will be of importance in rigorously establishing the *circular law* for iid random matrices , as it plays a key role in computing the *Stieltjes transform* of such matrices, which as we have already seen is a powerful tool in understanding the spectra of random matrices.

The least singular value

which sits at the “hard edge” of the spectrum, bears a superficial similarity to the operator norm

at the “soft edge” of the spectrum, that was discussed back in Notes 3, so one may at first think that the methods that were effective in controlling the latter, namely the epsilon-net argument and the moment method, would also work to control the former. The epsilon-net method does indeed have some effectiveness when dealing with rectangular matrices (in which the spectrum stays well away from zero), but the situation becomes more delicate for square matrices; it can control some “low entropy” portions of the infimum that arise from “structured” or “compressible” choices of , but are not able to control the “generic” or “incompressible” choices of , for which new arguments will be needed. As for the moment method, this can give the coarse order of magnitude (for instance, for rectangular matrices with for , it gives an upper bound of for the singular value with high probability, thanks to the *Marchenko-Pastur law*), but again this method begins to break down for square matrices, although one can make some partial headway by considering *negative* moments such as , though these are more difficult to compute than positive moments .

So one needs to supplement these existing methods with additional tools. It turns out that the key issue is to understand the distance between one of the rows of the matrix , and the hyperplane spanned by the other rows. The reason for this is as follows. First suppose that , so that is non-invertible, and there is a linear dependence between the rows . Thus, one of the will lie in the hyperplane spanned by the other rows, and so one of the distances mentioned above will vanish; in fact, one expects many of the distances to vanish. Conversely, whenever one of these distances vanishes, one has a linear dependence, and so .

More generally, if the least singular value is small, one generically expects many of these distances to be small also, and conversely. Thus, control of the least singular value is morally equivalent to control of the distance between a row and the hyperplane spanned by the other rows. This latter quantity is basically the dot product of with a unit normal of this hyperplane.

When working with random matrices with jointly independent coefficients, we have the crucial property that the unit normal (which depends on all the rows other than ) is *independent* of , so even after conditioning to be fixed, the entries of remain independent. As such, the dot product is a familiar scalar random walk, and can be controlled by a number of tools, most notably Littlewood-Offord theorems and the Berry-Esséen central limit theorem. As it turns out, this type of control works well except in some rare cases in which the normal is “compressible” or otherwise highly structured; but epsilon-net arguments can be used to dispose of these cases. (This general strategy was first developed for the technically simpler singularity problem by Komlós, and then extended to the least singular value problem by Rudelson.)

These methods rely quite strongly on the joint independence on all the entries; it remains a challenge to extend them to more general settings. Even for Wigner matrices, the methods run into difficulty because of the non-independence of some of the entries (although it turns out one can understand the least singular value in such cases by rather different methods).

To simplify the exposition, we shall focus primarily on just one specific ensemble of random matrices, the *Bernoulli ensemble* of random sign matrices, where are independent Bernoulli signs. However, the results can extend to more general classes of random matrices, with the main requirement being that the coefficients are jointly independent.

** — 1. The epsilon-net argument — **

We begin by using the epsilon net argument to establish a lower bound in the rectangular case, due to Litvak, Pajor, Rudelson, and Tomczak-Jaegermann:

Theorem 1 (Lower bound)Let be an Bernoulli matrix, where for some (independent of ). Then with exponentially high probability (i.e. for some ), one has , where depends only on .

This should be compared with the upper bound

for some absolute constant that holds with overwhelming probability from Notes 3 (indeed, one can take any here).

We use the epsilon-net method introduced in Notes 3, but with a smaller value of than used for the largest singular value. We write

Taking to be a maximal -net of the unit sphere in , with to be chosen later, we have that

and thus by (1), we have with overwhelming probability that

and so it suffices to show that

is exponentially small in . From the union bound, we can upper bound this by

From the volume packing argument we have

So we need to upper bound, for each , the probability

If we let be the rows of , we can write this as

By Markov’s inequality, the only way that this event can hold is if we have

for at least values of . We do not know in advance what the set of is for which this event holds. But the number of possible values of such sets of is at most . Applying the union bound (and paying the entropy cost of ) and using symmetry, we may thus bound the above probability by

Now observe that the random variables are independent, and so we can bound this expression by

where is a random vector of iid Bernoulli signs.

We write , so that is a random walk

To understand this walk, we apply (a slight variant) of the Berry-Esséen theorem from Notes 2:

Exercise 1Show thatfor any and any non-zero . (Actually, for the purposes of this set of notes, it would suffice to establish a weaker form of the Berry-Esséen theorem with replaced by for any fixed ). (

Hint:first normalise , then adapt the proof of the Berry-Esséen theorem.)Conclude in particular that if

for some then

(

Hint:condition out all the with .)

Let us temporarily call *incompressible* if

for somme to be chosen later, and *compressible* otherwise. If we only look at the incompressible elements of , we can now bound

and comparing this against the entropy cost (2) we obtain an acceptable contribution for small enough, and small enough depending on (here we are crucially using the rectangular condition ).

It remains to deal with the compressible vectors. Observe that such vectors lie within of a *sparse* unit vector which is only supported in at most positions. The -entropy of these sparse vectors (i.e. the number of balls of radius needed to cover this space) can easily be computed to be of polynomial size in . Meanwhile, we have the following crude bound:

Exercise 2For any unit vector , show thatfor small enough. (

Hint:Use the Paley-Zygmund inequality. Bounds on higher moments on can be obtained for instance using Hoeffding’s inequality, or by direct computation.) Use this to show thatfor all such and sufficiently small, with independent of and .

Thus the compressible vectors give a net contribution of , which is acceptable. This concludes the proof of Theorem 1.

** — 2. Singularity probability — **

Now we turn to square Bernoulli matrices . Before we investigate the size of the least singular value, we first tackle the easier problem of bounding the *singularity probability*

i.e. the probability that is not invertible. The problem of computing this probability exactly is still not completely settled. Since is singular whenever the first two rows (say) are identical, we obtain a lower bound

and it is conjectured that this bound is essentially tight in the sense that

but this remains open; the best bound currently is due to Bourgain, Vu, and Wood, and gives

We will not prove this bound here, but content ourselves with a weaker bound, essentially due to Komlós:

Proposition 2We have .

To show this, we need the following combinatorial fact, due to Erdös:

Proposition 3 (Erdös Littlewood-Offord theorem)Let be a vector with at least nonzero entries, and let be a random vector of iid Bernoulli signs. Then .

*Proof:* By taking real and imaginary parts we may assume that is real. By eliminating zero coefficients of we may assume that ; reflecting we may then assume that all the are positive. Observe that the set of with forms an antichain in . The claim now easily follows from Sperner’s theorem and Stirling’s formula.

Note that we also have the obvious bound

Now we prove the theorem. In analogy with the arugments of the previou ssection, we write

(actually we can take since is real). We divide into compressible and incompressible vectors as before, but our definition of compressibility and incompressibility is slightly different now. Also, one has to do a certain amount of technical maneuvering in order to preserve the crucial independence between rows and columns.

Namely, we pick an and call *compressible* if it is supported on at most coordinates, and *incompressible* otherwise.

Let us first consider the contribution of the event that for some nonzero compressible . Pick an with this property which is as sparse as possible, say sparse for some . Let us temporarily fix . By paying an entropy cost of , we may assume that it is the first entries that are non-zero for some . This implies that the first columns of have a linear dependence given by ; by minimality, are linearly independent. Thus, is uniquely determined (up to scalar multiples) by . Furthermore, as the matrix formed by has rank , there is some minor which already determines up to constants; by paying another entropy cost of , we may assume that it is the top left minor which does this. In particular, we can now use the first rows to determine up to constants. But the remaining rows are independent of and still need to be orthogonal to ; by Proposition 3 and (3), this happens with probability at most , giving a total cost of

which by Stirling’s formula is acceptable (in fact this gives an exponentially small contribution).

The same argument gives that the event that for some nonzero compressible also has exponentially small probability. The only remaining event to control is the event that for some incompressible , but that and for all nonzero compressible . Call this event .

Since for some incompressible , we see that for at least values of , the row lies in the vector space spanned by the remaining rows of . Let denote the event that holds, and that lies in ; then we see from double counting that

By symmetry, we thus have

To compute , we freeze consider a normal vector to ; note that we can select depending only on . We may assume that an incompressible normal vector exists, since otherwise the event would be empty. We make the crucial observation that is still independent of . By Proposition 3, we thus see that the conditional probability that , for fixed , is . We thus see that , and the claim follows.

Remark 1Further progress has been made on this problem by a finer analysis of the concentration probability , and in particular in classifying those for which this concentration probability is large (this is known as theinverse Littlewood-Offord problem). Important breakthroughs in this direction were made by Halász (introducing Fourier-analytic tools) and by Kahn, Komlós, and Szemerédi (introducing an efficient “swapping” argument). Van Vu and I introduced tools from additive combinatorics (such as Freiman’s theorem) to obtain further improvements, leading eventually to the results of Bourgain, Vu, and Wood mentioned earlier.

** — 3. Lower bound for the least singular value — **

Now we return to the least singular value of an iid Bernoulli matrix, and establish a lower bound. Given that there are singular values between and , which is typically of size , one expects the least singular value to be of size about on the average. Another argument supporting this heuristic scomes from the following identity:

Exercise 3 (Negative second moment identity)Let be an invertible matrix, let be the rows of , and let be the columns of . For each , let be the hyperplane spanned by all the rows other than . Show that and .

From Talagrand’s inequality (Notes 1), we expect each to be of size on the average, which suggests that ; this is consistent with the heuristic that the eigenvalues should be roughly evenly spaced in the interval (so that should be about ).

Now we give a rigorous lower bound:

Theorem 4 (Lower tail estimate for the least singular value)For any , one haswhere goes to zero as uniformly in , and goes to zero as for each fixed .

This is a weaker form of a result of Rudelson and Vershynin (which obtains a bound of the form for some ), which builds upon earlier work of Rudelson and of Vu and myself, which obtained variants of the above result.

The scale that we are working at here is too fine to use epsilon net arguments (unless one has a *lot* of control on the entropy, which can be obtained in some cases thanks to powerful inverse Littlewood-Offord theorems, but is difficult to obtain in general.) We can prove this theorem along similar lines to the arguments in the previous section; we sketch the method as follows. We can take to be small. We write the probability to be estimated as

We can assume that for some absolute constant , as the event that this fails has exponentially small probability.

We pick an (not depending on ) to be chosen later. We call a unit vector *compressible* if lies within a distance of a -sparse vector. Let us first dispose of the case in which for some compressible . By paying an entropy cost of , we may assume that is within of a vector supported in the first coordinates. Using the operator norm bound on and the triangle inequality, we conclude that

Since has norm comparable to , this implies that the least singular value of the first columns of is . But by Theorem 1, this occurs with probability (if are small enough). So the total probability of the compressible event is at most , which is acceptable if is small enough.

Thus we may assume now that for all compressible unit vectors ; we may similarly assume that for all compressible unit vectors . Indeed, we may also assume that for every , where is with the column removed.

The remaining case is if for some incompressible . Let us call this event . Write , and let be the column of , thus

Letting be the subspace spanned by all the except for , we conclude upon projecting to the orthogonal complement of that

for all (compare with Exercise 3). On the other hand, since is incompressible, we see that for at least values of , and thus

for at least values of . If we let be the event that and (4) both hold, we thus have from double-counting that

and thus by symmetry

(say). However, if holds, then setting to be a unit normal vector to (which is necessarily incompressible, by the hypothesis on ), we have

Again, the crucial point is that and are independent. The incompressibility of , combined with a Berry-Esséen type theorem, then gives

Exercise 4Show that(say) if is sufficiently small depending on , and is sufficiently large depending on .

This gives a bound of for if is small enough depending on , and is large enough; this gives the claim.

Remark 2A variant of these arguments, based on inverse Littlewood-Offord theorems rather than the Berry-Esséen theorem, gives the variant estimatewith high probability for some , and any of polynomial size in . There are several results of this type, with overlapping ranges of generality (and various values of ) by Götze-Tikhimirov, Pan-Zhou, and Vu and myself; the exponent is known to degrade if one has too few moment assumptions on the underlying random matrix . This type of result (with an unspecified ) is important for the circular law, discussed in the next set of lectures.

** — 4. Upper bound for the least singular value — **

One can complement the lower tail estimate with an upper tail estimate:

Theorem 5 (Upper tail estimate for the least singular value)For any , one has

We prove this using an argument of Rudelson and Vershynin. Suppose that , then

Next, let be the rows of , and let be the columns of , thus is a dual basis for . From (7) we have

We apply this with equal to , where is the orthogonal projection to the space spanned by . On the one hand, we have

and on the other hand we have for any that

If (8) holds, then for at least half of the , so the probability in (6) can be bounded by

which by symmetry can be bounded by

Let be a small quantity to be chosen later. From Talagrand’s inequality (Notes 1) we know that with probability , so we obtain a bound of

Now a key point is that the vectors depend only on and not on ; indeed, they are the dual basis for in . Thus, after conditioning and thus to be fixed, is still a Bernoulli random vector. Applying a Berry-Esséen inequality, we obtain a bound of for the conditional probability that for sufficiently small depending on , unless is compressible (in the sense that, say, it is within of an -sparse vector). But this latter possibility can be controlled (with exponentially small probability) by the same type of arguments as before; we omit the details.

** — 5. Asymptotic for the least singular value — **

The distribution of singular values of a gaussian random matrix can be computed explicitly by techniques similar to those employed in Notes 6. In particular, if is a real gaussian matrix (with all entries iid with distribution ), it was shown by Edelman that converges in distribution to the distribution as . It turns out that this result can be extended to other ensembles with the same mean and variance. In particular, we have the following result of Van Vu and myself:

Theorem 6If is an iid Bernoulli matrix, then also converges in distribution to as . (In fact there is a polynomial rate of convergence.)

This should be compared with Theorems 4, 5, which show that have a tight sequence of distributions in . The arguments of Van and myself thus provide an alternate proof of these two theorems.

We do not establish the limit directly, but instead use Edelman’s result as a black box, focusing instead on establishing the *universality* of the limiting distribution of , and in particular that this limiting distribution is the same whether one has a Bernoulli ensemble or a gaussian ensemble.

The arguments are somewhat technical and we will not present them in full here, but instead give a sketch of the key ideas.

In previous sections we have already seen the close relationship between the least singular value , and the distances between a row of and the hyperplane spanned by the other rows. It is not hard to use the above machinery to show that as , converges in distribution to the absolute value of a Gaussian regardless of the underlying distribution of the coefficients of (i.e. it is asymptotically universal). The basic point is that one can write as where is a unit normal of (we will assume here that is non-singular, which by previous arguments is true asymptotically almost surely). The previous machinery lets us show that is incompressible with high probability, and then claim then follows from the Berry-Esséen theorem.

Unfortunately, despite the presence of suggestive relationships such as Exercise 3, the asymptotic universality of the distances does not directly imply asymptotic universality of the least singular value. However, it turns out that one can obtain a higher-dimensional version of the universality of the scalar quantities , as follows. For any small (say, for some small ) and any distinct , a modification of the above argument shows that the covariance matrix

of the orthogonal projections of the rows to the complement of the space spanned by the other rows of , is also universal, converging in distribution to the covariance matrix of iid gaussians (note that the convergence of to is the case of this claim). (These covariance matrix distributions are also known as *Wishart distributions*.) The key point is that one can show that the complement is usually “incompressible” in a certain technical sense, which implies that the projections behave like iid gaussians on that projection thanks to a multidimensional Berry-Esséen theorem.

On the other hand, the covariance matrix (9) is closely related to the inverse matrix :

Exercise 5Show that (9) is also equal to , where is the matrix formed from the columns of .

In particular, this shows that the singular values of randomly selected columns of have a universal distribution.

Recall our goal is to show that has an asymptotically universal distribution, which is equivalent to asking that has an asymptotically universal distribution. The goal is then to extract the operator norm of from looking at a random minor of this matrix. This comes from the following application of the second moment method:

Exercise 6Let be an matrix with columns , and let be the matrix formed by taking of the columns at random. Show thatwhere is the Frobenius norm.

Recall from Exercise 3 that , so we expect each to have magnitude about . This, together with the Hoeffman-Wielandt inequality (Notes 3b) means that we expect to differ by from . In principle, this gives us asymptotic universality on from the already established universality of .

There is one technical obstacle remaining, however: while we know that each is distributed like a Gaussian, so that each individual is going to be of size with reasonably good probability, in order for the above exercise to be useful, one needs to bound *all* of the *simultaneously* with high probability. A naive application of the union bound leads to terrible results here. Fortunately, there is a strong correlation between the : they tend to be large together or small together, or equivalently that the distances tend to be small together or large together. Here is one indication of this:

Lemma 7For any , one haswhere is the orthogonal projection onto the space spanned by .

*Proof:* We may relabel so that ; then projecting everything by we may assume that . Our goal is now to show that

Recall that is a dual basis to . This implies in particular that

for any vector ; applying this to we obtain

and hence by the triangle inequality

Using the fact that , the claim follows.

In practice, once gets moderately large (e.g. for some small ), one can control the expressions appearing here by Talagrand’s inequality (Notes 1), and so this inequality tells us that once is bounded away from zero for , it is bounded away from zero for all other also. This turns out to be enough to get enough uniform control on the to make Exercise 6 useful, and ultimately to complete the proof of Theorem 6.

## 15 comments

Comments feed for this article

6 March, 2010 at 11:58 am

andrescaicedoDear Terry, one of your tags reads “Littlewood-Offord theoremslewood-Offord theorems”.

[Corrected, thanks – T.]8 March, 2010 at 8:24 pm

Mads SørensenMr. Tao.

I believe something is missing in Theorem~5; what does one have for ? :-)

[Corrected, thanks – T.]14 March, 2010 at 11:33 am

254A, Notes 8: The circular law « What’s new[…] asymptotically almost surely for the least singular value of . This has the effect of capping off the integrand to be of size . Next, by using Stieltjes transform methods, the convergence of to in an appropriate metric (e.g. the Levi distance metric) was shown to be polynomially fast, so that the distance decayed like for some . The gain can safely absorb the loss, and this leads to a proof of the circular law assuming enough boundedness and continuity hypotheses to ensure the least singular value bound and the convergence rate. This basic paradigm was also followed by later works such as that of Gotze-Tikhomirov, Pan-Zhou, and Tao-Vu, with the main new ingredient being the advances in the understanding of the least singular value (Notes 7). […]

23 March, 2010 at 1:29 pm

GabrielDear Terry,

Given an n\times n matrix A (not necessarily hermitian), what is the relation between the p-Schatten norm (0<p<\infty) and the l_{p} norm of the rows and columns?

How exercise 3 can be rewritten in this terms?

Thanks!

24 March, 2010 at 6:27 pm

Terence TaoThe Schatten norm only has a nice relationship with the norms of the rows and columns when p=2; the 2-Schatten norm (also known as the Hilbert-Schmidt norm, or Frobenius norm) is also the l^2 norm of the matrix entries.

The infinity-Schatten norm is the operator norm, which by Schur’s test is bounded by the geometric mean of the maximal l^1 norm of the rows, and the maximal l^1 norm of the columns. Presumably one can then interpolate with the 2-Schatten norm to get some inequalities for the p-Schatten norm for 2 <= p <= infty (a sort of non-commutative Hausdorff-Young inequality), but that's more or less the only general inequality I am aware of. (Note that for circulant matrices, the p-Schatten norm is basically the l^p norm of the Fourier transform of a row, and the only l^p estimates obeyed by the Fourier transform are given by the Hausdorff-Young inequality, so this suggests that there are no further general inequalities of this type beyond these ones.)

6 March, 2011 at 12:07 am

MustafaSuppose that are independent matrices that are chosen with uniform distribution on the sphere . I want to find a lower bound on

How do I juggle and ?

23 March, 2010 at 1:54 pm

GabrielI also think there is a typo in exercise 3. Shouldn’t it be

$$

\sum_{i=1}^{n}{\sigma_{i}(M)^{-2}}=\sum_{i=1}^{n}{dist(X_{i},V_{i})^{-2}}?

$$

Thanks!

[Corrected, thanks – T.]9 August, 2010 at 5:20 pm

SujitThe line above Hint in Exercise 1 needs to be fixed ($t$ missing)

[Corrected, thanks – T.]13 August, 2010 at 8:16 pm

weiyuHi,

In the paragraph right above Section 5, when you write for the dual basis vectors, do you mean ?

There might be some latex typing error in the formulation of Theorem 6.

Wonderful notes! Especially studying the smallest singular value can be reduced to studying the projection distances.That is beautiful.

[Corrected, thanks – T.]12 May, 2013 at 8:49 pm

Nick CookIn the proof of Proposition 2, the sparse case, for the final sum to be exponentially small we will need the implied constant from the Erdos Littlewood-Offord bound to be less than 1, or alternatively we can use (3) for small values of .

In exercise 1, I think the denominator of the first term should be , not .

Also in that exercise, I see how a combination of having a lot of energy at low levels and conditioning out large components should do the trick, but I don’t see how the specific assumption can work. It seems that only conditioning on for which will still allow the third moment term in the Berry-Esseen bound to be large. If we freeze all variables with , this term is under control, but the first term $\latex r/||x||$ is of order .

It can be done by saying that is incompressible if , and freezing all variables with . This gives a final bound rather than , but this is still enough.

Typo under exercise 1: flip the inequality in the definition of incompressible vs compressible.

13 May, 2013 at 11:50 am

Terence TaoThanks for the corrections!

14 May, 2013 at 12:06 pm

Nick CookTwo more problems:

One is with the correction I just gave! Getting a bound of size is actually not sufficient to beat the entropy cost of size from the net. If we define to be -compressible when , then we get a bound .

The second problem is with the bound written above exercise 1. With this exponent of we still won’t be able to beat the entropy cost (when at least). We can replace it with a bound valid for any positive . Taking large enough depending on , we can beat the entropy factor.

Then for the compressible vectors, these all lie within distance of a -sparse vector, so we can use a -net on the compressible vectors. Using this net, one is able to solve Exercise 2, in the course of which one will fix sufficiently small, after which one can go back and take sufficiently small depending on and .

I think this works – sorry to introduce 2 new parameters! I didn’t see a cleaner way (maybe LiPaRuTo have one).

[Corrections implemented, thanks – T.]2 February, 2018 at 6:17 pm

JohnProf. Tao,

Is that possible to achieve stronger singularity probability by replacing random sign with uniform integers in [-t, t] for a large enough t?

I guess in the paper by Bourgain, Vu and Wood they studied the case when t = O(1). But in general can we hope for the singularity probability to be t^{-\Omega(n)} for arbitrarily large t?

8 May, 2019 at 7:19 pm

keejIn Exercise 3 (the negative second moment identity), should the power of dist be -2 instead of 2? Then the identity would scale correctly under M -> tM.

[Corrected, thanks – T.]10 March, 2020 at 12:41 pm

AnonymousI am wondering if there are studies about the 2-norm (or largest singular value) of a rectangular random matrix M, but with strong dependence structure. Say, M is n by k and is generated by f(X, Y), where one sample iid n samples from the random variable X and k iid samples from Y