Now we turn attention to another important spectral statistic, the least singular value of an
matrix
(or, more generally, the least non-trivial singular value
of a
matrix with
). This quantity controls the invertibility of
. Indeed,
is invertible precisely when
is non-zero, and the operator norm
of
is given by
. This quantity is also related to the condition number
of
, which is of importance in numerical linear algebra. As we shall see in the next set of notes, the least singular value of
(and more generally, of the shifts
for complex
) will be of importance in rigorously establishing the circular law for iid random matrices
, as it plays a key role in computing the Stieltjes transform
of such matrices, which as we have already seen is a powerful tool in understanding the spectra of random matrices.
The least singular value
which sits at the “hard edge” of the spectrum, bears a superficial similarity to the operator norm
at the “soft edge” of the spectrum, that was discussed back in Notes 3, so one may at first think that the methods that were effective in controlling the latter, namely the epsilon-net argument and the moment method, would also work to control the former. The epsilon-net method does indeed have some effectiveness when dealing with rectangular matrices (in which the spectrum stays well away from zero), but the situation becomes more delicate for square matrices; it can control some “low entropy” portions of the infimum that arise from “structured” or “compressible” choices of , but are not able to control the “generic” or “incompressible” choices of
, for which new arguments will be needed. As for the moment method, this can give the coarse order of magnitude (for instance, for rectangular matrices with
for
, it gives an upper bound of
for the singular value with high probability, thanks to the Marchenko-Pastur law), but again this method begins to break down for square matrices, although one can make some partial headway by considering negative moments such as
, though these are more difficult to compute than positive moments
.
So one needs to supplement these existing methods with additional tools. It turns out that the key issue is to understand the distance between one of the rows
of the matrix
, and the hyperplane spanned by the other
rows. The reason for this is as follows. First suppose that
, so that
is non-invertible, and there is a linear dependence between the rows
. Thus, one of the
will lie in the hyperplane spanned by the other rows, and so one of the distances mentioned above will vanish; in fact, one expects many of the
distances to vanish. Conversely, whenever one of these distances vanishes, one has a linear dependence, and so
.
More generally, if the least singular value is small, one generically expects many of these
distances to be small also, and conversely. Thus, control of the least singular value is morally equivalent to control of the distance between a row
and the hyperplane spanned by the other rows. This latter quantity is basically the dot product of
with a unit normal
of this hyperplane.
When working with random matrices with jointly independent coefficients, we have the crucial property that the unit normal (which depends on all the rows other than
) is independent of
, so even after conditioning
to be fixed, the entries of
remain independent. As such, the dot product
is a familiar scalar random walk, and can be controlled by a number of tools, most notably Littlewood-Offord theorems and the Berry-Esséen central limit theorem. As it turns out, this type of control works well except in some rare cases in which the normal
is “compressible” or otherwise highly structured; but epsilon-net arguments can be used to dispose of these cases. (This general strategy was first developed for the technically simpler singularity problem by Komlós, and then extended to the least singular value problem by Rudelson.)
These methods rely quite strongly on the joint independence on all the entries; it remains a challenge to extend them to more general settings. Even for Wigner matrices, the methods run into difficulty because of the non-independence of some of the entries (although it turns out one can understand the least singular value in such cases by rather different methods).
To simplify the exposition, we shall focus primarily on just one specific ensemble of random matrices, the Bernoulli ensemble of random sign matrices, where
are independent Bernoulli signs. However, the results can extend to more general classes of random matrices, with the main requirement being that the coefficients are jointly independent.
— 1. The epsilon-net argument —
We begin by using the epsilon net argument to establish a lower bound in the rectangular case, due to Litvak, Pajor, Rudelson, and Tomczak-Jaegermann:
Theorem 1 (Lower bound) Let
be an
Bernoulli matrix, where
for some
(independent of
). Then with exponentially high probability (i.e.
for some
), one has
, where
depends only on
.
This should be compared with the upper bound
for some absolute constant that holds with overwhelming probability from Notes 3 (indeed, one can take any
here).
We use the epsilon-net method introduced in Notes 3, but with a smaller value of than used for the largest singular value. We write
Taking to be a maximal
-net of the unit sphere in
, with
to be chosen later, we have that
and thus by (1), we have with overwhelming probability that
and so it suffices to show that
is exponentially small in . From the union bound, we can upper bound this by
From the volume packing argument we have
So we need to upper bound, for each , the probability
If we let be the rows of
, we can write this as
By Markov’s inequality, the only way that this event can hold is if we have
for at least values of
. We do not know in advance what the set of
is for which this event holds. But the number of possible values of such sets of
is at most
. Applying the union bound (and paying the entropy cost of
) and using symmetry, we may thus bound the above probability by
Now observe that the random variables are independent, and so we can bound this expression by
where is a random vector of iid Bernoulli signs.
We write , so that
is a random walk
To understand this walk, we apply (a slight variant) of the Berry-Esséen theorem from Notes 2:
Exercise 1 Show that
for any
and any non-zero
. (Actually, for the purposes of this set of notes, it would suffice to establish a weaker form of the Berry-Esséen theorem with
replaced by
for any fixed
). (Hint: first normalise
, then adapt the proof of the Berry-Esséen theorem.)
Conclude in particular that if
for some
then
(Hint: condition out all the
with
.)
Let us temporarily call incompressible if
for somme to be chosen later, and compressible otherwise. If we only look at the incompressible elements of
, we can now bound
and comparing this against the entropy cost (2) we obtain an acceptable contribution for small enough, and
small enough depending on
(here we are crucially using the rectangular condition
).
It remains to deal with the compressible vectors. Observe that such vectors lie within of a sparse unit vector which is only supported in at most
positions. The
-entropy of these sparse vectors (i.e. the number of balls of radius
needed to cover this space) can easily be computed to be of polynomial size
in
. Meanwhile, we have the following crude bound:
Exercise 2 For any unit vector
, show that
for
small enough. (Hint: Use the Paley-Zygmund inequality. Bounds on higher moments on
can be obtained for instance using Hoeffding’s inequality, or by direct computation.) Use this to show that
for all such
and
sufficiently small, with
independent of
and
.
Thus the compressible vectors give a net contribution of , which is acceptable. This concludes the proof of Theorem 1.
— 2. Singularity probability —
Now we turn to square Bernoulli matrices . Before we investigate the size of the least singular value, we first tackle the easier problem of bounding the singularity probability
i.e. the probability that is not invertible. The problem of computing this probability exactly is still not completely settled. Since
is singular whenever the first two rows (say) are identical, we obtain a lower bound
and it is conjectured that this bound is essentially tight in the sense that
but this remains open; the best bound currently is due to Bourgain, Vu, and Wood, and gives
We will not prove this bound here, but content ourselves with a weaker bound, essentially due to Komlós:
Proposition 2 We have
.
To show this, we need the following combinatorial fact, due to Erdös:
Proposition 3 (Erdös Littlewood-Offord theorem) Let
be a vector with at least
nonzero entries, and let
be a random vector of iid Bernoulli signs. Then
.
Proof: By taking real and imaginary parts we may assume that is real. By eliminating zero coefficients of
we may assume that
; reflecting we may then assume that all the
are positive. Observe that the set of
with
forms an antichain in
. The claim now easily follows from Sperner’s theorem and Stirling’s formula.
Note that we also have the obvious bound
for any non-zero .
Now we prove the theorem. In analogy with the arugments of the previou ssection, we write
(actually we can take since
is real). We divide into compressible and incompressible vectors as before, but our definition of compressibility and incompressibility is slightly different now. Also, one has to do a certain amount of technical maneuvering in order to preserve the crucial independence between rows and columns.
Namely, we pick an and call
compressible if it is supported on at most
coordinates, and incompressible otherwise.
Let us first consider the contribution of the event that for some nonzero compressible
. Pick an
with this property which is as sparse as possible, say
sparse for some
. Let us temporarily fix
. By paying an entropy cost of
, we may assume that it is the first
entries that are non-zero for some
. This implies that the first
columns
of
have a linear dependence given by
; by minimality,
are linearly independent. Thus,
is uniquely determined (up to scalar multiples) by
. Furthermore, as the
matrix formed by
has rank
, there is some
minor which already determines
up to constants; by paying another entropy cost of
, we may assume that it is the top left minor which does this. In particular, we can now use the first
rows
to determine
up to constants. But the remaining
rows are independent of
and still need to be orthogonal to
; by Proposition 3 and (3), this happens with probability at most
, giving a total cost of
which by Stirling’s formula is acceptable (in fact this gives an exponentially small contribution).
The same argument gives that the event that for some nonzero compressible
also has exponentially small probability. The only remaining event to control is the event that
for some incompressible
, but that
and
for all nonzero compressible
. Call this event
.
Since for some incompressible
, we see that for at least
values of
, the row
lies in the vector space
spanned by the remaining
rows of
. Let
denote the event that
holds, and that
lies in
; then we see from double counting that
By symmetry, we thus have
To compute , we freeze
consider a normal vector
to
; note that we can select
depending only on
. We may assume that an incompressible normal vector exists, since otherwise the event
would be empty. We make the crucial observation that
is still independent of
. By Proposition 3, we thus see that the conditional probability that
, for fixed
, is
. We thus see that
, and the claim follows.
Remark 1 Further progress has been made on this problem by a finer analysis of the concentration probability
, and in particular in classifying those
for which this concentration probability is large (this is known as the inverse Littlewood-Offord problem). Important breakthroughs in this direction were made by Halász (introducing Fourier-analytic tools) and by Kahn, Komlós, and Szemerédi (introducing an efficient “swapping” argument). Van Vu and I introduced tools from additive combinatorics (such as Freiman’s theorem) to obtain further improvements, leading eventually to the results of Bourgain, Vu, and Wood mentioned earlier.
— 3. Lower bound for the least singular value —
Now we return to the least singular value of an iid Bernoulli matrix, and establish a lower bound. Given that there are
singular values between
and
, which is typically of size
, one expects the least singular value to be of size about
on the average. Another argument supporting this heuristic scomes from the following identity:
Exercise 3 (Negative second moment identity) Let
be an invertible
matrix, let
be the rows of
, and let
be the columns of
. For each
, let
be the hyperplane spanned by all the rows
other than
. Show that
and
.
From Talagrand’s inequality (Notes 1), we expect each to be of size
on the average, which suggests that
; this is consistent with the heuristic that the eigenvalues
should be roughly evenly spaced in the interval
(so that
should be about
).
Now we give a rigorous lower bound:
Theorem 4 (Lower tail estimate for the least singular value) For any
, one has
where
goes to zero as
uniformly in
, and
goes to zero as
for each fixed
.
This is a weaker form of a result of Rudelson and Vershynin (which obtains a bound of the form for some
), which builds upon earlier work of Rudelson and of Vu and myself, which obtained variants of the above result.
The scale that we are working at here is too fine to use epsilon net arguments (unless one has a lot of control on the entropy, which can be obtained in some cases thanks to powerful inverse Littlewood-Offord theorems, but is difficult to obtain in general.) We can prove this theorem along similar lines to the arguments in the previous section; we sketch the method as follows. We can take
to be small. We write the probability to be estimated as
We can assume that for some absolute constant
, as the event that this fails has exponentially small probability.
We pick an (not depending on
) to be chosen later. We call a unit vector
compressible if
lies within a distance
of a
-sparse vector. Let us first dispose of the case in which
for some compressible
. By paying an entropy cost of
, we may assume that
is within
of a vector
supported in the first
coordinates. Using the operator norm bound on
and the triangle inequality, we conclude that
Since has norm comparable to
, this implies that the least singular value of the first
columns of
is
. But by Theorem 1, this occurs with probability
(if
are small enough). So the total probability of the compressible event is at most
, which is acceptable if
is small enough.
Thus we may assume now that for all compressible unit vectors
; we may similarly assume that
for all compressible unit vectors
. Indeed, we may also assume that
for every
, where
is
with the
column removed.
The remaining case is if for some incompressible
. Let us call this event
. Write
, and let
be the column of
, thus
Letting be the subspace spanned by all the
except for
, we conclude upon projecting to the orthogonal complement of
that
for all (compare with Exercise 3). On the other hand, since
is incompressible, we see that
for at least
values of
, and thus
for at least values of
. If we let
be the event that
and (4) both hold, we thus have from double-counting that
and thus by symmetry
(say). However, if holds, then setting
to be a unit normal vector to
(which is necessarily incompressible, by the hypothesis on
), we have
Again, the crucial point is that and
are independent. The incompressibility of
, combined with a Berry-Esséen type theorem, then gives
Exercise 4 Show that
(say) if
is sufficiently small depending on
, and
is sufficiently large depending on
.
This gives a bound of for
if
is small enough depending on
, and
is large enough; this gives the claim.
Remark 2 A variant of these arguments, based on inverse Littlewood-Offord theorems rather than the Berry-Esséen theorem, gives the variant estimate
with high probability for some
, and any
of polynomial size in
. There are several results of this type, with overlapping ranges of generality (and various values of
) by Götze-Tikhimirov, Pan-Zhou, and Vu and myself; the exponent
is known to degrade if one has too few moment assumptions on the underlying random matrix
. This type of result (with an unspecified
) is important for the circular law, discussed in the next set of lectures.
— 4. Upper bound for the least singular value —
One can complement the lower tail estimate with an upper tail estimate:
Theorem 5 (Upper tail estimate for the least singular value) For any
, one has
We prove this using an argument of Rudelson and Vershynin. Suppose that , then
for all .
Next, let be the rows of
, and let
be the columns of
, thus
is a dual basis for
. From (7) we have
We apply this with equal to
, where
is the orthogonal projection to the space
spanned by
. On the one hand, we have
and on the other hand we have for any that
If (8) holds, then for at least half of the
, so the probability in (6) can be bounded by
which by symmetry can be bounded by
Let be a small quantity to be chosen later. From Talagrand’s inequality (Notes 1) we know that
with probability
, so we obtain a bound of
Now a key point is that the vectors depend only on
and not on
; indeed, they are the dual basis for
in
. Thus, after conditioning
and thus
to be fixed,
is still a Bernoulli random vector. Applying a Berry-Esséen inequality, we obtain a bound of
for the conditional probability that
for
sufficiently small depending on
, unless
is compressible (in the sense that, say, it is within
of an
-sparse vector). But this latter possibility can be controlled (with exponentially small probability) by the same type of arguments as before; we omit the details.
— 5. Asymptotic for the least singular value —
The distribution of singular values of a gaussian random matrix can be computed explicitly by techniques similar to those employed in Notes 6. In particular, if is a real gaussian matrix (with all entries iid with distribution
), it was shown by Edelman that
converges in distribution to the distribution
as
. It turns out that this result can be extended to other ensembles with the same mean and variance. In particular, we have the following result of Van Vu and myself:
Theorem 6 If
is an iid Bernoulli matrix, then
also converges in distribution to
as
. (In fact there is a polynomial rate of convergence.)
This should be compared with Theorems 4, 5, which show that have a tight sequence of distributions in
. The arguments of Van and myself thus provide an alternate proof of these two theorems.
We do not establish the limit directly, but instead use Edelman’s result as a black box, focusing instead on establishing the universality of the limiting distribution of
, and in particular that this limiting distribution is the same whether one has a Bernoulli ensemble or a gaussian ensemble.
The arguments are somewhat technical and we will not present them in full here, but instead give a sketch of the key ideas.
In previous sections we have already seen the close relationship between the least singular value , and the distances
between a row
of
and the hyperplane
spanned by the other
rows. It is not hard to use the above machinery to show that as
,
converges in distribution to the absolute value
of a Gaussian regardless of the underlying distribution of the coefficients of
(i.e. it is asymptotically universal). The basic point is that one can write
as
where
is a unit normal of
(we will assume here that
is non-singular, which by previous arguments is true asymptotically almost surely). The previous machinery lets us show that
is incompressible with high probability, and then claim then follows from the Berry-Esséen theorem.
Unfortunately, despite the presence of suggestive relationships such as Exercise 3, the asymptotic universality of the distances does not directly imply asymptotic universality of the least singular value. However, it turns out that one can obtain a higher-dimensional version of the universality of the scalar quantities
, as follows. For any small
(say,
for some small
) and any distinct
, a modification of the above argument shows that the covariance matrix
of the orthogonal projections of the
rows
to the complement
of the space
spanned by the other
rows of
, is also universal, converging in distribution to the covariance matrix
of
iid gaussians
(note that the convergence of
to
is the
case of this claim). (These covariance matrix distributions are also known as Wishart distributions.) The key point is that one can show that the complement
is usually “incompressible” in a certain technical sense, which implies that the projections
behave like iid gaussians on that projection thanks to a multidimensional Berry-Esséen theorem.
On the other hand, the covariance matrix (9) is closely related to the inverse matrix :
Exercise 5 Show that (9) is also equal to
, where
is the
matrix formed from the
columns of
.
In particular, this shows that the singular values of randomly selected columns of
have a universal distribution.
Recall our goal is to show that has an asymptotically universal distribution, which is equivalent to asking that
has an asymptotically universal distribution. The goal is then to extract the operator norm of
from looking at a random
minor
of this matrix. This comes from the following application of the second moment method:
Exercise 6 Let
be an
matrix with columns
, and let
be the
matrix formed by taking
of the columns
at random. Show that
where
is the Frobenius norm.
Recall from Exercise 3 that , so we expect each
to have magnitude about
. This, together with the Hoeffman-Wielandt inequality (Notes 3b) means that we expect
to differ by
from
. In principle, this gives us asymptotic universality on
from the already established universality of
.
There is one technical obstacle remaining, however: while we know that each is distributed like a Gaussian, so that each individual
is going to be of size
with reasonably good probability, in order for the above exercise to be useful, one needs to bound all of the
simultaneously with high probability. A naive application of the union bound leads to terrible results here. Fortunately, there is a strong correlation between the
: they tend to be large together or small together, or equivalently that the distances
tend to be small together or large together. Here is one indication of this:
Lemma 7 For any
, one has
where
is the orthogonal projection onto the space spanned by
.
Proof: We may relabel so that ; then projecting everything by
we may assume that
. Our goal is now to show that
Recall that is a dual basis to
. This implies in particular that
for any vector ; applying this to
we obtain
and hence by the triangle inequality
Using the fact that , the claim follows.
In practice, once gets moderately large (e.g.
for some small
), one can control the expressions
appearing here by Talagrand’s inequality (Notes 1), and so this inequality tells us that once
is bounded away from zero for
, it is bounded away from zero for all other
also. This turns out to be enough to get enough uniform control on the
to make Exercise 6 useful, and ultimately to complete the proof of Theorem 6.
15 comments
Comments feed for this article
6 March, 2010 at 11:58 am
andrescaicedo
Dear Terry, one of your tags reads “Littlewood-Offord theoremslewood-Offord theorems”.
[Corrected, thanks – T.]
8 March, 2010 at 8:24 pm
Mads Sørensen
Mr. Tao.
I believe something is missing in Theorem~5; what does one have for
? :-)
[Corrected, thanks – T.]
14 March, 2010 at 11:33 am
254A, Notes 8: The circular law « What’s new
[…] asymptotically almost surely for the least singular value of . This has the effect of capping off the integrand to be of size . Next, by using Stieltjes transform methods, the convergence of to in an appropriate metric (e.g. the Levi distance metric) was shown to be polynomially fast, so that the distance decayed like for some . The gain can safely absorb the loss, and this leads to a proof of the circular law assuming enough boundedness and continuity hypotheses to ensure the least singular value bound and the convergence rate. This basic paradigm was also followed by later works such as that of Gotze-Tikhomirov, Pan-Zhou, and Tao-Vu, with the main new ingredient being the advances in the understanding of the least singular value (Notes 7). […]
23 March, 2010 at 1:29 pm
Gabriel
Dear Terry,
Given an n\times n matrix A (not necessarily hermitian), what is the relation between the p-Schatten norm (0<p<\infty) and the l_{p} norm of the rows and columns?
How exercise 3 can be rewritten in this terms?
Thanks!
24 March, 2010 at 6:27 pm
Terence Tao
The Schatten norm only has a nice relationship with the norms of the rows and columns when p=2; the 2-Schatten norm (also known as the Hilbert-Schmidt norm, or Frobenius norm) is also the l^2 norm of the matrix entries.
The infinity-Schatten norm is the operator norm, which by Schur’s test is bounded by the geometric mean of the maximal l^1 norm of the rows, and the maximal l^1 norm of the columns. Presumably one can then interpolate with the 2-Schatten norm to get some inequalities for the p-Schatten norm for 2 <= p <= infty (a sort of non-commutative Hausdorff-Young inequality), but that's more or less the only general inequality I am aware of. (Note that for circulant matrices, the p-Schatten norm is basically the l^p norm of the Fourier transform of a row, and the only l^p estimates obeyed by the Fourier transform are given by the Hausdorff-Young inequality, so this suggests that there are no further general inequalities of this type beyond these ones.)
6 March, 2011 at 12:07 am
Mustafa
Suppose that
are
independent matrices that are chosen with uniform distribution on the sphere
. I want to find a lower bound on
How do I juggle
and
?
23 March, 2010 at 1:54 pm
Gabriel
I also think there is a typo in exercise 3. Shouldn’t it be
$$
\sum_{i=1}^{n}{\sigma_{i}(M)^{-2}}=\sum_{i=1}^{n}{dist(X_{i},V_{i})^{-2}}?
$$
Thanks!
[Corrected, thanks – T.]
9 August, 2010 at 5:20 pm
Sujit
The line above Hint in Exercise 1 needs to be fixed ($t$ missing)
[Corrected, thanks – T.]
13 August, 2010 at 8:16 pm
weiyu
Hi,
for the dual basis vectors, do you mean
?
In the paragraph right above Section 5, when you write
There might be some latex typing error in the formulation of Theorem 6.
Wonderful notes! Especially studying the smallest singular value can be reduced to studying the projection distances.That is beautiful.
[Corrected, thanks – T.]
12 May, 2013 at 8:49 pm
Nick Cook
In the proof of Proposition 2, the sparse case, for the final sum to be exponentially small we will need the implied constant from the Erdos Littlewood-Offord bound to be less than 1, or alternatively we can use (3) for small values of
.
In exercise 1, I think the denominator of the first term should be
, not
.
Also in that exercise, I see how a combination of having a lot of energy at low levels and conditioning out large components should do the trick, but I don’t see how the specific assumption
can work. It seems that only conditioning on
for which
will still allow the third moment term in the Berry-Esseen bound to be large. If we freeze all variables with
, this term is under control, but the first term $\latex r/||x||$ is of order
.
It can be done by saying that
is incompressible if
, and freezing all variables
with
. This gives a final bound
rather than
, but this is still enough.
Typo under exercise 1: flip the inequality in the definition of incompressible vs compressible.
13 May, 2013 at 11:50 am
Terence Tao
Thanks for the corrections!
14 May, 2013 at 12:06 pm
Nick Cook
Two more problems:
One is with the correction I just gave! Getting a bound of size
is actually not sufficient to beat the entropy cost of size
from the net. If we define
to be
-compressible when
, then we get a bound
.
The second problem is with the bound written above exercise 1. With this exponent of
we still won’t be able to beat the entropy cost (when
at least). We can replace it with a bound
valid for any positive
. Taking
large enough depending on
, we can beat the entropy factor.
Then for the compressible vectors, these all lie within distance
of a
-sparse vector, so we can use a
-net on the compressible vectors. Using this net, one is able to solve Exercise 2, in the course of which one will fix
sufficiently small, after which one can go back and take
sufficiently small depending on
and
.
I think this works – sorry to introduce 2 new parameters! I didn’t see a cleaner way (maybe LiPaRuTo have one).
[Corrections implemented, thanks – T.]
2 February, 2018 at 6:17 pm
John
Prof. Tao,
Is that possible to achieve stronger singularity probability by replacing random sign with uniform integers in [-t, t] for a large enough t?
I guess in the paper by Bourgain, Vu and Wood they studied the case when t = O(1). But in general can we hope for the singularity probability to be t^{-\Omega(n)} for arbitrarily large t?
8 May, 2019 at 7:19 pm
keej
In Exercise 3 (the negative second moment identity), should the power of dist be -2 instead of 2? Then the identity would scale correctly under M -> tM.
[Corrected, thanks – T.]
10 March, 2020 at 12:41 pm
Anonymous
I am wondering if there are studies about the 2-norm (or largest singular value) of a rectangular random matrix M, but with strong dependence structure. Say, M is n by k and is generated by f(X, Y), where one sample iid n samples from the random variable X and k iid samples from Y