Now we turn attention to another important spectral statistic, the least singular value {\sigma_n(M)} of an {n \times n} matrix {M} (or, more generally, the least non-trivial singular value {\sigma_p(M)} of a {n \times p} matrix with {p \leq n}). This quantity controls the invertibility of {M}. Indeed, {M} is invertible precisely when {\sigma_n(M)} is non-zero, and the operator norm {\|M^{-1}\|_{op}} of {M^{-1}} is given by {1/\sigma_n(M)}. This quantity is also related to the condition number {\sigma_1(M)/\sigma_n(M) = \|M\|_{op} \|M^{-1}\|_{op}} of {M}, which is of importance in numerical linear algebra. As we shall see in the next set of notes, the least singular value of {M} (and more generally, of the shifts {\frac{1}{\sqrt{n}} M - zI} for complex {z}) will be of importance in rigorously establishing the circular law for iid random matrices {M}, as it plays a key role in computing the Stieltjes transform {\frac{1}{n} \hbox{tr} (\frac{1}{\sqrt{n}} M - zI)^{-1}} of such matrices, which as we have already seen is a powerful tool in understanding the spectra of random matrices.

The least singular value

\displaystyle  \sigma_n(M) = \inf_{\|x\|=1} \|Mx\|,

which sits at the “hard edge” of the spectrum, bears a superficial similarity to the operator norm

\displaystyle  \|M\|_{op} = \sigma_1(M) = \sup_{\|x\|=1} \|Mx\|

at the “soft edge” of the spectrum, that was discussed back in Notes 3, so one may at first think that the methods that were effective in controlling the latter, namely the epsilon-net argument and the moment method, would also work to control the former. The epsilon-net method does indeed have some effectiveness when dealing with rectangular matrices (in which the spectrum stays well away from zero), but the situation becomes more delicate for square matrices; it can control some “low entropy” portions of the infimum that arise from “structured” or “compressible” choices of {x}, but are not able to control the “generic” or “incompressible” choices of {x}, for which new arguments will be needed. As for the moment method, this can give the coarse order of magnitude (for instance, for rectangular matrices with {p=yn} for {0 < y < 1}, it gives an upper bound of {(1-\sqrt{y}+o(1))n} for the singular value with high probability, thanks to the Marchenko-Pastur law), but again this method begins to break down for square matrices, although one can make some partial headway by considering negative moments such as {\hbox{tr} M^{-2}}, though these are more difficult to compute than positive moments {\hbox{tr} M^k}.

So one needs to supplement these existing methods with additional tools. It turns out that the key issue is to understand the distance between one of the {n} rows {X_1,\ldots,X_n \in {\bf C}^n} of the matrix {M}, and the hyperplane spanned by the other {n-1} rows. The reason for this is as follows. First suppose that {\sigma_n(M)=0}, so that {M} is non-invertible, and there is a linear dependence between the rows {X_1,\ldots,X_n}. Thus, one of the {X_i} will lie in the hyperplane spanned by the other rows, and so one of the distances mentioned above will vanish; in fact, one expects many of the {n} distances to vanish. Conversely, whenever one of these distances vanishes, one has a linear dependence, and so {\sigma_n(M)=0}.

More generally, if the least singular value {\sigma_n(M)} is small, one generically expects many of these {n} distances to be small also, and conversely. Thus, control of the least singular value is morally equivalent to control of the distance between a row {X_i} and the hyperplane spanned by the other rows. This latter quantity is basically the dot product of {X_i} with a unit normal {n_i} of this hyperplane.

When working with random matrices with jointly independent coefficients, we have the crucial property that the unit normal {n_i} (which depends on all the rows other than {X_i}) is independent of {X_i}, so even after conditioning {n_i} to be fixed, the entries of {X_i} remain independent. As such, the dot product {X_i \cdot n_i} is a familiar scalar random walk, and can be controlled by a number of tools, most notably Littlewood-Offord theorems and the Berry-Esséen central limit theorem. As it turns out, this type of control works well except in some rare cases in which the normal {n_i} is “compressible” or otherwise highly structured; but epsilon-net arguments can be used to dispose of these cases. (This general strategy was first developed for the technically simpler singularity problem by Komlós, and then extended to the least singular value problem by Rudelson.)

These methods rely quite strongly on the joint independence on all the entries; it remains a challenge to extend them to more general settings. Even for Wigner matrices, the methods run into difficulty because of the non-independence of some of the entries (although it turns out one can understand the least singular value in such cases by rather different methods).

To simplify the exposition, we shall focus primarily on just one specific ensemble of random matrices, the Bernoulli ensemble {M = (\xi_{ij})_{1 \leq i,j \leq n}} of random sign matrices, where {\xi_{ij} = \pm 1} are independent Bernoulli signs. However, the results can extend to more general classes of random matrices, with the main requirement being that the coefficients are jointly independent.

— 1. The epsilon-net argument —

We begin by using the epsilon net argument to establish a lower bound in the rectangular case, due to Litvak, Pajor, Rudelson, and Tomczak-Jaegermann:

Theorem 1 (Lower bound) Let {M = (\xi_{ij})_{1\leq i\leq p; 1\leq j \leq n}} be an {n \times p} Bernoulli matrix, where {1 p \leq (1-\delta) n} for some {\delta > 0} (independent of {n}). Then with exponentially high probability (i.e. {1-O(e^{-cn})} for some {c>0}), one has {\sigma_p(M) \geq c \sqrt{n}}, where {c > 0} depends only on {\delta}.

This should be compared with the upper bound

\displaystyle  \|M\|_{op} = \sigma_1(M) \leq C \sqrt{n} \ \ \ \ \ (1)

for some absolute constant {C} that holds with overwhelming probability from Notes 3 (indeed, one can take any {C>2} here).

We use the epsilon-net method introduced in Notes 3, but with a smaller value of {\epsilon > 0} than used for the largest singular value. We write

\displaystyle  \sigma_p(M) = \inf_{x \in {\bf C}^p: \|x\|=1} \|Mx\|.

Taking {\Sigma} to be a maximal {\epsilon}-net of the unit sphere in {{\bf C}^p}, with {\epsilon > 0} to be chosen later, we have that

\displaystyle  \sigma_p(M) \geq \inf_{x \in \Sigma} \|Mx\| - \epsilon \|M\|_{op}

and thus by (1), we have with overwhelming probability that

\displaystyle  \sigma_p(M) \geq \inf_{x \in \Sigma} \|Mx\| - C \epsilon \sqrt{n},

and so it suffices to show that

\displaystyle  \mathop{\bf P}( \inf_{x \in \Sigma} \|Mx\| \leq 2C\epsilon \sqrt{n} )

is exponentially small in {n}. From the union bound, we can upper bound this by

\displaystyle  \sum_{x \in \Sigma} \mathop{\bf P}( \|Mx\| \leq 2C\epsilon \sqrt{n} ).

From the volume packing argument we have

\displaystyle  |\Sigma| \leq O(1/\epsilon)^p \leq O(1/\epsilon)^{(1-\delta) n}. \ \ \ \ \ (2)

So we need to upper bound, for each {x \in \Sigma}, the probability

\displaystyle  {\bf P}(\|Mx\| \leq 2C\epsilon \sqrt{n}).

If we let {Y_1,\ldots,Y_n \in {\bf C}^p} be the rows of {M}, we can write this as

\displaystyle  {\bf P}( \sum_{j=1}^n |Y_j \cdot x|^2 \leq 4 C^2 \epsilon^2 n ).

By Markov’s inequality, the only way that this event can hold is if we have

\displaystyle  |Y_j \cdot x|^2 \leq 8 C^2 \epsilon^2/\delta

for at least {(1-\delta/2)n} values of {j}. We do not know in advance what the set of {j} is for which this event holds. But the number of possible values of such sets of {j} is at most {2^n}. Applying the union bound (and paying the entropy cost of {2^n}) and using symmetry, we may thus bound the above probability by

\displaystyle  \leq 2^n {\bf P}( |Y_j \cdot x|^2 \leq 8 C^2 \epsilon^2/\delta \hbox{ for } 1 \leq j \leq (1-\delta/2)n ).

Now observe that the random variables {Y_j \cdot x} are independent, and so we can bound this expression by

\displaystyle  \leq 2^n {\bf P}( |Y \cdot x| \leq \sqrt{8/\delta} C \epsilon )^{(1-\delta/2)n}

where {Y = (\xi_1,\ldots,\xi_n)} is a random vector of iid Bernoulli signs.

We write {x = (x_1,\ldots,x_n)}, so that {Y \cdot x} is a random walk

\displaystyle  Y \cdot x = \xi_1 x_1 + \ldots + \xi_n x_n.

To understand this walk, we apply (a slight variant) of the Berry-Esséen theorem from Notes 2:

Exercise 1 Show that

\displaystyle  \sup_t {\bf P}( |Y \cdot x - t| \leq r ) \ll \frac{r}{\|x\|} + \frac{1}{\|x\|^3} \sum_{j=1}^n |x_j|^3

for any {r>0} and any non-zero {x}. (Actually, for the purposes of this set of notes, it would suffice to establish a weaker form of the Berry-Esséen theorem with {\sum_{j=1}^n |x_j|^3 / \|x\|^3} replaced by {(\sum_{j=1}^n |x_j|^3 / \|x\|^3)^c} for any fixed {c>0}). (Hint: first normalise {\|x\|=1}, then adapt the proof of the Berry-Esséen theorem.)

Conclude in particular that if

\displaystyle  \sum_{j: |x_j| \leq \epsilon} |x_j|^2 \geq \eta^2

for some {\eta>0} then

\displaystyle  \sup_t {\bf P}( |Y \cdot x-t| \leq \sqrt{8/\delta} C \epsilon ) \ll_{\eta,\delta} \epsilon.

(Hint: condition out all the {x_j} with {|x_j| > \epsilon}.)

Let us temporarily call {x} incompressible if

\displaystyle  \sum_{j: |x_j| \leq \epsilon} |x_j|^2 \geq \eta^2

for somme {\eta>0} to be chosen later, and compressible otherwise. If we only look at the incompressible elements of {\Sigma}, we can now bound

\displaystyle  {\bf P}(\|Mx\| \leq 2C\epsilon \sqrt{n}) \ll (O_{\eta,\delta}(\epsilon))^{(1-\delta/2)n},

and comparing this against the entropy cost (2) we obtain an acceptable contribution for {\eta} small enough, and {\epsilon} small enough depending on {\eta,\delta} (here we are crucially using the rectangular condition {p \leq (1-\delta) n}).

It remains to deal with the compressible vectors. Observe that such vectors lie within {\eta} of a sparse unit vector which is only supported in at most {\epsilon^{-2}} positions. The {\eta}-entropy of these sparse vectors (i.e. the number of balls of radius {\eta} needed to cover this space) can easily be computed to be of polynomial size {O(n^{O_{\epsilon,\eta}(1)})} in {n}. Meanwhile, we have the following crude bound:

Exercise 2 For any unit vector {x}, show that

\displaystyle  {\bf P}( |Y \cdot x| \leq \kappa ) \leq 1-\kappa

for {\kappa > 0} small enough. (Hint: Use the Paley-Zygmund inequality. Bounds on higher moments on {|Y \cdot x|} can be obtained for instance using Hoeffding’s inequality, or by direct computation.) Use this to show that

\displaystyle  \mathop{\bf P}( \|Mx\| \leq 2C\eta \sqrt{n} ) \ll \exp(-cn)

for all such {x} and {\eta} sufficiently small, with {c>0} independent of {\eta} and {n}.

Thus the compressible vectors give a net contribution of {O( n^{O_{\epsilon,\eta}(1)} ) \times \exp(-cn)}, which is acceptable. This concludes the proof of Theorem 1.

— 2. Singularity probability —

Now we turn to square Bernoulli matrices {M = (\xi_{ij})_{1 \leq i,j \leq n}}. Before we investigate the size of the least singular value, we first tackle the easier problem of bounding the singularity probability

\displaystyle  {\bf P}(\sigma_n(M) = 0),

i.e. the probability that {M} is not invertible. The problem of computing this probability exactly is still not completely settled. Since {M} is singular whenever the first two rows (say) are identical, we obtain a lower bound

\displaystyle  {\bf P}(\sigma_n(M) = 0) \geq \frac{1}{2^n},

and it is conjectured that this bound is essentially tight in the sense that

\displaystyle  {\bf P}(\sigma_n(M) = 0) = (\frac{1}{2}+o(1))^n,

but this remains open; the best bound currently is due to Bourgain, Vu, and Wood, and gives

\displaystyle  {\bf P}(\sigma_n(M) = 0) \leq (\frac{1}{\sqrt{2}}+o(1))^n.

We will not prove this bound here, but content ourselves with a weaker bound, essentially due to Komlós:

Proposition 2 We have {{\bf P}(\sigma_n(M) = 0) \ll 1/n^{1/2}}.

To show this, we need the following combinatorial fact, due to Erdös:

Proposition 3 (Erdös Littlewood-Offord theorem) Let {x = (x_1,\ldots,x_n)} be a vector with at least {k} nonzero entries, and let {Y = (\xi_1,\ldots,\xi_n)} be a random vector of iid Bernoulli signs. Then {{\bf P}(Y \cdot x = 0) \ll k^{-1/2}}.

Proof: By taking real and imaginary parts we may assume that {x} is real. By eliminating zero coefficients of {x} we may assume that {k=n}; reflecting we may then assume that all the {x_i} are positive. Observe that the set of {Y = (\xi_1,\ldots,\xi_n) \in \{-1,1\}^n} with {Y \cdot x = 0} forms an antichain in {\{-1,1\}^n}. The claim now easily follows from Sperner’s theorem and Stirling’s formula. \Box

Note that we also have the obvious bound

\displaystyle  {\bf P}(Y \cdot x = 0) \leq 1/2 \ \ \ \ \ (3)

for any non-zero {x}.

Now we prove the theorem. In analogy with the arugments of the previou ssection, we write

\displaystyle  {\bf P}(\sigma_n(M) = 0) = {\bf P}( Mx = 0 \hbox{ for some nonzero } x \in {\bf C}^n )

(actually we can take {x \in {\bf R}^n} since {M} is real). We divide into compressible and incompressible vectors as before, but our definition of compressibility and incompressibility is slightly different now. Also, one has to do a certain amount of technical maneuvering in order to preserve the crucial independence between rows and columns.

Namely, we pick an {\epsilon > 0} and call {x} compressible if it is supported on at most {\epsilon n} coordinates, and incompressible otherwise.

Let us first consider the contribution of the event that {Mx=0} for some nonzero compressible {x}. Pick an {x} with this property which is as sparse as possible, say {k} sparse for some {1 \leq k < \epsilon n}. Let us temporarily fix {k}. By paying an entropy cost of {\lfloor \epsilon n\rfloor \binom{n}{k}}, we may assume that it is the first {k} entries that are non-zero for some {1 \leq k \leq \epsilon n}. This implies that the first {k} columns {Y_1,\ldots,Y_k} of {M} have a linear dependence given by {x}; by minimality, {Y_1,\ldots,Y_{k-1}} are linearly independent. Thus, {x} is uniquely determined (up to scalar multiples) by {Y_1,\ldots,Y_k}. Furthermore, as the {n \times k} matrix formed by {Y_1,\ldots,Y_k} has rank {k-1}, there is some {k \times k} minor which already determines {x} up to constants; by paying another entropy cost of {\binom{n}{k}}, we may assume that it is the top left minor which does this. In particular, we can now use the first {k} rows {X_1,\ldots,X_k} to determine {x} up to constants. But the remaining {n-k} rows are independent of {X_1,\ldots,X_k} and still need to be orthogonal to {x}; by Proposition 3 and (3), this happens with probability at most {\min( O( k^{1/2} ), 1/2 )^{n-k}}, giving a total cost of

\displaystyle  \sum_{1 \leq k \leq \epsilon n} \binom{n}{k}^2 \min( O( k^{1/2} ), 1/2 )^{n-k},

which by Stirling’s formula is acceptable (in fact this gives an exponentially small contribution).

The same argument gives that the event that {y^* M=0} for some nonzero compressible {y} also has exponentially small probability. The only remaining event to control is the event that {Mx=0} for some incompressible {x}, but that {Mz \neq 0} and {y^* M \neq 0} for all nonzero compressible {z,y}. Call this event {E}.

Since {Mx=0} for some incompressible {x}, we see that for at least {\epsilon n} values of {k \in \{1,\ldots,n\}}, the row {X_k} lies in the vector space {V_k} spanned by the remaining {n-1} rows of {M}. Let {E_k} denote the event that {E} holds, and that {X_k} lies in {V_k}; then we see from double counting that

\displaystyle  {\bf P}(E) \leq \frac{1}{\epsilon n} \sum_{k=1}^n {\bf P}(E_k).

By symmetry, we thus have

\displaystyle  {\bf P}(E) \leq \frac{1}{\epsilon} {\bf P}(E_n).

To compute {{\bf P}(E_n)}, we freeze {X_1,\ldots,X_{n-1}} consider a normal vector {x} to {V_{n-1}}; note that we can select {x} depending only on {X_1,\ldots,X_{n-1}}. We may assume that an incompressible normal vector exists, since otherwise the event {E_n} would be empty. We make the crucial observation that {X_n} is still independent of {x}. By Proposition 3, we thus see that the conditional probability that {X_n \cdot x = 0}, for fixed {X_1,\ldots,X_{n-1}}, is {O_\epsilon(n^{-1/2})}. We thus see that {{\bf P}(E) \ll_\epsilon 1/n^{1/2}}, and the claim follows.

Remark 1 Further progress has been made on this problem by a finer analysis of the concentration probability {{\bf P}(Y \cdot x = 0)}, and in particular in classifying those {x} for which this concentration probability is large (this is known as the inverse Littlewood-Offord problem). Important breakthroughs in this direction were made by Halász (introducing Fourier-analytic tools) and by Kahn, Komlós, and Szemerédi (introducing an efficient “swapping” argument). Van Vu and I introduced tools from additive combinatorics (such as Freiman’s theorem) to obtain further improvements, leading eventually to the results of Bourgain, Vu, and Wood mentioned earlier.

— 3. Lower bound for the least singular value —

Now we return to the least singular value {\sigma_n(M)} of an iid Bernoulli matrix, and establish a lower bound. Given that there are {n} singular values between {0} and {\sigma_1(M)}, which is typically of size {O(\sqrt{n})}, one expects the least singular value to be of size about {1/\sqrt{n}} on the average. Another argument supporting this heuristic scomes from the following identity:

Exercise 3 (Negative second moment identity) Let {M} be an invertible {n \times n} matrix, let {X_1,\ldots,X_n} be the rows of {M}, and let {R_1,\ldots,R_n} be the columns of {M^{-1}}. For each {1 \leq i \leq n}, let {V_i} be the hyperplane spanned by all the rows {X_1,\ldots,X_n} other than {X_i}. Show that {\|R_i\| = \hbox{dist}(X_i,V_i)^{-1}} and {\sum_{i=1}^n \sigma_i(M)^{-2} = \sum_{i=1}^n \hbox{dist}(X_i,V_i)^{2}}.

From Talagrand’s inequality (Notes 1), we expect each {\hbox{dist}(X_i,V_i)} to be of size {O(1)} on the average, which suggests that {\sum_{i=1}^n \sigma_i(M)^{-2} = O(n)}; this is consistent with the heuristic that the eigenvalues {\sigma_i(M)} should be roughly evenly spaced in the interval {[0,2\sqrt{n}]} (so that {\sigma_{n-i}(M)} should be about {(i+1)/\sqrt{n}}).

Now we give a rigorous lower bound:

Theorem 4 (Lower tail estimate for the least singular value) For any {\lambda > 0}, one has

\displaystyle  {\bf P}( \sigma_n(M) \leq \lambda / \sqrt{n} ) \ll o_{\lambda \rightarrow 0}(1) + o_{n \rightarrow \infty;\lambda}(1)

where {o_{\epsilon \rightarrow 0}(1)} goes to zero as {\lambda \rightarrow 0} uniformly in {n}, and {o_{n \rightarrow \infty;\lambda}(1)} goes to zero as {n \rightarrow \infty} for each fixed {\lambda}.

This is a weaker form of a result of Rudelson and Vershynin (which obtains a bound of the form {O(\lambda) + O(c^n)} for some {c<1}), which builds upon earlier work of Rudelson and of Vu and myself, which obtained variants of the above result.

The scale {1/\sqrt{n}} that we are working at here is too fine to use epsilon net arguments (unless one has a lot of control on the entropy, which can be obtained in some cases thanks to powerful inverse Littlewood-Offord theorems, but is difficult to obtain in general.) We can prove this theorem along similar lines to the arguments in the previous section; we sketch the method as follows. We can take {\lambda} to be small. We write the probability to be estimated as

\displaystyle  {\bf P}( \|Mx\| \leq \lambda/\sqrt{n} \hbox{ for some unit vector } x \in {\bf C}^n ).

We can assume that {\|M\|_{op} \leq C \sqrt{n}} for some absolute constant {C}, as the event that this fails has exponentially small probability.

We pick an {\epsilon > 0} (not depending on {\lambda}) to be chosen later. We call a unit vector {x \in {\bf C}^n} compressible if {x} lies within a distance {\epsilon} of a {\epsilon n}-sparse vector. Let us first dispose of the case in which {\|Mx\| \leq \lambda \sqrt{n}} for some compressible {x}. By paying an entropy cost of {\binom{n}{\lfloor \epsilon n\rfloor}}, we may assume that {x} is within {\epsilon} of a vector {y} supported in the first {\lfloor \epsilon n\rfloor} coordinates. Using the operator norm bound on {M} and the triangle inequality, we conclude that

\displaystyle  \|My\| \leq (\lambda + C\epsilon) \sqrt{n}.

Since {y} has norm comparable to {1}, this implies that the least singular value of the first {\lfloor \epsilon n \rfloor} columns of {M} is {O( (\lambda + \epsilon)\sqrt{n} )}. But by Theorem 1, this occurs with probability {O(\exp(-cn))} (if {\lambda,\epsilon} are small enough). So the total probability of the compressible event is at most {\binom{n}{\lfloor \epsilon n\rfloor} O(\exp(-cn))}, which is acceptable if {\epsilon} is small enough.

Thus we may assume now that {\|M x \| > \lambda/\sqrt{n}} for all compressible unit vectors {x}; we may similarly assume that {\| y^* M \| > \lambda/\sqrt{n}} for all compressible unit vectors {y}. Indeed, we may also assume that {\|y^* M_i \| > \lambda/\sqrt{n}} for every {i}, where {M_i} is {M} with the {i^{th}} column removed.

The remaining case is if {\|Mx\| \leq \lambda/\sqrt{n}} for some incompressible {x}. Let us call this event {E}. Write {x=(x_1,\ldots,x_n)}, and let {Y_1,\ldots,Y_n} be the column of {M}, thus

\displaystyle  \| x_1 Y_1 + \ldots + x_n Y_n \| \leq \lambda/\sqrt{n}.

Letting {W_i} be the subspace spanned by all the {Y_1,\ldots,Y_n} except for {Y_i}, we conclude upon projecting to the orthogonal complement of {W_i} that

\displaystyle  |x_i| \hbox{dist}(Y_i,W_i) \leq \lambda/\sqrt{n}

for all {i} (compare with Exercise 3). On the other hand, since {x} is incompressible, we see that {|x_i| \geq \epsilon/\sqrt{n}} for at least {\epsilon n} values of {i}, and thus

\displaystyle  \hbox{dist}(Y_i,W_i) \leq \lambda/\epsilon. \ \ \ \ \ (4)

for at least {\epsilon n} values of {i}. If we let {E_i} be the event that {E} and (4) both hold, we thus have from double-counting that

\displaystyle  {\bf P}(E) \leq \frac{1}{\epsilon n} \sum_{i=1}^n {\bf P}(E_i)

and thus by symmetry

\displaystyle  {\bf P}(E) \leq \frac{1}{\epsilon} {\bf P}(E_n)

(say). However, if {E_n} holds, then setting {y} to be a unit normal vector to {W_i} (which is necessarily incompressible, by the hypothesis on {M_i}), we have

\displaystyle  |Y_i \cdot y| \leq \lambda/\epsilon.

Again, the crucial point is that {Y_i} and {y} are independent. The incompressibility of {y}, combined with a Berry-Esséen type theorem, then gives

Exercise 4 Show that

\displaystyle  {\bf P}( |Y_i \cdot y| \leq \lambda/\epsilon ) \ll \epsilon^2

(say) if {\lambda} is sufficiently small depending on {\epsilon}, and {n} is sufficiently large depending on {\epsilon}.

This gives a bound of {O(\epsilon)} for {{\bf P}(E)} if {\lambda} is small enough depending on {\epsilon}, and {n} is large enough; this gives the claim.

Remark 2 A variant of these arguments, based on inverse Littlewood-Offord theorems rather than the Berry-Esséen theorem, gives the variant estimate

\displaystyle  \sigma_n( \frac{1}{\sqrt{n}} M_n - z I ) \geq n^{-A} \ \ \ \ \ (5)

with high probability for some {A>0}, and any {z} of polynomial size in {n}. There are several results of this type, with overlapping ranges of generality (and various values of {A}) by Götze-Tikhimirov, Pan-Zhou, and Vu and myself; the exponent {A} is known to degrade if one has too few moment assumptions on the underlying random matrix {M}. This type of result (with an unspecified {A}) is important for the circular law, discussed in the next set of lectures.

— 4. Upper bound for the least singular value —

One can complement the lower tail estimate with an upper tail estimate:

Theorem 5 (Upper tail estimate for the least singular value) For any {\lambda > 0}, one has

\displaystyle  {\bf P}( \sigma_n(M) \geq \lambda / \sqrt{n} ) \ll o_{\lambda \rightarrow \infty}(1) + o_{n \rightarrow \infty;\lambda}(1). \ \ \ \ \ (6)

We prove this using an argument of Rudelson and Vershynin. Suppose that {\sigma_n(M) > \lambda / \sqrt{n}}, then

\displaystyle  \|y^* M^{-1}\| \leq \sqrt{n} \|y\| / \lambda \ \ \ \ \ (7)

for all {y}.

Next, let {X_1,\ldots,X_n} be the rows of {M}, and let {R_1,\ldots,R_n} be the columns of {M^{-1}}, thus {R_1,\ldots,R_n} is a dual basis for {X_1,\ldots,X_n}. From (7) we have

\displaystyle  \sum_{i=1}^n |y \cdot R_i|^2 \leq n \|y\|^2/\lambda^2.

We apply this with {y} equal to {X_n - \pi_n(X_n)}, where {\pi_n} is the orthogonal projection to the space {V_{n-1}} spanned by {X_1,\ldots,X_{n-1}}. On the one hand, we have

\displaystyle  \|y\|^2 = \hbox{dist}(X_n,V_{n-1})^2

and on the other hand we have for any {1 \leq i < n} that

\displaystyle  y \cdot R_i = - \pi_n(X_n) \cdot R_i = - X_n \cdot \pi_n(R_i)

and so

\displaystyle  \sum_{i=1}^{n-1} |X_n \cdot \pi_n(R_i)|^2 \leq n \hbox{dist}(X_n,V_{n-1})^2 / \lambda^2. \ \ \ \ \ (8)

If (8) holds, then {|X_n \cdot \pi_n(R_i)|^2 = O( \hbox{dist}(X_n,V_{n-1})^2 / \lambda^2 )} for at least half of the {i}, so the probability in (6) can be bounded by

\displaystyle  \ll \frac{1}{n} \sum_{i=1}^{n-1} {\bf P} ( |X_n \cdot \pi_n(R_i)|^2 = O( \hbox{dist}(X_n,V_{n-1})^2 / \lambda^2 ) )

which by symmetry can be bounded by

\displaystyle  \ll {\bf P} ( |X_n \cdot \pi_n(R_1)|^2 = O( \hbox{dist}(X_n,V_{n-1})^2 / \lambda^2 ) ).

Let {\epsilon > 0} be a small quantity to be chosen later. From Talagrand’s inequality (Notes 1) we know that {\hbox{dist}(X_n,V_{n-1}) = O_\epsilon(1)} with probability {1-O(\epsilon)}, so we obtain a bound of

\displaystyle  \ll {\bf P} ( X_n \cdot \pi_n(R_1) = O_\epsilon( 1 / \lambda ) ) + O(\epsilon).

Now a key point is that the vectors {\pi_n(R_1),\ldots,\pi_n(R_{n-1})} depend only on {X_1,\ldots,X_{n-1}} and not on {X_n}; indeed, they are the dual basis for {X_1,\ldots,X_{n-1}} in {V_{n-1}}. Thus, after conditioning {X_1,\ldots,X_{n-1}} and thus {\pi_n(R_1)} to be fixed, {X_n} is still a Bernoulli random vector. Applying a Berry-Esséen inequality, we obtain a bound of {O(\epsilon)} for the conditional probability that {X_n \cdot \pi_n(R_1) = O_\epsilon( 1 / \lambda )} for {\lambda} sufficiently small depending on {\epsilon}, unless {\pi_n(R_1)} is compressible (in the sense that, say, it is within {\epsilon} of an {\epsilon n}-sparse vector). But this latter possibility can be controlled (with exponentially small probability) by the same type of arguments as before; we omit the details.

— 5. Asymptotic for the least singular value —

The distribution of singular values of a gaussian random matrix can be computed explicitly by techniques similar to those employed in Notes 6. In particular, if {M} is a real gaussian matrix (with all entries iid with distribution {N(0,1)_{\bf R}}), it was shown by Edelman that {\sqrt{n} \sigma_n(M)} converges in distribution to the distribution {\mu_E := \frac{1+\sqrt{x}}{2\sqrt{x}} e^{-x/2 - \sqrt{x}}\ dx} as {n \rightarrow \infty}. It turns out that this result can be extended to other ensembles with the same mean and variance. In particular, we have the following result of Van Vu and myself:

Theorem 6 If {M} is an iid Bernoulli matrix, then {\sqrt{n} \sigma_n(M)} also converges in distribution to {\mu_E} as {n \rightarrow \infty}. (In fact there is a polynomial rate of convergence.)

This should be compared with Theorems 4, 5, which show that {\sqrt{n} \sigma_n(M)} have a tight sequence of distributions in {(0,+\infty)}. The arguments of Van and myself thus provide an alternate proof of these two theorems.

We do not establish the limit {\mu_E} directly, but instead use Edelman’s result as a black box, focusing instead on establishing the universality of the limiting distribution of {\sqrt{n} \sigma_n(M)}, and in particular that this limiting distribution is the same whether one has a Bernoulli ensemble or a gaussian ensemble.

The arguments are somewhat technical and we will not present them in full here, but instead give a sketch of the key ideas.

In previous sections we have already seen the close relationship between the least singular value {\sigma_n(M)}, and the distances {\hbox{dist}(X_i,V_i)} between a row {X_i} of {M} and the hyperplane {V_i} spanned by the other {n-1} rows. It is not hard to use the above machinery to show that as {n \rightarrow \infty}, {\hbox{dist}(X_i,V_i)} converges in distribution to the absolute value {|N(0,1)_{\bf R}|} of a Gaussian regardless of the underlying distribution of the coefficients of {M} (i.e. it is asymptotically universal). The basic point is that one can write {\hbox{dist}(X_i,V_i)} as {|X_i \cdot n_i|} where {n_i} is a unit normal of {V_i} (we will assume here that {M} is non-singular, which by previous arguments is true asymptotically almost surely). The previous machinery lets us show that {n_i} is incompressible with high probability, and then claim then follows from the Berry-Esséen theorem.

Unfortunately, despite the presence of suggestive relationships such as Exercise 3, the asymptotic universality of the distances {\hbox{dist}(X_i,V_i)} does not directly imply asymptotic universality of the least singular value. However, it turns out that one can obtain a higher-dimensional version of the universality of the scalar quantities {\hbox{dist}(X_i,V_i)}, as follows. For any small {k} (say, {1 \leq k \leq n^c} for some small {c>0}) and any distinct {i_1,\ldots,i_k \in \{1,\ldots,n\}}, a modification of the above argument shows that the covariance matrix

\displaystyle  ( \pi( X_{i_a} ) \cdot \pi(X_{i_b}) )_{1 \leq a,b \leq k} \ \ \ \ \ (9)

of the orthogonal projections {\pi(X_{i_1}),\ldots,\pi(X_{i_k})} of the {k} rows {X_{i_1},\ldots,X_{i_k}} to the complement {V^\perp_{i_1,\ldots,i_k}} of the space {V_{i_1,\ldots,i_k}} spanned by the other {n-k} rows of {M}, is also universal, converging in distribution to the covariance matrix {( G_a \cdot G_b )_{1 \leq a,b \leq k}} of {k} iid gaussians {G_a \equiv N(0,1)_{\bf R}} (note that the convergence of {\hbox{dist}(X_i,V_i)} to {|N(0,1)_{\bf R}|} is the {k=1} case of this claim). (These covariance matrix distributions are also known as Wishart distributions.) The key point is that one can show that the complement {V^\perp_{i_1,\ldots,i_k}} is usually “incompressible” in a certain technical sense, which implies that the projections {\pi(X_{i_a})} behave like iid gaussians on that projection thanks to a multidimensional Berry-Esséen theorem.

On the other hand, the covariance matrix (9) is closely related to the inverse matrix {M^{-1}}:

Exercise 5 Show that (9) is also equal to {A^* A}, where {A} is the {n \times k} matrix formed from the {i_1,\ldots,i_k} columns of {M^{-1}}.

In particular, this shows that the singular values of {k} randomly selected columns of {M^{-1}} have a universal distribution.

Recall our goal is to show that {\sqrt{n} \sigma_n(M)} has an asymptotically universal distribution, which is equivalent to asking that {\frac{1}{\sqrt{n}} \|M^{-1}\|_{op}} has an asymptotically universal distribution. The goal is then to extract the operator norm of {M^{-1}} from looking at a random {n \times k} minor {B} of this matrix. This comes from the following application of the second moment method:

Exercise 6 Let {A} be an {n \times n} matrix with columns {R_1,\ldots,R_n}, and let {B} be the {n \times k} matrix formed by taking {k} of the columns {R_1,\ldots,R_n} at random. Show that

\displaystyle  {\bf E} \| A^* A - \frac{n}{k} B^* B \|_F^2 \leq \frac{n}{k} \sum_{k=1}^n \|R_k\|^4,

where {\|\|_F} is the Frobenius norm.

Recall from Exercise 3 that {\|R_k\| = 1 / \hbox{dist}(X_k,V_k)}, so we expect each {\|R_k\|} to have magnitude about {O(1)}. This, together with the Hoeffman-Wielandt inequality (Notes 3b) means that we expect {\sigma_1( (M^{-1})^* (M^{-1}) ) = \sigma_n(M)^{-2}} to differ by {O( n^2/k )} from {\frac{n}{k} \sigma_1(B^* B) = \frac{n}{k} \sigma_1(B)^2}. In principle, this gives us asymptotic universality on {\sqrt{n} \sigma_n(M)} from the already established universality of {B}.

There is one technical obstacle remaining, however: while we know that each {\hbox{dist}(X_k,V_k)} is distributed like a Gaussian, so that each individual {R_k} is going to be of size {O(1)} with reasonably good probability, in order for the above exercise to be useful, one needs to bound all of the {R_k} simultaneously with high probability. A naive application of the union bound leads to terrible results here. Fortunately, there is a strong correlation between the {R_k}: they tend to be large together or small together, or equivalently that the distances {\hbox{dist}(X_k,V_k)} tend to be small together or large together. Here is one indication of this:

Lemma 7 For any {1 \leq k < i \leq n}, one has

\displaystyle  \hbox{dist}(X_i, V_i) \geq \frac{\|\pi_i(X_i)\|}{1 + \sum_{j=1}^k \frac{\| \pi_i(X_j) \|}{\|\pi_i(X_i)\| \hbox{dist}(X_j,V_j)}},

where {\pi_i} is the orthogonal projection onto the space spanned by {X_1,\ldots,X_k,X_i}.

Proof: We may relabel so that {i=k+1}; then projecting everything by {\pi_i} we may assume that {n=k+1}. Our goal is now to show that

\displaystyle  \hbox{dist}(X_n, V_{n-1}) \geq \frac{\| X_n\|}{1 + \sum_{j=1}^{n-1} \frac{\| X_j \|}{\|X_n\| \hbox{dist}(X_j,V_j)}}.

Recall that {R_1,\ldots,R_n} is a dual basis to {X_1,\ldots,X_n}. This implies in particular that

\displaystyle  x = \sum_{j=1}^n (x \cdot X_j) R_j

for any vector {x}; applying this to {X_n} we obtain

\displaystyle  X_n = \|X_n\|^2 R_n + \sum_{j=1}^{n-1} (X_j \cdot X_n) R_j

and hence by the triangle inequality

\displaystyle  \|X_n\|^2 \|R_n\| \leq \|X_n\| + \sum_{j=1}^{n-1} \|X_j\| \|X_n\| \|R_j\|.

Using the fact that {\|R_j\| = 1/\hbox{dist}(X_j,R_j)}, the claim follows. \Box

In practice, once {k} gets moderately large (e.g. {k=n^c} for some small {c>0}), one can control the expressions {\|\pi_i(X_j)\|} appearing here by Talagrand’s inequality (Notes 1), and so this inequality tells us that once {\hbox{dist}(X_j,V_j)} is bounded away from zero for {j=1,\ldots,k}, it is bounded away from zero for all other {k} also. This turns out to be enough to get enough uniform control on the {R_j} to make Exercise 6 useful, and ultimately to complete the proof of Theorem 6.