You are currently browsing the tag archive for the ‘epsilon net argument’ tag.

Now we turn attention to another important spectral statistic, the least singular value ${\sigma_n(M)}$ of an ${n \times n}$ matrix ${M}$ (or, more generally, the least non-trivial singular value ${\sigma_p(M)}$ of a ${n \times p}$ matrix with ${p \leq n}$). This quantity controls the invertibility of ${M}$. Indeed, ${M}$ is invertible precisely when ${\sigma_n(M)}$ is non-zero, and the operator norm ${\|M^{-1}\|_{op}}$ of ${M^{-1}}$ is given by ${1/\sigma_n(M)}$. This quantity is also related to the condition number ${\sigma_1(M)/\sigma_n(M) = \|M\|_{op} \|M^{-1}\|_{op}}$ of ${M}$, which is of importance in numerical linear algebra. As we shall see in the next set of notes, the least singular value of ${M}$ (and more generally, of the shifts ${\frac{1}{\sqrt{n}} M - zI}$ for complex ${z}$) will be of importance in rigorously establishing the circular law for iid random matrices ${M}$, as it plays a key role in computing the Stieltjes transform ${\frac{1}{n} \hbox{tr} (\frac{1}{\sqrt{n}} M - zI)^{-1}}$ of such matrices, which as we have already seen is a powerful tool in understanding the spectra of random matrices.

The least singular value

$\displaystyle \sigma_n(M) = \inf_{\|x\|=1} \|Mx\|,$

which sits at the “hard edge” of the spectrum, bears a superficial similarity to the operator norm

$\displaystyle \|M\|_{op} = \sigma_1(M) = \sup_{\|x\|=1} \|Mx\|$

at the “soft edge” of the spectrum, that was discussed back in Notes 3, so one may at first think that the methods that were effective in controlling the latter, namely the epsilon-net argument and the moment method, would also work to control the former. The epsilon-net method does indeed have some effectiveness when dealing with rectangular matrices (in which the spectrum stays well away from zero), but the situation becomes more delicate for square matrices; it can control some “low entropy” portions of the infimum that arise from “structured” or “compressible” choices of ${x}$, but are not able to control the “generic” or “incompressible” choices of ${x}$, for which new arguments will be needed. As for the moment method, this can give the coarse order of magnitude (for instance, for rectangular matrices with ${p=yn}$ for ${0 < y < 1}$, it gives an upper bound of ${(1-\sqrt{y}+o(1))n}$ for the singular value with high probability, thanks to the Marchenko-Pastur law), but again this method begins to break down for square matrices, although one can make some partial headway by considering negative moments such as ${\hbox{tr} M^{-2}}$, though these are more difficult to compute than positive moments ${\hbox{tr} M^k}$.

So one needs to supplement these existing methods with additional tools. It turns out that the key issue is to understand the distance between one of the ${n}$ rows ${X_1,\ldots,X_n \in {\bf C}^n}$ of the matrix ${M}$, and the hyperplane spanned by the other ${n-1}$ rows. The reason for this is as follows. First suppose that ${\sigma_n(M)=0}$, so that ${M}$ is non-invertible, and there is a linear dependence between the rows ${X_1,\ldots,X_n}$. Thus, one of the ${X_i}$ will lie in the hyperplane spanned by the other rows, and so one of the distances mentioned above will vanish; in fact, one expects many of the ${n}$ distances to vanish. Conversely, whenever one of these distances vanishes, one has a linear dependence, and so ${\sigma_n(M)=0}$.

More generally, if the least singular value ${\sigma_n(M)}$ is small, one generically expects many of these ${n}$ distances to be small also, and conversely. Thus, control of the least singular value is morally equivalent to control of the distance between a row ${X_i}$ and the hyperplane spanned by the other rows. This latter quantity is basically the dot product of ${X_i}$ with a unit normal ${n_i}$ of this hyperplane.

When working with random matrices with jointly independent coefficients, we have the crucial property that the unit normal ${n_i}$ (which depends on all the rows other than ${X_i}$) is independent of ${X_i}$, so even after conditioning ${n_i}$ to be fixed, the entries of ${X_i}$ remain independent. As such, the dot product ${X_i \cdot n_i}$ is a familiar scalar random walk, and can be controlled by a number of tools, most notably Littlewood-Offord theorems and the Berry-Esséen central limit theorem. As it turns out, this type of control works well except in some rare cases in which the normal ${n_i}$ is “compressible” or otherwise highly structured; but epsilon-net arguments can be used to dispose of these cases. (This general strategy was first developed for the technically simpler singularity problem by Komlós, and then extended to the least singular value problem by Rudelson.)

These methods rely quite strongly on the joint independence on all the entries; it remains a challenge to extend them to more general settings. Even for Wigner matrices, the methods run into difficulty because of the non-independence of some of the entries (although it turns out one can understand the least singular value in such cases by rather different methods).

To simplify the exposition, we shall focus primarily on just one specific ensemble of random matrices, the Bernoulli ensemble ${M = (\xi_{ij})_{1 \leq i,j \leq n}}$ of random sign matrices, where ${\xi_{ij} = \pm 1}$ are independent Bernoulli signs. However, the results can extend to more general classes of random matrices, with the main requirement being that the coefficients are jointly independent.

Now that we have developed the basic probabilistic tools that we will need, we now turn to the main subject of this course, namely the study of random matrices. There are many random matrix models (aka matrix ensembles) of interest – far too many to all be discussed in a single course. We will thus focus on just a few simple models. First of all, we shall restrict attention to square matrices ${M = (\xi_{ij})_{1 \leq i,j \leq n}}$, where ${n}$ is a (large) integer and the ${\xi_{ij}}$ are real or complex random variables. (One can certainly study rectangular matrices as well, but for simplicity we will only look at the square case.) Then, we shall restrict to three main models:

• Iid matrix ensembles, in which the coefficients ${\xi_{ij}}$ are iid random variables with a single distribution ${\xi_{ij} \equiv \xi}$. We will often normalise ${\xi}$ to have mean zero and unit variance. Examples of iid models include the Bernouli ensemble (aka random sign matrices) in which the ${\xi_{ij}}$ are signed Bernoulli variables, the real gaussian matrix ensemble in which ${\xi_{ij} \equiv N(0,1)_{\bf R}}$, and the complex gaussian matrix ensemble in which ${\xi_{ij} \equiv N(0,1)_{\bf C}}$.
• Symmetric Wigner matrix ensembles, in which the upper triangular coefficients ${\xi_{ij}}$, ${j \geq i}$ are jointly independent and real, but the lower triangular coefficients ${\xi_{ij}}$, ${j are constrained to equal their transposes: ${\xi_{ij}=\xi_{ji}}$. Thus ${M}$ by construction is always a real symmetric matrix. Typically, the strictly upper triangular coefficients will be iid, as will the diagonal coefficients, but the two classes of coefficients may have a different distribution. One example here is the symmetric Bernoulli ensemble, in which both the strictly upper triangular and the diagonal entries are signed Bernoulli variables; another important example is the Gaussian Orthogonal Ensemble (GOE), in which the upper triangular entries have distribution ${N(0,1)_{\bf R}}$ and the diagonal entries have distribution ${N(0,2)_{\bf R}}$. (We will explain the reason for this discrepancy later.)
• Hermitian Wigner matrix ensembles, in which the upper triangular coefficients are jointly independent, with the diagonal entries being real and the strictly upper triangular entries complex, and the lower triangular coefficients ${\xi_{ij}}$, ${j are constrained to equal their adjoints: ${\xi_{ij} = \overline{\xi_{ji}}}$. Thus ${M}$ by construction is always a Hermitian matrix. This class of ensembles contains the symmetric Wigner ensembles as a subclass. Another very important example is the Gaussian Unitary Ensemble (GUE), in which all off-diagional entries have distribution ${N(0,1)_{\bf C}}$, but the diagonal entries have distribution ${N(0,1)_{\bf R}}$.

Given a matrix ensemble ${M}$, there are many statistics of ${M}$ that one may wish to consider, e.g. the eigenvalues or singular values of ${M}$, the trace and determinant, etc. In these notes we will focus on a basic statistic, namely the operator norm

$\displaystyle \| M \|_{op} := \sup_{x \in {\bf C}^n: |x|=1} |Mx| \ \ \ \ \ (1)$

of the matrix ${M}$. This is an interesting quantity in its own right, but also serves as a basic upper bound on many other quantities. (For instance, ${\|M\|_{op}}$ is also the largest singular value ${\sigma_1(M)}$ of ${M}$ and thus dominates the other singular values; similarly, all eigenvalues ${\lambda_i(M)}$ of ${M}$ clearly have magnitude at most ${\|M\|_{op}}$.) Because of this, it is particularly important to get good upper tail bounds

$\displaystyle {\bf P}( \|M\|_{op} \geq \lambda ) \leq \ldots$

on this quantity, for various thresholds ${\lambda}$. (Lower tail bounds are also of interest, of course; for instance, they give us confidence that the upper tail bounds are sharp.) Also, as we shall see, the problem of upper bounding ${\|M\|_{op}}$ can be viewed as a non-commutative analogue of upper bounding the quantity ${|S_n|}$ studied in Notes 1. (The analogue of the central limit theorem in Notes 2 is the Wigner semi-circular law, which will be studied in the next set of notes.)

An ${n \times n}$ matrix consisting entirely of ${1}$s has an operator norm of exactly ${n}$, as can for instance be seen from the Cauchy-Schwarz inequality. More generally, any matrix whose entries are all uniformly ${O(1)}$ will have an operator norm of ${O(n)}$ (which can again be seen from Cauchy-Schwarz, or alternatively from Schur’s test, or from a computation of the Frobenius norm). However, this argument does not take advantage of possible cancellations in ${M}$. Indeed, from analogy with concentration of measure, when the entries of the matrix ${M}$ are independent, bounded and have mean zero, we expect the operator norm to be of size ${O(\sqrt{n})}$ rather than ${O(n)}$. We shall see shortly that this intuition is indeed correct. (One can see, though, that the mean zero hypothesis is important; from the triangle inequality we see that if we add the all-ones matrix (for instance) to a random matrix with mean zero, to obtain a random matrix whose coefficients all have mean ${1}$, then at least one of the two random matrices necessarily has operator norm at least ${n/2}$.)

As mentioned before, there is an analogy here with the concentration of measure phenomenon, and many of the tools used in the latter (e.g. the moment method) will also appear here. (Indeed, we will be able to use some of the concentration inequalities from Notes 1 directly to help control ${\|M\|_{op}}$ and related quantities.) Similarly, just as many of the tools from concentration of measure could be adapted to help prove the central limit theorem, several the tools seen here will be of use in deriving the semi-circular law.

The most advanced knowledge we have on the operator norm is given by the Tracy-Widom law, which not only tells us where the operator norm is concentrated in (it turns out, for instance, that for a Wigner matrix (with some additional technical assumptions), it is concentrated in the range ${[2\sqrt{n} - O(n^{-1/6}), 2\sqrt{n} + O(n^{-1/6})]}$), but what its distribution in that range is. While the methods in this set of notes can eventually be pushed to establish this result, this is far from trivial, and will only be briefly discussed here. (We may return to the Tracy-Widom law later in this course, though.)