You are currently browsing the tag archive for the ‘concentration of measure’ tag.

Let ${X}$ and ${Y}$ be two random variables taking values in the same (discrete) range ${R}$, and let ${E}$ be some subset of ${R}$, which we think of as the set of “bad” outcomes for either ${X}$ or ${Y}$. If ${X}$ and ${Y}$ have the same probability distribution, then clearly

$\displaystyle {\bf P}( X \in E ) = {\bf P}( Y \in E ).$

In particular, if it is rare for ${Y}$ to lie in ${E}$, then it is also rare for ${X}$ to lie in ${E}$.

If ${X}$ and ${Y}$ do not have exactly the same probability distribution, but their probability distributions are close to each other in some sense, then we can expect to have an approximate version of the above statement. For instance, from the definition of the total variation distance ${\delta(X,Y)}$ between two random variables (or more precisely, the total variation distance between the probability distributions of two random variables), we see that

$\displaystyle {\bf P}(Y \in E) - \delta(X,Y) \leq {\bf P}(X \in E) \leq {\bf P}(Y \in E) + \delta(X,Y) \ \ \ \ \ (1)$

for any ${E \subset R}$. In particular, if it is rare for ${Y}$ to lie in ${E}$, and ${X,Y}$ are close in total variation, then it is also rare for ${X}$ to lie in ${E}$.

A basic inequality in information theory is Pinsker’s inequality

$\displaystyle \delta(X,Y) \leq \sqrt{\frac{1}{2} D_{KL}(X||Y)}$

where the Kullback-Leibler divergence ${D_{KL}(X||Y)}$ is defined by the formula

$\displaystyle D_{KL}(X||Y) = \sum_{x \in R} {\bf P}( X=x ) \log \frac{{\bf P}(X=x)}{{\bf P}(Y=x)}.$

(See this previous blog post for a proof of this inequality.) A standard application of Jensen’s inequality reveals that ${D_{KL}(X||Y)}$ is non-negative (Gibbs’ inequality), and vanishes if and only if ${X}$, ${Y}$ have the same distribution; thus one can think of ${D_{KL}(X||Y)}$ as a measure of how close the distributions of ${X}$ and ${Y}$ are to each other, although one should caution that this is not a symmetric notion of distance, as ${D_{KL}(X||Y) \neq D_{KL}(Y||X)}$ in general. Inserting Pinsker’s inequality into (1), we see for instance that

$\displaystyle {\bf P}(X \in E) \leq {\bf P}(Y \in E) + \sqrt{\frac{1}{2} D_{KL}(X||Y)}.$

Thus, if ${X}$ is close to ${Y}$ in the Kullback-Leibler sense, and it is rare for ${Y}$ to lie in ${E}$, then it is rare for ${X}$ to lie in ${E}$ as well.

We can specialise this inequality to the case when ${Y}$ a uniform random variable ${U}$ on a finite range ${R}$ of some cardinality ${N}$, in which case the Kullback-Leibler divergence ${D_{KL}(X||U)}$ simplifies to

$\displaystyle D_{KL}(X||U) = \log N - {\bf H}(X)$

where

$\displaystyle {\bf H}(X) := \sum_{x \in R} {\bf P}(X=x) \log \frac{1}{{\bf P}(X=x)}$

is the Shannon entropy of ${X}$. Again, a routine application of Jensen’s inequality shows that ${{\bf H}(X) \leq \log N}$, with equality if and only if ${X}$ is uniformly distributed on ${R}$. The above inequality then becomes

$\displaystyle {\bf P}(X \in E) \leq {\bf P}(U \in E) + \sqrt{\frac{1}{2}(\log N - {\bf H}(X))}. \ \ \ \ \ (2)$

Thus, if ${E}$ is a small fraction of ${R}$ (so that it is rare for ${U}$ to lie in ${E}$), and the entropy of ${X}$ is very close to the maximum possible value of ${\log N}$, then it is rare for ${X}$ to lie in ${E}$ also.

The inequality (2) is only useful when the entropy ${{\bf H}(X)}$ is close to ${\log N}$ in the sense that ${{\bf H}(X) = \log N - O(1)}$, otherwise the bound is worse than the trivial bound of ${{\bf P}(X \in E) \leq 1}$. In my recent paper on the Chowla and Elliott conjectures, I ended up using a variant of (2) which was still non-trivial when the entropy ${{\bf H}(X)}$ was allowed to be smaller than ${\log N - O(1)}$. More precisely, I used the following simple inequality, which is implicit in the arguments of that paper but which I would like to make more explicit in this post:

Lemma 1 (Pinsker-type inequality) Let ${X}$ be a random variable taking values in a finite range ${R}$ of cardinality ${N}$, let ${U}$ be a uniformly distributed random variable in ${R}$, and let ${E}$ be a subset of ${R}$. Then

$\displaystyle {\bf P}(X \in E) \leq \frac{(\log N - {\bf H}(X)) + \log 2}{\log 1/{\bf P}(U \in E)}.$

Proof: Consider the conditional entropy ${{\bf H}(X | 1_{X \in E} )}$. On the one hand, we have

$\displaystyle {\bf H}(X | 1_{X \in E} ) = {\bf H}(X, 1_{X \in E}) - {\bf H}(1_{X \in E} )$

$\displaystyle = {\bf H}(X) - {\bf H}(1_{X \in E})$

$\displaystyle \geq {\bf H}(X) - \log 2$

by Jensen’s inequality. On the other hand, one has

$\displaystyle {\bf H}(X | 1_{X \in E} ) = {\bf P}(X \in E) {\bf H}(X | X \in E )$

$\displaystyle + (1-{\bf P}(X \in E)) {\bf H}(X | X \not \in E)$

$\displaystyle \leq {\bf P}(X \in E) \log |E| + (1-{\bf P}(X \in E)) \log N$

$\displaystyle = \log N - {\bf P}(X \in E) \log \frac{N}{|E|}$

$\displaystyle = \log N - {\bf P}(X \in E) \log \frac{1}{{\bf P}(U \in E)},$

where we have again used Jensen’s inequality. Putting the two inequalities together, we obtain the claim. $\Box$

Remark 2 As noted in comments, this inequality can be viewed as a special case of the more general inequality

$\displaystyle {\bf P}(X \in E) \leq \frac{D(X||Y) + \log 2}{\log 1/{\bf P}(Y \in E)}$

for arbitrary random variables ${X,Y}$ taking values in the same discrete range ${R}$, which follows from the data processing inequality

$\displaystyle D( f(X)||f(Y)) \leq D(X|| Y)$

for arbitrary functions ${f}$, applied to the indicator function ${f = 1_E}$. Indeed one has

$\displaystyle D( 1_E(X) || 1_E(Y) ) = {\bf P}(X \in E) \log \frac{{\bf P}(X \in E)}{{\bf P}(Y \in E)}$

$\displaystyle + {\bf P}(X \not \in E) \log \frac{{\bf P}(X \not \in E)}{{\bf P}(Y \not \in E)}$

$\displaystyle \geq {\bf P}(X \in E) \log \frac{1}{{\bf P}(Y \in E)} - h( {\bf P}(X \in E) )$

$\displaystyle \geq {\bf P}(X \in E) \log \frac{1}{{\bf P}(Y \in E)} - \log 2$

where ${h(u) := u \log \frac{1}{u} + (1-u) \log \frac{1}{1-u}}$ is the entropy function.

Thus, for instance, if one has

$\displaystyle {\bf H}(X) \geq \log N - o(K)$

and

$\displaystyle {\bf P}(U \in E) \leq \exp( - K )$

for some ${K}$ much larger than ${1}$ (so that ${1/K = o(1)}$), then

$\displaystyle {\bf P}(X \in E) = o(1).$

More informally: if the entropy of ${X}$ is somewhat close to the maximum possible value of ${\log N}$, and it is exponentially rare for a uniform variable to lie in ${E}$, then it is still somewhat rare for ${X}$ to lie in ${E}$. The estimate given is close to sharp in this regime, as can be seen by calculating the entropy of a random variable ${X}$ which is uniformly distributed inside a small set ${E}$ with some probability ${p}$ and uniformly distributed outside of ${E}$ with probability ${1-p}$, for some parameter ${0 \leq p \leq 1}$.

It turns out that the above lemma combines well with concentration of measure estimates; in my paper, I used one of the simplest such estimates, namely Hoeffding’s inequality, but there are of course many other estimates of this type (see e.g. this previous blog post for some others). Roughly speaking, concentration of measure inequalities allow one to make approximations such as

$\displaystyle F(U) \approx {\bf E} F(U)$

with exponentially high probability, where ${U}$ is a uniform distribution and ${F}$ is some reasonable function of ${U}$. Combining this with the above lemma, we can then obtain approximations of the form

$\displaystyle F(X) \approx {\bf E} F(U) \ \ \ \ \ (3)$

with somewhat high probability, if the entropy of ${X}$ is somewhat close to maximum. This observation, combined with an “entropy decrement argument” that allowed one to arrive at a situation in which the relevant random variable ${X}$ did have a near-maximum entropy, is the key new idea in my recent paper; for instance, one can use the approximation (3) to obtain an approximation of the form

$\displaystyle \sum_{j=1}^H \sum_{p \in {\mathcal P}} \lambda(n+j) \lambda(n+j+p) 1_{p|n+j}$

$\displaystyle \approx \sum_{j=1}^H \sum_{p \in {\mathcal P}} \frac{\lambda(n+j) \lambda(n+j+p)}{p}$

for “most” choices of ${n}$ and a suitable choice of ${H}$ (with the latter being provided by the entropy decrement argument). The left-hand side is tied to Chowla-type sums such as ${\sum_{n \leq x} \frac{\lambda(n)\lambda(n+1)}{n}}$ through the multiplicativity of ${\lambda}$, while the right-hand side, being a linear correlation involving two parameters ${j,p}$ rather than just one, has “finite complexity” and can be treated by existing techniques such as the Hardy-Littlewood circle method. One could hope that one could similarly use approximations such as (3) in other problems in analytic number theory or combinatorics.

Van Vu and I have just uploaded to the arXiv our paper Random matrices: Sharp concentration of eigenvalues, submitted to the Electronic Journal of Probability. As with many of our previous papers, this paper is concerned with the distribution of the eigenvalues ${\lambda_1(M_n) \leq \ldots \leq \lambda_n(M_n)}$ of a random Wigner matrix ${M_n}$ (such as a matrix drawn from the Gaussian Unitary Ensemble (GUE) or Gaussian Orthogonal Ensemble (GOE)). To simplify the discussion we shall mostly restrict attention to the bulk of the spectrum, i.e. to eigenvalues ${\lambda_i(M_n)}$ with ${\delta n \leq i \leq (1-\delta) i n}$ for some fixed ${\delta>0}$, although analogues of most of the results below have also been obtained at the edge of the spectrum.

If we normalise the entries of the matrix ${M_n}$ to have mean zero and variance ${1/n}$, then in the asymptotic limit ${n \rightarrow \infty}$, we have the Wigner semicircle law, which asserts that the eigenvalues are asymptotically distributed according to the semicircular distribution ${\rho_{sc}(x)\ dx}$, where

$\displaystyle \rho_{sc}(x) := \frac{1}{2\pi} (4-x^2)_+^{1/2}.$

An essentially equivalent way of saying this is that for large ${n}$, we expect the ${i^{th}}$ eigenvalue ${\lambda_i(M_n)}$ of ${M_n}$ to stay close to the classical location ${\gamma_i \in [-2,2]}$, defined by the formula

$\displaystyle \int_{-2}^{\gamma_i} \rho_{sc}(x)\ dx = \frac{i}{n}.$

In particular, from the Wigner semicircle law it can be shown that asymptotically almost surely, one has

$\displaystyle \lambda_i(M_n) = \gamma_i + o(1) \ \ \ \ \ (1)$

for all ${1 \leq i \leq n}$.

In the modern study of the spectrum of Wigner matrices (and in particular as a key tool in establishing universality results), it has become of interest to improve the error term in (1) as much as possible. A typical early result in this direction was by Bai, who used the Stieltjes transform method to obtain polynomial convergence rates of the shape ${O(n^{-c})}$ for some absolute constant ${c>0}$; see also the subsequent papers of Alon-Krivelevich-Vu and of of Meckes, who were able to obtain such convergence rates (with exponentially high probability) by using concentration of measure tools, such as Talagrand’s inequality. On the other hand, in the case of the GUE ensemble it is known (by this paper of Gustavsson) that ${\lambda_i(M_n)}$ has variance comparable to ${\frac{\log n}{n^2}}$ in the bulk, so that the optimal error term in (1) should be about ${O(\log^{1/2} n/n)}$. (One may think that if one wanted bounds on (1) that were uniform in ${i}$, one would need to enlarge the error term further, but this does not appear to be the case, due to strong correlations between the ${\lambda_i}$; note for instance this recent result of Ben Arous and Bourgarde that the largest gap between eigenvalues in the bulk is typically of order ${O(\log^{1/2} n/n)}$.)

A significant advance in this direction was achieved by Erdos, Schlein, and Yau in a series of papers where they used a combination of Stieltjes transform and concentration of measure methods to obtain local semicircle laws which showed, among other things, that one had asymptotics of the form

$\displaystyle N(I) = (1+o(1)) \int_I \rho_{sc}(x)\ dx$

with exponentially high probability for intervals ${I}$ in the bulk that were as short as ${n^{-1+\epsilon}}$ for some ${\epsilon>0}$, where ${N(I)}$ is the number of eigenvalues. These asymptotics are consistent with a good error term in (1), and are already sufficient for many applications, but do not quite imply a strong concentration result for individual eigenvalues ${\lambda_i}$ (basically because they do not preclude long-range or “secular” shifts in the spectrum that involve large blocks of eigenvalues at mesoscopic scales). Nevertheless, this was rectified in a subsequent paper of Erdos, Yau, and Yin, which roughly speaking obtained a bound of the form

$\displaystyle \lambda_i(M_n) = \gamma_i + O( \frac{\log^{O(\log\log n)} n}{n} )$

in the bulk with exponentially high probability, for Wigner matrices obeying some exponential decay conditions on the entries. This was achieved by a rather delicate high moment calculation, in which the contribution of the diagonal entries of the resolvent (whose average forms the Stieltjes transform) was shown to mostly cancel each other out.

As the GUE computations show, this concentration result is sharp up to the quasilogarithmic factor ${\log^{O(\log\log n)} n}$. The main result of this paper is to improve the concentration result to one more in line with the GUE case, namely

$\displaystyle \lambda_i(M_n) = \gamma_i + O( \frac{\log^{O(1)} n}{n} )$

with exponentially high probability (see the paper for a more precise statement of results). The one catch is that an additional hypothesis is required, namely that the entries of the Wigner matrix have vanishing third moment. We also obtain similar results for the edge of the spectrum (but with a different scaling).

Our arguments are rather different from those of Erdos, Yau, and Yin, and thus provide an alternate approach to establishing eigenvalue concentration. The main tool is the Lindeberg exchange strategy, which is also used to prove the Four Moment Theorem (although we do not directly invoke the Four Moment Theorem in our analysis). The main novelty is that this exchange strategy is now used to establish large deviation estimates (i.e. exponentially small tail probabilities) rather than universality of the limiting distribution. Roughly speaking, the basic point is as follows. The Lindeberg exchange strategy seeks to compare a function ${F(X_1,\ldots,X_n)}$ of many independent random variables ${X_1,\ldots,X_n}$ with the same function ${F(Y_1,\ldots,Y_n)}$ of a different set of random variables (which match moments with the original set of variables to some order, such as to second or fourth order) by exchanging the random variables one at a time. Typically, one tries to upper bound expressions such as

$\displaystyle {\bf E} \phi(F(X_1,\ldots,X_n)) - \phi(F(X_1,\ldots,X_{n-1},Y_n))$

for various smooth test functions ${\phi}$, by performing a Taylor expansion in the variable being swapped and taking advantage of the matching moment hypotheses. In previous implementations of this strategy, ${\phi}$ was a bounded test function, which allowed one to get control of the bulk of the distribution of ${F(X_1,\ldots,X_n)}$, and in particular in controlling probabilities such as

$\displaystyle {\bf P}( a \leq F(X_1,\ldots,X_n) \leq b )$

for various thresholds ${a}$ and ${b}$, but did not give good control on the tail as the error terms tended to be polynomially decaying in ${n}$ rather than exponentially decaying. However, it turns out that one can modify the exchange strategy to deal with moments such as

$\displaystyle {\bf E} (1 + F(X_1,\ldots,X_n)^2)^k$

for various moderately large ${k}$ (e.g. of size comparable to ${\log n}$), obtaining results such as

$\displaystyle {\bf E} (1 + F(Y_1,\ldots,Y_n)^2)^k = (1+o(1)) {\bf E} (1 + F(X_1,\ldots,X_n)^2)^k$

after performing all the relevant exchanges. As such, one can then use large deviation estimates on ${F(X_1,\ldots,X_n)}$ to deduce large deviation estimates on ${F(Y_1,\ldots,Y_n)}$.

In this paper we also take advantage of a simplification, first noted by Erdos, Yau, and Yin, that Four Moment Theorems become somewhat easier to prove if one works with resolvents ${(M_n-z)^{-1}}$ (and the closely related Stieltjes transform ${s(z) := \frac{1}{n} \hbox{tr}( (M_n-z)^{-1} )}$) rather than with individual eigenvalues, as the Taylor expansion of resolvents are very simple (essentially being a Neumann series). The relationship between the Stieltjes transform and the location of individual eigenvalues can be seen by taking advantage of the identity

$\displaystyle \frac{\pi}{2} - \frac{\pi}{n} N((-\infty,E)) = \int_0^\infty \hbox{Re} s(E + i \eta)\ d\eta$

for any energy level ${E \in {\bf R}}$, which can be verified from elementary calculus. (In practice, we would truncate ${\eta}$ near zero and near infinity to avoid some divergences, but this is a minor technicality.) As such, a concentration result for the Stieltjes transform can be used to establish an analogous concentration result for the eigenvalue counting functions ${N((-\infty,E))}$, which in turn can be used to deduce concentration results for individual eigenvalues ${\lambda_i(M_n)}$ by some basic combinatorial manipulations.

We can now turn attention to one of the centerpiece universality results in random matrix theory, namely the Wigner semi-circle law for Wigner matrices. Recall from previous notes that a Wigner Hermitian matrix ensemble is a random matrix ensemble ${M_n = (\xi_{ij})_{1 \leq i,j \leq n}}$ of Hermitian matrices (thus ${\xi_{ij} = \overline{\xi_{ji}}}$; this includes real symmetric matrices as an important special case), in which the upper-triangular entries ${\xi_{ij}}$, ${i>j}$ are iid complex random variables with mean zero and unit variance, and the diagonal entries ${\xi_{ii}}$ are iid real variables, independent of the upper-triangular entries, with bounded mean and variance. Particular special cases of interest include the Gaussian Orthogonal Ensemble (GOE), the symmetric random sign matrices (aka symmetric Bernoulli ensemble), and the Gaussian Unitary Ensemble (GUE).

In previous notes we saw that the operator norm of ${M_n}$ was typically of size ${O(\sqrt{n})}$, so it is natural to work with the normalised matrix ${\frac{1}{\sqrt{n}} M_n}$. Accordingly, given any ${n \times n}$ Hermitian matrix ${M_n}$, we can form the (normalised) empirical spectral distribution (or ESD for short)

$\displaystyle \mu_{\frac{1}{\sqrt{n}} M_n} := \frac{1}{n} \sum_{j=1}^n \delta_{\lambda_j(M_n) / \sqrt{n}},$

of ${M_n}$, where ${\lambda_1(M_n) \leq \ldots \leq \lambda_n(M_n)}$ are the (necessarily real) eigenvalues of ${M_n}$, counting multiplicity. The ESD is a probability measure, which can be viewed as a distribution of the normalised eigenvalues of ${M_n}$.

When ${M_n}$ is a random matrix ensemble, then the ESD ${\mu_{\frac{1}{\sqrt{n}} M_n}}$ is now a random measure – i.e. a random variable taking values in the space ${\hbox{Pr}({\mathbb R})}$ of probability measures on the real line. (Thus, the distribution of ${\mu_{\frac{1}{\sqrt{n}} M_n}}$ is a probability measure on probability measures!)

Now we consider the behaviour of the ESD of a sequence of Hermitian matrix ensembles ${M_n}$ as ${n \rightarrow \infty}$. Recall from Notes 0 that for any sequence of random variables in a ${\sigma}$-compact metrisable space, one can define notions of convergence in probability and convergence almost surely. Specialising these definitions to the case of random probability measures on ${{\mathbb R}}$, and to deterministic limits, we see that a sequence of random ESDs ${\mu_{\frac{1}{\sqrt{n}} M_n}}$ converge in probability (resp. converge almost surely) to a deterministic limit ${\mu \in \hbox{Pr}({\mathbb R})}$ (which, confusingly enough, is a deterministic probability measure!) if, for every test function ${\varphi \in C_c({\mathbb R})}$, the quantities ${\int_{\mathbb R} \varphi\ d\mu_{\frac{1}{\sqrt{n}} M_n}}$ converge in probability (resp. converge almost surely) to ${\int_{\mathbb R} \varphi\ d\mu}$.

Remark 1 As usual, convergence almost surely implies convergence in probability, but not vice versa. In the special case of random probability measures, there is an even weaker notion of convergence, namely convergence in expectation, defined as follows. Given a random ESD ${\mu_{\frac{1}{\sqrt{n}} M_n}}$, one can form its expectation ${{\bf E} \mu_{\frac{1}{\sqrt{n}} M_n} \in \hbox{Pr}({\mathbb R})}$, defined via duality (the Riesz representation theorem) as

$\displaystyle \int_{\mathbb R} \varphi\ d{\bf E} \mu_{\frac{1}{\sqrt{n}} M_n} := {\bf E} \int_{\mathbb R} \varphi\ d \mu_{\frac{1}{\sqrt{n}} M_n};$

this probability measure can be viewed as the law of a random eigenvalue ${\frac{1}{\sqrt{n}}\lambda_i(M_n)}$ drawn from a random matrix ${M_n}$ from the ensemble. We then say that the ESDs converge in expectation to a limit ${\mu \in \hbox{Pr}({\mathbb R})}$ if ${{\bf E} \mu_{\frac{1}{\sqrt{n}} M_n}}$ converges the vague topology to ${\mu}$, thus

$\displaystyle {\bf E} \int_{\mathbb R} \varphi\ d \mu_{\frac{1}{\sqrt{n}} M_n} \rightarrow \int_{\mathbb R} \varphi\ d\mu$

for all ${\phi \in C_c({\mathbb R})}$.

In general, these notions of convergence are distinct from each other; but in practice, one often finds in random matrix theory that these notions are effectively equivalent to each other, thanks to the concentration of measure phenomenon.

Exercise 1 Let ${M_n}$ be a sequence of ${n \times n}$ Hermitian matrix ensembles, and let ${\mu}$ be a continuous probability measure on ${{\mathbb R}}$.

• Show that ${\mu_{\frac{1}{\sqrt{n}} M_n}}$ converges almost surely to ${\mu}$ if and only if ${\mu_{\frac{1}{\sqrt{n}}}(-\infty,\lambda)}$ converges almost surely to ${\mu(-\infty,\lambda)}$ for all ${\lambda \in {\mathbb R}}$.
• Show that ${\mu_{\frac{1}{\sqrt{n}} M_n}}$ converges in probability to ${\mu}$ if and only if ${\mu_{\frac{1}{\sqrt{n}}}(-\infty,\lambda)}$ converges in probability to ${\mu(-\infty,\lambda)}$ for all ${\lambda \in {\mathbb R}}$.
• Show that ${\mu_{\frac{1}{\sqrt{n}} M_n}}$ converges in expectation to ${\mu}$ if and only if ${\mathop{\mathbb E} \mu_{\frac{1}{\sqrt{n}}}(-\infty,\lambda)}$ converges to ${\mu(-\infty,\lambda)}$ for all ${\lambda \in {\mathbb R}}$.

We can now state the Wigner semi-circular law.

Theorem 1 (Semicircular law) Let ${M_n}$ be the top left ${n \times n}$ minors of an infinite Wigner matrix ${(\xi_{ij})_{i,j \geq 1}}$. Then the ESDs ${\mu_{\frac{1}{\sqrt{n}} M_n}}$ converge almost surely (and hence also in probability and in expectation) to the Wigner semi-circular distribution

$\displaystyle \mu_{sc} := \frac{1}{2\pi} (4-|x|^2)^{1/2}_+\ dx. \ \ \ \ \ (1)$

A numerical example of this theorem in action can be seen at the MathWorld entry for this law.

The semi-circular law nicely complements the upper Bai-Yin theorem from Notes 3, which asserts that (in the case when the entries have finite fourth moment, at least), the matrices ${\frac{1}{\sqrt{n}} M_n}$ almost surely has operator norm at most ${2+o(1)}$. Note that the operator norm is the same thing as the largest magnitude of the eigenvalues. Because the semi-circular distribution (1) is supported on the interval ${[-2,2]}$ with positive density on the interior of this interval, Theorem 1 easily supplies the lower Bai-Yin theorem, that the operator norm of ${\frac{1}{\sqrt{n}} M_n}$ is almost surely at least ${2-o(1)}$, and thus (in the finite fourth moment case) the norm is in fact equal to ${2+o(1)}$. Indeed, we have just shown that the circular law provides an alternate proof of the lower Bai-Yin bound (Proposition 11 of Notes 3).

As will hopefully become clearer in the next set of notes, the semi-circular law is the noncommutative (or free probability) analogue of the central limit theorem, with the semi-circular distribution (1) taking on the role of the normal distribution. Of course, there is a striking difference between the two distributions, in that the former is compactly supported while the latter is merely subgaussian. One reason for this is that the concentration of measure phenomenon is more powerful in the case of ESDs of Wigner matrices than it is for averages of iid variables; compare the concentration of measure results in Notes 3 with those in Notes 1.

There are several ways to prove (or at least to heuristically justify) the circular law. In this set of notes we shall focus on the two most popular methods, the moment method and the Stieltjes transform method, together with a third (heuristic) method based on Dyson Brownian motion (Notes 3b). In the next set of notes we shall also study the free probability method, and in the set of notes after that we use the determinantal processes method (although this method is initially only restricted to highly symmetric ensembles, such as GUE).

Now that we have developed the basic probabilistic tools that we will need, we now turn to the main subject of this course, namely the study of random matrices. There are many random matrix models (aka matrix ensembles) of interest – far too many to all be discussed in a single course. We will thus focus on just a few simple models. First of all, we shall restrict attention to square matrices ${M = (\xi_{ij})_{1 \leq i,j \leq n}}$, where ${n}$ is a (large) integer and the ${\xi_{ij}}$ are real or complex random variables. (One can certainly study rectangular matrices as well, but for simplicity we will only look at the square case.) Then, we shall restrict to three main models:

• Iid matrix ensembles, in which the coefficients ${\xi_{ij}}$ are iid random variables with a single distribution ${\xi_{ij} \equiv \xi}$. We will often normalise ${\xi}$ to have mean zero and unit variance. Examples of iid models include the Bernouli ensemble (aka random sign matrices) in which the ${\xi_{ij}}$ are signed Bernoulli variables, the real gaussian matrix ensemble in which ${\xi_{ij} \equiv N(0,1)_{\bf R}}$, and the complex gaussian matrix ensemble in which ${\xi_{ij} \equiv N(0,1)_{\bf C}}$.
• Symmetric Wigner matrix ensembles, in which the upper triangular coefficients ${\xi_{ij}}$, ${j \geq i}$ are jointly independent and real, but the lower triangular coefficients ${\xi_{ij}}$, ${j are constrained to equal their transposes: ${\xi_{ij}=\xi_{ji}}$. Thus ${M}$ by construction is always a real symmetric matrix. Typically, the strictly upper triangular coefficients will be iid, as will the diagonal coefficients, but the two classes of coefficients may have a different distribution. One example here is the symmetric Bernoulli ensemble, in which both the strictly upper triangular and the diagonal entries are signed Bernoulli variables; another important example is the Gaussian Orthogonal Ensemble (GOE), in which the upper triangular entries have distribution ${N(0,1)_{\bf R}}$ and the diagonal entries have distribution ${N(0,2)_{\bf R}}$. (We will explain the reason for this discrepancy later.)
• Hermitian Wigner matrix ensembles, in which the upper triangular coefficients are jointly independent, with the diagonal entries being real and the strictly upper triangular entries complex, and the lower triangular coefficients ${\xi_{ij}}$, ${j are constrained to equal their adjoints: ${\xi_{ij} = \overline{\xi_{ji}}}$. Thus ${M}$ by construction is always a Hermitian matrix. This class of ensembles contains the symmetric Wigner ensembles as a subclass. Another very important example is the Gaussian Unitary Ensemble (GUE), in which all off-diagional entries have distribution ${N(0,1)_{\bf C}}$, but the diagonal entries have distribution ${N(0,1)_{\bf R}}$.

Given a matrix ensemble ${M}$, there are many statistics of ${M}$ that one may wish to consider, e.g. the eigenvalues or singular values of ${M}$, the trace and determinant, etc. In these notes we will focus on a basic statistic, namely the operator norm

$\displaystyle \| M \|_{op} := \sup_{x \in {\bf C}^n: |x|=1} |Mx| \ \ \ \ \ (1)$

of the matrix ${M}$. This is an interesting quantity in its own right, but also serves as a basic upper bound on many other quantities. (For instance, ${\|M\|_{op}}$ is also the largest singular value ${\sigma_1(M)}$ of ${M}$ and thus dominates the other singular values; similarly, all eigenvalues ${\lambda_i(M)}$ of ${M}$ clearly have magnitude at most ${\|M\|_{op}}$.) Because of this, it is particularly important to get good upper tail bounds

$\displaystyle {\bf P}( \|M\|_{op} \geq \lambda ) \leq \ldots$

on this quantity, for various thresholds ${\lambda}$. (Lower tail bounds are also of interest, of course; for instance, they give us confidence that the upper tail bounds are sharp.) Also, as we shall see, the problem of upper bounding ${\|M\|_{op}}$ can be viewed as a non-commutative analogue of upper bounding the quantity ${|S_n|}$ studied in Notes 1. (The analogue of the central limit theorem in Notes 2 is the Wigner semi-circular law, which will be studied in the next set of notes.)

An ${n \times n}$ matrix consisting entirely of ${1}$s has an operator norm of exactly ${n}$, as can for instance be seen from the Cauchy-Schwarz inequality. More generally, any matrix whose entries are all uniformly ${O(1)}$ will have an operator norm of ${O(n)}$ (which can again be seen from Cauchy-Schwarz, or alternatively from Schur’s test, or from a computation of the Frobenius norm). However, this argument does not take advantage of possible cancellations in ${M}$. Indeed, from analogy with concentration of measure, when the entries of the matrix ${M}$ are independent, bounded and have mean zero, we expect the operator norm to be of size ${O(\sqrt{n})}$ rather than ${O(n)}$. We shall see shortly that this intuition is indeed correct. (One can see, though, that the mean zero hypothesis is important; from the triangle inequality we see that if we add the all-ones matrix (for instance) to a random matrix with mean zero, to obtain a random matrix whose coefficients all have mean ${1}$, then at least one of the two random matrices necessarily has operator norm at least ${n/2}$.)

As mentioned before, there is an analogy here with the concentration of measure phenomenon, and many of the tools used in the latter (e.g. the moment method) will also appear here. (Indeed, we will be able to use some of the concentration inequalities from Notes 1 directly to help control ${\|M\|_{op}}$ and related quantities.) Similarly, just as many of the tools from concentration of measure could be adapted to help prove the central limit theorem, several the tools seen here will be of use in deriving the semi-circular law.

The most advanced knowledge we have on the operator norm is given by the Tracy-Widom law, which not only tells us where the operator norm is concentrated in (it turns out, for instance, that for a Wigner matrix (with some additional technical assumptions), it is concentrated in the range ${[2\sqrt{n} - O(n^{-1/6}), 2\sqrt{n} + O(n^{-1/6})]}$), but what its distribution in that range is. While the methods in this set of notes can eventually be pushed to establish this result, this is far from trivial, and will only be briefly discussed here. (We may return to the Tracy-Widom law later in this course, though.)

Suppose we have a large number of scalar random variables ${X_1,\ldots,X_n}$, which each have bounded size on average (e.g. their mean and variance could be ${O(1)}$). What can one then say about their sum ${S_n := X_1+\ldots+X_n}$? If each individual summand ${X_i}$ varies in an interval of size ${O(1)}$, then their sum of course varies in an interval of size ${O(n)}$. However, a remarkable phenomenon, known as concentration of measure, asserts that assuming a sufficient amount of independence between the component variables ${X_1,\ldots,X_n}$, this sum sharply concentrates in a much narrower range, typically in an interval of size ${O(\sqrt{n})}$. This phenomenon is quantified by a variety of large deviation inequalities that give upper bounds (often exponential in nature) on the probability that such a combined random variable deviates significantly from its mean. The same phenomenon applies not only to linear expressions such as ${S_n = X_1+\ldots+X_n}$, but more generally to nonlinear combinations ${F(X_1,\ldots,X_n)}$ of such variables, provided that the nonlinear function ${F}$ is sufficiently regular (in particular, if it is Lipschitz, either separately in each variable, or jointly in all variables).

The basic intuition here is that it is difficult for a large number of independent variables ${X_1,\ldots,X_n}$ to “work together” to simultaneously pull a sum ${X_1+\ldots+X_n}$ or a more general combination ${F(X_1,\ldots,X_n)}$ too far away from its mean. Independence here is the key; concentration of measure results typically fail if the ${X_i}$ are too highly correlated with each other.

There are many applications of the concentration of measure phenomenon, but we will focus on a specific application which is useful in the random matrix theory topics we will be studying, namely on controlling the behaviour of random ${n}$-dimensional vectors with independent components, and in particular on the distance between such random vectors and a given subspace.

Once one has a sufficient amount of independence, the concentration of measure tends to be sub-gaussian in nature; thus the probability that one is at least ${\lambda}$ standard deviations from the mean tends to drop off like ${C \exp(-c\lambda^2)}$ for some ${C,c > 0}$. In particular, one is ${O( \log^{1/2} n )}$ standard deviations from the mean with high probability, and ${O( \log^{1/2+\epsilon} n)}$ standard deviations from the mean with overwhelming probability. Indeed, concentration of measure is our primary tool for ensuring that various events hold with overwhelming probability (other moment methods can give high probability, but have difficulty ensuring overwhelming probability).

This is only a brief introduction to the concentration of measure phenomenon. A systematic study of this topic can be found in this book by Ledoux.