You are currently browsing the category archive for the ‘254A – random matrices’ category.

I’ve just finished writing the first draft of my second book coming out of the 2010 blog posts, namely “Topics in random matrix theory“, which was based primarily on my graduate course in the topic, though it also contains material from some additional posts related to random matrices on the blog.  It is available online here.  As usual, comments and corrections are welcome.  There is also a stub page for the book, which at present does not contain much more than the above link.

In this final set of lecture notes for this course, we leave the realm of self-adjoint matrix ensembles, such as Wigner random matrices, and consider instead the simplest examples of non-self-adjoint ensembles, namely the iid matrix ensembles. (I had also hoped to discuss recent progress in eigenvalue spacing distributions of Wigner matrices, but have run out of time. For readers interested in this topic, I can recommend the recent Bourbaki exposé of Alice Guionnet.)

The basic result in this area is

Theorem 1 (Circular law) Let {M_n} be an {n \times n} iid matrix, whose entries {\xi_{ij}}, {1 \leq i,j \leq n} are iid with a fixed (complex) distribution {\xi_{ij} \equiv \xi} of mean zero and variance one. Then the spectral measure {\mu_{\frac{1}{\sqrt{n}}M_n}} converges both in probability and almost surely to the circular law {\mu_{circ} := \frac{1}{\pi} 1_{|x|^2+|y|^2 \leq 1}\ dx dy}, where {x, y} are the real and imaginary coordinates of the complex plane.

This theorem has a long history; it is analogous to the semi-circular law, but the non-Hermitian nature of the matrices makes the spectrum so unstable that key techniques that are used in the semi-circular case, such as truncation and the moment method, no longer work; significant new ideas are required. In the case of random gaussian matrices, this result was established by Mehta (in the complex case) and by Edelman (in the real case), as was sketched out in Notes. In 1984, Girko laid out a general strategy for establishing the result for non-gaussian matrices, which formed the base of all future work on the subject; however, a key ingredient in the argument, namely a bound on the least singular value of shifts {\frac{1}{\sqrt{n}} M_n - zI}, was not fully justified at the time. A rigorous proof of the circular law was then established by Bai, assuming additional moment and boundedness conditions on the individual entries. These additional conditions were then slowly removed in a sequence of papers by Gotze-Tikhimirov, Girko, Pan-Zhou, and Tao-Vu, with the last moment condition being removed in a paper of myself, Van Vu, and Manjunath Krishnapur.

At present, the known methods used to establish the circular law for general ensembles rely very heavily on the joint independence of all the entries. It is a key challenge to see how to weaken this joint independence assumption.

Read the rest of this entry »

Now we turn attention to another important spectral statistic, the least singular value {\sigma_n(M)} of an {n \times n} matrix {M} (or, more generally, the least non-trivial singular value {\sigma_p(M)} of a {n \times p} matrix with {p \leq n}). This quantity controls the invertibility of {M}. Indeed, {M} is invertible precisely when {\sigma_n(M)} is non-zero, and the operator norm {\|M^{-1}\|_{op}} of {M^{-1}} is given by {1/\sigma_n(M)}. This quantity is also related to the condition number {\sigma_1(M)/\sigma_n(M) = \|M\|_{op} \|M^{-1}\|_{op}} of {M}, which is of importance in numerical linear algebra. As we shall see in the next set of notes, the least singular value of {M} (and more generally, of the shifts {\frac{1}{\sqrt{n}} M - zI} for complex {z}) will be of importance in rigorously establishing the circular law for iid random matrices {M}, as it plays a key role in computing the Stieltjes transform {\frac{1}{n} \hbox{tr} (\frac{1}{\sqrt{n}} M - zI)^{-1}} of such matrices, which as we have already seen is a powerful tool in understanding the spectra of random matrices.

The least singular value

\displaystyle  \sigma_n(M) = \inf_{\|x\|=1} \|Mx\|,

which sits at the “hard edge” of the spectrum, bears a superficial similarity to the operator norm

\displaystyle  \|M\|_{op} = \sigma_1(M) = \sup_{\|x\|=1} \|Mx\|

at the “soft edge” of the spectrum, that was discussed back in Notes 3, so one may at first think that the methods that were effective in controlling the latter, namely the epsilon-net argument and the moment method, would also work to control the former. The epsilon-net method does indeed have some effectiveness when dealing with rectangular matrices (in which the spectrum stays well away from zero), but the situation becomes more delicate for square matrices; it can control some “low entropy” portions of the infimum that arise from “structured” or “compressible” choices of {x}, but are not able to control the “generic” or “incompressible” choices of {x}, for which new arguments will be needed. As for the moment method, this can give the coarse order of magnitude (for instance, for rectangular matrices with {p=yn} for {0 < y < 1}, it gives an upper bound of {(1-\sqrt{y}+o(1))n} for the singular value with high probability, thanks to the Marchenko-Pastur law), but again this method begins to break down for square matrices, although one can make some partial headway by considering negative moments such as {\hbox{tr} M^{-2}}, though these are more difficult to compute than positive moments {\hbox{tr} M^k}.

So one needs to supplement these existing methods with additional tools. It turns out that the key issue is to understand the distance between one of the {n} rows {X_1,\ldots,X_n \in {\bf C}^n} of the matrix {M}, and the hyperplane spanned by the other {n-1} rows. The reason for this is as follows. First suppose that {\sigma_n(M)=0}, so that {M} is non-invertible, and there is a linear dependence between the rows {X_1,\ldots,X_n}. Thus, one of the {X_i} will lie in the hyperplane spanned by the other rows, and so one of the distances mentioned above will vanish; in fact, one expects many of the {n} distances to vanish. Conversely, whenever one of these distances vanishes, one has a linear dependence, and so {\sigma_n(M)=0}.

More generally, if the least singular value {\sigma_n(M)} is small, one generically expects many of these {n} distances to be small also, and conversely. Thus, control of the least singular value is morally equivalent to control of the distance between a row {X_i} and the hyperplane spanned by the other rows. This latter quantity is basically the dot product of {X_i} with a unit normal {n_i} of this hyperplane.

When working with random matrices with jointly independent coefficients, we have the crucial property that the unit normal {n_i} (which depends on all the rows other than {X_i}) is independent of {X_i}, so even after conditioning {n_i} to be fixed, the entries of {X_i} remain independent. As such, the dot product {X_i \cdot n_i} is a familiar scalar random walk, and can be controlled by a number of tools, most notably Littlewood-Offord theorems and the Berry-Esséen central limit theorem. As it turns out, this type of control works well except in some rare cases in which the normal {n_i} is “compressible” or otherwise highly structured; but epsilon-net arguments can be used to dispose of these cases. (This general strategy was first developed for the technically simpler singularity problem by Komlós, and then extended to the least singular value problem by Rudelson.)

These methods rely quite strongly on the joint independence on all the entries; it remains a challenge to extend them to more general settings. Even for Wigner matrices, the methods run into difficulty because of the non-independence of some of the entries (although it turns out one can understand the least singular value in such cases by rather different methods).

To simplify the exposition, we shall focus primarily on just one specific ensemble of random matrices, the Bernoulli ensemble {M = (\xi_{ij})_{1 \leq i,j \leq n}} of random sign matrices, where {\xi_{ij} = \pm 1} are independent Bernoulli signs. However, the results can extend to more general classes of random matrices, with the main requirement being that the coefficients are jointly independent.

Read the rest of this entry »

Our study of random matrices, to date, has focused on somewhat general ensembles, such as iid random matrices or Wigner random matrices, in which the distribution of the individual entries of the matrices was essentially arbitrary (as long as certain moments, such as the mean and variance, were normalised). In these notes, we now focus on two much more special, and much more symmetric, ensembles:

  • The Gaussian Unitary Ensemble (GUE), which is an ensemble of random {n \times n} Hermitian matrices {M_n} in which the upper-triangular entries are iid with distribution {N(0,1)_{\bf C}}, and the diagonal entries are iid with distribution {N(0,1)_{\bf R}}, and independent of the upper-triangular ones; and
  • The Gaussian random matrix ensemble, which is an ensemble of random {n \times n} (non-Hermitian) matrices {M_n} whose entries are iid with distribution {N(0,1)_{\bf C}}.

The symmetric nature of these ensembles will allow us to compute the spectral distribution by exact algebraic means, revealing a surprising connection with orthogonal polynomials and with determinantal processes. This will, for instance, recover the semi-circular law for GUE, but will also reveal fine spacing information, such as the distribution of the gap between adjacent eigenvalues, which is largely out of reach of tools such as the Stieltjes transform method and the moment method (although the moment method, with some effort, is able to control the extreme edges of the spectrum).

Similarly, we will see for the first time the circular law for eigenvalues of non-Hermitian matrices.

There are a number of other highly symmetric ensembles which can also be treated by the same methods, most notably the Gaussian Orthogonal Ensemble (GOE) and the Gaussian Symplectic Ensemble (GSE). However, for simplicity we shall focus just on the above two ensembles. For a systematic treatment of these ensembles, see the text by Deift.

Read the rest of this entry »

In the foundations of modern probability, as laid out by Kolmogorov, the basic objects of study are constructed in the following order:

  1. Firstly, one selects a sample space {\Omega}, whose elements {\omega} represent all the possible states that one’s stochastic system could be in.
  2. Then, one selects a {\sigma}-algebra {{\mathcal B}} of events {E} (modeled by subsets of {\Omega}), and assigns each of these events a probability {{\bf P}(E) \in [0,1]} in a countably additive manner, so that the entire sample space has probability {1}.
  3. Finally, one builds (commutative) algebras of random variables {X} (such as complex-valued random variables, modeled by measurable functions from {\Omega} to {{\bf C}}), and (assuming suitable integrability or moment conditions) one can assign expectations {\mathop{\bf E} X} to each such random variable.

In measure theory, the underlying measure space {\Omega} plays a prominent foundational role, with the measurable sets and measurable functions (the analogues of the events and the random variables) always being viewed as somehow being attached to that space. In probability theory, in contrast, it is the events and their probabilities that are viewed as being fundamental, with the sample space {\Omega} being abstracted away as much as possible, and with the random variables and expectations being viewed as derived concepts. See Notes 0 for further discussion of this philosophy.

However, it is possible to take the abstraction process one step further, and view the algebra of random variables and their expectations as being the foundational concept, and ignoring both the presence of the original sample space, the algebra of events, or the probability measure.

There are two reasons for wanting to shed (or abstract away) these previously foundational structures. Firstly, it allows one to more easily take certain types of limits, such as the large {n} limit {n \rightarrow \infty} when considering {n \times n} random matrices, because quantities built from the algebra of random variables and their expectations, such as the normalised moments of random matrices tend to be quite stable in the large {n} limit (as we have seen in previous notes), even as the sample space and event space varies with {n}. (This theme of using abstraction to facilitate the taking of the large {n} limit also shows up in the application of ergodic theory to combinatorics via the correspondence principle; see this previous blog post for further discussion.)

Secondly, this abstract formalism allows one to generalise the classical, commutative theory of probability to the more general theory of non-commutative probability theory, which does not have a classical underlying sample space or event space, but is instead built upon a (possibly) non-commutative algebra of random variables (or “observables”) and their expectations (or “traces”). This more general formalism not only encompasses classical probability, but also spectral theory (with matrices or operators taking the role of random variables, and the trace taking the role of expectation), random matrix theory (which can be viewed as a natural blend of classical probability and spectral theory), and quantum mechanics (with physical observables taking the role of random variables, and their expected value on a given quantum state being the expectation). It is also part of a more general “non-commutative way of thinking” (of which non-commutative geometry is the most prominent example), in which a space is understood primarily in terms of the ring or algebra of functions (or function-like objects, such as sections of bundles) placed on top of that space, and then the space itself is largely abstracted away in order to allow the algebraic structures to become less commutative. In short, the idea is to make algebra the foundation of the theory, as opposed to other possible choices of foundations such as sets, measures, categories, etc..

[Note that this foundational preference is to some extent a metamathematical one rather than a mathematical one; in many cases it is possible to rewrite the theory in a mathematically equivalent form so that some other mathematical structure becomes designated as the foundational one, much as probability theory can be equivalently formulated as the measure theory of probability measures. However, this does not negate the fact that a different choice of foundations can lead to a different way of thinking about the subject, and thus to ask a different set of questions and to discover a different set of proofs and solutions. Thus it is often of value to understand multiple foundational perspectives at once, to get a truly stereoscopic view of the subject.]

It turns out that non-commutative probability can be modeled using operator algebras such as {C^*}-algebras, von Neumann algebras, or algebras of bounded operators on a Hilbert space, with the latter being accomplished via the Gelfand-Naimark-Segal construction. We will discuss some of these models here, but just as probability theory seeks to abstract away its measure-theoretic models, the philosophy of non-commutative probability is also to downplay these operator algebraic models once some foundational issues are settled.

When one generalises the set of structures in one’s theory, for instance from the commutative setting to the non-commutative setting, the notion of what it means for a structure to be “universal”, “free”, or “independent” can change. The most familiar example of this comes from group theory. If one restricts attention to the category of abelian groups, then the “freest” object one can generate from two generators {e,f} is the free abelian group of commutative words {e^n f^m} with {n,m \in {\bf Z}}, which is isomorphic to the group {{\bf Z}^2}. If however one generalises to the non-commutative setting of arbitrary groups, then the “freest” object that can now be generated from two generators {e,f} is the free group {{\Bbb F}_2} of non-commutative words {e^{n_1} f^{m_1} \ldots e^{n_k} f^{m_k}} with {n_1,m_1,\ldots,n_k,m_k \in {\bf Z}}, which is a significantly larger extension of the free abelian group {{\bf Z}^2}.

Similarly, when generalising classical probability theory to non-commutative probability theory, the notion of what it means for two or more random variables to be independent changes. In the classical (commutative) setting, two (bounded, real-valued) random variables {X, Y} are independent if one has

\displaystyle  \mathop{\bf E} f(X) g(Y) = 0

whenever {f, g: {\bf R} \rightarrow {\bf R}} are well-behaved functions (such as polynomials) such that all of {\mathop{\bf E} f(X)}, {\mathop{\bf E} g(Y)} vanishes. In the non-commutative setting, one can generalise the above definition to two commuting bounded self-adjoint variables; this concept is useful for instance in quantum probability, which is an abstraction of the theory of observables in quantum mechanics. But for two (bounded, self-adjoint) non-commutative random variables {X, Y}, the notion of classical independence no longer applies. As a substitute, one can instead consider the notion of being freely independent (or free for short), which means that

\displaystyle  \mathop{\bf E} f_1(X) g_1(Y) \ldots f_k(X) g_k(Y) = 0

whenever {f_1,g_1,\ldots,f_k,g_k: {\bf R} \rightarrow {\bf R}} are well-behaved functions such that all of {\mathop{\bf E} f_1(X), \mathop{\bf E} g_1(Y), \ldots, \mathop{\bf E} f_k(X), \mathop{\bf E} g_k(Y)} vanish.

The concept of free independence was introduced by Voiculescu, and its study is now known as the subject of free probability. We will not attempt a systematic survey of this subject here; for this, we refer the reader to the surveys of Speicher and of Biane. Instead, we shall just discuss a small number of topics in this area to give the flavour of the subject only.

The significance of free probability to random matrix theory lies in the fundamental observation that random matrices which are independent in the classical sense, also tend to be independent in the free probability sense, in the large {n} limit {n \rightarrow \infty}. (This is only possible because of the highly non-commutative nature of these matrices; as we shall see, it is not possible for non-trivial commuting independent random variables to be freely independent.) Because of this, many tedious computations in random matrix theory, particularly those of an algebraic or enumerative combinatorial nature, can be done more quickly and systematically by using the framework of free probability, which by design is optimised for algebraic tasks rather than analytical ones.

Much as free groups are in some sense “maximally non-commutative”, freely independent random variables are about as far from being commuting as possible. For instance, if {X, Y} are freely independent and of expectation zero, then {\mathop{\bf E} XYXY} vanishes, but {\mathop{\bf E} XXYY} instead factors as {(\mathop{\bf E} X^2) (\mathop{\bf E} Y^2)}. As a consequence, the behaviour of freely independent random variables can be quite different from the behaviour of their classically independent commuting counterparts. Nevertheless there is a remarkably strong analogy between the two types of independence, in that results which are true in the classically independent case often have an interesting analogue in the freely independent setting. For instance, the central limit theorem (Notes 2) for averages of classically independent random variables, which roughly speaking asserts that such averages become gaussian in the large {n} limit, has an analogue for averages of freely independent variables, the free central limit theorem, which roughly speaking asserts that such averages become semicircular in the large {n} limit. One can then use this theorem to provide yet another proof of Wigner’s semicircle law (Notes 4).

Another important (and closely related) analogy is that while the distribution of sums of independent commutative random variables can be quickly computed via the characteristic function (i.e. the Fourier transform of the distribution), the distribution of sums of freely independent non-commutative random variables can be quickly computed using the Stieltjes transform instead (or with closely related objects, such as the {R}-transform of Voiculescu). This is strongly reminiscent of the appearance of the Stieltjes transform in random matrix theory, and indeed we will see many parallels between the use of the Stieltjes transform here and in Notes 4.

As mentioned earlier, free probability is an excellent tool for computing various expressions of interest in random matrix theory, such as asymptotic values of normalised moments in the large {n} limit {n \rightarrow \infty}. Nevertheless, as it only covers the asymptotic regime in which {n} is sent to infinity while holding all other parameters fixed, there are some aspects of random matrix theory to which the tools of free probability are not sufficient by themselves to resolve (although it can be possible to combine free probability theory with other tools to then answer these questions). For instance, questions regarding the rate of convergence of normalised moments as {n \rightarrow \infty} are not directly answered by free probability, though if free probability is combined with tools such as concentration of measure (Notes 1) then such rate information can often be recovered. For similar reasons, free probability lets one understand the behaviour of {k^{th}} moments as {n \rightarrow \infty} for fixed {k}, but has more difficulty dealing with the situation in which {k} is allowed to grow slowly in {n} (e.g. {k = O(\log n)}). Because of this, free probability methods are effective at controlling the bulk of the spectrum of a random matrix, but have more difficulty with the edges of that spectrum (as well as with related concepts such as the operator norm, Notes 3) as well as with fine-scale structure of the spectrum. Finally, free probability methods are most effective when dealing with matrices that are Hermitian with bounded operator norm, largely because the spectral theory of bounded self-adjoint operators in the infinite-dimensional setting of the large {n} limit is non-pathological. (This is ultimately due to the stable nature of eigenvalues in the self-adjoint setting; see this previous blog post for discussion.) For non-self-adjoint operators, free probability needs to be augmented with additional tools, most notably by bounds on least singular values, in order to recover the required stability for the various spectral data of random matrices to behave continuously with respect to the large {n} limit. We will discuss this latter point in a later set of notes.

Read the rest of this entry »

We can now turn attention to one of the centerpiece universality results in random matrix theory, namely the Wigner semi-circle law for Wigner matrices. Recall from previous notes that a Wigner Hermitian matrix ensemble is a random matrix ensemble {M_n = (\xi_{ij})_{1 \leq i,j \leq n}} of Hermitian matrices (thus {\xi_{ij} = \overline{\xi_{ji}}}; this includes real symmetric matrices as an important special case), in which the upper-triangular entries {\xi_{ij}}, {i>j} are iid complex random variables with mean zero and unit variance, and the diagonal entries {\xi_{ii}} are iid real variables, independent of the upper-triangular entries, with bounded mean and variance. Particular special cases of interest include the Gaussian Orthogonal Ensemble (GOE), the symmetric random sign matrices (aka symmetric Bernoulli ensemble), and the Gaussian Unitary Ensemble (GUE).

In previous notes we saw that the operator norm of {M_n} was typically of size {O(\sqrt{n})}, so it is natural to work with the normalised matrix {\frac{1}{\sqrt{n}} M_n}. Accordingly, given any {n \times n} Hermitian matrix {M_n}, we can form the (normalised) empirical spectral distribution (or ESD for short)

\displaystyle  \mu_{\frac{1}{\sqrt{n}} M_n} := \frac{1}{n} \sum_{j=1}^n \delta_{\lambda_j(M_n) / \sqrt{n}},

of {M_n}, where {\lambda_1(M_n) \leq \ldots \leq \lambda_n(M_n)} are the (necessarily real) eigenvalues of {M_n}, counting multiplicity. The ESD is a probability measure, which can be viewed as a distribution of the normalised eigenvalues of {M_n}.

When {M_n} is a random matrix ensemble, then the ESD {\mu_{\frac{1}{\sqrt{n}} M_n}} is now a random measure – i.e. a random variable taking values in the space {\hbox{Pr}({\mathbb R})} of probability measures on the real line. (Thus, the distribution of {\mu_{\frac{1}{\sqrt{n}} M_n}} is a probability measure on probability measures!)

Now we consider the behaviour of the ESD of a sequence of Hermitian matrix ensembles {M_n} as {n \rightarrow \infty}. Recall from Notes 0 that for any sequence of random variables in a {\sigma}-compact metrisable space, one can define notions of convergence in probability and convergence almost surely. Specialising these definitions to the case of random probability measures on {{\mathbb R}}, and to deterministic limits, we see that a sequence of random ESDs {\mu_{\frac{1}{\sqrt{n}} M_n}} converge in probability (resp. converge almost surely) to a deterministic limit {\mu \in \hbox{Pr}({\mathbb R})} (which, confusingly enough, is a deterministic probability measure!) if, for every test function {\varphi \in C_c({\mathbb R})}, the quantities {\int_{\mathbb R} \varphi\ d\mu_{\frac{1}{\sqrt{n}} M_n}} converge in probability (resp. converge almost surely) to {\int_{\mathbb R} \varphi\ d\mu}.

Remark 1 As usual, convergence almost surely implies convergence in probability, but not vice versa. In the special case of random probability measures, there is an even weaker notion of convergence, namely convergence in expectation, defined as follows. Given a random ESD {\mu_{\frac{1}{\sqrt{n}} M_n}}, one can form its expectation {{\bf E} \mu_{\frac{1}{\sqrt{n}} M_n} \in \hbox{Pr}({\mathbb R})}, defined via duality (the Riesz representation theorem) as

\displaystyle  \int_{\mathbb R} \varphi\ d{\bf E} \mu_{\frac{1}{\sqrt{n}} M_n} := {\bf E} \int_{\mathbb R} \varphi\ d	 \mu_{\frac{1}{\sqrt{n}} M_n};

this probability measure can be viewed as the law of a random eigenvalue {\frac{1}{\sqrt{n}}\lambda_i(M_n)} drawn from a random matrix {M_n} from the ensemble. We then say that the ESDs converge in expectation to a limit {\mu \in \hbox{Pr}({\mathbb R})} if {{\bf E} \mu_{\frac{1}{\sqrt{n}} M_n}} converges the vague topology to {\mu}, thus

\displaystyle  {\bf E} \int_{\mathbb R} \varphi\ d	 \mu_{\frac{1}{\sqrt{n}} M_n} \rightarrow \int_{\mathbb R} \varphi\ d\mu

for all {\phi \in C_c({\mathbb R})}.

In general, these notions of convergence are distinct from each other; but in practice, one often finds in random matrix theory that these notions are effectively equivalent to each other, thanks to the concentration of measure phenomenon.

Exercise 1 Let {M_n} be a sequence of {n \times n} Hermitian matrix ensembles, and let {\mu} be a continuous probability measure on {{\mathbb R}}.

  • Show that {\mu_{\frac{1}{\sqrt{n}} M_n}} converges almost surely to {\mu} if and only if {\mu_{\frac{1}{\sqrt{n}}}(-\infty,\lambda)} converges almost surely to {\mu(-\infty,\lambda)} for all {\lambda \in {\mathbb R}}.
  • Show that {\mu_{\frac{1}{\sqrt{n}} M_n}} converges in probability to {\mu} if and only if {\mu_{\frac{1}{\sqrt{n}}}(-\infty,\lambda)} converges in probability to {\mu(-\infty,\lambda)} for all {\lambda \in {\mathbb R}}.
  • Show that {\mu_{\frac{1}{\sqrt{n}} M_n}} converges in expectation to {\mu} if and only if {\mathop{\mathbb E} \mu_{\frac{1}{\sqrt{n}}}(-\infty,\lambda)} converges to {\mu(-\infty,\lambda)} for all {\lambda \in {\mathbb R}}.

We can now state the Wigner semi-circular law.

Theorem 1 (Semicircular law) Let {M_n} be the top left {n \times n} minors of an infinite Wigner matrix {(\xi_{ij})_{i,j \geq 1}}. Then the ESDs {\mu_{\frac{1}{\sqrt{n}} M_n}} converge almost surely (and hence also in probability and in expectation) to the Wigner semi-circular distribution

\displaystyle  \mu_{sc} := \frac{1}{2\pi} (4-|x|^2)^{1/2}_+\ dx. \ \ \ \ \ (1)

A numerical example of this theorem in action can be seen at the MathWorld entry for this law.

The semi-circular law nicely complements the upper Bai-Yin theorem from Notes 3, which asserts that (in the case when the entries have finite fourth moment, at least), the matrices {\frac{1}{\sqrt{n}} M_n} almost surely has operator norm at most {2+o(1)}. Note that the operator norm is the same thing as the largest magnitude of the eigenvalues. Because the semi-circular distribution (1) is supported on the interval {[-2,2]} with positive density on the interior of this interval, Theorem 1 easily supplies the lower Bai-Yin theorem, that the operator norm of {\frac{1}{\sqrt{n}} M_n} is almost surely at least {2-o(1)}, and thus (in the finite fourth moment case) the norm is in fact equal to {2+o(1)}. Indeed, we have just shown that the circular law provides an alternate proof of the lower Bai-Yin bound (Proposition 11 of Notes 3).

As will hopefully become clearer in the next set of notes, the semi-circular law is the noncommutative (or free probability) analogue of the central limit theorem, with the semi-circular distribution (1) taking on the role of the normal distribution. Of course, there is a striking difference between the two distributions, in that the former is compactly supported while the latter is merely subgaussian. One reason for this is that the concentration of measure phenomenon is more powerful in the case of ESDs of Wigner matrices than it is for averages of iid variables; compare the concentration of measure results in Notes 3 with those in Notes 1.

There are several ways to prove (or at least to heuristically justify) the circular law. In this set of notes we shall focus on the two most popular methods, the moment method and the Stieltjes transform method, together with a third (heuristic) method based on Dyson Brownian motion (Notes 3b). In the next set of notes we shall also study the free probability method, and in the set of notes after that we use the determinantal processes method (although this method is initially only restricted to highly symmetric ensembles, such as GUE).

Read the rest of this entry »

One theme in this course will be the central nature played by the gaussian random variables {X \equiv N(\mu,\sigma^2)}. Gaussians have an incredibly rich algebraic structure, and many results about general random variables can be established by first using this structure to verify the result for gaussians, and then using universality techniques (such as the Lindeberg exchange strategy) to extend the results to more general variables.

One way to exploit this algebraic structure is to continuously deform the variance {t := \sigma^2} from an initial variance of zero (so that the random variable is deterministic) to some final level {T}. We would like to use this to give a continuous family {t \mapsto X_t} of random variables {X_t \equiv N(\mu, t)} as {t} (viewed as a “time” parameter) runs from {0} to {T}.

At present, we have not completely specified what {X_t} should be, because we have only described the individual distribution {X_t \equiv N(\mu,t)} of each {X_t}, and not the joint distribution. However, there is a very natural way to specify a joint distribution of this type, known as Brownian motion. In these notes we lay the necessary probability theory foundations to set up this motion, and indicate its connection with the heat equation, the central limit theorem, and the Ornstein-Uhlenbeck process. This is the beginning of stochastic calculus, which we will not develop fully here.

We will begin with one-dimensional Brownian motion, but it is a simple matter to extend the process to higher dimensions. In particular, we can define Brownian motion on vector spaces of matrices, such as the space of {n \times n} Hermitian matrices. This process is equivariant with respect to conjugation by unitary matrices, and so we can quotient out by this conjugation and obtain a new process on the quotient space, or in other words on the spectrum of {n \times n} Hermitian matrices. This process is called Dyson Brownian motion, and turns out to have a simple description in terms of ordinary Brownian motion; it will play a key role in several of the subsequent notes in this course.

Read the rest of this entry »

Let {A} be a Hermitian {n \times n} matrix. By the spectral theorem for Hermitian matrices (which, for sake of completeness, we prove below), one can diagonalise {A} using a sequence

\displaystyle  \lambda_1(A) \geq \ldots \geq \lambda_n(A)

of {n} real eigenvalues, together with an orthonormal basis of eigenvectors {u_1(A),\ldots,u_n(A) \in {\mathbb C}^n}. (The eigenvalues are uniquely determined by {A}, but the eigenvectors have a little ambiguity to them, particularly if there are repeated eigenvalues; for instance, one could multiply each eigenvector by a complex phase {e^{i\theta}}. In these notes we are arranging eigenvalues in descending order; of course, one can also arrange eigenvalues in increasing order, which causes some slight notational changes in the results below.) The set {\{\lambda_1(A),\ldots,\lambda_n(A)\}} is known as the spectrum of {A}.

A basic question in linear algebra asks the extent to which the eigenvalues {\lambda_1(A),\ldots,\lambda_n(A)} and {\lambda_1(B),\ldots,\lambda_n(B)} of two Hermitian matrices {A, B} constrains the eigenvalues {\lambda_1(A+B),\ldots,\lambda_n(A+B)} of the sum. For instance, the linearity of trace

\displaystyle  \hbox{tr}(A+B) = \hbox{tr}(A)+\hbox{tr}(B),

when expressed in terms of eigenvalues, gives the trace constraint

\displaystyle  \lambda_1(A+B)+\ldots+\lambda_n(A+B) = \lambda_1(A)+\ldots+\lambda_n(A) \ \ \ \ \ (1)

\displaystyle  +\lambda_1(B)+\ldots+\lambda_n(B);

the identity

\displaystyle  \lambda_1(A) = \sup_{|v|=1} v^* Av \ \ \ \ \ (2)

(together with the counterparts for {B} and {A+B}) gives the inequality

\displaystyle  \lambda_1(A+B) \leq \lambda_1(A) + \lambda_1(B); \ \ \ \ \ (3)

and so forth.

The complete answer to this problem is a fascinating one, requiring a strangely recursive description (once known as Horn’s conjecture, which is now solved), and connected to a large number of other fields of mathematics, such as geometric invariant theory, intersection theory, and the combinatorics of a certain gadget known as a “honeycomb”. See for instance my survey with Allen Knutson on this topic some years ago.

In typical applications to random matrices, one of the matrices (say, {B}) is “small” in some sense, so that {A+B} is a perturbation of {A}. In this case, one does not need the full strength of the above theory, and instead rely on a simple aspect of it pointed out by Helmke and Rosenthal and by Totaro, which generates several of the eigenvalue inequalities relating {A}, {B}, and {C}, of which (1) and (3) are examples. (Actually, this method eventually generates all of the eigenvalue inequalities, but this is a non-trivial fact to prove.) These eigenvalue inequalities can mostly be deduced from a number of minimax characterisations of eigenvalues (of which (2) is a typical example), together with some basic facts about intersections of subspaces. Examples include the Weyl inequalities

\displaystyle  \lambda_{i+j-1}(A+B) \leq \lambda_i(A) + \lambda_j(B), \ \ \ \ \ (4)

valid whenever {i,j \geq 1} and {i+j-1 \leq n}, and the Ky Fan inequality

\displaystyle  \lambda_1(A+B)+\ldots+\lambda_k(A+B) \leq

\displaystyle  \lambda_1(A)+\ldots+\lambda_k(A) + \lambda_1(B)+\ldots+\lambda_k(B). \ \ \ \ \ (5)

One consequence of these inequalities is that the spectrum of a Hermitian matrix is stable with respect to small perturbations.

We will also establish some closely related inequalities concerning the relationships between the eigenvalues of a matrix, and the eigenvalues of its minors.

Many of the inequalities here have analogues for the singular values of non-Hermitian matrices (which is consistent with the discussion near Exercise 16 of Notes 3). However, the situation is markedly different when dealing with eigenvalues of non-Hermitian matrices; here, the spectrum can be far more unstable, if pseudospectrum is present. Because of this, the theory of the eigenvalues of a random non-Hermitian matrix requires an additional ingredient, namely upper bounds on the prevalence of pseudospectrum, which after recentering the matrix is basically equivalent to establishing lower bounds on least singular values. We will discuss this point in more detail in later notes.

We will work primarily here with Hermitian matrices, which can be viewed as self-adjoint transformations on complex vector spaces such as {{\mathbb C}^n}. One can of course specialise the discussion to real symmetric matrices, in which case one can restrict these complex vector spaces to their real counterparts {{\mathbb R}^n}. The specialisation of the complex theory below to the real case is straightforward and is left to the interested reader.

Read the rest of this entry »

Now that we have developed the basic probabilistic tools that we will need, we now turn to the main subject of this course, namely the study of random matrices. There are many random matrix models (aka matrix ensembles) of interest – far too many to all be discussed in a single course. We will thus focus on just a few simple models. First of all, we shall restrict attention to square matrices {M = (\xi_{ij})_{1 \leq i,j \leq n}}, where {n} is a (large) integer and the {\xi_{ij}} are real or complex random variables. (One can certainly study rectangular matrices as well, but for simplicity we will only look at the square case.) Then, we shall restrict to three main models:

  • Iid matrix ensembles, in which the coefficients {\xi_{ij}} are iid random variables with a single distribution {\xi_{ij} \equiv \xi}. We will often normalise {\xi} to have mean zero and unit variance. Examples of iid models include the Bernouli ensemble (aka random sign matrices) in which the {\xi_{ij}} are signed Bernoulli variables, the real gaussian matrix ensemble in which {\xi_{ij} \equiv N(0,1)_{\bf R}}, and the complex gaussian matrix ensemble in which {\xi_{ij} \equiv N(0,1)_{\bf C}}.
  • Symmetric Wigner matrix ensembles, in which the upper triangular coefficients {\xi_{ij}}, {j \geq i} are jointly independent and real, but the lower triangular coefficients {\xi_{ij}}, {j<i} are constrained to equal their transposes: {\xi_{ij}=\xi_{ji}}. Thus {M} by construction is always a real symmetric matrix. Typically, the strictly upper triangular coefficients will be iid, as will the diagonal coefficients, but the two classes of coefficients may have a different distribution. One example here is the symmetric Bernoulli ensemble, in which both the strictly upper triangular and the diagonal entries are signed Bernoulli variables; another important example is the Gaussian Orthogonal Ensemble (GOE), in which the upper triangular entries have distribution {N(0,1)_{\bf R}} and the diagonal entries have distribution {N(0,2)_{\bf R}}. (We will explain the reason for this discrepancy later.)
  • Hermitian Wigner matrix ensembles, in which the upper triangular coefficients are jointly independent, with the diagonal entries being real and the strictly upper triangular entries complex, and the lower triangular coefficients {\xi_{ij}}, {j<i} are constrained to equal their adjoints: {\xi_{ij} = \overline{\xi_{ji}}}. Thus {M} by construction is always a Hermitian matrix. This class of ensembles contains the symmetric Wigner ensembles as a subclass. Another very important example is the Gaussian Unitary Ensemble (GUE), in which all off-diagional entries have distribution {N(0,1)_{\bf C}}, but the diagonal entries have distribution {N(0,1)_{\bf R}}.

Given a matrix ensemble {M}, there are many statistics of {M} that one may wish to consider, e.g. the eigenvalues or singular values of {M}, the trace and determinant, etc. In these notes we will focus on a basic statistic, namely the operator norm

\displaystyle \| M \|_{op} := \sup_{x \in {\bf C}^n: |x|=1} |Mx| \ \ \ \ \ (1)

of the matrix {M}. This is an interesting quantity in its own right, but also serves as a basic upper bound on many other quantities. (For instance, {\|M\|_{op}} is also the largest singular value {\sigma_1(M)} of {M} and thus dominates the other singular values; similarly, all eigenvalues {\lambda_i(M)} of {M} clearly have magnitude at most {\|M\|_{op}}.) Because of this, it is particularly important to get good upper tail bounds

\displaystyle {\bf P}( \|M\|_{op} \geq \lambda ) \leq \ldots

on this quantity, for various thresholds {\lambda}. (Lower tail bounds are also of interest, of course; for instance, they give us confidence that the upper tail bounds are sharp.) Also, as we shall see, the problem of upper bounding {\|M\|_{op}} can be viewed as a non-commutative analogue of upper bounding the quantity {|S_n|} studied in Notes 1. (The analogue of the central limit theorem in Notes 2 is the Wigner semi-circular law, which will be studied in the next set of notes.)

An {n \times n} matrix consisting entirely of {1}s has an operator norm of exactly {n}, as can for instance be seen from the Cauchy-Schwarz inequality. More generally, any matrix whose entries are all uniformly {O(1)} will have an operator norm of {O(n)} (which can again be seen from Cauchy-Schwarz, or alternatively from Schur’s test, or from a computation of the Frobenius norm). However, this argument does not take advantage of possible cancellations in {M}. Indeed, from analogy with concentration of measure, when the entries of the matrix {M} are independent, bounded and have mean zero, we expect the operator norm to be of size {O(\sqrt{n})} rather than {O(n)}. We shall see shortly that this intuition is indeed correct. (One can see, though, that the mean zero hypothesis is important; from the triangle inequality we see that if we add the all-ones matrix (for instance) to a random matrix with mean zero, to obtain a random matrix whose coefficients all have mean {1}, then at least one of the two random matrices necessarily has operator norm at least {n/2}.)

As mentioned before, there is an analogy here with the concentration of measure phenomenon, and many of the tools used in the latter (e.g. the moment method) will also appear here. (Indeed, we will be able to use some of the concentration inequalities from Notes 1 directly to help control {\|M\|_{op}} and related quantities.) Similarly, just as many of the tools from concentration of measure could be adapted to help prove the central limit theorem, several the tools seen here will be of use in deriving the semi-circular law.

The most advanced knowledge we have on the operator norm is given by the Tracy-Widom law, which not only tells us where the operator norm is concentrated in (it turns out, for instance, that for a Wigner matrix (with some additional technical assumptions), it is concentrated in the range {[2\sqrt{n} - O(n^{-1/6}), 2\sqrt{n} + O(n^{-1/6})]}), but what its distribution in that range is. While the methods in this set of notes can eventually be pushed to establish this result, this is far from trivial, and will only be briefly discussed here. (We may return to the Tracy-Widom law later in this course, though.)

Read the rest of this entry »

Let {X_1,X_2,\dots} be iid copies of an absolutely integrable real scalar random variable {X}, and form the partial sums {S_n := X_1 + \dots + X_n}. As we saw in the last set of notes, the law of large numbers ensures that the empirical averages {S_n/n} converge (both in probability and almost surely) to a deterministic limit, namely the mean {\mu= {\bf E} X} of the reference variable {X}. Furthermore, under some additional moment hypotheses on the underlying variable {X}, we can obtain square root cancellation for the fluctuation {\frac{S_n}{n} - \mu} of the empirical average from the mean. To simplify the calculations, let us first restrict to the case {\mu=0, \sigma^2=1} of mean zero and variance one, thus

\displaystyle  {\bf E} X = 0

and

\displaystyle  {\bf Var}(X) = {\bf E} X^2 = 1.

Then, as computed in previous notes, the normalised fluctuation {S_n/\sqrt{n}} also has mean zero and variance one:

\displaystyle  {\bf E} \frac{S_n}{\sqrt{n}} = 0

\displaystyle  {\bf Var}(\frac{S_n}{\sqrt{n}}) = {\bf E} (\frac{S_n}{\sqrt{n}})^2 = 1.

This and Chebyshev’s inequality already indicates that the “typical” size of {S_n} is {O(\sqrt{n})}, thus for instance {\frac{S_n}{\sqrt{n} \omega(n)}} goes to zero in probability for any {\omega(n)} that goes to infinity as {n \rightarrow \infty}. If we also have a finite fourth moment {{\bf E} |X|^4 < \infty}, then the calculations of the previous notes also give a fourth moment estimate

\displaystyle  {\bf E} (\frac{S_n}{\sqrt{n}})^4 = 3 + O( \frac{{\bf E} |X|^4}{n} ).

From this and the Paley-Zygmund inequality (Exercise 42 of Notes 1) we also get some lower bound for {\frac{S_n}{\sqrt{n}}} of the form

\displaystyle  {\bf P}( |\frac{S_n}{\sqrt{n}}| \geq \varepsilon ) \geq \varepsilon

for some absolute constant {\varepsilon>0} and for {n} sufficiently large; this indicates in particular that {\frac{S_n \omega(n)}{\sqrt{n}}} does not converge in any reasonable sense to something finite for any {\omega(n)} that goes to infinity.

The question remains as to what happens to the ratio {S_n/\sqrt{n}} itself, without multiplying or dividing by any factor {\omega(n)}. A first guess would be that these ratios converge in probability or almost surely, but this is unfortunately not the case:

Proposition 1 Let {X_1,X_2,\dots} be iid copies of an absolutely integrable real scalar random variable {X} with mean zero, variance one, and finite fourth moment, and write {S_n := X_1 + \dots + X_n}. Then the random variables {S_n/\sqrt{n}} do not converge in probability or almost surely to any limit, and neither does any subsequence of these random variables.

Proof: Suppose for contradiction that some sequence {S_{n_j}/\sqrt{n_j}} converged in probability or almost surely to a limit {Y}. By passing to a further subsequence we may assume that the convergence is in the almost sure sense. Since all of the {S_{n_j}/\sqrt{n_j}} have mean zero, variance one, and bounded fourth moment, Theorem 24 of Notes 1 implies that the limit {Y} also has mean zero and variance one. On the other hand, {Y} is a tail random variable and is thus almost surely constant by the Kolmogorov zero-one law from Notes 3. Since constants have variance zero, we obtain the required contradiction. \Box

Nevertheless there is an important limit for the ratio {S_n/\sqrt{n}}, which requires one to replace the notions of convergence in probability or almost sure convergence by the weaker concept of convergence in distribution.

Definition 2 (Vague convergence and convergence in distribution) Let {R} be a locally compact Hausdorff topological space with the Borel {\sigma}-algebra. A sequence of finite measures {\mu_n} on {R} is said to converge vaguely to another finite measure {\mu} if one has

\displaystyle  \int_R G(x)\ d\mu_n(x) \rightarrow \int_R G(x)\ d\mu(x)

as {n \rightarrow \infty} for all continuous compactly supported functions {G: R \rightarrow {\bf R}}. (Vague convergence is also known as weak convergence, although strictly speaking the terminology weak-* convergence would be more accurate.) A sequence of random variables {X_n} taking values in {R} is said to converge in distribution (or converge weakly or converge in law) to another random variable {X} if the distributions {\mu_{X_n}} converge vaguely to the distribution {\mu_X}, or equivalently if

\displaystyle  {\bf E}G(X_n) \rightarrow {\bf E} G(X)

as {n \rightarrow \infty} for all continuous compactly supported functions {G: R \rightarrow {\bf R}}.

One could in principle try to extend this definition beyond the locally compact Hausdorff setting, but certain pathologies can occur when doing so (e.g. failure of the Riesz representation theorem), and we will never need to consider vague convergence in spaces that are not locally compact Hausdorff, so we restrict to this setting for simplicity.

Note that the notion of convergence in distribution depends only on the distribution of the random variables involved. One consequence of this is that convergence in distribution does not produce unique limits: if {X_n} converges in distribution to {X}, and {Y} has the same distribution as {X}, then {X_n} also converges in distribution to {Y}. However, limits are unique up to equivalence in distribution (this is a consequence of the Riesz representation theorem, discussed for instance in this blog post). As a consequence of the insensitivity of convergence in distribution to equivalence in distribution, we may also legitimately talk about convergence of distribution of a sequence of random variables {X_n} to another random variable {X} even when all the random variables {X_1,X_2,\dots} and {X} involved are being modeled by different probability spaces (e.g. each {X_n} is modeled by {\Omega_n}, and {X} is modeled by {\Omega}, with no coupling presumed between these spaces). This is in contrast to the stronger notions of convergence in probability or almost sure convergence, which require all the random variables to be modeled by a common probability space. Also, by an abuse of notation, we can say that a sequence {X_n} of random variables converges in distribution to a probability measure {\mu}, when {\mu_{X_n}} converges vaguely to {\mu}. Thus we can talk about a sequence of random variables converging in distribution to a uniform distribution, a gaussian distribution, etc..

From the dominated convergence theorem (available for both convergence in probability and almost sure convergence) we see that convergence in probability or almost sure convergence implies convergence in distribution. The converse is not true, due to the insensitivity of convergence in distribution to equivalence in distribution; for instance, if {X_1,X_2,\dots} are iid copies of a non-deterministic scalar random variable {X}, then the {X_n} trivially converge in distribution to {X}, but will not converge in probability or almost surely (as one can see from the zero-one law). However, there are some partial converses that relate convergence in distribution to convergence in probability; see Exercise 10 below.

Remark 3 The notion of convergence in distribution is somewhat similar to the notion of convergence in the sense of distributions that arises in distribution theory (discussed for instance in this previous blog post), however strictly speaking the two notions of convergence are distinct and should not be confused with each other, despite the very similar names.

The notion of convergence in distribution simplifies in the case of real scalar random variables:

Proposition 4 Let {X_1,X_2,\dots} be a sequence of scalar random variables, and let {X} be another scalar random variable. Then the following are equivalent:

  • (i) {X_n} converges in distribution to {X}.
  • (ii) {F_{X_n}(t)} converges to {F_X(t)} for each continuity point {t} of {F_X} (i.e. for all real numbers {t \in {\bf R}} at which {F_X} is continuous). Here {F_X(t) := {\bf P}(X \leq t)} is the cumulative distribution function of {X}.

Proof: First suppose that {X_n} converges in distribution to {X}, and {F_X} is continuous at {t}. For any {\varepsilon > 0}, one can find a {\delta} such that

\displaystyle  F_X(t) - \varepsilon \leq F_X(t') \leq F_X(t) + \varepsilon

for every {t' \in [t-\delta,t+\delta]}. One can also find an {N} larger than {|t|+\delta} such that {F_X(-N) \leq \varepsilon} and {F_X(N) \geq 1-\varepsilon}. Thus

\displaystyle  {\bf P} (|X| \geq N ) = O(\varepsilon)

and

\displaystyle  {\bf P} (|X - t| \leq \delta ) = O(\varepsilon).

Let {G: {\bf R} \rightarrow [0,1]} be a continuous function supported on {[-2N, t]} that equals {1} on {[-N, t-\delta]}. Then by the above discussion we have

\displaystyle  {\bf E} G(X) = F_X(t) + O(\varepsilon)

and hence

\displaystyle  {\bf E} G(X_n) = F_X(t) + O(\varepsilon)

for large enough {n}. In particular

\displaystyle  {\bf P}( X_n \leq t ) \geq F_X(t) - O(\varepsilon).

A similar argument, replacing {G} with a continuous function supported on {[t,2N]} that equals {1} on {[t+\delta,N]} gives

\displaystyle  {\bf P}( X_n > t ) \geq 1 - F_X(t) - O(\varepsilon)

for {n} large enough. Putting the two estimates together gives

\displaystyle  F_{X_n}(t) = F_X(t) + O(\varepsilon)

for {n} large enough; sending {\varepsilon \rightarrow 0}, we obtain the claim.

Conversely, suppose that {F_{X_n}(t)} converges to {F_X(t)} at every continuity point {t} of {F_X}. Let {G: {\bf R} \rightarrow {\bf R}} be a continuous compactly supported function, then it is uniformly continuous. As {F_X} is monotone increasing, it can only have countably many points of discontinuity. From these two facts one can find, for any {\varepsilon>0}, a simple function {G_\varepsilon(t) = \sum_{i=1}^n c_i 1_{(t_i,t_{i+1}]}} for some {t_1 < \dots < t_n} that are points of continuity of {F_X}, and real numbers {c_i}, such that {|G(t) - G_\varepsilon(t)| \leq \varepsilon} for all {t}. Thus

\displaystyle  {\bf E} G(X_n) = {\bf E} G_\varepsilon(X_n) + O(\varepsilon)

\displaystyle  = \sum_{i=1}^n c_i(F_{X_n}(t_{i+1}) - F_{X_n}(t)) + O(\varepsilon).

Similarly for {X_n} replaced by {X}. Subtracting and taking limit superior, we conclude that

\displaystyle  \limsup_{n \rightarrow \infty} |{\bf E} G(X_n) - {\bf E} G(X)| = O(\varepsilon),

and on sending {\varepsilon \rightarrow 0}, we obtain that {X_n} converges in distribution to {X} as claimed. \Box

The restriction to continuity points of {t} is necessary. Consider for instance the deterministic random variables {X_n = 1/n}, then {X_n} converges almost surely (and hence in distribution) to {0}, but {F_{X_n}(0) = 0} does not converge to {F_X(0)=1}.

Example 5 For any natural number {n}, let {X_n} be a discrete random variable drawn uniformly from the finite set {\{0/n, 1/n, \dots, (n-1)/n\}}, and let {X} be the continuous random variable drawn uniformly from {[0,1]}. Then {X_n} converges in distribution to {X}. Thus we see that a continuous random variable can emerge as the limit of discrete random variables.

Example 6 For any natural number {n}, let {X_n} be a continuous random variable drawn uniformly from {[0,1/n]}, then {X_n} converges in distribution to the deterministic real number {0}. Thus we see that discrete (or even deterministic) random variables can emerge as the limit of continuous random variables.

Exercise 7 (Portmanteau theorem) Show that the properties (i) and (ii) in Proposition 4 are also equivalent to the following three statements:

  • (iii) One has {\limsup_{n \rightarrow \infty} {\bf P}( X_n \in K ) \leq {\bf P}(X \in K)} for all closed sets {K \subset {\bf R}}.
  • (iv) One has {\liminf_{n \rightarrow \infty} {\bf P}( X_n \in U ) \geq {\bf P}(X \in U)} for all open sets {U \subset {\bf R}}.
  • (v) For any Borel set {E \subset {\bf R}} whose topological boundary {\partial E} is such that {{\bf P}(X \in \partial E) = 0}, one has {\lim_{n \rightarrow \infty} {\bf P}(X_n \in E) = {\bf P}(X \in E)}.

(Note: to prove this theorem, you may wish to invoke Urysohn’s lemma. To deduce (iii) from (i), you may wish to start with the case of compact {K}.)

We can now state the famous central limit theorem:

Theorem 8 (Central limit theorem) Let {X_1,X_2,\dots} be iid copies of a scalar random variable {X} of finite mean {\mu := {\bf E} X} and finite non-zero variance {\sigma^2 := {\bf Var}(X)}. Let {S_n := X_1 + \dots + X_n}. Then the random variables {\frac{\sqrt{n}}{\sigma} (\frac{S_n}{n} - \mu)} converges in distribution to a random variable with the standard normal distribution {N(0,1)} (that is to say, a random variable with probability density function {x \mapsto \frac{1}{\sqrt{2\pi}} e^{-x^2/2}}). Thus, by abuse of notation

\displaystyle  \frac{\sqrt{n}}{\sigma} (\frac{S_n}{n} - \mu) \rightarrow N(0,1).

In the normalised case {\mu=0, \sigma^2=1} when {X} has mean zero and unit variance, this simplifies to

\displaystyle  \frac{S_n}{\sqrt{n}} \rightarrow N(0,1).

Using Proposition 4 (and the fact that the cumulative distribution function associated to {N(0,1)} is continuous, the central limit theorem is equivalent to asserting that

\displaystyle  {\bf P}( \frac{\sqrt{n}}{\sigma} (\frac{S_n}{n} - \mu) \leq t ) \rightarrow \frac{1}{\sqrt{2\pi}} \int_{-\infty}^t e^{-x^2/2}\ dx

as {n \rightarrow \infty} for any {t \in {\bf R}}, or equivalently that

\displaystyle  {\bf P}( a \leq \frac{\sqrt{n}}{\sigma} (\frac{S_n}{n} - \mu) \leq b ) \rightarrow \frac{1}{\sqrt{2\pi}} \int_{a}^b e^{-x^2/2}\ dx.

Informally, one can think of the central limit theorem as asserting that {S_n} approximately behaves like it has distribution {N( n \mu, n \sigma^2 )} for large {n}, where {N(\mu,\sigma^2)} is the normal distribution with mean {\mu} and variance {\sigma^2}, that is to say the distribution with probability density function {x \mapsto \frac{1}{\sqrt{2\pi} \sigma} e^{-(x-\mu)^2/2\sigma^2}}. The integrals {\frac{1}{\sqrt{2\pi}} \int_{-\infty}^t e^{-x^2/2}\ dx} can be written in terms of the error function {\hbox{erf}} as {\frac{1}{2} + \frac{1}{2} \hbox{erf}(t/\sqrt{2})}.

The central limit theorem is a basic example of the universality phenomenon in probability – many statistics involving a large system of many independent (or weakly dependent) variables (such as the normalised sums {\frac{\sqrt{n}}{\sigma}(\frac{S_n}{n}-\mu)}) end up having a universal asymptotic limit (in this case, the normal distribution), regardless of the precise makeup of the underlying random variable {X} that comprised that system. Indeed, the universality of the normal distribution is such that it arises in many other contexts than the fluctuation of iid random variables; the central limit theorem is merely the first place in probability theory where it makes a prominent appearance.

We will give several proofs of the central limit theorem in these notes; each of these proofs has their advantages and disadvantages, and can each extend to prove many further results beyond the central limit theorem. We first give Lindeberg’s proof of the central limit theorem, based on exchanging (or swapping) each component {X_1,\dots,X_n} of the sum {S_n} in turn. This proof gives an accessible explanation as to why there should be a universal limit for the central limit theorem; one then computes directly with gaussians to verify that it is the normal distribution which is the universal limit. Our second proof is the most popular one taught in probability texts, namely the Fourier-analytic proof based around the concept of the characteristic function {t \mapsto {\bf E} e^{itX}} of a real random variable {X}. Thanks to the powerful identities and other results of Fourier analysis, this gives a quite short and direct proof of the central limit theorem, although the arguments may seem rather magical to readers who are not already familiar with Fourier methods. Finally, we give a proof based on the moment method, in the spirit of the arguments in the previous notes; this argument is more combinatorial, but is straightforward and is particularly robust, in particular being well equipped to handle some dependencies between components; we will illustrate this by proving the Erdos-Kac law in number theory by this method. Some further discussion of the central limit theorem (including some further proofs, such as one based on Stein’s method) can be found in this blog post. Some further variants of the central limit theorem, such as local limit theorems, stable laws, and large deviation inequalities, will be discussed in the next (and final) set of notes.

The following exercise illustrates the power of the central limit theorem, by establishing combinatorial estimates which would otherwise require the use of Stirling’s formula to establish.

Exercise 9 (De Moivre-Laplace theorem) Let {X} be a Bernoulli random variable, taking values in {\{0,1\}} with {{\bf P}(X=0)={\bf P}(X=1)=1/2}, thus {X} has mean {1/2} and variance {1/4}. Let {X_1,X_2,\dots} be iid copies of {X}, and write {S_n := X_1+\dots+X_n}.

  • (i) Show that {S_n} takes values in {\{0,\dots,n\}} with {{\bf P}(S_n=i) = \frac{1}{2^n} \binom{n}{i}}. (This is an example of a binomial distribution.)
  • (ii) Assume Stirling’s formula

    \displaystyle  n! = (1+o(1)) \sqrt{2\pi n} n^n e^{-n} \ \ \ \ \ (1)

    where {o(1)} is a function of {n} that goes to zero as {n \rightarrow \infty}. (A proof of this formula may be found in this previous blog post.) Using this formula, and without using the central limit theorem, show that

    \displaystyle  {\bf P}( a \leq 2\sqrt{n} (\frac{S_n}{n} - \frac{1}{2}) \leq b ) \rightarrow \frac{1}{\sqrt{2\pi}} \int_{a}^b e^{-x^2/2}\ dx

    as {n \rightarrow \infty} for any fixed real numbers {a<b}.

The above special case of the central limit theorem was first established by de Moivre and Laplace.

We close this section with some basic facts about convergence of distribution that will be useful in the sequel.

Exercise 10 Let {X_1,X_2,\dots}, {Y_1,Y_2,\dots} be sequences of real random variables, and let {X,Y} be further real random variables.

  • (i) If {X} is deterministic, show that {X_n} converges in distribution to {X} if and only if {X_n} converges in probability to {X}.
  • (ii) Suppose that {X_n} is independent of {Y_n} for each {n}, and {X} independent of {Y}. Show that {X_n+iY_n} converges in distribution to {X+iY} if and only if {X_n} converges in distribution to {X} and {Y_n} converges in distribution to {Y}. (The shortest way to prove this is by invoking the Stone-Weierstrass theorem, but one can also proceed by proving some version of Proposition 4.) What happens if the independence hypothesis is dropped?
  • (iii) If {X_n} converges in distribution to {X}, show that for every {\varepsilon>0} there exists {K>0} such that {{\bf P}( |X_n| \geq K ) < \varepsilon} for all sufficiently large {n}. (That is to say, {X_n} is a tight sequence of random variables.)
  • (iv) Show that {X_n} converges in distribution to {X} if and only if, after extending the probability space model if necessary, one can find copies {Z_1,Z_2,\dots} and {Z} of {X_1,X_2,\dots} and {X} respectively such that {Z_n} converges almost surely to {Z}. (Hint: use the Skorohod representation, Exercise 29 of Notes 0.)
  • (v) If {X_1,X_2,\dots} converges in distribution to {X}, and {F: {\bf R} \rightarrow {\bf R}} is continuous, show that {F(X_1),F(X_2),\dots} converges in distribution to {F(X)}. Generalise this claim to the case when {X} takes values in an arbitrary locally compact Hausdorff space.
  • (vi) (Slutsky’s theorem) If {X_n} converges in distribution to {X}, and {Y_n} converges in probability to a deterministic limit {Y}, show that {X_n+Y_n} converges in distribution to {X+Y}, and {X_n Y_n} converges in distribution to {XY}. (Hint: either use (iv), or else use (iii) to control some error terms.) This statement combines particularly well with (i). What happens if {Y} is not assumed to be deterministic?
  • (vii) (Fatou lemma) If {G: {\bf R} \rightarrow [0,+\infty)} is continuous, and {X_n} converges in distribution to {X}, show that {\liminf_{n \rightarrow \infty} {\bf E} G(X_n) \geq {\bf E} G(X)}.
  • (viii) (Bounded convergence) If {G: {\bf R} \rightarrow {\bf R}} is continuous and bounded, and {X_n} converges in distribution to {X}, show that {\lim_{n \rightarrow \infty} {\bf E} G(X_n) = {\bf E} G(X)}.
  • (ix) (Dominated convergence) If {X_n} converges in distribution to {X}, and there is an absolutely integrable {Y} such that {|X_n| \leq Y} almost surely for all {n}, show that {\lim_{n \rightarrow \infty} {\bf E} X_n = {\bf E} X}.

For future reference we also mention (but will not prove) Prokhorov’s theorem that gives a partial converse to part (iii) of the above exercise:

Theorem 11 (Prokhorov’s theorem) Let {X_1,X_2,\dots} be a sequence of real random variables which is tight (that is, for every {\varepsilon>0} there exists {K>0} such that {{\bf P}(|X_n| \geq K) < \varepsilon} for all sufficiently large {n}). Then there exists a subsequence {X_{n_j}} which converges in distribution to some random variable {X} (which may possibly be modeled by a different probability space model than the {X_1,X_2,\dots}.)

The proof of this theorem relies on the Riesz representation theorem, and is beyond the scope of this course; but see for instance Exercise 29 of this previous blog post. (See also the closely related Helly selection theorem, covered in Exercise 30 of the same post.)

Read the rest of this entry »

Archives

RSS Google+ feed

  • An error has occurred; the feed is probably down. Try again later.