You are currently browsing the monthly archive for January 2010.

I have blogged a number of times in the past about the relationship between finitary (or “hard”, or “quantitative”) analysis, and infinitary (or “soft”, or “qualitative”) analysis. One way to connect the two types of analysis is via compactness arguments (and more specifically, contradiction and compactness arguments); such arguments can convert qualitative properties (such as continuity) to quantitative properties (such as bounded), basically because of the fundamental fact that continuous functions on a compact space are bounded (or the closely related fact that sequentially continuous functions on a sequentially compact space are bounded).

A key stage in any such compactness argument is the following: one has a sequence ${X_n}$ of “quantitative” or “finitary” objects or spaces, and one has to somehow end up with a “qualitative” or “infinitary” limit object ${X}$ or limit space. One common way to achieve this is to embed everything inside some universal space and then use some weak compactness property of that space, such as the Banach-Alaoglu theorem (or its sequential counterpart). This is for instance the idea behind the Furstenberg correspondence principle relating ergodic theory to combinatorics; see for instance this post of mine on this topic.

However, there is a slightly different approach, which I will call ultralimit analysis, which proceeds via the machinery of ultrafilters and ultraproducts; typically, the limit objects ${X}$ one constructs are now the ultraproducts (or ultralimits) of the original objects ${X_\alpha}$. There are two main facts that make ultralimit analysis powerful. The first is that one can take ultralimits of arbitrary sequences of objects, as opposed to more traditional tools such as metric completions, which only allow one to take limits of Cauchy sequences of objects. The second fact is Los’s theorem, which tells us that ${X}$ is an elementary limit of the ${X_\alpha}$ (i.e. every sentence in first-order logic which is true for the ${X_\alpha}$ for ${\alpha}$ large enough, is true for ${X}$). This existence of elementary limits is a manifestation of the compactness theorem in logic; see this earlier blog post for more discussion. So we see that compactness methods and ultrafilter methods are closely intertwined. (See also my earlier class notes for a related connection between ultrafilters and compactness.)

Ultralimit analysis is very closely related to nonstandard analysis. I already discussed some aspects of this relationship in an earlier post, and will expand upon it at the bottom of this post. Roughly speaking, the relationship between ultralimit analysis and nonstandard analysis is analogous to the relationship between measure theory and probability theory.

To illustrate how ultralimit analysis is actually used in practice, I will show later in this post how to take a qualitative infinitary theory – in this case, basic algebraic geometry – and apply ultralimit analysis to then deduce a quantitative version of this theory, in which the complexity of the various algebraic sets and varieties that appear as outputs are controlled uniformly by the complexity of the inputs. The point of this exercise is to show how ultralimit analysis allows for a relatively painless conversion back and forth between the quantitative and qualitative worlds, though in some cases the quantitative translation of a qualitative result (or vice versa) may be somewhat unexpected. In an upcoming paper of myself, Ben Green, and Emmanuel Breuillard (announced in the previous blog post), we will rely on ultralimit analysis to reduce the messiness of various quantitative arguments by replacing them with a qualitative setting in which the theory becomes significantly cleaner.

For sake of completeness, I also redo some earlier instances of the correspondence principle via ultralimit analysis, namely the deduction of the quantitative Gromov theorem from the qualitative one, and of Szemerédi’s theorem from the Furstenberg recurrence theorem, to illustrate how close the two techniques are to each other.

Emmanuel Breuillard, Ben Green, and I have just uploaded to the arXiv our announcement “Linear approximate groups“, submitted to Electronic Research Announcements.

The main result is a step towards the classification of ${K}$-approximate groups, in the specific setting of simple and semisimple Lie groups (with some partial results for more general Lie groups). For ${K \geq 1}$, define a ${K}$-approximate group to be a finite subset ${A}$ of a group ${G}$ which is a symmetric neighbourhood of the origin (thus ${1 \in A}$ and ${A^{-1} := \{a^{-1}: a \in A \}}$ is equal to ${A}$), and such that the product set ${A \cdot A}$ is covered by ${K}$ left-translates (or equivalently, ${K}$ right-translates) of ${A}$. For ${K=1}$, this is the same concept as a finite subgroup of ${G}$, but for larger values of ${K}$, one also gets some interesting objects which are close to, but not exactly groups, such as geometric progressions ${\{ g^n: -N \leq n \leq N \}}$ for some ${g \in G}$ and ${N \geq 1}$.

The expectation is that ${K}$-approximate groups are ${C_K}$-controlled by “structured” objects, such as actual groups and progressions, though the precise formulation of this has not yet been finalised. (We say that one finite set ${A}$ ${K}$-controls another ${B}$ if ${A}$ is at most ${K}$ times larger than ${B}$ in cardinality, and ${B}$ can be covered by at most ${K}$ left translates or right translates of ${A}$.) The task of stating and proving this statement is the noncommutative Freiman theorem problem, discussed in these earlier blog posts.

While this problem remains unsolved for general groups, significant progress has been made in special groups, notably abelian, nilpotent, and solvable groups. Furthermore, the work of Chang (over ${{\mathbb C}}$) and Helfgott (over ${{\Bbb F}_p}$) has established the important special cases of the special linear groups ${SL_2(k)}$ and ${SL_3(k)}$:

Theorem 1 (Helfgott’s theorem) Let ${d = 2,3}$ and let ${k}$ be either ${{\Bbb F}_p}$ or ${{\mathbb C}}$ for some prime ${p}$. Let ${A}$ be a ${K}$-approximate subgroup of ${G = SL_d(k)}$.

• If ${A}$ generates the entire group ${SL_d(k)}$ (which is only possible in the finite case ${k={\Bbb F}_p}$), then ${A}$ is either controlled by the trivial group or the whole group.
• If ${d=2}$, then ${A}$ is ${K^{O(1)}}$-controlled by a solvable ${K^{O(1)}}$-approximate subgroup ${B}$ of ${G}$, or by ${G}$ itself. If ${k={\mathbb C}}$, the latter possibility cannot occur, and ${B}$ must be abelian.

Our main result is an extension of Helfgott’s theorem to ${SL_d(k)}$ for general ${d}$. In fact, we obtain an analogous result for any simple (or almost simple) Chevalley group over an arbitrary finite field (not necessarily of prime order), or over ${{\mathbb C}}$. (Standard embedding arguments then allow us to in fact handle arbitrary fields.) The results from simple groups can also be extended to (almost) semisimple Lie groups by an approximate version of Goursat’s lemma. Given that general Lie groups are known to split as extensions of (almost) semisimple Lie groups by solvable Lie groups, and Freiman-type theorems are known for solvable groups also, this in principle gives a Freiman-type theorem for arbitrary Lie groups; we have already established this in the characteristic zero case ${k={\mathbb C}}$, but there are some technical issues in the finite characteristic case ${k = {\Bbb F}_q}$ that we are currently in the process of resolving.

We remark that a qualitative version of this result (with the polynomial bounds ${K^{O(1)}}$ replaced by an ineffective bound ${O_K(1)}$) was also recently obtained by Hrushovski.

Our arguments are based in part on Helfgott’s arguments, in particular maximal tori play a major role in our arguments for much the same reason they do in Helfgott’s arguments. Our main new ingredient is a surprisingly simple argument, which we call the pivot argument, which is an analogue of a corresponding argument of Konyagin and Bourgain-Glibichuk-Konyagin that was used to prove a sum-product estimate. Indeed, it seems that Helfgott-type results in these groups can be viewed as a manifestation of a product-conjugation phenomenon analogous to the sum-product phenomenon. Namely, the sum-product phenomenon asserts that it is difficult for a subset of a field to be simultaneously approximately closed under sums and products, without being close to an actual field; similarly, the product-conjugation phenomenon asserts that it is difficult for a union of (subsets of) tori to be simultaneously approximately closed under products and conjugations, unless it is coming from a genuine group. In both cases, the key is to exploit a sizeable gap between the behaviour of two types of “pivots” (which are scaling parameters ${\xi}$ in the sum-product case, and tori in the product-conjugation case): ones which interact strongly with the underlying set ${A}$, and ones which do not interact at all. The point is that there is no middle ground of pivots which only interact weakly with the set. This separation between interacting (or “involved”) and non-interacting (or “non-involved”) pivots can then be exploited to bootstrap approximate algebraic structure into exact algebraic structure. (Curiously, a similar argument is used all the time in PDE, where it goes under the name of the “bootstrap argument”.)

Below the fold we give more details of this crucial pivot argument.

One piece of trivia about the writing of this paper: this was the first time any of us had used modern version control software to collaboratively write a paper; specifically, we used Subversion, with the repository being hosted online by xp-dev. (See this post at the Secret Blogging Seminar for how to get started with this software.) There were a certain number of technical glitches in getting everything to install and run smoothly, but once it was set up, it was significantly easier to use than our traditional system of emailing draft versions of the paper back and forth, as one could simply download and upload the most recent versions whenever one wished, with all changes merged successfully. I had a positive impression of this software and am likely to try it again in future collaborations, particularly those involving at least three people. (It would also work well for polymath projects, modulo the technical barrier of every participant having to install some software.)

In mathematics, one frequently starts with some space ${X}$ and wishes to extend it to a larger space ${Y}$. Generally speaking, there are two ways in which one can extend a space ${X}$:

• By embedding ${X}$ into a space ${Y}$ that has ${X}$ (or at least an isomorphic copy of ${X}$) as a subspace.
• By covering ${X}$ by a space ${Y}$ that has ${X}$ (or an isomorphic copy thereof) as a quotient.

For many important categories of interest (such as abelian categories), the former type of extension can be represented by the exact sequence,

$\displaystyle 0 \rightarrow X \rightarrow Y$

and the latter type of extension be represented by the exact sequence

$\displaystyle Y \rightarrow X \rightarrow 0.$

In some cases, ${X}$ can be both embedded in, and covered by, ${Y}$, in a consistent fashion; in such cases we sometimes say that the above exact sequences split.

An analogy would be to that of digital images. When a computer represents an image, it is limited both by the scope of the image (what it is picturing), and by the resolution of an image (how much physical space is represented by a given pixel). To make the image “larger”, one could either embed the image in an image of larger scope but equal resolution (e.g. embedding a picture of a ${200 \times 200}$ pixel image of person’s face into a ${800 \times 800}$ pixel image that covers a region of space that is four times larger in both dimensions, e.g. the person’s upper body) or cover the image with an image of higher resolution but of equal scope (e.g. enhancing a ${200 \times 200}$ pixel picture of a face to a ${800 \times 800}$ pixel of the same face). In the former case, the original image is a sub-image (or cropped image) of the extension, but in the latter case the original image is a quotient (or a pixelation) of the extension. In the former case, each pixel in the original image can be identified with a pixel in the extension, but not every pixel in the extension is covered. In the latter case, every pixel in the original image is covered by several pixels in the extension, but the pixel in the original image is not canonically identified with any particular pixel in the extension that covers it; it “loses its identity” by dispersing into higher resolution pixels.

(Note that “zooming in” the visual representation of an image by making each pixel occupy a larger region of the screen neither increases the scope or the resolution; in this language, a zoomed-in version of an image is merely an isomorphic copy of the original image; it carries the same amount of information as the original image, but has been represented in a new coordinate system which may make it easier to view, especially to the visually impaired.)

In the study of a given category of spaces (e.g. topological spaces, manifolds, groups, fields, etc.), embedding and coverings are both important; this is particularly true in the more topological areas of mathematics, such as manifold theory. But typically, the term extension is reserved for just one of these two operations. For instance, in the category of fields, coverings are quite trivial; if one covers a field ${k}$ by a field ${l}$, the kernel of the covering map ${\pi: l \rightarrow k}$ is necessarily trivial and so ${k, l}$ are in fact isomorphic. So in field theory, a field extension refers to an embedding of a field, rather than a covering of a field. Similarly, in the theory of metric spaces, there are no non-trivial isometric coverings of a metric space, and so the only useful notion of an extension of a metric space is the one given by embedding the original space in the extension.

On the other hand, in group theory (and in group-like theories, such as the theory of dynamical systems, which studies group actions), the term “extension” is reserved for coverings, rather than for embeddings. I think one of the main reasons for this is that coverings of groups automatically generate a special type of embedding (a normal embedding), whereas most embeddings don’t generate coverings. More precisely, given a group extension ${G}$ of a base group ${H}$,

$\displaystyle G \rightarrow H \rightarrow 0,$

one can form the kernel ${K = \hbox{ker}(\phi)}$ of the covering map ${\pi: G \rightarrow H}$, which is a normal subgroup of ${G}$, and we thus can extend the above sequence canonically to a short exact sequence

$\displaystyle 0 \rightarrow K \rightarrow G \rightarrow H \rightarrow 0.$

On the other hand, an embedding of ${K}$ into ${G}$,

$\displaystyle 0 \rightarrow K \rightarrow G$

does not similarly extend to a short exact sequence unless the the embedding is normal.

Another reason for the notion of extension varying between embeddings and coverings from subject to subject is that there are various natural duality operations (and more generally, contravariant functors) which turn embeddings into coverings and vice versa. For instance, an embedding of one vector space ${V}$ into another ${W}$ induces a covering of the dual space ${V^*}$ by the dual space ${W^*}$, and conversely; similarly, an embedding of a locally compact abelian group ${H}$ in another ${G}$ induces a covering of the Pontryagin dual ${\hat H}$ by the Pontryagin dual ${\hat G}$. In the language of images, embedding an image in an image of larger scope is largely equivalent to covering the Fourier transform of that image by a transform of higher resolution, and conversely; this is ultimately a manifestation of the basic fact that frequency is inversely proportional to wavelength.

Similarly, a common duality operation arises in many areas of mathematics by starting with a space ${X}$ and then considering a space ${C(X)}$ of functions on that space (e.g. continuous real-valued functions, if ${X}$ was a topological space, or in more algebraic settings one could consider homomorphisms from ${X}$ to some fixed space). Embedding ${X}$ into ${Y}$ then induces a covering of ${C(X)}$ by ${C(Y)}$, and conversely, a covering of ${X}$ by ${Y}$ induces an embedding of ${C(X)}$ into ${C(Y)}$. Returning again to the analogy with images, if one looks at the collection of all images of a fixed scope and resolution, rather than just a single image, then increasing the available resolution causes an embedding of the space of low-resolution images into the space of high-resolution images (since of course every low-resolution image is an example of a high-resolution image), whereas increasing the available scope causes a covering of the space of narrow-scope images by the space of wide-scope images (since every wide-scope image can be cropped into a narrow-scope image). Note in the case of images, that these extensions can be split: not only can a low-resolution image be viewed as a special case of a high-resolution image, but any high-resolution image can be pixelated into a low-resolution one. Similarly, not only can any wide-scope image be cropped into a narrow-scope one, a narrow-scope image can be extended to a wide-scope one simply by filling in all the new areas of scope with black (or by using more advanced image processing tools to create a more visually pleasing extension). (In the category of sets, the statement that every covering can be split is precisely the axiom of choice.)

I’ve recently found myself having to deal quite a bit with group extensions in my research, so I have decided to make some notes on the basic theory of such extensions here. This is utterly elementary material for a group theorist, but I found this task useful for organising my own thoughts on this topic, and also in pinning down some of the jargon in this field.

One theme in this course will be the central nature played by the gaussian random variables ${X \equiv N(\mu,\sigma^2)}$. Gaussians have an incredibly rich algebraic structure, and many results about general random variables can be established by first using this structure to verify the result for gaussians, and then using universality techniques (such as the Lindeberg exchange strategy) to extend the results to more general variables.

One way to exploit this algebraic structure is to continuously deform the variance ${t := \sigma^2}$ from an initial variance of zero (so that the random variable is deterministic) to some final level ${T}$. We would like to use this to give a continuous family ${t \mapsto X_t}$ of random variables ${X_t \equiv N(\mu, t)}$ as ${t}$ (viewed as a “time” parameter) runs from ${0}$ to ${T}$.

At present, we have not completely specified what ${X_t}$ should be, because we have only described the individual distribution ${X_t \equiv N(\mu,t)}$ of each ${X_t}$, and not the joint distribution. However, there is a very natural way to specify a joint distribution of this type, known as Brownian motion. In these notes we lay the necessary probability theory foundations to set up this motion, and indicate its connection with the heat equation, the central limit theorem, and the Ornstein-Uhlenbeck process. This is the beginning of stochastic calculus, which we will not develop fully here.

We will begin with one-dimensional Brownian motion, but it is a simple matter to extend the process to higher dimensions. In particular, we can define Brownian motion on vector spaces of matrices, such as the space of ${n \times n}$ Hermitian matrices. This process is equivariant with respect to conjugation by unitary matrices, and so we can quotient out by this conjugation and obtain a new process on the quotient space, or in other words on the spectrum of ${n \times n}$ Hermitian matrices. This process is called Dyson Brownian motion, and turns out to have a simple description in terms of ordinary Brownian motion; it will play a key role in several of the subsequent notes in this course.

After a hiatus of several months, I’ve made an effort to advance the writing of the second Polymath1 paper, entitled “Density Hales-Jewett and Moser numbers“.  This is in part due to a request from the Szemeredi 60th 70th birthday conference proceedings (which solicited the paper) to move the submission date up from April to February.  (Also, the recent launch of Polymath5 on Tim Gowers blog reminds me that I should get this older project out of the way.)

The current draft of the paper is here, with source files here.  I have been trimming the paper, in particular replacing some of the auxiliary or incomplete material in the paper with references to pages on the polymath wiki instead.  Nevertheless this is still a large paper, at 51 pages.  It is now focused primarily on the computation of the Density Hales-Jewett numbers $c_{n,3}$ and the Moser numbers $c'_{n,3}$ for all n up to 6, with the latter requiring a significant amount of computer assistance.

There are a number of minor issues remaining with the paper:

1. A picture of a Fujimura set for the introduction would be nice.
2. In the proof of Theorem 1.3 (asymptotic lower bound for DHJ numbers), it is asserted without proof that the circulant matrix with first row 1,2,…,k-1 is nonsingular.  One can prove this claim by computing the Fourier coefficients $\sum_{j=1}^{k-1} j e^{2\pi i j t / (k-1)}$ for all t, but is there a slicker way to see this (e.g. by citing a reference?).
3. Reference [15] (which is Komlos’s lower bound on the Moser numbers) is missing a volume number.  The reference is currently given as
J. Komlos, solution to problem P.170 by Leo Moser, Canad. Math.. Bull. vol  ???  (1972), 312-313, 1970.

Finally, the text probably needs to be proofread one or two more times before it is ready to go, hopefully by early February.  There is still also one last opportunity to propose non-trivial restructuring of the paper (in particular, if there are other ways to trim the size of the paper, this may be worth looking into).

Let ${A}$ be a Hermitian ${n \times n}$ matrix. By the spectral theorem for Hermitian matrices (which, for sake of completeness, we prove below), one can diagonalise ${A}$ using a sequence

$\displaystyle \lambda_1(A) \geq \ldots \geq \lambda_n(A)$

of ${n}$ real eigenvalues, together with an orthonormal basis of eigenvectors ${u_1(A),\ldots,u_n(A) \in {\mathbb C}^n}$. (The eigenvalues are uniquely determined by ${A}$, but the eigenvectors have a little ambiguity to them, particularly if there are repeated eigenvalues; for instance, one could multiply each eigenvector by a complex phase ${e^{i\theta}}$. In these notes we are arranging eigenvalues in descending order; of course, one can also arrange eigenvalues in increasing order, which causes some slight notational changes in the results below.) The set ${\{\lambda_1(A),\ldots,\lambda_n(A)\}}$ is known as the spectrum of ${A}$.

A basic question in linear algebra asks the extent to which the eigenvalues ${\lambda_1(A),\ldots,\lambda_n(A)}$ and ${\lambda_1(B),\ldots,\lambda_n(B)}$ of two Hermitian matrices ${A, B}$ constrains the eigenvalues ${\lambda_1(A+B),\ldots,\lambda_n(A+B)}$ of the sum. For instance, the linearity of trace

$\displaystyle \hbox{tr}(A+B) = \hbox{tr}(A)+\hbox{tr}(B),$

when expressed in terms of eigenvalues, gives the trace constraint

$\displaystyle \lambda_1(A+B)+\ldots+\lambda_n(A+B) = \lambda_1(A)+\ldots+\lambda_n(A) \ \ \ \ \ (1)$

$\displaystyle +\lambda_1(B)+\ldots+\lambda_n(B);$

the identity

$\displaystyle \lambda_1(A) = \sup_{|v|=1} v^* Av \ \ \ \ \ (2)$

(together with the counterparts for ${B}$ and ${A+B}$) gives the inequality

$\displaystyle \lambda_1(A+B) \leq \lambda_1(A) + \lambda_1(B); \ \ \ \ \ (3)$

and so forth.

The complete answer to this problem is a fascinating one, requiring a strangely recursive description (once known as Horn’s conjecture, which is now solved), and connected to a large number of other fields of mathematics, such as geometric invariant theory, intersection theory, and the combinatorics of a certain gadget known as a “honeycomb”. See for instance my survey with Allen Knutson on this topic some years ago.

In typical applications to random matrices, one of the matrices (say, ${B}$) is “small” in some sense, so that ${A+B}$ is a perturbation of ${A}$. In this case, one does not need the full strength of the above theory, and instead rely on a simple aspect of it pointed out by Helmke and Rosenthal and by Totaro, which generates several of the eigenvalue inequalities relating ${A}$, ${B}$, and ${C}$, of which (1) and (3) are examples. (Actually, this method eventually generates all of the eigenvalue inequalities, but this is a non-trivial fact to prove.) These eigenvalue inequalities can mostly be deduced from a number of minimax characterisations of eigenvalues (of which (2) is a typical example), together with some basic facts about intersections of subspaces. Examples include the Weyl inequalities

$\displaystyle \lambda_{i+j-1}(A+B) \leq \lambda_i(A) + \lambda_j(B), \ \ \ \ \ (4)$

valid whenever ${i,j \geq 1}$ and ${i+j-1 \leq n}$, and the Ky Fan inequality

$\displaystyle \lambda_1(A+B)+\ldots+\lambda_k(A+B) \leq$

$\displaystyle \lambda_1(A)+\ldots+\lambda_k(A) + \lambda_1(B)+\ldots+\lambda_k(B). \ \ \ \ \ (5)$

One consequence of these inequalities is that the spectrum of a Hermitian matrix is stable with respect to small perturbations.

We will also establish some closely related inequalities concerning the relationships between the eigenvalues of a matrix, and the eigenvalues of its minors.

Many of the inequalities here have analogues for the singular values of non-Hermitian matrices (which is consistent with the discussion near Exercise 16 of Notes 3). However, the situation is markedly different when dealing with eigenvalues of non-Hermitian matrices; here, the spectrum can be far more unstable, if pseudospectrum is present. Because of this, the theory of the eigenvalues of a random non-Hermitian matrix requires an additional ingredient, namely upper bounds on the prevalence of pseudospectrum, which after recentering the matrix is basically equivalent to establishing lower bounds on least singular values. We will discuss this point in more detail in later notes.

We will work primarily here with Hermitian matrices, which can be viewed as self-adjoint transformations on complex vector spaces such as ${{\mathbb C}^n}$. One can of course specialise the discussion to real symmetric matrices, in which case one can restrict these complex vector spaces to their real counterparts ${{\mathbb R}^n}$. The specialisation of the complex theory below to the real case is straightforward and is left to the interested reader.

Now that we have developed the basic probabilistic tools that we will need, we now turn to the main subject of this course, namely the study of random matrices. There are many random matrix models (aka matrix ensembles) of interest – far too many to all be discussed in a single course. We will thus focus on just a few simple models. First of all, we shall restrict attention to square matrices ${M = (\xi_{ij})_{1 \leq i,j \leq n}}$, where ${n}$ is a (large) integer and the ${\xi_{ij}}$ are real or complex random variables. (One can certainly study rectangular matrices as well, but for simplicity we will only look at the square case.) Then, we shall restrict to three main models:

• Iid matrix ensembles, in which the coefficients ${\xi_{ij}}$ are iid random variables with a single distribution ${\xi_{ij} \equiv \xi}$. We will often normalise ${\xi}$ to have mean zero and unit variance. Examples of iid models include the Bernouli ensemble (aka random sign matrices) in which the ${\xi_{ij}}$ are signed Bernoulli variables, the real gaussian matrix ensemble in which ${\xi_{ij} \equiv N(0,1)_{\bf R}}$, and the complex gaussian matrix ensemble in which ${\xi_{ij} \equiv N(0,1)_{\bf C}}$.
• Symmetric Wigner matrix ensembles, in which the upper triangular coefficients ${\xi_{ij}}$, ${j \geq i}$ are jointly independent and real, but the lower triangular coefficients ${\xi_{ij}}$, ${j are constrained to equal their transposes: ${\xi_{ij}=\xi_{ji}}$. Thus ${M}$ by construction is always a real symmetric matrix. Typically, the strictly upper triangular coefficients will be iid, as will the diagonal coefficients, but the two classes of coefficients may have a different distribution. One example here is the symmetric Bernoulli ensemble, in which both the strictly upper triangular and the diagonal entries are signed Bernoulli variables; another important example is the Gaussian Orthogonal Ensemble (GOE), in which the upper triangular entries have distribution ${N(0,1)_{\bf R}}$ and the diagonal entries have distribution ${N(0,2)_{\bf R}}$. (We will explain the reason for this discrepancy later.)
• Hermitian Wigner matrix ensembles, in which the upper triangular coefficients are jointly independent, with the diagonal entries being real and the strictly upper triangular entries complex, and the lower triangular coefficients ${\xi_{ij}}$, ${j are constrained to equal their adjoints: ${\xi_{ij} = \overline{\xi_{ji}}}$. Thus ${M}$ by construction is always a Hermitian matrix. This class of ensembles contains the symmetric Wigner ensembles as a subclass. Another very important example is the Gaussian Unitary Ensemble (GUE), in which all off-diagional entries have distribution ${N(0,1)_{\bf C}}$, but the diagonal entries have distribution ${N(0,1)_{\bf R}}$.

Given a matrix ensemble ${M}$, there are many statistics of ${M}$ that one may wish to consider, e.g. the eigenvalues or singular values of ${M}$, the trace and determinant, etc. In these notes we will focus on a basic statistic, namely the operator norm

$\displaystyle \| M \|_{op} := \sup_{x \in {\bf C}^n: |x|=1} |Mx| \ \ \ \ \ (1)$

of the matrix ${M}$. This is an interesting quantity in its own right, but also serves as a basic upper bound on many other quantities. (For instance, ${\|M\|_{op}}$ is also the largest singular value ${\sigma_1(M)}$ of ${M}$ and thus dominates the other singular values; similarly, all eigenvalues ${\lambda_i(M)}$ of ${M}$ clearly have magnitude at most ${\|M\|_{op}}$.) Because of this, it is particularly important to get good upper tail bounds

$\displaystyle {\bf P}( \|M\|_{op} \geq \lambda ) \leq \ldots$

on this quantity, for various thresholds ${\lambda}$. (Lower tail bounds are also of interest, of course; for instance, they give us confidence that the upper tail bounds are sharp.) Also, as we shall see, the problem of upper bounding ${\|M\|_{op}}$ can be viewed as a non-commutative analogue of upper bounding the quantity ${|S_n|}$ studied in Notes 1. (The analogue of the central limit theorem in Notes 2 is the Wigner semi-circular law, which will be studied in the next set of notes.)

An ${n \times n}$ matrix consisting entirely of ${1}$s has an operator norm of exactly ${n}$, as can for instance be seen from the Cauchy-Schwarz inequality. More generally, any matrix whose entries are all uniformly ${O(1)}$ will have an operator norm of ${O(n)}$ (which can again be seen from Cauchy-Schwarz, or alternatively from Schur’s test, or from a computation of the Frobenius norm). However, this argument does not take advantage of possible cancellations in ${M}$. Indeed, from analogy with concentration of measure, when the entries of the matrix ${M}$ are independent, bounded and have mean zero, we expect the operator norm to be of size ${O(\sqrt{n})}$ rather than ${O(n)}$. We shall see shortly that this intuition is indeed correct. (One can see, though, that the mean zero hypothesis is important; from the triangle inequality we see that if we add the all-ones matrix (for instance) to a random matrix with mean zero, to obtain a random matrix whose coefficients all have mean ${1}$, then at least one of the two random matrices necessarily has operator norm at least ${n/2}$.)

As mentioned before, there is an analogy here with the concentration of measure phenomenon, and many of the tools used in the latter (e.g. the moment method) will also appear here. (Indeed, we will be able to use some of the concentration inequalities from Notes 1 directly to help control ${\|M\|_{op}}$ and related quantities.) Similarly, just as many of the tools from concentration of measure could be adapted to help prove the central limit theorem, several the tools seen here will be of use in deriving the semi-circular law.

The most advanced knowledge we have on the operator norm is given by the Tracy-Widom law, which not only tells us where the operator norm is concentrated in (it turns out, for instance, that for a Wigner matrix (with some additional technical assumptions), it is concentrated in the range ${[2\sqrt{n} - O(n^{-1/6}), 2\sqrt{n} + O(n^{-1/6})]}$), but what its distribution in that range is. While the methods in this set of notes can eventually be pushed to establish this result, this is far from trivial, and will only be briefly discussed here. (We may return to the Tracy-Widom law later in this course, though.)

This week at UCLA, Pierre-Louis Lions gave one of this year’s Distinguished Lecture Series, on the topic of mean field games. These are a relatively novel class of systems of partial differential equations, that are used to understand the behaviour of multiple agents each individually trying to optimise their position in space and time, but with their preferences being partly determined by the choices of all the other agents, in the asymptotic limit when the number of agents goes to infinity. A good example here is that of traffic congestion: as a first approximation, each agent wishes to get from A to B in the shortest path possible, but the speed at which one can travel depends on the density of other agents in the area. A more light-hearted example is that of a Mexican wave (or audience wave), which can be modeled by a system of this type, in which each agent chooses to stand, sit, or be in an intermediate position based on his or her comfort level, and also on the position of nearby agents.

Under some assumptions, mean field games can be expressed as a coupled system of two equations, a Fokker-Planck type equation evolving forward in time that governs the evolution of the density function ${m}$ of the agents, and a Hamilton-Jacobi (or Hamilton-Jacobi-Bellman) type equation evolving backward in time that governs the computation of the optimal path for each agent. The combination of both forward propagation and backward propagation in time creates some unusual “elliptic” phenomena in the time variable that is not seen in more conventional evolution equations. For instance, for Mexican waves, this model predicts that such waves only form for stadiums exceeding a certain minimum size (and this phenomenon has apparently been confirmed experimentally!).

Due to lack of time and preparation, I was not able to transcribe Lions’ lectures in full detail; but I thought I would describe here a heuristic derivation of the mean field game equations, and mention some of the results that Lions and his co-authors have been working on. (Video of a related series of lectures (in French) by Lions on this topic at the Collége de France is available here.)

To avoid (rather important) technical issues, I will work at a heuristic level only, ignoring issues of smoothness, convergence, existence and uniqueness, etc.

Let ${X_1,X_2,\dots}$ be iid copies of an absolutely integrable real scalar random variable ${X}$, and form the partial sums ${S_n := X_1 + \dots + X_n}$. As we saw in the last set of notes, the law of large numbers ensures that the empirical averages ${S_n/n}$ converge (both in probability and almost surely) to a deterministic limit, namely the mean ${\mu= {\bf E} X}$ of the reference variable ${X}$. Furthermore, under some additional moment hypotheses on the underlying variable ${X}$, we can obtain square root cancellation for the fluctuation ${\frac{S_n}{n} - \mu}$ of the empirical average from the mean. To simplify the calculations, let us first restrict to the case ${\mu=0, \sigma^2=1}$ of mean zero and variance one, thus

$\displaystyle {\bf E} X = 0$

and

$\displaystyle {\bf Var}(X) = {\bf E} X^2 = 1.$

Then, as computed in previous notes, the normalised fluctuation ${S_n/\sqrt{n}}$ also has mean zero and variance one:

$\displaystyle {\bf E} \frac{S_n}{\sqrt{n}} = 0$

$\displaystyle {\bf Var}(\frac{S_n}{\sqrt{n}}) = {\bf E} (\frac{S_n}{\sqrt{n}})^2 = 1.$

This and Chebyshev’s inequality already indicates that the “typical” size of ${S_n}$ is ${O(\sqrt{n})}$, thus for instance ${\frac{S_n}{\sqrt{n} \omega(n)}}$ goes to zero in probability for any ${\omega(n)}$ that goes to infinity as ${n \rightarrow \infty}$. If we also have a finite fourth moment ${{\bf E} |X|^4 < \infty}$, then the calculations of the previous notes also give a fourth moment estimate

$\displaystyle {\bf E} (\frac{S_n}{\sqrt{n}})^4 = 3 + O( \frac{{\bf E} |X|^4}{n} ).$

From this and the Paley-Zygmund inequality (Exercise 42 of Notes 1) we also get some lower bound for ${\frac{S_n}{\sqrt{n}}}$ of the form

$\displaystyle {\bf P}( |\frac{S_n}{\sqrt{n}}| \geq \varepsilon ) \geq \varepsilon$

for some absolute constant ${\varepsilon>0}$ and for ${n}$ sufficiently large; this indicates in particular that ${\frac{S_n \omega(n)}{\sqrt{n}}}$ does not converge in any reasonable sense to something finite for any ${\omega(n)}$ that goes to infinity.

The question remains as to what happens to the ratio ${S_n/\sqrt{n}}$ itself, without multiplying or dividing by any factor ${\omega(n)}$. A first guess would be that these ratios converge in probability or almost surely, but this is unfortunately not the case:

Proposition 1 Let ${X_1,X_2,\dots}$ be iid copies of an absolutely integrable real scalar random variable ${X}$ with mean zero, variance one, and finite fourth moment, and write ${S_n := X_1 + \dots + X_n}$. Then the random variables ${S_n/\sqrt{n}}$ do not converge in probability or almost surely to any limit, and neither does any subsequence of these random variables.

Proof: Suppose for contradiction that some sequence ${S_{n_j}/\sqrt{n_j}}$ converged in probability or almost surely to a limit ${Y}$. By passing to a further subsequence we may assume that the convergence is in the almost sure sense. Since all of the ${S_{n_j}/\sqrt{n_j}}$ have mean zero, variance one, and bounded fourth moment, Theorem 24 of Notes 1 implies that the limit ${Y}$ also has mean zero and variance one. On the other hand, ${Y}$ is a tail random variable and is thus almost surely constant by the Kolmogorov zero-one law from Notes 3. Since constants have variance zero, we obtain the required contradiction. $\Box$

Nevertheless there is an important limit for the ratio ${S_n/\sqrt{n}}$, which requires one to replace the notions of convergence in probability or almost sure convergence by the weaker concept of convergence in distribution.

Definition 2 (Vague convergence and convergence in distribution) Let ${R}$ be a locally compact Hausdorff topological space with the Borel ${\sigma}$-algebra. A sequence of finite measures ${\mu_n}$ on ${R}$ is said to converge vaguely to another finite measure ${\mu}$ if one has

$\displaystyle \int_R G(x)\ d\mu_n(x) \rightarrow \int_R G(x)\ d\mu(x)$

as ${n \rightarrow \infty}$ for all continuous compactly supported functions ${G: R \rightarrow {\bf R}}$. (Vague convergence is also known as weak convergence, although strictly speaking the terminology weak-* convergence would be more accurate.) A sequence of random variables ${X_n}$ taking values in ${R}$ is said to converge in distribution (or converge weakly or converge in law) to another random variable ${X}$ if the distributions ${\mu_{X_n}}$ converge vaguely to the distribution ${\mu_X}$, or equivalently if

$\displaystyle {\bf E}G(X_n) \rightarrow {\bf E} G(X)$

as ${n \rightarrow \infty}$ for all continuous compactly supported functions ${G: R \rightarrow {\bf R}}$.

One could in principle try to extend this definition beyond the locally compact Hausdorff setting, but certain pathologies can occur when doing so (e.g. failure of the Riesz representation theorem), and we will never need to consider vague convergence in spaces that are not locally compact Hausdorff, so we restrict to this setting for simplicity.

Note that the notion of convergence in distribution depends only on the distribution of the random variables involved. One consequence of this is that convergence in distribution does not produce unique limits: if ${X_n}$ converges in distribution to ${X}$, and ${Y}$ has the same distribution as ${X}$, then ${X_n}$ also converges in distribution to ${Y}$. However, limits are unique up to equivalence in distribution (this is a consequence of the Riesz representation theorem, discussed for instance in this blog post). As a consequence of the insensitivity of convergence in distribution to equivalence in distribution, we may also legitimately talk about convergence of distribution of a sequence of random variables ${X_n}$ to another random variable ${X}$ even when all the random variables ${X_1,X_2,\dots}$ and ${X}$ involved are being modeled by different probability spaces (e.g. each ${X_n}$ is modeled by ${\Omega_n}$, and ${X}$ is modeled by ${\Omega}$, with no coupling presumed between these spaces). This is in contrast to the stronger notions of convergence in probability or almost sure convergence, which require all the random variables to be modeled by a common probability space. Also, by an abuse of notation, we can say that a sequence ${X_n}$ of random variables converges in distribution to a probability measure ${\mu}$, when ${\mu_{X_n}}$ converges vaguely to ${\mu}$. Thus we can talk about a sequence of random variables converging in distribution to a uniform distribution, a gaussian distribution, etc..

From the dominated convergence theorem (available for both convergence in probability and almost sure convergence) we see that convergence in probability or almost sure convergence implies convergence in distribution. The converse is not true, due to the insensitivity of convergence in distribution to equivalence in distribution; for instance, if ${X_1,X_2,\dots}$ are iid copies of a non-deterministic scalar random variable ${X}$, then the ${X_n}$ trivially converge in distribution to ${X}$, but will not converge in probability or almost surely (as one can see from the zero-one law). However, there are some partial converses that relate convergence in distribution to convergence in probability; see Exercise 10 below.

Remark 3 The notion of convergence in distribution is somewhat similar to the notion of convergence in the sense of distributions that arises in distribution theory (discussed for instance in this previous blog post), however strictly speaking the two notions of convergence are distinct and should not be confused with each other, despite the very similar names.

The notion of convergence in distribution simplifies in the case of real scalar random variables:

Proposition 4 Let ${X_1,X_2,\dots}$ be a sequence of scalar random variables, and let ${X}$ be another scalar random variable. Then the following are equivalent:

• (i) ${X_n}$ converges in distribution to ${X}$.
• (ii) ${F_{X_n}(t)}$ converges to ${F_X(t)}$ for each continuity point ${t}$ of ${F_X}$ (i.e. for all real numbers ${t \in {\bf R}}$ at which ${F_X}$ is continuous). Here ${F_X(t) := {\bf P}(X \leq t)}$ is the cumulative distribution function of ${X}$.

Proof: First suppose that ${X_n}$ converges in distribution to ${X}$, and ${F_X}$ is continuous at ${t}$. For any ${\varepsilon > 0}$, one can find a ${\delta}$ such that

$\displaystyle F_X(t) - \varepsilon \leq F_X(t') \leq F_X(t) + \varepsilon$

for every ${t' \in [t-\delta,t+\delta]}$. One can also find an ${N}$ larger than ${|t|+\delta}$ such that ${F_X(-N) \leq \varepsilon}$ and ${F_X(N) \geq 1-\varepsilon}$. Thus

$\displaystyle {\bf P} (|X| \geq N ) = O(\varepsilon)$

and

$\displaystyle {\bf P} (|X - t| \leq \delta ) = O(\varepsilon).$

Let ${G: {\bf R} \rightarrow [0,1]}$ be a continuous function supported on ${[-2N, t]}$ that equals ${1}$ on ${[-N, t-\delta]}$. Then by the above discussion we have

$\displaystyle {\bf E} G(X) = F_X(t) + O(\varepsilon)$

and hence

$\displaystyle {\bf E} G(X_n) = F_X(t) + O(\varepsilon)$

for large enough ${n}$. In particular

$\displaystyle {\bf P}( X_n \leq t ) \geq F_X(t) - O(\varepsilon).$

A similar argument, replacing ${G}$ with a continuous function supported on ${[t,2N]}$ that equals ${1}$ on ${[t+\delta,N]}$ gives

$\displaystyle {\bf P}( X_n > t ) \geq 1 - F_X(t) - O(\varepsilon)$

for ${n}$ large enough. Putting the two estimates together gives

$\displaystyle F_{X_n}(t) = F_X(t) + O(\varepsilon)$

for ${n}$ large enough; sending ${\varepsilon \rightarrow 0}$, we obtain the claim.

Conversely, suppose that ${F_{X_n}(t)}$ converges to ${F_X(t)}$ at every continuity point ${t}$ of ${F_X}$. Let ${G: {\bf R} \rightarrow {\bf R}}$ be a continuous compactly supported function, then it is uniformly continuous. As ${F_X}$ is monotone increasing, it can only have countably many points of discontinuity. From these two facts one can find, for any ${\varepsilon>0}$, a simple function ${G_\varepsilon(t) = \sum_{i=1}^n c_i 1_{(t_i,t_{i+1}]}}$ for some ${t_1 < \dots < t_n}$ that are points of continuity of ${F_X}$, and real numbers ${c_i}$, such that ${|G(t) - G_\varepsilon(t)| \leq \varepsilon}$ for all ${t}$. Thus

$\displaystyle {\bf E} G(X_n) = {\bf E} G_\varepsilon(X_n) + O(\varepsilon)$

$\displaystyle = \sum_{i=1}^n c_i(F_{X_n}(t_{i+1}) - F_{X_n}(t)) + O(\varepsilon).$

Similarly for ${X_n}$ replaced by ${X}$. Subtracting and taking limit superior, we conclude that

$\displaystyle \limsup_{n \rightarrow \infty} |{\bf E} G(X_n) - {\bf E} G(X)| = O(\varepsilon),$

and on sending ${\varepsilon \rightarrow 0}$, we obtain that ${X_n}$ converges in distribution to ${X}$ as claimed. $\Box$

The restriction to continuity points of ${t}$ is necessary. Consider for instance the deterministic random variables ${X_n = 1/n}$, then ${X_n}$ converges almost surely (and hence in distribution) to ${0}$, but ${F_{X_n}(0) = 0}$ does not converge to ${F_X(0)=1}$.

Example 5 For any natural number ${n}$, let ${X_n}$ be a discrete random variable drawn uniformly from the finite set ${\{0/n, 1/n, \dots, (n-1)/n\}}$, and let ${X}$ be the continuous random variable drawn uniformly from ${[0,1]}$. Then ${X_n}$ converges in distribution to ${X}$. Thus we see that a continuous random variable can emerge as the limit of discrete random variables.

Example 6 For any natural number ${n}$, let ${X_n}$ be a continuous random variable drawn uniformly from ${[0,1/n]}$, then ${X_n}$ converges in distribution to the deterministic real number ${0}$. Thus we see that discrete (or even deterministic) random variables can emerge as the limit of continuous random variables.

Exercise 7 (Portmanteau theorem) Show that the properties (i) and (ii) in Proposition 4 are also equivalent to the following three statements:

• (iii) One has ${\limsup_{n \rightarrow \infty} {\bf P}( X_n \in K ) \leq {\bf P}(X \in K)}$ for all closed sets ${K \subset {\bf R}}$.
• (iv) One has ${\liminf_{n \rightarrow \infty} {\bf P}( X_n \in U ) \geq {\bf P}(X \in U)}$ for all open sets ${U \subset {\bf R}}$.
• (v) For any Borel set ${E \subset {\bf R}}$ whose topological boundary ${\partial E}$ is such that ${{\bf P}(X \in \partial E) = 0}$, one has ${\lim_{n \rightarrow \infty} {\bf P}(X_n \in E) = {\bf P}(X \in E)}$.

(Note: to prove this theorem, you may wish to invoke Urysohn’s lemma. To deduce (iii) from (i), you may wish to start with the case of compact ${K}$.)

We can now state the famous central limit theorem:

Theorem 8 (Central limit theorem) Let ${X_1,X_2,\dots}$ be iid copies of a scalar random variable ${X}$ of finite mean ${\mu := {\bf E} X}$ and finite non-zero variance ${\sigma^2 := {\bf Var}(X)}$. Let ${S_n := X_1 + \dots + X_n}$. Then the random variables ${\frac{\sqrt{n}}{\sigma} (\frac{S_n}{n} - \mu)}$ converges in distribution to a random variable with the standard normal distribution ${N(0,1)}$ (that is to say, a random variable with probability density function ${x \mapsto \frac{1}{\sqrt{2\pi}} e^{-x^2/2}}$). Thus, by abuse of notation

$\displaystyle \frac{\sqrt{n}}{\sigma} (\frac{S_n}{n} - \mu) \rightarrow N(0,1).$

In the normalised case ${\mu=0, \sigma^2=1}$ when ${X}$ has mean zero and unit variance, this simplifies to

$\displaystyle \frac{S_n}{\sqrt{n}} \rightarrow N(0,1).$

Using Proposition 4 (and the fact that the cumulative distribution function associated to ${N(0,1)}$ is continuous, the central limit theorem is equivalent to asserting that

$\displaystyle {\bf P}( \frac{\sqrt{n}}{\sigma} (\frac{S_n}{n} - \mu) \leq t ) \rightarrow \frac{1}{\sqrt{2\pi}} \int_{-\infty}^t e^{-x^2/2}\ dx$

as ${n \rightarrow \infty}$ for any ${t \in {\bf R}}$, or equivalently that

$\displaystyle {\bf P}( a \leq \frac{\sqrt{n}}{\sigma} (\frac{S_n}{n} - \mu) \leq b ) \rightarrow \frac{1}{\sqrt{2\pi}} \int_{a}^b e^{-x^2/2}\ dx.$

Informally, one can think of the central limit theorem as asserting that ${S_n}$ approximately behaves like it has distribution ${N( n \mu, n \sigma^2 )}$ for large ${n}$, where ${N(\mu,\sigma^2)}$ is the normal distribution with mean ${\mu}$ and variance ${\sigma^2}$, that is to say the distribution with probability density function ${x \mapsto \frac{1}{\sqrt{2\pi} \sigma} e^{-(x-\mu)^2/2\sigma^2}}$. The integrals ${\frac{1}{\sqrt{2\pi}} \int_{-\infty}^t e^{-x^2/2}\ dx}$ can be written in terms of the error function ${\hbox{erf}}$ as ${\frac{1}{2} + \frac{1}{2} \hbox{erf}(t/\sqrt{2})}$.

The central limit theorem is a basic example of the universality phenomenon in probability – many statistics involving a large system of many independent (or weakly dependent) variables (such as the normalised sums ${\frac{\sqrt{n}}{\sigma}(\frac{S_n}{n}-\mu)}$) end up having a universal asymptotic limit (in this case, the normal distribution), regardless of the precise makeup of the underlying random variable ${X}$ that comprised that system. Indeed, the universality of the normal distribution is such that it arises in many other contexts than the fluctuation of iid random variables; the central limit theorem is merely the first place in probability theory where it makes a prominent appearance.

We will give several proofs of the central limit theorem in these notes; each of these proofs has their advantages and disadvantages, and can each extend to prove many further results beyond the central limit theorem. We first give Lindeberg’s proof of the central limit theorem, based on exchanging (or swapping) each component ${X_1,\dots,X_n}$ of the sum ${S_n}$ in turn. This proof gives an accessible explanation as to why there should be a universal limit for the central limit theorem; one then computes directly with gaussians to verify that it is the normal distribution which is the universal limit. Our second proof is the most popular one taught in probability texts, namely the Fourier-analytic proof based around the concept of the characteristic function ${t \mapsto {\bf E} e^{itX}}$ of a real random variable ${X}$. Thanks to the powerful identities and other results of Fourier analysis, this gives a quite short and direct proof of the central limit theorem, although the arguments may seem rather magical to readers who are not already familiar with Fourier methods. Finally, we give a proof based on the moment method, in the spirit of the arguments in the previous notes; this argument is more combinatorial, but is straightforward and is particularly robust, in particular being well equipped to handle some dependencies between components; we will illustrate this by proving the Erdos-Kac law in number theory by this method. Some further discussion of the central limit theorem (including some further proofs, such as one based on Stein’s method) can be found in this blog post. Some further variants of the central limit theorem, such as local limit theorems, stable laws, and large deviation inequalities, will be discussed in the next (and final) set of notes.

The following exercise illustrates the power of the central limit theorem, by establishing combinatorial estimates which would otherwise require the use of Stirling’s formula to establish.

Exercise 9 (De Moivre-Laplace theorem) Let ${X}$ be a Bernoulli random variable, taking values in ${\{0,1\}}$ with ${{\bf P}(X=0)={\bf P}(X=1)=1/2}$, thus ${X}$ has mean ${1/2}$ and variance ${1/4}$. Let ${X_1,X_2,\dots}$ be iid copies of ${X}$, and write ${S_n := X_1+\dots+X_n}$.

• (i) Show that ${S_n}$ takes values in ${\{0,\dots,n\}}$ with ${{\bf P}(S_n=i) = \frac{1}{2^n} \binom{n}{i}}$. (This is an example of a binomial distribution.)
• (ii) Assume Stirling’s formula

$\displaystyle n! = (1+o(1)) \sqrt{2\pi n} n^n e^{-n} \ \ \ \ \ (1)$

where ${o(1)}$ is a function of ${n}$ that goes to zero as ${n \rightarrow \infty}$. (A proof of this formula may be found in this previous blog post.) Using this formula, and without using the central limit theorem, show that

$\displaystyle {\bf P}( a \leq 2\sqrt{n} (\frac{S_n}{n} - \frac{1}{2}) \leq b ) \rightarrow \frac{1}{\sqrt{2\pi}} \int_{a}^b e^{-x^2/2}\ dx$

as ${n \rightarrow \infty}$ for any fixed real numbers ${a.

The above special case of the central limit theorem was first established by de Moivre and Laplace.

We close this section with some basic facts about convergence of distribution that will be useful in the sequel.

Exercise 10 Let ${X_1,X_2,\dots}$, ${Y_1,Y_2,\dots}$ be sequences of real random variables, and let ${X,Y}$ be further real random variables.

• (i) If ${X}$ is deterministic, show that ${X_n}$ converges in distribution to ${X}$ if and only if ${X_n}$ converges in probability to ${X}$.
• (ii) Suppose that ${X_n}$ is independent of ${Y_n}$ for each ${n}$, and ${X}$ independent of ${Y}$. Show that ${X_n+iY_n}$ converges in distribution to ${X+iY}$ if and only if ${X_n}$ converges in distribution to ${X}$ and ${Y_n}$ converges in distribution to ${Y}$. (The shortest way to prove this is by invoking the Stone-Weierstrass theorem, but one can also proceed by proving some version of Proposition 4.) What happens if the independence hypothesis is dropped?
• (iii) If ${X_n}$ converges in distribution to ${X}$, show that for every ${\varepsilon>0}$ there exists ${K>0}$ such that ${{\bf P}( |X_n| \geq K ) < \varepsilon}$ for all sufficiently large ${n}$. (That is to say, ${X_n}$ is a tight sequence of random variables.)
• (iv) Show that ${X_n}$ converges in distribution to ${X}$ if and only if, after extending the probability space model if necessary, one can find copies ${Z_1,Z_2,\dots}$ and ${Z}$ of ${X_1,X_2,\dots}$ and ${X}$ respectively such that ${Z_n}$ converges almost surely to ${Z}$. (Hint: use the Skorohod representation, Exercise 29 of Notes 0.)
• (v) If ${X_1,X_2,\dots}$ converges in distribution to ${X}$, and ${F: {\bf R} \rightarrow {\bf R}}$ is continuous, show that ${F(X_1),F(X_2),\dots}$ converges in distribution to ${F(X)}$. Generalise this claim to the case when ${X}$ takes values in an arbitrary locally compact Hausdorff space.
• (vi) (Slutsky’s theorem) If ${X_n}$ converges in distribution to ${X}$, and ${Y_n}$ converges in probability to a deterministic limit ${Y}$, show that ${X_n+Y_n}$ converges in distribution to ${X+Y}$, and ${X_n Y_n}$ converges in distribution to ${XY}$. (Hint: either use (iv), or else use (iii) to control some error terms.) This statement combines particularly well with (i). What happens if ${Y}$ is not assumed to be deterministic?
• (vii) (Fatou lemma) If ${G: {\bf R} \rightarrow [0,+\infty)}$ is continuous, and ${X_n}$ converges in distribution to ${X}$, show that ${\liminf_{n \rightarrow \infty} {\bf E} G(X_n) \geq {\bf E} G(X)}$.
• (viii) (Bounded convergence) If ${G: {\bf R} \rightarrow {\bf R}}$ is continuous and bounded, and ${X_n}$ converges in distribution to ${X}$, show that ${\lim_{n \rightarrow \infty} {\bf E} G(X_n) = {\bf E} G(X)}$.
• (ix) (Dominated convergence) If ${X_n}$ converges in distribution to ${X}$, and there is an absolutely integrable ${Y}$ such that ${|X_n| \leq Y}$ almost surely for all ${n}$, show that ${\lim_{n \rightarrow \infty} {\bf E} X_n = {\bf E} X}$.

For future reference we also mention (but will not prove) Prokhorov’s theorem that gives a partial converse to part (iii) of the above exercise:

Theorem 11 (Prokhorov’s theorem) Let ${X_1,X_2,\dots}$ be a sequence of real random variables which is tight (that is, for every ${\varepsilon>0}$ there exists ${K>0}$ such that ${{\bf P}(|X_n| \geq K) < \varepsilon}$ for all sufficiently large ${n}$). Then there exists a subsequence ${X_{n_j}}$ which converges in distribution to some random variable ${X}$ (which may possibly be modeled by a different probability space model than the ${X_1,X_2,\dots}$.)

The proof of this theorem relies on the Riesz representation theorem, and is beyond the scope of this course; but see for instance Exercise 29 of this previous blog post. (See also the closely related Helly selection theorem, covered in Exercise 30 of the same post.)

Suppose we have a large number of scalar random variables ${X_1,\ldots,X_n}$, which each have bounded size on average (e.g. their mean and variance could be ${O(1)}$). What can one then say about their sum ${S_n := X_1+\ldots+X_n}$? If each individual summand ${X_i}$ varies in an interval of size ${O(1)}$, then their sum of course varies in an interval of size ${O(n)}$. However, a remarkable phenomenon, known as concentration of measure, asserts that assuming a sufficient amount of independence between the component variables ${X_1,\ldots,X_n}$, this sum sharply concentrates in a much narrower range, typically in an interval of size ${O(\sqrt{n})}$. This phenomenon is quantified by a variety of large deviation inequalities that give upper bounds (often exponential in nature) on the probability that such a combined random variable deviates significantly from its mean. The same phenomenon applies not only to linear expressions such as ${S_n = X_1+\ldots+X_n}$, but more generally to nonlinear combinations ${F(X_1,\ldots,X_n)}$ of such variables, provided that the nonlinear function ${F}$ is sufficiently regular (in particular, if it is Lipschitz, either separately in each variable, or jointly in all variables).

The basic intuition here is that it is difficult for a large number of independent variables ${X_1,\ldots,X_n}$ to “work together” to simultaneously pull a sum ${X_1+\ldots+X_n}$ or a more general combination ${F(X_1,\ldots,X_n)}$ too far away from its mean. Independence here is the key; concentration of measure results typically fail if the ${X_i}$ are too highly correlated with each other.

There are many applications of the concentration of measure phenomenon, but we will focus on a specific application which is useful in the random matrix theory topics we will be studying, namely on controlling the behaviour of random ${n}$-dimensional vectors with independent components, and in particular on the distance between such random vectors and a given subspace.

Once one has a sufficient amount of independence, the concentration of measure tends to be sub-gaussian in nature; thus the probability that one is at least ${\lambda}$ standard deviations from the mean tends to drop off like ${C \exp(-c\lambda^2)}$ for some ${C,c > 0}$. In particular, one is ${O( \log^{1/2} n )}$ standard deviations from the mean with high probability, and ${O( \log^{1/2+\epsilon} n)}$ standard deviations from the mean with overwhelming probability. Indeed, concentration of measure is our primary tool for ensuring that various events hold with overwhelming probability (other moment methods can give high probability, but have difficulty ensuring overwhelming probability).

This is only a brief introduction to the concentration of measure phenomenon. A systematic study of this topic can be found in this book by Ledoux.

### Recent Comments

 Ing. Kontra Miklos on The Collatz conjecture, Little… Ing. Kontra Miklos on The Collatz conjecture, Little… Anonymous on Furstenberg limits of the Liou… Wrichik Basu on The blue-eyed islanders puzzle… Anonymous on An introduction to measure… Wrichik Basu on The blue-eyed islanders puzzle… Anonymous on Yves Meyer wins the 2017 Abel… kenneth on Mean field games Notes in Mathematics… on Spielman, Meyer, Nirenberg Claudeh5 on 254A, Notes 6: Large values of… Abdelghaffar Slimane on The Collatz conjecture, Little… Czerno on The Collatz conjecture, Little… Anonymous on Yves Meyer wins the 2017 Abel… Abdelghaffar Slimane on The Collatz conjecture, Little… Anonymous on 245B, notes 3: L^p spaces