We can now turn attention to one of the centerpiece universality results in random matrix theory, namely the Wigner semi-circle law for Wigner matrices. Recall from previous notes that a Wigner Hermitian matrix ensemble is a random matrix ensemble of Hermitian matrices (thus
; this includes real symmetric matrices as an important special case), in which the upper-triangular entries
,
are iid complex random variables with mean zero and unit variance, and the diagonal entries
are iid real variables, independent of the upper-triangular entries, with bounded mean and variance. Particular special cases of interest include the Gaussian Orthogonal Ensemble (GOE), the symmetric random sign matrices (aka symmetric Bernoulli ensemble), and the Gaussian Unitary Ensemble (GUE).
In previous notes we saw that the operator norm of was typically of size
, so it is natural to work with the normalised matrix
. Accordingly, given any
Hermitian matrix
, we can form the (normalised) empirical spectral distribution (or ESD for short)
of , where
are the (necessarily real) eigenvalues of
, counting multiplicity. The ESD is a probability measure, which can be viewed as a distribution of the normalised eigenvalues of
.
When is a random matrix ensemble, then the ESD
is now a random measure – i.e. a random variable taking values in the space
of probability measures on the real line. (Thus, the distribution of
is a probability measure on probability measures!)
Now we consider the behaviour of the ESD of a sequence of Hermitian matrix ensembles as
. Recall from Notes 0 that for any sequence of random variables in a
-compact metrisable space, one can define notions of convergence in probability and convergence almost surely. Specialising these definitions to the case of random probability measures on
, and to deterministic limits, we see that a sequence of random ESDs
converge in probability (resp. converge almost surely) to a deterministic limit
(which, confusingly enough, is a deterministic probability measure!) if, for every test function
, the quantities
converge in probability (resp. converge almost surely) to
.
Remark 1 As usual, convergence almost surely implies convergence in probability, but not vice versa. In the special case of random probability measures, there is an even weaker notion of convergence, namely convergence in expectation, defined as follows. Given a random ESD
, one can form its expectation
, defined via duality (the Riesz representation theorem) as
this probability measure can be viewed as the law of a random eigenvalue
drawn from a random matrix
from the ensemble. We then say that the ESDs converge in expectation to a limit
if
converges the vague topology to
, thus
for all
.
In general, these notions of convergence are distinct from each other; but in practice, one often finds in random matrix theory that these notions are effectively equivalent to each other, thanks to the concentration of measure phenomenon.
Exercise 1 Let
be a sequence of
Hermitian matrix ensembles, and let
be a continuous probability measure on
.
- Show that
converges almost surely to
if and only if
converges almost surely to
for all
.
- Show that
converges in probability to
if and only if
converges in probability to
for all
.
- Show that
converges in expectation to
if and only if
converges to
for all
.
We can now state the Wigner semi-circular law.
Theorem 1 (Semicircular law) Let
be the top left
minors of an infinite Wigner matrix
. Then the ESDs
converge almost surely (and hence also in probability and in expectation) to the Wigner semi-circular distribution
A numerical example of this theorem in action can be seen at the MathWorld entry for this law.
The semi-circular law nicely complements the upper Bai-Yin theorem from Notes 3, which asserts that (in the case when the entries have finite fourth moment, at least), the matrices almost surely has operator norm at most
. Note that the operator norm is the same thing as the largest magnitude of the eigenvalues. Because the semi-circular distribution (1) is supported on the interval
with positive density on the interior of this interval, Theorem 1 easily supplies the lower Bai-Yin theorem, that the operator norm of
is almost surely at least
, and thus (in the finite fourth moment case) the norm is in fact equal to
. Indeed, we have just shown that the semi-circular law provides an alternate proof of the lower Bai-Yin bound (Proposition 11 of Notes 3).
As will hopefully become clearer in the next set of notes, the semi-circular law is the noncommutative (or free probability) analogue of the central limit theorem, with the semi-circular distribution (1) taking on the role of the normal distribution. Of course, there is a striking difference between the two distributions, in that the former is compactly supported while the latter is merely subgaussian. One reason for this is that the concentration of measure phenomenon is more powerful in the case of ESDs of Wigner matrices than it is for averages of iid variables; compare the concentration of measure results in Notes 3 with those in Notes 1.
There are several ways to prove (or at least to heuristically justify) the semi-circular law. In this set of notes we shall focus on the two most popular methods, the moment method and the Stieltjes transform method, together with a third (heuristic) method based on Dyson Brownian motion (Notes 3b). In the next set of notes we shall also study the free probability method, and in the set of notes after that we use the determinantal processes method (although this method is initially only restricted to highly symmetric ensembles, such as GUE).
— 1. Preliminary reductions —
Before we begin any of the proofs of the semi-circular law, we make some simple observations which will reduce the difficulty of the arguments in the sequel.
The first observation is that the Cauchy interlacing law (Exercise 14 from Notes 3a) shows that the ESD of is very stable in
. Indeed, we see from the interlacing law that
for any threshold and any
.
Exercise 2 Using this observation, show that to establish the semi-circular law (in any of the three senses of convergence), it suffices to do so for an arbitrary lacunary sequence
of
(thus
for some
and all
).
The above lacunary reduction does not help one establish convergence in probability or expectation, but will be useful when establishing almost sure convergence, as it significantly reduces the inefficiency of the union bound. (Note that a similar lacunary reduction was also used to prove the strong law of large numbers in Notes 1.)
Next, we exploit the stability of the ESD with respect to perturbations, by taking advantage of the Weilandt-Hoffmann inequality
for Hermitian matrices , where
is the Frobenius norm of
. (This inequality was established in Exercise 6 or Exercise 11 of Notes 3a.) We convert this inequality into an inequality about ESDs:
Lemma 2 For any
Hermitian matrices
, any
, and any
, we have
and similarly
Proof: We just prove the first inequality, as the second is similar (and also follows from the first, by reversing the sign of ).
Let be the largest eigenvalue of
less than
, and let
be the largest eigenvalue of
less than
. Our task is to show that
If then we are clearly done, so suppose that
. Then we have
for all
, and hence
The claim now follows from (2).
This has the following corollary:
Exercise 3 (Stability of ESD laws wrt small perturbations) Let
be a sequence of random Hermitian matrix ensembles such that
converges almost surely to a limit
. Let
be another sequence of Hermitian random matrix ensembles such that
converges almost surely to zero. Show that
converges almost surely to
.
Show that the same claim holds if “almost surely” is replaced by “in probability” or “in expectation” throughout.
Informally, this exercise allows us to discard any portion of the matrix which is in the Frobenius norm. For instance, the diagonal entries of
have a Frobenius norm of
almost surely, by the strong law of large numbers. Hence, without loss of generality, we may set the diagonal equal to zero for the purposes of the semi-circular law.
One can also remove any component of that is of rank
:
Exercise 4 (Stability of ESD laws wrt small rank perturbations) Let
be a sequence of random Hermitian matrix ensembles such that
converges almost surely to a limit
. Let
be another sequence of random Hermitian matrix ensembles such that
converges almost surely to zero. Show that
converges almost surely to
. (Hint: use the Weyl inequalities instead of the Wielandt-Hoffman law.)
Show that the same claim holds if “almost surely” is replaced by “in probability” or “in expectation” throughout.
In a similar vein, we may apply the truncation argument (much as was done for the central limit theorem in Notes 2) to reduce the semi-circular law to the bounded case:
Exercise 5 Show that in order to prove the semi-circular law (in the almost sure sense), it suffices to do so under the additional hypothesis that the random variables are bounded. Similarly for the convergence in probability or in expectation senses.
Remark 2 These facts ultimately rely on the stability of eigenvalues with respect to perturbations. This stability is automatic in the Hermitian case, but for non-symmetric matrices, serious instabilities can occur due to the presence of pseudospectrum. We will discuss this phenomenon more in later lectures (but see also this earlier blog post).
— 2. The moment method —
We now prove the semi-circular law via the method of moments, which we have already used several times in the previous notes. In order to use this method, it is convenient to use the preceding reductions to assume that the coefficients are bounded, the diagonal vanishes, and that ranges over a lacunary sequence. We will implicitly assume these hypotheses throughout the rest of the section.
As we have already discussed the moment method extensively, much of the argument here will be delegated to exercises. A full treatment of these computations can be found in the book of Bai and Silverstein.
The basic starting point is the observation that the moments of the ESD can be written as normalised traces of powers of
:
In particular, on taking expectations, we have
From concentration of measure (and the Bai-Yin theorem) for the operator norm of a random matrix (Proposition 7 of Notes 3), we see that the are uniformly subgaussian, indeed we have
for , where
are absolute (so the decay in fact improves quite rapidly with
). From this and the moment continuity theorem (Theorem 4 of Notes 2), we can now establish the semi-circular law through computing the mean and variance of moments:
Exercise 6
- Show that to prove convergence in expectation to the semi-circular law, it suffices to show that
for
, where
is an expression that goes to zero as
for fixed
(and fixed choice of coefficient distribution
).
- Show that to prove convergence in probability to the semi-circular law, it suffices to show (4) together with the variance bound
- Show that to prove almost sure convergence to the semi-circular law, it suffices to show (4) together with the variance bound
for
and some
. (Note here that it is useful to restrict
to a lacunary sequence!)
Ordinarily, computing second-moment quantities such as the left-hand side of (5) is harder than computing first-moment quantities such as (4). But one can obtain the required variance bounds from concentration of measure:
Exercise 7
- When
is a positive even integer, Use Talagrand’s inequality and convexity of the Schatten norm
to establish (6) (and hence (5)) when
is even.
- For
odd, the formula
still applies as long as
is positive definite. Applying this observation, the Bai-Yin theorem, and Talagrand’s inequality to the
norms of
for a constant
, establish (6) (and hence (5)) when
is odd also.
Remark 3 More generally, concentration of measure results (such as Talagrand’s inequality) can often be used to automatically upgrade convergence in expectation to convergence in probability or almost sure convergence. We will not attempt to formalise this principle here.
It is not difficult to establish (6), (5) through the moment method as well. Indeed, recall from Theorem 10 of Notes 3 that we have the expected moment
for all , where the Catalan number
is zero when
is odd, and is equal to
Exercise 8 By modifying the proof of that theorem, show that
and deduce (5). By refining the error analysis (e.g. using Theorem 12 of Notes 3, also establish (6).
In view of the above computations, the establishment of the semi-circular law now reduces to computing the moments of the semi-circular distribution:
Exercise 9 Show that for any
, one has
(Hint: use a trigonometric substitution
, and then express the integrand in terms of Fourier phases
.)
This concludes the proof of the semi-circular law (for any of the three modes of convergence).
Remark 4 In the spirit of the Lindeberg exchange method, observe that Exercise (9) is unnecessary if one already knows that the semi-circular law holds for at least one ensemble of Wigner matrices (e.g. the GUE ensemble). Indeed, Exercise 9 can be deduced from such a piece of knowledge. In such a situation, it is not necessary to actually compute the main term
on the right of (4); it would be sufficient to know that that limit is universal, in that it does not depend on the underlying distribution. In fact, it would even suffice to establish the slightly weaker statement
whenever
are two ensembles of Wigner matrices arising from different underlying distributions (but still normalised to have mean zero, unit variance, and to be bounded (or at worst subgaussian)). We will take advantage of this perspective later in these notes.
— 3. The Stieltjes transform method —
The moment method was computationally intensive, but straightforward. As noted in Remark 4, even without doing much of the algebraic computation, it is clear that the moment method will show that some universal limit for Wigner matrices exists (or, at least, that the differences between the distributions of two different Wigner matrices converge to zero). But it is not easy to see from this method why the limit should be given by the semi-circular law, as opposed to some other distribution (although one could eventually work this out from an inverse moment computation).
When studying the central limit theorem, we were able to use the Fourier method to control the distribution of random matrices in a cleaner way than in the moment method. Analogues of this method exist, but require non-trivial formulae from noncommutative Fourier analysis, such as the Harish-Chandra integration formula (and also only work for highly symmetric ensembles, such as GUE or GOE), and will not be discussed in this course. (Our later notes on determinantal processes, however, will contain some algebraic identities related in some ways to the noncommutative Fourier-analytic approach.)
We now turn to another method, the Stieltjes transform method, which uses complex-analytic methods rather than Fourier-analytic methods, and has turned out to be one of the most powerful and accurate tools in dealing with the ESD of random Hermitian matrices. Whereas the moment method started from the identity (3), the Stieltjes transform method proceeds from the identity
for any complex not in the support of
. We refer to the expression on the left-hand side as the Stieltjes transform of
or of
, and denote it by
or as
for short. The expression
is the normalised resolvent of
, and plays an important role in the spectral theory of that matrix. Indeed, in contrast to general-purpose methods such as the moment method, the Stieltjes transform method draws heavily on the specific linear-algebraic structure of this problem, and in particular on the rich structure of resolvents.
On the other hand, the Stieltjes transform can be viewed as a generating function of the moments via the Taylor series expansion
valid for sufficiently large. This is somewhat (though not exactly) analogous to how the characteristic function
of a scalar random variable can be viewed as a generating function of the moments
.
Now let us study the Stieltjes transform more systematically. Given any probability measure on the real line, we can form its Stieltjes transform
for any outside of the support of
; in particular, the Stieltjes transform is well-defined on the upper and lower half-planes in the complex plane. Even without any further hypotheses on
other than it is a probability measure, we can say a remarkable amount about how this transform behaves in
. Applying conjugations we obtain the symmetry
so we may as well restrict attention to in the upper half-plane (say). Next, from the trivial bound
In a similar spirit, an easy application of dominated convergence gives the asymptotic
where is an expression that, for any fixed
, goes to zero as
goes to infinity non-tangentially in the sense that
is kept bounded, where the rate of convergence is allowed to depend on
. From differentiation under the integral sign (or an application of Morera’s theorem and Fubini’s theorem) we see that
is complex analytic on the upper and lower half-planes; in particular, it is smooth away from the real axis. From the Cauchy integral formula (or differentiation under the integral sign) we in fact get some bounds for higher derivatives of the Stieltjes transform away from this axis:
Informally, “behaves like a constant” at scales significantly less than the distance
to the real axis; all the really interesting action here is going on near that axis.
The imaginary part of the Stieltjes transform is particularly interesting. Writing , we observe that
and so we see that
for in the upper half-plane; thus
is a complex-analytic map from the upper half-plane to itself, a type of function known as a Herglotz function. (In fact, all complex-analytic maps from the upper half-plane to itself that obey the asymptotic (12) are of this form; this is a special case of the Herglotz representation theorem, which also gives a slightly more general description in the case when the asymptotic (12) is not assumed. A good reference for this material and its consequences is this book of Garnett.)
One can also express the imaginary part of the Stieltjes transform as a convolution
where is the Poisson kernel
As is well known, these kernels form a family of approximations to the identity, and thus converges in the vague topology to
(see e.g. my notes on distributions). Thus we see that
as in the vague topology,or equivalently (by (10)) that
as (this is closely related to the Plemelj formula in potential theory). Thus we see that a probability measure
can be recovered in terms of the limiting behaviour of the Stieltjes transform on the real axis.
A variant of the above machinery gives us a criterion for convergence:
Exercise 10 (Stieltjes continuity theorem) Let
be a sequence of random probability measures on the real line, and let
be a deterministic probability measure.
converges almost surely to
in the vague topology if and only if
converges almost surely to
for every
in the upper half-plane.
converges in probability to
in the vague topology if and only if
converges in probability to
for every
in the upper half-plane.
converges in expectation to
in the vague topology if and only if
converges to
for every
in the upper half-plane.
(Hint: The “only if” parts are fairly easy. For the “if” parts, take a test function
and approximate
by
. Then approximate this latter integral in turn by a Riemann sum, using (13).)
Thus, to prove the semi-circular law, it suffices to show that for each in the upper half-plane, the Stieltjes transform
converges almost surely (and thus in probability and in expectation) to the Stieltjes transform of the semi-circular law.
It is not difficult to compute the Stieltjes transform of the semi-circular law, but let us hold off on that task for now, because we want to illustrate how the Stieltjes transform method can be used to find the semi-circular law, even if one did not know this law in advance, by directly controlling
. We will fix
to be a complex number not on the real line, and allow all implied constants in the discussion below to depend on
and
(we will focus here only on the behaviour as
).
The main idea here is predecessor comparison: to compare the transform of the
matrix
with the transform
of the top left
minor
, or of other minors. For instance, we have the Cauchy interlacing law (Exercise 14 from Notes 3a), which asserts that the eigenvalues
of
intersperse that of
. This implies that for a complex number
with
, the difference
is an alternating sum of evaluations of the function . The total variation of this function is
(recall that we are suppressing dependence of constaants on
), and so the alternating sum above is
. Writing this in terms of the Stieltjes transform, we conclude that
Applying (13) to approximate by
, we conclude that
So for fixed away from the real axis, the Stieltjes transform
is quite stable in
.
This stability has the following important consequence. Observe that while the left-hand side of (16) depends on the matrix
, the right-hand side depends only on the top left minor
of that matrix. In particular, it is independent of the
row and column of
. This implies that this entire row and column has only a limited amount of influence on the Stieltjes transform
: no matter what value one assigns to this row and column (including possibly unbounded values, as long as one keeps the matrix Hermitian of course), the transform
can only move by
.
By permuting the rows and columns, we obtain that in fact any row or column of can influence
is at most
. (This is closely related to the observation in Exercise 4 that low rank perturbations do not significantly affect the ESD.) On the other hand, the rows of (the upper triangular portion of)
are jointly independent. When
is a Wigner random matrix, we can then apply a standard concentration of measure result, such as McDiarmid’s inequality (Theorem 7 from Notes 1) to conclude concetration of
around its mean:
for all and some absolute constants
. (This is not necessarily the strongest concentration result one can establish for the Stieltjes transform, but it will certainly suffice for our discussion here.) In particular, we see from the Borel-Cantelli lemma (Exercise 24 of Notes 0a)that for any fixed
away from the real line,
converges almost surely (and thus also in probability) to zero. As a consequence, convergence of
in expectation automatically implies convergence in probability or almost sure convergence.
However, while concentration of measure tells us that is close to its mean, it does not shed much light as to what this mean is. For this, we have to go beyond the Cauchy interlacing formula and deal with the resolvent
more directly. Firstly, we observe from the linearity of trace that
where denotes the
component of a matrix
. Because
is a Wigner matrix, it is easy to see on permuting the rows and columns that all of the random variables
have the same distribution. Thus we may simplify the above formula as
So now we have to compute the last entry of an inverse of a matrix. There are of course a number of formulae for this, such as Cramer’s rule. But it will be more convenient here to use a formula based instead on the Schur complement:
Exercise 11 Let
be a
matrix, let
be the top left
minor, let
be the bottom right entry of
, let
be the right column of
with the bottom right entry removed, and let
be the bottom row with the bottom right entry removed. In other words,
Assume that
and
are both invertible. Show that
(Hint: Solve the equation
, where
is the
basis vector, using the method of Schur complements (or from first principles).)
The point of this identity is that it describes (part of) the inverse of in terms of the inverse of
, which will eventually provide a non-trivial recursive relationship between
and
, which can then be played off against (16) to solve for
in the asymptotic limit
.
In our situation, the matrix and its minor
is automatically invertible. Inserting the above formula into (18) (and recalling that we normalised the diagonal of
to vanish), we conclude that
where is the top right column of
with the bottom entry
removed.
One may be concerned that the denominator here could vanish. However, observe that has imaginary part
if
. Furthermore, from the spectral theorem we see that the imaginary part of
is positive definite, and so
has non-negative imaginary part. As a consequence the magnitude of the denominator here is bounded below by
, and so its reciprocal is
(compare with (11)). So the reciprocal here is not going to cause any discontinuity, as we are considering
is fixed and non-zero.
Now we need to understand the expression . We write this as
, where
is the resolvent matrix
. The distribution of the random matrix
could conceivably be quite complicated. However, the key point is that the vector
only involves the entries of
that do not lie in
, and so the random matrix
and the vector
are independent. Because of this, we can use the randomness of
to do most of the work in understanding the expression
, without having to know much about
at all.
To understand this, let us first condition to be a deterministic matrix
, and see what we can do with the expression
.
Firstly, observe that will not be arbitrary; indeed, from the spectral theorem we see that
will have operator norm at most
. Meanwhile, from the Chernoff (or Hoeffding) inequality (Theorem 2 or Exercise 4 of Notes 1) we know that
has magnitude
with overwhelming probability. So we know that
has magnitude
with overwhelming probability.
Furthermore, we can use concentration of measure as follows. Given any positive semi-definite matrix of operator norm
, the expression
is a Lipschitz function of
with operator norm
. Applying Talagrand’s inequality (Theorem 9 of Notes 1) we see that this expression concentrates around its median:
for any . On the other hand,
has magnitude
with overwhelming probability, so the median
must be
. Squaring, we conclude that
(possibly after adjusting the absolute constants ). As usual, we may replace the median with the expectation:
This was for positive-definite matrices, but one can easily use the triangle inequality to generalise to self-adjoint matrices, and then to arbitrary matrices, of operator norm , and conclude that
for any deterministic matrix of operator norm
.
But what is the expectation ? This can be expressed in components as
where are the entries of
, and
are the entries of
. But the
are iid with mean zero and variance one, so the standard second moment computation shows that this expectation is nothing more than the trace
of . We have thus shown the concentration of measure result
for any deterministic matrix of operator norm
, and any
. Informally,
is typically
.
The bound (21) was proven for deterministic matrices, but by using conditional expectation it also applies for any random matrix , so long as that matrix is independent of
. In particular, we may apply it to our specific matrix of interest
The trace of this matrix is essentially just the Stieltjes transform at
. Actually, due to the normalisation factor being slightly off, we actually have
but by using the smoothness (13) of the Stieltjes transform, together with the stability property (16) we can simplify this as
In particular, from (21) and (17), we see that
with overwhelming probability. Putting this back into (19), and recalling that the denominator is bounded away from zero, we have the remarkable equation
Note how this equation came by playing off two ways in which the spectral properties of a matrix interacted with that of its minor
; firstly via the Cauchy interlacing inequality, and secondly via the Schur complement formula.
This equation already describes the behaviour of quite well, but we will content ourselves with understanding the limiting behaviour as
. From (13) and Fubini’s theorem we know that the function
is locally uniformly equicontinuous and locally uniformly bounded away from the real line. Applying the Arzelá-Ascoli theorem, we thus conclude that on a subsequence at least,
converges locally uniformly to a limit
. This will be a Herglotz function (i.e. an analytic function mapping the upper half-plane to the upper half-plane), and taking limits in (22) (observing that the imaginary part of the denominator here is bounded away from zero) we end up with the exact equation
We can of course solve this by the quadratic formula, obtaining
To figure out what branch of the square root one has to use here, we use (12), which easily implies that
as goes to infinity non-tangentially away from the real line. (To justify this, one has to make the error term in (12) uniform in
, but this can be accomplished without difficulty using the Bai-Yin theorem (for instance).) Also, we know that
has to be complex analytic (and in particular, continuous) away from the real line. From this and basic complex analysis, we conclude that
where is the branch of the square root with a branch cut at
and which equals
at infinity.
As there is only one possible subsequence limit of the , we conclude that
converges locally uniformly (and thus pointwise) to the function (24), and thus (by the concentration of measure of
) we see that for each
,
converges almost surely (and in probability) to
.
Exercise 12 Find a direct proof (starting from (22), (12), and the smoothness of
) that
for any fixed
, that avoids using the Arzelá-Ascoli theorem. (The basic point here is that one has to solve the approximate equation (22), using some robust version of the quadratic formula. The fact that
is a Herglotz function will help eliminate various unwanted possibilities, such as one coming from the wrong branch of the square root.)
To finish computing the limiting ESD of Wigner matrices, we have to figure out what probability measure comes from. But this is easily read off from (24) and (15):
as . Thus the semi-circular law is the only possible measure which has Stieltjes transform
, and indeed a simple application of the Cauchy integral formula and (25) shows us that
is indeed the Stieltjes transform of
.
Putting all this together, we have completed the Stieltjes transform proof of the semi-circular law.
Remark 5 In order to simplify the above exposition, we opted for a qualitative analysis of the semi-circular law here, ignoring such questions as the rate of convergence to this law. However, an inspection of the above arguments reveals that it is easy to make all of the above analysis quite quantitative, with quite reasonable control on all terms. (One has to use Exercise 12 instead of the Arzelá-Ascoli theorem if one wants everything to be quantitative.) In particular, it is not hard to use the above analysis to show that for
for some small absolute constant
, one has
with overwhelming probability. Combining this with a suitably quantitative version of the Stieltjes continuity theorem, this in turn gives a polynomial rate of convergence of the ESDs
to the semi-circular law
, in that one has
with overwhelming probability for all
.
A variant of this quantitative analysis can in fact get very good control on this ESD down to quite fine scales, namely to scales
, which is only just a little bit larger than the mean spacing
of the normalised eigenvalues (recall that we have
normalised eigenvalues, constrained to lie in the interval
by the Bai-Yin theorem). This was accomplished by Erdös, Schlein, and Yau (under some additional regularity hypotheses on the distribution
, but these can be easily removed with the assistance of Talagrand’s inequality) by using an additional observation, namely that the eigenvectors of a random matrix are very likely to be delocalised in the sense that their
energy is dispersed more or less evenly across its coefficients. We will return to this point in later notes.
— 4. Dyson Brownian motion and the Stieltjes transform —
We now explore how the Stieltjes transform interacts with the Dyson Brownian motion introduced in Notes 3b. We let be a large number, and let
be a Wiener process of Hermitian random matrices, with associated eigenvalues
, Stieltjes transforms
We now study how ,
evolve in time in the asymptotic limit
. Our computation will be only heuristic in nature.
Recall from Notes 3b that the eigenvalues undergo Dyson Brownian motion
Applying (26) and Taylor expansion (dropping all terms of higher order than , using the Ito heuristic
), we conclude that
For away from the real line, the term
is of size
and can heuristically be ignored in the limit
. Dropping this term, and then taking expectations to remove the Brownian motion term
, we are led to
Performing the summation using (26) we obtain
where we adopt the convention that for real ,
is the average of
and
. Using (27), this becomes
where the subscript denotes differentiation in
. From (15) we heuristically have
(heuristically treating as a function rather than a measure) and on squaring one obtains
From this the Cauchy integral formula around a slit in real axis (using the bound (11) to ignore the contributions near infinity) we thus have
and thus on differentiation in
Comparing this with (29), we obtain
From concentration of measure, we expect to concentrate around its mean
, and similarly
should concentrate around
. In the limit
, the expected Stieltjes transform
should thus obey Burgers’ equation
To illustrate how this equation works in practice, let us give an informal derivation of the semi-circular law. We consider the case when the Wiener process starts from , thus
for a GUE matrix
. As such, we have the scaling symmetry
where is the asymptotic Stieltjes transform for GUE (which we secretly know to be given by (24), but let us pretend that we did not yet know this fact). Inserting this self-similar ansatz into (30) and setting
, we conclude that
multiplying by two and integrating, we conclude that
for some constant . But from the asymptotic (12) we see that
must equal
. But then the above equation can be rearranged into (23), and so by repeating the arguments at the end of the previous section we can deduce the formula (24), which then gives the semi-circular law by (15).
As is well known in PDE, one can solve Burgers’ equation more generally by the method of characteristics. For reasons that will be come clearer in the next set of notes, I will solve this equation by a slightly different (but ultimately equivalent) method. The idea is that rather than think of as a function of
for fixed
, we think of
as a function of
for fixed
. (This trick is sometimes known as the hodograph transform, especially if one views
as “velocity” and
as “position”.) Note from (12) that we expect to be able to invert the relationship between
and
as long as
is large (and
is small).
To exploit this change of perspective, we think of as all varying by infinitesimal amounts
respectively. Using (30) and the total derivative formula
, we see that
If we hold fixed (i.e.
), so that
is now just a function of
, and cancel off the
factor, we conclude that
This, in principle, gives a way to compute from
. First, we invert the relationship
to
; then we add
to
; then we invert again to recover
.
Since , where
is a GUE matrix independent of
, we have thus given a formula to describe the Stieltjes transform of
in terms of the Stieltjes transform of
. This formula is a special case of a more general formula of Voiculescu for free convolution, with the operation of inverting the Stieltjes transform essentially being the famous
-transform of Voiculescu; we will discuss this more in the next section.
55 comments
Comments feed for this article
2 February, 2010 at 6:11 pm
David
In the first sentence of the second paragraph I think the “normalized matrix” is misprinted.
[Corrected, thanks – T.]
4 February, 2010 at 10:09 am
Jan
Thx for the nice article and explanation!
6 February, 2010 at 8:16 pm
Orr
In the proof of Lemma 2, in the last line of text, perhaps you wanted to wrote
.
[Corrected, thanks – T.]
7 February, 2010 at 4:41 pm
ano
Dear Prof. Tao,
Sorry for a non-mathematical question — how did you manage to make the WordPress feed show only brief excerpts for the 254a posts, but full content for the others?
Thanks,
10 February, 2010 at 11:01 am
Toby
Dear Prof. Tao,
In exercise #1, why the requirement that
? It seems that by constructing matrix ensembles with all negative eigenvalues, you could give a counterexample. For instance, if
consists of (deterministic) point masses at
(which you could do by making
just be a deterministic, diagonal matrix with the right things on the diagonal) then
converges almost surely to a uniform measure on
. But
for
would always be 1, so this would converge to
for any
supported on
. Is
a mistake, or is there some stronger meaning to Hermitian matrix ensemble that I’m missing?
Thanks,
Toby
10 February, 2010 at 11:38 am
Terence Tao
Oops, you’re right,
should be real throughout. This should be corrected now…
10 February, 2010 at 2:54 pm
Toby
Thanks!
10 February, 2010 at 10:56 pm
245A, Notes 5: Free probability « What’s new
[…] Much as free groups are in some sense “maximally non-commutative”, freely independent random variables are about as far from being commuting as possible. For instance, if are freely independent and of expectation zero, then vanishes, but instead factors as . As a consequence, the behaviour of freely independent random variables can be quite different from the behaviour of their classically independent commuting counterparts. Nevertheless there is a remarkably strong analogy between the two types of independence, in that results which are true in the classically independent case often have an interesting analogue in the freely independent setting. For instance, the central limit theorem (Notes 2) for averages of classically independent random variables, which roughly speaking asserts that such averages become gaussian in the large limit, has an analogue for averages of freely independent variables, the free central limit theorem, which roughly speaking asserts that such averages become semicircular in the large limit. One can then use this theorem to provide yet another proof of Wigner’s semicircle law (Notes 4). […]
15 February, 2010 at 11:37 am
Choongbum Lee
Seems like in Lemma 2 and Exercise 3, the $n^2$ terms should be $n$.
15 February, 2010 at 11:46 am
Choongbum Lee
Oh, it’s just in Lemma 2. And in the discussion following exercise 3,
, and the diagonal entries of
having Frobenius norm
?
shouldn’t it be we can discard matrices whose Frobenius norm is
[Corrected, thanks. -T.]
16 February, 2010 at 4:14 pm
Toby
Dear Prof. Tao,
At the beginning of this post you suggest that the space of probability measures on
is
-compact. I set out to check this and eventually decided it was actually false. Am I mistaken? My proof works (or so I hope) by supposing you can decompose the space of measures into countably many compact (and therefore tight) sets, and then diagonalizing to produce a measure not in any of them.
Thanks,
Toby
16 February, 2010 at 4:37 pm
Terence Tao
Ah, yes, there is a slight subtlety here. The space of probability measures is not sigma-compact, but the larger space of non-negative measures with total mass at most 1 is sigma-compact (in fact it is compact) in the vague topology, basically by the Banach-Alaoglu theorem. So one can define the notions of convergence in probability, etc. in the larger space and then descend to the smaller space. (Actually, if all one wants is the definition, then sigma-compactness is not really needed; it’s only when one wants to use the definitions, e.g. to show that two different definitions are equivalent, that some regularity hypotheses on the underlying space are needed, and sigma compactness is probably overkill in most cases.)
16 February, 2010 at 5:24 pm
Toby
Ahh, thank you. That clears things up for me.
Toby
17 February, 2010 at 9:46 am
anonymous
Dear Prof. Tao,
You state that the Stieltjes transform “can be viewed as a generating function of the moments via the Taylor series expansion ” and that “This is somewhat (though not exactly) analogous to how the characteristic function {{\bf E} e^{itX}} of a scalar random variable can be viewed as a generating function of the moments {{\bf E} X^k}.”
Do (14) and (15) also have natural translations in terms of characteristic functions?
Thank you
17 February, 2010 at 12:37 pm
Terence Tao
I suppose the Fourier inversion formula would be the closest analogue. The analogy between the Stieltjes transform and the Fourier transform only seems to be particularly strong near infinity (where the Laurent series look very similar, except for a factor of k!); otherwise the two transforms seem rather different in nature.
18 February, 2010 at 9:25 pm
Sungjin Kim
In Exercise 2,
I think it is enough considering lacunary sequence of the form
a_n= [ a^n ], where [] is the floor function, and a>1.
Also, ‘it suffices to do so for a lacunary sequence ‘ should be changed to ‘it suffices to do so for every lacunary sequences’
[Thanks, I’ve reworded. – T.]
19 February, 2010 at 9:26 am
Sungjin Kim
In Exercise6, for the last part(almost sure convergence),
the error term in (4) should also be $O(n^{-c_k})$ ??
19 February, 2010 at 9:30 am
Terence Tao
This is not necessary; one can derive almost sure convergence without the stronger bound in (4) (although this bound is true). Consider for instance the extreme case when the error term in (4) is o(1), but the variance estimate is infinitely strong (i.e. the variance is zero).
19 February, 2010 at 11:19 pm
Sungjin Kim
I see. Thank you.
23 February, 2010 at 9:18 am
Ben
Possible typo. In the preliminary reduction section for the inequalities coming from the Cauchy interlacing law, is it supposed to be m>n?
23 February, 2010 at 9:31 am
Terence Tao
Ah, there was a sign error; I think I’ve fixed the problem now.
23 February, 2010 at 10:03 pm
254A, Notes 6: Gaussian ensembles « What’s new
[…] can then be solved to obtain the semi-circular law as in previous notes. Remark 4 Recall from Notes 3b that Dyson Brownian motion can be used to derive the formula (1). […]
8 March, 2010 at 5:04 pm
Ben
I’d appreciate any help showing that pointwise convergence of
gives weak convergence of the measures. Don’t you need some sort of uniformity of
over n? Also, the hint about using (13) is very mysterious to me.
10 March, 2010 at 6:34 pm
Terence Tao
One can use (14) to control the integral of any function of the form F*P_b against mu_n with an integral of the corresponding Stieltjes transform against F. The latter can be shown to be convergent using (13).
10 March, 2010 at 9:13 am
Ben
possible typo: in the formula before (28), it should be
?
[Corrected, thanks. – T.]
14 March, 2010 at 11:33 am
254A, Notes 8: The circular law « What’s new
[…] the semi-circular law where this reduction follows easily from the Hoffman-Weilandt inequality (see Notes 4). Instead, we must continue working with unbounded random variables throughout the argument […]
20 May, 2010 at 7:45 am
Random matrices: Localization of the eigenvalues and the necessity of four moments « What’s new
[…] the vague topology) where is the semicircular distribution. (See these lecture notes on this blog for more discusssion of this […]
22 October, 2010 at 6:27 am
Anonymous
Dear Prof. Tao,
Do you plan to post on Tracy-Widom distribution (and hence somehow Airy function, Painleve II equation)?I would be great.
Thanks
3 March, 2011 at 10:11 am
FP Recent Follows « YW's E-Profile
[…] https://terrytao.wordpress.com/2010/02/02/254a-notes-4-the-semi-circular-law/ […]
27 November, 2012 at 10:02 am
Alex Gittens
In (12), it looks like the denominator’s missing a negative sign, since 1/z is not Herglotz
[Corrected, thanks – T.]
21 February, 2014 at 6:15 pm
Luo Yongming
Dear Prof. Tao,
Did you use the Borel-Cantelli-inequality for the almost convergence in the Exercise 6? But to do so, i think ck should at least larger than 1. Or did you mean another approach cause i saw that the hint was “a lacunary sequence” but i couldn’t see that point. What i’ve read from the lecture notes of prob. thry is that every subsequence has a conv. subsequence almost sure iff xn converges to x in probability, but from this i also can’t derive the almost sure convergence.
27 December, 2014 at 2:23 pm
Appell polynomials, cumulants, noncrossing partitions, Dyck lattice paths, and inversion | Shadows of Simplicity
[…] https://terrytao.wordpress.com/2010/02/02/254a-notes-4-the-semi-circular-law/#more-3426 […]
14 February, 2015 at 7:17 am
JulianS
Dear Prof. Tao,
the way you phrased Exercise 6(2) it sounds like one can prove weak convergence in probability directly from the Moment Continuity Theorem? I don’t see how this can work. I am not even sure how weak convergence in probability can be phrased in terms of the characteristic functions. Any comment would be highly appreciated!
14 February, 2015 at 8:33 am
Terence Tao
From the moment continuity theorem in the contrapositive, one can show that if
a real subgaussian random variable,
is a Gaussian variable, and one wishes to show that
for some
, then it suffices to show that
for sufficiently many
(depending on
and the subgaussian constants) and some
(also depending on
and the subgaussian constants). Applying this to
a randomly selected eigenvalue of
gives the result.
16 July, 2016 at 6:38 am
Kwangjun Ahn
Dear prof. Tao,
At the beginning of [1. Preliminary reductions ], you made an argument that it suffices to consider for a lacunary subsequnces. However, I don’t see why this is true because in your argument it appeared that you considered $\mu (-\infty , \lambda/ \sqrt{n})$ rather than $\mu (-\infty , \lambda)$, which is our actual interest.
I would be grateful to hear from you!
17 July, 2016 at 8:14 am
Terence Tao
In the interlacing law one can replace
by
and use the fact that
is monotone in $t$; this lets one compare the nonlacunary sequence
from above and below by two lacunary sequences (with slightly different values of
). One also needs to use the fact that the semicircular law is continuous to conclude.
18 March, 2017 at 10:56 am
tornado92
I am a bit confused in the discussion following (19).
Could someone please elaborate on how
has a positive definite imaginary part, and why
has operator norm
? How does this follow from the spectral theorem?
22 March, 2017 at 5:51 am
Terence Tao
The spectral theorem tells us that the Hermitian matrix
is unitarily conjugate to a real diagonal matrix, so to establish these two claims it suffices to do so in the case that
is a real diagonal matrix. Then the problem decouples into
copies of the one-dimensional version of the problem, so it suffices to check the scalar case, which is then a straightforward calculation.
24 March, 2017 at 1:49 am
Eric Kauderer-Abrams
Dear Prof. Tao,
In your first observation in 1. Preliminary Reductions, when you apply the Cauchy interlacing law to demonstrate that the ESD is stable in n, shouldn’t the measures be $\mu_{\frac{1}{\sqrt{n}}M_n}(-\infty, \lambda \sqrt{n})$ and $\mu_{\frac{1}{\sqrt{m}}M_m}(-\infty, \lambda \sqrt{m})$,
instead of $\mu_{\frac{1}{\sqrt{n}}M_n}(-\infty, \lambda / \sqrt{n})$ and $\mu_{\frac{1}{\sqrt{m}}M_m}(-\infty, \lambda / \sqrt{m})$ ?
The definition that you use for the ESD counts the number of eigenvalues less than $\lambda / \sqrt{n}$, so it seems that the changes described above are necessary to ensure that we are counting up to the same value for both $\mu_{\frac{1}{\sqrt{n}}M_n}$ and $\mu_{\frac{1}{\sqrt{m}}M_m}$.
24 March, 2017 at 1:53 am
Eric Kauderer-Abrams
Dear Prof. Tao,
In your first observation in 1. Preliminary Reductions, when you apply the Cauchy interlacing law to demonstrate that the ESD is stable in n, shouldn’t the measures be
and
,
and
?
instead of
The definition that you use for the ESD counts the number of eigenvalues less than
, so it seems that the changes described above are necessary to ensure that we are counting up to the same value for both
and
.
[I believe the formulae are correct as stated. Note for instance that
. -T]
28 March, 2017 at 9:57 pm
keej
I think it’s fine; the number of eigenvalues of
below
is the same as the number of eigenvalues of
below
.
19 April, 2017 at 8:04 am
Anonymous
Does the Stieltjies transform proof, which uses conc. of measure, require that the matrix-entry distribution be bounded?
19 April, 2017 at 8:16 am
Anonymous
That was poorly phrased. I meant to ask, does the Stieltjes transform version of the proof of the semicircle law, in which Talagrand’s Concentration Inequality was used, assume that the entries of the random matrix have a bounded distribution? Otherwise it seems, one cannot appeal to Talagrand’s Inequality.
19 April, 2017 at 9:26 pm
Terence Tao
Yes, but thanks to Exercise 5, one can reduce to the bounded distribution case.
16 September, 2017 at 2:08 pm
Inverting the Schur complement, and large-dimensional Gelfand-Tsetlin patterns | What's new
[…] 2 If is GUE normalised so that each entry has variance , then by the semi-circular law (see previous notes) one has (using an appropriate branch of the square root). One can then verify the self-similar […]
2 December, 2017 at 5:44 pm
keej
I want to prove the semicircular law holds for the adjacency matrix of an Erdos Renyi graph G(n, p). The problem is its off-diagonal entries are mean p, not zero. So, I would like to subtract p times the all ones matrix, and then add back the p along the diagonal.
First, the diagonal doesn’t matter, by Exercise 3. Next, the all ones matrix is low rank, so I want to apply Exercise 4, which as written doesn’t require that the perturbation be Hermitian. But I’m suspicious because the hint for Exercise 4 says to use the Weyl inequalities, which do require Hermitian matrices.
So is Exercise 4 really true? If not, how does one finish the argument for the Erdos Renyi graph?
6 December, 2017 at 12:22 pm
Terence Tao
The hypothesis that
be Hermitian (which is for instance the case in your Erdos-Renyi application) was mistakenly omitted from Exercise 4, and is now corrected.
6 December, 2017 at 2:49 pm
keej
Thank you! Ah, for some reason I had been thinking that the all-ones matrix was not Hermitian, so my previous comment must not have made too much sense.
24 December, 2018 at 6:10 pm
Chen
The question is: if we can discard any portion of the matrix
that has small frobenius norm, then why can we not discard any small number of entries of
, and repeat the process many times, so that we are left with the zero matrix? At each successive step, we only remove a matrix of small frobenius norm, so we should not change the limiting distribution. But then our starting matrix clearly is different from the zero matrix. What went wrong here?
24 December, 2018 at 6:43 pm
keej
To get the zero matrix, you’d have to discard an o(N) portion Omega(N) times. If you write out and index all the matrices involved in a 2d array, you will see that this is a question of interchanging limits, which is not valid in general (and your example proves it false in this case)
12 August, 2019 at 1:48 pm
Some news regarding the Paley graph – Short, Fat Matrices
[…] with zeros on the diagonal and independent Rademacher variables on the off-diagonal. Since is a Wigner matrix, its spectrum follows a semi-circle law, and so enjoys a similar spectrum with an additional spike […]
7 November, 2020 at 4:10 am
The Cauchy residue trick: spectral analysis made “easy” – Machine Learning Research Blog
[…] In this post, I have shown various applications of the Cauchy residue formula, for computing integrals and for the spectral analysis of matrices. I have just scratched the surface of spectral analysis, and what I presented extends to many interesting situations, for example, to more general linear operators in infinite-dimensional spaces [3], or to the analysis fo the eigenvalue distribution of random matrices (see a nice and reasonably simple derivation of the semi-circular law from Terry Tao’s blog). […]
26 August, 2021 at 7:28 am
Anonymous
Dear Prof Tao,
A minor notational inconsistency in this post: you sometimes write circular law when semi-circular law is intended. (For instance in the statement of Exercise 2.)
[Great, thanks – T.]
1 June, 2022 at 2:49 pm
Ben
Dear Professor Tao,
If one allows the variance of the diagonal terms to be of slower order, like O(N^{-1/3}) instead of O(N^-1), would the semicircle law still hold?
2 June, 2022 at 1:40 pm
Terence Tao
Yes; this can basically be derived from the small-diagonal case by treating the large diagonal as an additive perturbation and using the Weyl eigenvalue inequalities.