Now that we have developed the basic probabilistic tools that we will need, we now turn to the main subject of this course, namely the study of random matrices. There are many random matrix models (aka matrix ensembles) of interest – far too many to all be discussed in a single course. We will thus focus on just a few simple models. First of all, we shall restrict attention to square matrices , where is a (large) integer and the are real or complex random variables. (One can certainly study rectangular matrices as well, but for simplicity we will only look at the square case.) Then, we shall restrict to three main models:
- Iid matrix ensembles, in which the coefficients are iid random variables with a single distribution . We will often normalise to have mean zero and unit variance. Examples of iid models include the Bernouli ensemble (aka random sign matrices) in which the are signed Bernoulli variables, the real gaussian matrix ensemble in which , and the complex gaussian matrix ensemble in which .
- Symmetric Wigner matrix ensembles, in which the upper triangular coefficients , are jointly independent and real, but the lower triangular coefficients , are constrained to equal their transposes: . Thus by construction is always a real symmetric matrix. Typically, the strictly upper triangular coefficients will be iid, as will the diagonal coefficients, but the two classes of coefficients may have a different distribution. One example here is the symmetric Bernoulli ensemble, in which both the strictly upper triangular and the diagonal entries are signed Bernoulli variables; another important example is the Gaussian Orthogonal Ensemble (GOE), in which the upper triangular entries have distribution and the diagonal entries have distribution . (We will explain the reason for this discrepancy later.)
- Hermitian Wigner matrix ensembles, in which the upper triangular coefficients are jointly independent, with the diagonal entries being real and the strictly upper triangular entries complex, and the lower triangular coefficients , are constrained to equal their adjoints: . Thus by construction is always a Hermitian matrix. This class of ensembles contains the symmetric Wigner ensembles as a subclass. Another very important example is the Gaussian Unitary Ensemble (GUE), in which all off-diagional entries have distribution , but the diagonal entries have distribution .
Given a matrix ensemble , there are many statistics of that one may wish to consider, e.g. the eigenvalues or singular values of , the trace and determinant, etc. In these notes we will focus on a basic statistic, namely the operator norm
of the matrix . This is an interesting quantity in its own right, but also serves as a basic upper bound on many other quantities. (For instance, is also the largest singular value of and thus dominates the other singular values; similarly, all eigenvalues of clearly have magnitude at most .) Because of this, it is particularly important to get good upper tail bounds
on this quantity, for various thresholds . (Lower tail bounds are also of interest, of course; for instance, they give us confidence that the upper tail bounds are sharp.) Also, as we shall see, the problem of upper bounding can be viewed as a non-commutative analogue of upper bounding the quantity studied in Notes 1. (The analogue of the central limit theorem in Notes 2 is the Wigner semi-circular law, which will be studied in the next set of notes.)
An matrix consisting entirely of s has an operator norm of exactly , as can for instance be seen from the Cauchy-Schwarz inequality. More generally, any matrix whose entries are all uniformly will have an operator norm of (which can again be seen from Cauchy-Schwarz, or alternatively from Schur’s test, or from a computation of the Frobenius norm). However, this argument does not take advantage of possible cancellations in . Indeed, from analogy with concentration of measure, when the entries of the matrix are independent, bounded and have mean zero, we expect the operator norm to be of size rather than . We shall see shortly that this intuition is indeed correct. (One can see, though, that the mean zero hypothesis is important; from the triangle inequality we see that if we add the all-ones matrix (for instance) to a random matrix with mean zero, to obtain a random matrix whose coefficients all have mean , then at least one of the two random matrices necessarily has operator norm at least .)
As mentioned before, there is an analogy here with the concentration of measure phenomenon, and many of the tools used in the latter (e.g. the moment method) will also appear here. (Indeed, we will be able to use some of the concentration inequalities from Notes 1 directly to help control and related quantities.) Similarly, just as many of the tools from concentration of measure could be adapted to help prove the central limit theorem, several the tools seen here will be of use in deriving the semi-circular law.
The most advanced knowledge we have on the operator norm is given by the Tracy-Widom law, which not only tells us where the operator norm is concentrated in (it turns out, for instance, that for a Wigner matrix (with some additional technical assumptions), it is concentrated in the range ), but what its distribution in that range is. While the methods in this set of notes can eventually be pushed to establish this result, this is far from trivial, and will only be briefly discussed here. (We may return to the Tracy-Widom law later in this course, though.)
— 1. The epsilon net argument —
The slickest way to control is via the moment method. But let us defer using this method for the moment, and work with a more “naive” way to control the operator norm, namely by working with the definition (1). From that definition, we see that we can view the upper tail event as a union of many simpler events:
The point of doing this is that the event is easier to control than the event , and can in fact be handled by the concentration of measure estimates we already have. For instance:
Lemma 1 Suppose that the coefficients of are independent, have mean zero, and uniformly bounded in magnitude by . Let be a unit vector in . Then for sufficiently large (larger than some absolute constant), one has
for some absolute constants .
Proof: Let be the rows of , then the column vector has coefficients for . if we let be the coefficients of , so that , then is just . Applying standard concentration of measure results (e.g. Exercise 4, Exercise 5, or Theorem 9 from Notes 1), we see that each is uniformly subgaussian, thus
for some absolute constants . In particular, we have
for some (slightly different) absolute constants . Multiplying these inequalities together for all , we obtain
and the claim then follows from Markov’s inequality.
Thus (with the hypotheses of Proposition 1), we see that for each individual unit vector , we have with overwhelming probability. It is then tempting to apply the union bound and try to conclude that with overwhelming probability also. However, we encounter a difficulty: the unit sphere is uncountable, and so we are taking the union over an uncountable number of events. Even though each event occurs with exponentially small probability, the union could well be everything.
Of course, it is extremely wasteful to apply the union bound to an uncountable union. One can pass to a countable union just by working with a countable dense subset of the unit sphere instead of the sphere itself, since the map is continuous. Of course, this is still an infinite set and so we still cannot usefully apply the union bound. However, the map is not just continuous; it is Lipschitz continuous, with a Lipschitz constant of . Now, of course there is some circularity here because is precisely the quantity we are trying to bound. Nevertheless, we can use this stronger continuity to refine the countable dense subset further, to a finite dense subset of , at the slight cost of modifying the threshold by a constant factor. Namely:
Lemma 2 Let be a maximal -net of the sphere , i.e. a set of points in that are separated from each other by a distance of at least , and which is maximal with respect to set inclusion. Then for any matrix with complex coefficients, and any , we have
Proof: By (1) (and compactness) we can find such that
This point need not lie in . However, as is a maximal -net of , we know that lies within of some point in (since otherwise we could add to and contradict maximality). Since , we have
By the triangle inequality we conclude that
In particular, if , then for some , and the claim follows.
Remark 1 Clearly, if one replaces the maximal -net here with an maximal -net for some other (defined in the obvious manner), then we get the same conclusion, but with replaced by .
Now that we have discretised the range of points to be finite, the union bound becomes viable again. We first make the following basic observation:
Proof: Consider the balls of radius centred around each point in ; by hypothesis, these are disjoint. On the other hand, by the triangle inequality, they are all contained in the ball of radius centred at the origin. The volume of the latter ball is at most the volume of any of the small balls, and the claim follows.
Exercise 1 Conversely, if is a maximal -net, show that has cardinality at least for some absolute constant .
And now we get an upper tail estimate:
Corollary 4 (Upper tail estimate for iid ensembles) Suppose that the coefficients of are independent, have mean zero, and uniformly bounded in magnitude by . Then there exists absolute constants such that
for all . In particular, we have with overwhelming probability.
Proof: From Lemma 2 and the union bound, we have
where is a maximal -net of . By Lemma 1, each of the probabilities is bounded by if is large enough. Meanwhile, from Lemma 3, has cardinality . If is large enough, the entropy loss of can be absorbed into the exponential gain of by modifying slightly, and the claim follows.
Exercise 2 If is a maximal -net instead of a maximal -net, establish the following variant
The above result was for matrices with independent entries, but it easily extends to the Wigner case:
Corollary 5 (Upper tail estimate for Wigner ensembles) Suppose that the coefficients of are independent for , mean zero, and uniformly bounded in magnitude by , and let for . Then there exists absolute constants such that
for all . In particular, we have with overwhelming probability.
Proof: From Corollary 4, the claim already holds for the upper-triangular portion of , as well as for the strict lower-triangular portion of . The claim then follows from the triangle inequality (adjusting the constants appropriately).
Remark 2 What we have just seen is a simple example of an epsilon net argument, which is useful when controlling a supremum of random variables such as (1), where each individual random variable is known to obey a large deviation inequality (in this case, Lemma 1). The idea is to use metric arguments (e.g. the triangle inequality, see Lemma 2) to refine the set of parameters to take the supremum over to an -net for some suitable , and then apply the union bound. One takes a loss based on the cardinality of the -net (which is basically the metric entropy or covering number of the original parameter space at scale ), but one can hope that the bounds from the large deviation inequality are strong enough (and the metric entropy bounds sufficiently accurate) to overcome this entropy loss.
There is of course the question of what scale to use. In this simple example, the scale sufficed. In other contexts, one has to choose the scale more carefully. In more complicated examples with no natural preferred scale, it often makes sense to take a large range of scales (e.g. for ) and chain them together by using telescoping series such as (where is the nearest point in to for , and is by convention) to estimate the supremum, the point being that one can hope to exploit cancellations between adjacent elements of the sequence . This is known as the method of chaining. There is an even more powerful refinement of this method, known as the method of generic chaining, which has a large number of applications; see this book by Talagrand for a beautiful and systematic treatment of the subject. However, we will not use this method in this course.
— 2. A symmetrisation argument (optional) —
We pause here to record an elegant symmetrisation argument that exploits convexity to allow us to reduce without loss of generality to the symmetric case , albeit at the cost of losing a factor of . We will not use this type of argument directly in these notes, but it is often used elsewhere in the literature.
Let be any random matrix with mean zero, and let be an independent copy of . Then, conditioning on , we have
As the operator norm is convex, we can then apply Jensen’s inequality to conclude that
One can modify (3) in a few ways, given some more hypotheses on . Suppose now that is a matrix with independent entries, thus has coefficients where is an independent copy of . Introduce a random sign matrix which is (jointly) independent of . Observe that as the distribution of is symmetric, that
By the distributive law and the triangle inequality we have
Thus, if one does not mind losing a factor of two, one has the freedom to randomise the sign of each entry of independently (assuming that the entries were already independent). Thus, in proving Corollary 4, one could have reduced to the case when the were symmetric, though in this case this would not have made the argument that much simpler.
Sometimes it is preferable to multiply the coefficients by a Gaussian rather than by a random sign. Again, let have independent entries with mean zero. Let be a real gaussian matrix independent of , thus the are iid. We can split , where and . Note that , , are independent, and is a random sign matrix. In particular, (4) holds. We now use
Exercise 4 If , show that .
From this exercise we see that
and hence by Jensen’s inequality again
Undoing the conditional expectation in and applying (4) we conclude the gaussian symmetrisation inequality
Thus, for instance, when proving Corollary 4, one could have inserted a random gaussian in front of each coefficient. This would have made the proof of Lemma 1 marginally simpler (as one could compute directly with gaussians, and reduce the number of appeals to concentration of measure results) but in this case the improvement is negligible. In other situations though it can be quite helpful to have the additional random sign or random gaussian factor present. For instance, we have the following result of Latala:
and the fourth moment bound
for some . Then .
Proof: (Sketch only) Using (5) one can replace by without much penalty. One then runs the epsilon-net argument with an explicit net, and uses concentration of measure results for gaussians (such as Theorem 8 from Notes 1) to obtain the analogue of Lemma 1. The details are rather intricate, and we refer the interested reader to Latala’s paper.
As a corollary of Theorem 6, we see that if we have an iid matrix (or Wigner matrix) of mean zero whose entries have a fourth moment of , then the expected operator norm is . The fourth moment hypothesis is sharp. To see this, we make the trivial observation that the operator norm of a matrix bounds the magnitude of any of its coefficients, thus
or equivalently that
and so the right-hand side of (6) is asymptotically trivial. But with weaker hypotheses than the fourth moment hypothesis, the rate of convergence of to can be slower, and one can easily build examples for which the right-hand side of (6) is for every , which forces to typically be much larger than on the average.
Remark 3 The symmetrisation inequalities remain valid with the operator norm replaced by any other convex norm on the space of matrices. The results are also just as valid for rectangular matrices as for square ones.
— 3. Concentration of measure —
Consider a random matrix of the type considered in Corollary 4 (e.g. a random sign matrix). We now know that the operator norm is of size with overwhelming probability. But there is much more that can be said. For instance, by taking advantage of the convexity and Lipschitz properties of , we have the following quick application of Talagrand’s inequality (Theorem 9 from Notes 1):
Proposition 7 Let be as in Corollary 4. Then for any , one has
for some absolute constants , where is a median value for . The same result also holds with replaced by the expectation .
Proof: We view as a function of the independent complex variables , thus is a function from to . The convexity of the operator norm tells us that is convex. The triangle inequality, together with the elementary bound
Exercise 5 Establish a similar result for the matrices in Corollary 5.
From Corollary 4 we know that the median or expectation of is of size ; we now know that concentrates around this median to width at most . (This turns out to be non-optimal; the Tracy-Widom law actually gives a concentration of , under some additional assumptions on . Nevertheless this level of concentration is already non-trivial.)
However, this argument does not tell us much about what the median or expected value of actually is. For this, we will need to use other methods, such as the moment method which we turn to next.
Remark 4 Talagrand’s inequality, as formulated in these notes, relies heavily on convexity. Because of this, we cannot apply this argument directly to non-convex matrix statistics, such as singular values other than the largest singular value . Nevertheless, one can still use this inequality to obtain good concentration results, by using the convexity of related quantities, such as the partial sums ; see this paper of Meckes. Other approaches include the use of alternate large deviation inequalities, such as those arising from log-Sobolev inequalities (see e.g. this book by Guionnet), or by using more abstract versions of Talagrand’s inequality (see this paper of Alon, Krivelevich, and Vu).
— 4. The moment method —
We now bring the moment method to bear on the problem, starting with the easy moments and working one’s way up to the more sophisticated moments. It turns out that it is easier to work first with the case when is symmetric or Hermitian; we will discuss the non-symmetric case near the end of these notes.
and more generally
To put this another way, knowledge of the moment controls the operator norm up to a multiplicative factor of . Taking larger and larger , we should thus obtain more accurate control on the operator norm. (This is basically the philosophy underlying the power method.)
Remark 5 In most cases, one expects the eigenvalues to be reasonably uniformly distributed, in which case the upper bound in (9) is closer to the truth than the lower bound. One scenario in which this can be rigorously established is if it is known that the eigenvalues of all come with a high multiplicity. This is often the case for matrices associated with group actions (particularly those which are quasirandom in the sense of Gowers). However, this is usually not the case with most random matrix ensembles, and we must instead proceed by increasing as described above.
Let’s see how this method works in practice. The simplest case is that of the second moment , which in the Hermitian case works out to
The expression is easy to compute in practice. For instance, for the symmetric Bernoulli ensemble, this expression is exactly equal to . More generally, if we have a Wigner matrix in which all off-diagonal entries have mean zero and unit variance, and the diagonal entries have mean zero and bounded variance (this is the case for instance for GOE), then the off-diagonal entries have mean , and by the law of large numbers we see that this expression is almost surely asymptotic to . (There is of course a dependence between the upper triangular and lower triangular entries, but this is easy to deal with by folding the sum into twice the upper triangular portion (plus the diagonal portion, which is lower order).)
Exercise 6 If the have uniformly sub-exponential tail, show that we in fact have (10) with overwhelming probability.
Applying (9), we obtain the bounds
asymptotically almost surely. This is already enough to show that the median of is at least , which complements (up to constants) the upper bound of obtained from the epsilon net argument. But the upper bound here is terrible; we will need to move to higher moments to improve it.
Accordingly, we now turn to the fourth moment. For simplicity let us assume that all entries have zero mean and unit variance. To control moments beyond the second moment, we will also assume that all entries are bounded in magnitude by some . We expand
To understand this expression, we take expectations:
Firstly, there is the case when all the four edges are distinct. Then the four factors are independent; since we are assuming them to have mean zero, the term (12) vanishes. Indeed, the same argument shows that the only terms that do not vanish are those in which each edge is repeated at least twice. A short combinatorial case check then shows that, up to cyclic permutations of the indices there are now only a few types of cycles in which the term (12) does not automatically vanish:
- , but are distinct from each other and from .
- and .
- , but is distinct from .
In the first case, the independence and unit variance assumptions tells us that (12) is , and there are such terms, so the total contribution here to is at most . In the second case, the unit variance and bounded by tells us that the term is , and there are such terms, so the contribution here is . Similarly, the contribution of the third type of cycle is , and the fourth type of cycle is , so we can put it all together to get
In particular, if we make the hypothesis , then we have
and thus by Markov’s inequality we see that for any , with probability at least . Applying (9), this leads to the upper bound
with probability at least ; a similar argument shows that for any fixed , one has
with high probability. This is better than the upper bound obtained from the second moment method, but still non-optimal.
which in some sense improves upon (11) by a factor of . In particular, if , conclude that the median of is at least .
Now let us take a quick look at the sixth moment, again with the running assumption of a Wigner matrix in which all entries have mean zero, unit variance, and bounded in magnitude by . We have
a sum over cycles of length in . Again, most of the summands here vanish; the only ones which do not are those cycles in which each edge occurs at least twice (so in particular, there are at most three distinct edges).
Classifying all the types of cycles that could occur here is somewhat tedious, but it is clear that there are going to be different types of cycles. But we can organise things by the multiplicity of each edge, leaving us with four classes of cycles to deal with:
- Cycles in which there are three distinct edges, each occuring two times.
- Cycles in which there are two distinct edges, one occuring twice and one occuring four times.
- Cycles in which there are two distinct edges, each occuring three times. (Actually, this case ends up being impossible, due to a “bridges of Königsberg” type of obstruction, but we will retain it for this discussion.)
- Cycles in which a single edge occurs six times.
It is not hard to see that summands coming from the first type of cycle give a contribution of , and there are of these (because such cycles span at most four vertices). Similarly, the second and third types of cycles give a contribution of per summand, and there are summands; finally, the fourth type of cycle gives a contribution of , with summands. Putting this together we see that
so in particular if we assume as before, we have
and if we then use (9) as before we see that
with probability , for any ; so we are continuing to make progress towards what we suspect (from the epsilon net argument) to be the correct bound of .
In particular, if , conclude that the median of is at least . Thus this is a (slight) improvement over Exercise 7.
We divide the cycles into various classes, depending on which edges are equal to each other. (More formally, a class is an equivalence relation on a set of labels, say in which each equivalence class contains at least two elements, and a cycle of edges lies in the class associated to when we have that iff , where we adopt the cyclic notation .)
How many different classes could there be? We have to assign up to labels to edges, so a crude upper bound here is .
Now consider a given class of cycle. It has edges for some , with multiplicities , where are at least and add up to . The edges span at most vertices; indeed, in addition to the first vertex , one can specify all the other vertices by looking at the first appearance of each of the edges in the path from to , and recording the final vertex of each such edge. From this, we see that the total number of cycles in this particular class is at most . On the other hand, because each has mean zero, unit variance and is bounded by , the moment of this coefficient is at most for any . Thus each summand in (13) coming from a cycle in this class has magnitude at most
Thus the total contribution of this class to (13) is , which we can upper bound by
Summign up over all classes, we obtain the (somewhat crude) bound
and thus by (9)
and so by Markov’s inequality we have
for all . This, for instance, places the median of at . We can optimise this in by choosing to be comparable to , and so we obtain an upper bound of for the median; indeed, a slight tweaking of the constants tells us that with high probability.
The same argument works if the entries have at most unit variance rather than unit variance, thus we have shown
Proposition 8 (Weak upper bound) Let be a random Hermitian matrix, with the upper triangular entries , being independent with mean zero and variance at most , and bounded in magnitude by . Then with high probability.
When , this gives an upper bound of , which is still off by half a logarithm from the expected bound of . We will remove this half-logarithmic loss later in these notes.
— 5. Computing the moment to top order —
Now let us consider the case when , and each entry has variance exactly . We have an upper bound
let us try to get a more precise answer here (as in Exercises 7, 8). Recall that each class of cycle contributed a bound of to this expression. If , we see that such expressions are whenever , where the notation means that the decay rate as can depend on . So the total contribution of all such classes is .
Now we consider the remaining classes with . For such classes, each equivalence class of edges contains exactly two representatives, thus each edge is repeated exactly once. The contribution of each such cycle to (13) is exactly , thanks to the unit variance and independence hypothesis. Thus, the total contribution of these classes to is equal to a purely combinatorial quantity, namely the number of cycles of length on in which each edge is repeated exactly once, yielding unique edges. We are thus faced with the enumerative combinatorics problem of bounding this quantity as precisely as possible.
With edges, there are at most vertices traversed by the cycle. If there are fewer than vertices traversed, then there are at most cycles of this type, since one can specify such cycles by identifying up to vertices in and then matching those coordinates with the vertices of the cycle. So we set aside these cycles, and only consider those cycles which traverse exactly vertices. Let us call such cycles (i.e. cycles of length with each edge repeated exactly once, and traversing exactly vertices) non-crossing cycles of length in . Our remaining task is then to count the number of non-crossing cycles.
Exercise 9 Show that a cycle of length is non-crossing if and only if there exists a tree in of edges and vertices, such that the cycle lies in the tree and traverses each edge in the tree exactly twice.
Exercise 10 Let be a cycle of length . Arrange the integers around a circle, and draw a line segment between two distinct integers whenever . Show that the cycle is non-crossing if and only if the number of line segments is exactly , and the line segments do not cross each other. This may help explain the terminology “non-crossing”.
Now we can complete the count. If is a positive even integer, define a Dyck word of length to be the number of words consisting of left and right parentheses (, ) of length , such that when one reads from left to right, there are always at least as many left parentheses as right parentheses (or in other words, the parentheses define a valid nesting). For instance, the only Dyck word of length is , the two Dyck words of length are and , and the five Dyck words of length are
and so forth. (Dyck words are also closely related to Dyck paths.)
Lemma 9 The number of non-crossing cycles of length in is equal to , where is the number of Dyck words of length . (The number is also known as the Catalan number.)
Proof: We will give a bijective proof. Namely, we will find a way to store a non-crossing cycle as a Dyck word, together with an (ordered) sequence of distinct elements from , in such a way that any such pair of a Dyck word and ordered sequence generates exactly one non-crossing cycle. This will clearly give the claim.
So, let us take a non-crossing cycle . We imagine traversing this cycle from to , then from to , and so forth until we finally return to from . On each leg of this journey, say from to , we either use an edge that we have not seen before, or else we are using an edge for the second time. Let us say that the leg from to is an innovative leg if it is in the first category, and a returning leg otherwise. Thus there are innovative legs and returning legs. Clearly, it is only the innovative legs that can bring us to vertices that we have not seen before. Since we have to visit distinct vertices (including the vertex we start at), we conclude that each innovative leg must take us to a new vertex. We thus record, in order, each of the new vertices we visit, starting at and adding another vertex for each innovative leg; this is an ordered sequence of distinct elements of . Next, traversing the cycle again, we write down a whenever we traverse an innovative leg, and an otherwise. This is clearly a Dyck word. For instance, using the examples in Example 1, the non-crossing cycle gives us the ordered sequence and the Dyck word , while gives us the ordered sequence and the Dyck word .
We have seen that every non-crossing cycle gives rise to an ordered sequence and a Dyck word. A little thought shows that the cycle can be uniquely reconstructed from this ordered sequence and Dyck word (the key point being that whenever one is performing a returning leg from a vertex , one is forced to return along the unique innovative leg that discovered ). A slight variant of this thought also shows that every Dyck word of length and ordered sequence of distinct elements gives rise to a non-crossing cycle. This gives the required bijection, and the claim follows.
Next, we recall the classical formula for the Catalan number:
Exercise 11 Establish the recurrence
Exercise 12 Let be a positive even integer. Given a string of left parentheses and right parentheses which is not a Dyck word, define the reflection of this string by taking the first right parenthesis which does not have a matching left parenthesis, and then reversing all the parentheses after that right parenthesis. Thus, for instance, the reflection of is . Show that there is a bijection between non-Dyck words with left parentheses and right parentheses, and arbitrary words with left parentheses and right parentheses. Use this to give an alternate proof of (14).
Note that . Putting all the above computations together, we conclude
Theorem 10 (Moment computation) Let be a real symmetric random matrix, with the upper triangular elements , jointly independent with mean zero and variance one, and bounded in magnitude by . Let be a positive even integer. Then we have
where is given by (14).
Remark 6 An inspection of the proof also shows that if we allow the to have variance at most one, rather than equal to one, we obtain the upper bound
Exercise 13 Show that Theorem 10 also holds for Hermitian random matrices. (Hint: The main point is that with non-crossing cycles, each non-innovative leg goes in the reverse direction to the corresponding innovative leg – why?)
Remark 7 Theorem 10 can be compared with the formula
derived in Notes 1, where is the sum of iid random variables of mean zero and variance one, and
Exercise 10 shows that can be interpreted as the number of ways to join points on the circle by non-crossing chords. In a similar vein, can be interpreted as the number of ways to join points on the circle by chords which are allowed to cross each other (except at the endpoints). Thus moments of Wigner-type matrices are in some sense the “non-crossing” version of moments of sums of random variables. We will discuss this phenomenon more when we turn to free probability in later notes.
In the bounded case , we can combine this with Exercise 5 to conclude that the median (or mean) of is at least . On the other hand, from Stirling’s formula (Notes 0a) we see that converges to as . Taking to be a slowly growing function of , we conclude
Proposition 11 (Lower bound) Let be a real symmetric random matrix, with the upper triangular elements , jointly independent with mean zero and variance one, and bounded in magnitude by . Then the median (or mean) of is at least .
Remark 8 One can in fact obtain an exact asymptotic expansion of the moments as a polynomial in , known as the genus expansion of the moments. This expansion is however somewhat difficult to work with from a combinatorial perspective (except at top order) and will not be used here.
— 6. Removing the logarithm —
and hence by Markov’s inequality
Applying this with for some fixed , and setting to be a large multiple of , we see that asymptotically almost surely, which on selecting to grow slowly in gives in fact that asymptotically almost surely, thus complementing the lower bound in Proposition 11.
This argument was not rigorous because it did not address the error. Without a more quantitative accounting of this error, one cannot set as large as without losing control of the error terms; and indeed, a crude accounting of this nature will lose factors of which are unacceptable. Nevertheless, by tightening the hypotheses a little bit and arguing more carefully, we can get a good bound, for in the region of interest:
Theorem 12 (Improved moment bound) Let be a real symmetric random matrix, with the upper triangular elements , jointly independent with mean zero and variance one, and bounded in magnitude by (say). Let be a positive even integer of size (say). Then we have
where is given by (14). In particular, from the trivial bound (which is obvious from the Dyck words definition) one has
One can of course adjust the parameters and in the above theorem, but we have tailored these parameters for our application to simplify the exposition slightly.
Proof: We may assume large, as the claim is vacuous for bounded .
We again expand using (13), and discard all the cycles in which there is an edge that only appears once. The contribution of the non-crossing cycles was already computed in the previous section to be
which can easily be computed (e.g. by taking logarithms, or using Stirling’s formula) to be . So the only task is to show that the net contribution of the remaining cycles is .
Consider one of these cycles ; it has distinct edges for some (with each edge repeated at least once).
We order the distinct edges by their first appearance in the cycle. Let be the multiplicities of these edges, thus the are all at least and add up to . Observe from the moment hypotheses that the moment is bounded by for . Since , we conclude that the expression
in (13) has magnitude at most , and so the net contribution of the cycles that are not non-crossing is bounded in magnitude by
where range over integers that are at least and which add up to , and is the number of cycles that are not non-crossing and have distinct edges with multiplicity (in order of appearance). It thus suffices to show that (16) is .
Next, we estimate for a fixed . Given a cycle , we traverse its legs (which each traverse one of the edges ) one at a time and classify them into various categories:
- High-multiplicity legs, which use an edge whose multiplicity is larger than two.
- Fresh legs, which use an edge with for the first time.
- Return legs, which use an edge with that has already been traversed by a previous fresh leg.
We also subdivide fresh legs into innovative legs, which take one to a vertex one has not visited before, and non-innovative legs, which take one to a vertex that one has visited before.
At any given point in time when traversing this cycle, we define an available edge to be an edge of multiplicity that has already been traversed by its fresh leg, but not by its return leg. Thus, at any given point in time, one travels along either a high-multiplicity leg, a fresh leg (thus creating a new available edge), or one returns along an available edge (thus removing that edge from availability).
Call a return leg starting from a vertex forced if, at the time one is performing that leg, there is only one available edge from , and unforced otherwise (i.e. there are two or more available edges to choose from).
We assume that there are non-innovative legs among the fresh legs, leaving innovative legs. As the cycle is not non-crossing, we either have or .
Similarly, we assume that there are unforced return legs among the total return legs. We have an important estimate:
Proof: Let be a vertex visited by the cycle which is not the initial vertex . Then the very first arrival at comes from a fresh leg, which immediately becomes available. Each departure from may create another available edge from , but each subsequent arrival at will delete an available leg from , unless the arrival is along a non-innovative or high-multiplicity edge. (Note that one can loop from to itself and create an available edge, but this is along a non-innovative edge and so is not inconsistent with the previous statements.) Finally, any returning leg that departs from will also delete an available edge from .
This has two consequences. Firstly, if there are no non-innovative or high-multiplicity edges arriving at , then whenever one arrives at , there is at most one available edge from , and so every return leg from is forced. (And there will be only one such return leg.) If instead there are non-innovative or high-multiplicity edges arriving at , then we see that the total number of return legs from is at most one plus the number of such edges. In both cases, we conclude that the number of unforced return legs from is bounded by twice the number of non-innovative or high-multiplicity edges arriving at . Summing over , one obtains the claim.
Now we return to the task of counting , by recording various data associated to any given cycle contributing to this number. First, fix . We record the initial vertex of the cycle, for which there are possibilities. Next, for each high-multiplicity edge (in increasing order of ), we record all the locations in the cycle where this edge is used; the total number of ways this can occur for each such edge can be bounded above by , so the total entropy cost here is . We also record the final endpoint of the first occurrence of the edge for each such ; this list of vertices in has at most possibilities.
For each innovative leg, we record the final endpoint of that leg, leading to an additional list of vertices with at most possibilities.
For each non-innovative leg, we record the position of that leg, leading to a list of numbers from , which has at most possibilities. We also record the position of the first time one already visited the vertex that the non-innovative leg is returning to, leading to a further list of numbers from , and another possibilities.
For each unforced return leg, we record the position of the corresponding fresh leg, leading to a list of numbers from , which has at most possibilities.
Finally, we record a Dyck-like word of length , in which we place a whenever the leg is innovative, and otherwise (the brackets need not match here). The total entropy cost here can be bounded above by .
We now observe that all this data (together with ) can be used to completely reconstruct the original cycle. Indeed, as one traverses the cycle, the data already tells us which edges are high-multiplicity, which ones are innovative, which ones are non-innovative, and which ones are return legs. In all edges in which one could possibly visit a new vertex, the location of that vertex has been recorded. For all unforced returns, the data tells us which fresh leg to backtrack upon to return to. Finally, for forced returns, there is only one available leg to backtrack to, and so one can reconstruct the entire cycle from this data.
As a consequence, for fixed and , there are at most
Summing over the possible values of (recalling that we either have or , and also that ) we obtain
The expression (16) can then be bounded by
When is exactly , then all the must equal , and so the contribution of this case simplifies to . For , the numbers are non-negative and add up to , and so the total number of possible values for these numbers (for fixed ) can be bounded crudely by (for instance). Putting all this together, we can bound (16) by
which simplifies by the geometric series formula (and the hypothesis ) to
Theorem 14 (Weak Bai-Yin theorem, upper bound) Let be a real symmetric matrix whose entries all have the same distribution , with mean zero, variance one, and fourth moment . Then for every independent of , one has asymptotically almost surely. In particular, asymptotically almost surely; as another consequence, the median of is at most . (If is bounded, we see in particular from Proposition 11 that the median is in fact equal to .)
The fourth moment hypothesis is best possible, as seen in the discussion after Theorem 6. We will discuss some generalisations and improvements of this theorem in other directions below.
Proof: To obtain Theorem 14 from Theorem 12 we use the truncation method. We split each as in the usual manner, and split accordingly. We would like to apply Theorem 12 to , but unfortunately the truncation causes some slight adjustment to the mean and variance of the . The variance is not much of a problem; since had variance , it is clear that has variance at most , and it is easy to see that reducing the variance only serves to improve the bound (15). As for the mean, we use the mean zero nature of to write
To control the right-hand side, we use the trivial inequality and the bounded fourth moment hypothesis to conclude that
Thus we can write , where is the random matrix with coefficients
and is a matrix whose entries have magnitude . In particular, by Schur’s test this matrix has operator norm , and so by the triangle inequality
for any . In particular, we see from Markov’s inequality that (19) holds with probability at most
Setting to be a large enough multiple of (depending on ), we thus see that this event (19) indeed holds asymptotically almost surely (indeed, one can ensure it happens with overwhelming probability, by letting grow slowly to infinity).
Now we turn to (20). The idea here is to exploit the sparseness of the matrix . First let us dispose of the event that one of the entries has magnitude larger than (which would certainly cause (20) to fail). By the union bound, the probability of this event is at most
By the fourth moment bound on and dominated convergence, this expression goes to zero as . Thus, asymptotically almost surely, all entries are less than .
Now let us see how many non-zero entries there are in . By Markov’s inequality and the fourth moment hypothesis, each entry has a probability of being non-zero; by the first moment method, we see that the expected number of entries is . As this is much less than , we expect it to be unlikely that any row or column has more than one entry. Indeed, from the union bound and independence, we see that the probability that any given row and column has at least two non-zero entries is at most
and so by the union bound again, we see that with probability at least (and in particular, asymptotically almost surely), none of the rows or columns have more than one non-zero entry. As the entries have magnitude at most , the bound (20) now follows from Schur’s test, and the claim follows.
We can upgrade the asymptotic almost sure bound to almost sure boundedness:
Theorem 15 (Strong Bai-Yin theorem, upper bound) Let be a real random variable with mean zero, variance , and finite fourth moment, and for all , let be an iid sequence with distribution , and set . Let be the random matrix formed by the top left block. Then almost surely one has .
Exercise 14 By combining the above results with Proposition 11 and Exercise 5, show that with the hypotheses of Theorem 15 with bounded, one has almost surely. (The same claim is true without the boundedness hypothesis; we will see this in the next set of notes, after we establish the semicircle law.)
Proof: We first give ourselves an epsilon of room. It suffices to show that for each , one has
Next, we perform dyadic sparsification (as was done in the proof of the strong law of large numbers in Notes 1). Observe that any minor of a matrix has its operator norm bounded by that of the larger matrix; and so is increasing in . Because of this, it will suffice to show (21) almost surely for restricted to a lacunary sequence, such as for , as the general case then follows by rounding upwards to the nearest (and adjusting a little bit as necessary).
Once we sparsified, it is now safe to apply the Borel-Cantelli lemma (Exercise 1 of Notes 0), and it will suffice to show that
To bound the probabilities , we inspect the proof of Theorem 14. Most of the contributions to this probability decay polynomially in (i.e. are of the form for some ) and so are summable. The only contribution which can cause difficulty is the contribution of the event that one of the entries of exceeds in magnitude; this event was bounded by
But if one sums over using Fubini’s theorem and the geometric series formula, we see that this expression is bounded by , which is finite by hypothesis, and the claim follows.
Now we discuss some variants and generalisations of the Bai-Yin result.
Firstly, we note that the results stated above require the diagonal and off-diagonal terms to have the same distribution. This is not the case for important ensembles such as the Gaussian Orthogonal Ensemble (GOE), in which the diagonal entries have twice as much variance as the off-diagonal ones. But this can easily be handled by considering the diagonal separately. For instance, consider a diagonal matrix where the are identically distributed with finite second moment. The operator norm of this matrix is just , and so by the union bound
From the finite second moment and dominated convergence, the right-hand side is , and so we conclude that for for every fixed , asymptotically almost surely; diagonalising, we conclude that asymptotically almost surely. Because of this and the triangle inequality, we can modify the diagonal by any amount with identical distribution and bounded second moment (a similar argument also works for non-identical distributions if one has uniform control of some moment beyond the second, such as the fourth moment) while only affecting all operator norms by .
Exercise 15 Modify this observation to extend the weak and strong Bai-Yin theorems to the case where the diagonal entries are allowed to have different distribution than the off-diagonal terms, and need not be independent of each other or of the off-diagonal terms, but have uniformly bounded fourth moment.
Secondly, it is a routine matter to generalise the Bai-Yin result from real symmetric matrices to Hermitian matrices, basically for the same reasons that Exercise 13 works. We leave the details to the interested reader.
The Bai-Yin results also hold for iid random matrices, where has mean zero, unit variance, and bounded fourth moment; this is a result of Yin, Bai, and Krishnaiah. Because of the lack of symmetry, the eigenvalues need not be real, and the bounds (9) no longer apply. However, there is a substitute, namely the bound
Exercise 16 Prove (22).
It is possible to adapt all of the above moment calculations for in the symmetric or Hermitian cases to give analogous results for in the non-symmetric cases; we do not give the details here, but mention that the cycles now go back and forth along a bipartite graph with vertices in each class, rather than in the complete graph on vertices, although this ends up not to affect the enumerative combinatorics significantly. Another way of viewing this is through the simple observation that the operator norm of a non-symmetric matrix is equal to the operator norm of the augmented matrix
Exercise 17 If has singular values , show that has eigenvalues . This suggests that the theory of the singular values of an iid matrix should resemble to some extent the theory of eigenvalues of a Wigner matrix; we will see several examples of this phenomenon in later notes.
When one assumes more moment conditions on than bounded fourth moment, one can obtain substantially more precise asymptotics on than given by results such as Theorem 12, particularly if one also assumes that the underlying random variable is symmetric (i.e. ). At a practical level, the advantage of symmetry is that it allows one to assume that the high-multiplicity edges in a cycle are traversed an even number of times; see the following exercise.
Exercise 18 Let be a bounded real random variable. Show that is symmetric if and only if for all positive odd integers .
Next, extend the previous result to the case when is subgaussian rather than bounded. (Hint: The slickest way to do this is via the characteristic function and analytic continuation; it is also instructive to find a “real-variable” proof that avoids the use of this function.)
By using these methods, it is in fact possible to show that under various hypotheses, is concentrated in the range , and even to get a universal distribution for the normalised expression , known as the Tracy-Widom law. See this paper of Soshnikov for details. There has also been a number of subsequent variants and refinements of this result (as well as counterexamples when not enough moment hypotheses are assumed); see for instance these papers of Soshnikov, Soshnikov-Fyodorov, Ruzmaikina, Péché, Vu, Soshnikov-Péché, Péché, Khorunzhiy and Tao-Vu for an (incomplete) sample of the work in this area. (Similar results for some non-independent distributions are also available, see e.g. this paper of Deift and Gioev, which (like many of the other references just cited) builds upon the original work of Tracy and Widom that handled special ensembles such as GOE and GUE.)