This week I am at the American Institute of Mathematics, as an organiser on a workshop on the universality phenomenon in random matrices. There have been a number of interesting discussions so far in this workshop. Percy Deift, in a lecture on universality for invariant ensembles, gave some applications of what he only half-jokingly termed “the most important identity in mathematics”, namely the formula

\displaystyle  \hbox{det}( 1 + AB ) = \hbox{det}(1 + BA)

whenever {A, B} are {n \times k} and {k \times n} matrices respectively (or more generally, {A} and {B} could be linear operators with sufficiently good spectral properties that make both sides equal). Note that the left-hand side is an {n \times n} determinant, while the right-hand side is a {k \times k} determinant; this formula is particularly useful when computing determinants of large matrices (or of operators), as one can often use it to transform such determinants into much smaller determinants. In particular, the asymptotic behaviour of {n \times n} determinants as {n \rightarrow \infty} can be converted via this formula to determinants of a fixed size (independent of {n}), which is often a more favourable situation to analyse. Unsurprisingly, this trick is particularly useful for understanding the asymptotic behaviour of determinantal processes.

There are many ways to prove the identity. One is to observe first that when {A, B} are invertible square matrices of the same size, that {1+BA} and {1+AB} are conjugate to each other and thus clearly have the same determinant; a density argument then removes the invertibility hypothesis, and a padding-by-zeroes argument then extends the square case to the rectangular case. Another is to proceed via the spectral theorem, noting that {AB} and {BA} have the same non-zero eigenvalues.

By rescaling, one obtains the variant identity

\displaystyle  \hbox{det}( z + AB ) = z^{n-k} \hbox{det}(z + BA)

which essentially relates the characteristic polynomial of {AB} with that of {BA}. When {n=k}, a comparison of coefficients this already gives important basic identities such as {\hbox{tr}(AB) = \hbox{tr}(BA)} and {\hbox{det}(AB) = \hbox{det}(BA)}; when {n} is not equal to {k}, an inspection of the {z^{n-k}} coefficient similarly gives the Cauchy-Binet formula (which, incidentally, is also useful when performing computations on determinantal processes).

Thanks to this formula (and with a crucial insight of Alice Guionnet), I was able to solve a problem (on outliers for the circular law) that I had in the back of my mind for a few months, and initially posed to me by Larry Abbott; I hope to talk more about this in a future post.

Today, though, I wish to talk about another piece of mathematics that emerged from an afternoon of free-form discussion that we managed to schedule within the AIM workshop. Specifically, we hammered out a heuristic model of the mesoscopic structure of the eigenvalues {\lambda_1 \leq \ldots \leq \lambda_n} of the {n \times n} Gaussian Unitary Ensemble (GUE), where {n} is a large integer. As is well known, the probability density of these eigenvalues is given by the Ginebre distribution

\displaystyle  \frac{1}{Z_n} e^{-H(\lambda)}\ d\lambda

where {d\lambda = d\lambda_1 \ldots d\lambda_n} is Lebesgue measure on the Weyl chamber {\{ (\lambda_1,\ldots,\lambda_n) \in {\bf R}^n: \lambda_1 \leq \ldots \leq \lambda_n \}}, {Z_n} is a constant, and the Hamiltonian {H} is given by the formula

\displaystyle  H(\lambda_1,\ldots,\lambda_n) := \sum_{j=1}^n \frac{\lambda_j^2}{2} - 2 \sum_{1 \leq i < j \leq n} \log |\lambda_i-\lambda_j|.

At the macroscopic scale of {\sqrt{n}}, the eigenvalues {\lambda_j} are distributed according to the Wigner semicircle law

\displaystyle  \rho_{sc}(x) := \frac{1}{2\pi} (4-x^2)_+^{1/2}.

Indeed, if one defines the classical location {\gamma_i^{cl}} of the {i^{th}} eigenvalue to be the unique solution in {[-2\sqrt{n}, 2\sqrt{n}]} to the equation

\displaystyle  \int_{-2\sqrt{n}}^{\gamma_i^{cl}/\sqrt{n}} \rho_{sc}(x)\ dx = \frac{i}{n}

then it is known that the random variable {\lambda_i} is quite close to {\gamma_i^{cl}}. Indeed, a result of Gustavsson shows that, in the bulk region when {\epsilon n < i  0}, {\lambda_i} is distributed asymptotically as a gaussian random variable with mean {\gamma_i^{cl}} and variance {\sqrt{\frac{\log n}{\pi}} \times \frac{1}{\sqrt{n} \rho_{sc}(\gamma_i^{cl})}}. Note that from the semicircular law, the factor {\frac{1}{\sqrt{n} \rho_{sc}(\gamma_i^{cl})}} is the mean eigenvalue spacing.

At the other extreme, at the microscopic scale of the mean eigenvalue spacing (which is comparable to {1/\sqrt{n}} in the bulk, but can be as large as {n^{-1/6}} at the edge), the eigenvalues are asymptotically distributed with respect to a special determinantal point process, namely the Dyson sine process in the bulk (and the Airy process on the edge), as discussed in this previous post.

Here, I wish to discuss the mesoscopic structure of the eigenvalues, in which one involves scales that are intermediate between the microscopic scale {1/\sqrt{n}} and the macroscopic scale {\sqrt{n}}, for instance in correlating the eigenvalues {\lambda_i} and {\lambda_j} in the regime {|i-j| \sim n^\theta} for some {0 < \theta < 1}. Here, there is a surprising phenomenon; there is quite a long-range correlation between such eigenvalues. The result of Gustavsson shows that both {\lambda_i} and {\lambda_j} behave asymptotically like gaussian random variables, but a further result from the same paper shows that the correlation between these two random variables is asymptotic to {1-\theta} (in the bulk, at least); thus, for instance, adjacent eigenvalues {\lambda_{i+1}} and {\lambda_i} are almost perfectly correlated (which makes sense, as their spacing is much less than either of their standard deviations), but that even very distant eigenvalues, such as {\lambda_{n/4}} and {\lambda_{3n/4}}, have a correlation comparable to {1/\log n}. One way to get a sense of this is to look at the trace

\displaystyle  \lambda_1 + \ldots + \lambda_n.

This is also the sum of the diagonal entries of a GUE matrix, and is thus normally distributed with a variance of {n}. In contrast, each of the {\lambda_i} (in the bulk, at least) has a variance comparable to {\log n/n}. In order for these two facts to be consistent, the average correlation between pairs of eigenvalues then has to be of the order of {1/\log n}.

Below the fold, I give a heuristic way to see this correlation, based on Taylor expansion of the convex Hamiltonian {H(\lambda)} around the minimum {\gamma}, which gives a conceptual probabilistic model for the mesoscopic structure of the GUE eigenvalues. While this heuristic is in no way rigorous, it does seem to explain many of the features currently known or conjectured about GUE, and looks likely to extend also to other models.

— 1. Fekete points —

It is easy to see that the Hamiltonian {H(\lambda)} is convex in the Weyl chamber, and goes to infinity on the boundary of this chamber, so it must have a unique minimum, at a set of points {\gamma = (\gamma_1,\ldots,\gamma_n)} known as the Fekete points. At the minimum, we have {\nabla H(\gamma) = 0}, which expands to become the set of conditions

\displaystyle  \gamma_j - 2 \sum_{i \neq j} \frac{1}{\gamma_j - \gamma_i} = 0 \ \ \ \ \ (1)

for all {1 \leq j \leq n}. To solve these conditions, we introduce the monic degree {n} polynomial

\displaystyle  P(x) := \prod_{i=1}^n (x-\gamma_i).

Differentiating this polynomial, we observe that

\displaystyle  P'(x) = P(x) \sum_{i=1}^n \frac{1}{x-\gamma_i} \ \ \ \ \ (2)

and

\displaystyle  P''(x) = P(x) \sum_{1 \leq i,j \leq n: i \neq j} \frac{1}{x-\gamma_i} \frac{1}{x-\gamma_j}.

Using the identity

\displaystyle  \frac{1}{x-\gamma_i} \frac{1}{x-\gamma_j} = \frac{1}{x-\gamma_i} \frac{1}{\gamma_i-\gamma_j} + \frac{1}{x-\gamma_j} \frac{1}{\gamma_j-\gamma_i}

followed by (1), we can rearrange this as

\displaystyle  P''(x) = P(x) \sum_{1 \leq i \leq n: i \neq j} \frac{\gamma_i}{x-\gamma_i}.

Comparing this with (2), we conclude that

\displaystyle  P''(x) = x P'(x) - n P(x),

or in other words that {P} is the {n^{th}} Hermite polyomial

\displaystyle  P(x) = H_n(x) := (-1)^n e^{x^2/2} \frac{d}{dx^2} e^{-x^2/2}.

Thus the Fekete points {\gamma_i} are nothing more than the zeroes of the {n^{th}} Hermite polynomial.

Heuristically, one can study these zeroes by looking at the function

\displaystyle  \phi(x) := P(x) e^{-x^2/4}

which solves the eigenfunction equation

\displaystyle  \phi''(x) + (n - \frac{x^2}{4}) \phi(x) = 0.

Comparing this equation with the harmonic oscillator equation {\phi''(x) + k^2 \phi(x) = 0}, which has plane wave solutions {\phi(x) = A \cos(kx+\theta)} for {k^2} positive and exponentially decaying solutions for {k^2} negative, we are led (heuristically, at least) to conclude that {\phi} is concentrated in the region where {n-\frac{x^2}{4}} is positive (i.e. inside the interval {[-2\sqrt{n},2\sqrt{n}]}) and will oscillate at frequency roughly {\sqrt{n-\frac{x^2}{4}}} inside this region. As such, we expect the Fekete points {\gamma_i} to obey the same spacing law as the classical locations {\gamma_i^{sc}}; indeed it is possible to show that {\gamma_i = \gamma_i^{sc} + O(1/\sqrt{n})} in the bulk (with some standard modifications at the edge). In particular, we have the heuristic

\displaystyle  \gamma_i - \gamma_j \approx (i-j)/\sqrt{n} \ \ \ \ \ (3)

for {i,j} in the bulk.

Remark 1 If one works with the circular unitary ensemble (CUE) instead of the GUE, the Fekete points become equally spaced around the unit circle, so that this heuristic essentially becomes exact.

— 2. Taylor expansion —

Now we expand around the Fekete points by making the ansatz

\displaystyle  \lambda_i = \gamma_i + x_i,

thus Gustavsson’s result predicts that each {x_i} is normally distributed with standard deviation {O(\sqrt{\log n}/\sqrt{n})} (in the bulk). We Taylor expand

\displaystyle  H(\lambda) = H(\gamma) + \nabla H(\gamma)(x) + \frac{1}{2} \nabla^2 H(\gamma)(x,x) + \ldots.

We heuristically drop the cubic and higher order terms. The constant term {H(\gamma)} can be absorbed into the partition constant {Z_n}, while the linear term vanishes by the property {\nabla H(\gamma)} of the Fekete points. We are thus lead to a quadratic (i.e. gaussian) model

\displaystyle  \frac{1}{Z'_n} e^{-\frac{1}{2} \nabla^2 H(\gamma)(x,x)}\ dx

for the probability distribution of the shifts {x_i}, where {Z'_n} is the appropriate normalisation constant.

Direct computation allows us to expand the quadratic form {\frac{1}{2} \nabla^2 H(\gamma)} as

\displaystyle \frac{1}{2} \nabla^2 H(\gamma)(x,x) = \sum_{j=1}^n \frac{x_j^2}{2} + \sum_{1 \leq i < j \leq n} \frac{(x_i-x_j)^2}{(\gamma_i - \gamma_j)^2}.

The Taylor expansion is not particularly accurate when {j} and {i} are too close, say {j=i+O(\log^{O(1)} n)}, but we will ignore this issue as it should only affect the microscopic behaviour rather than the mesoscopic behaviour. This models the {x_i} as (coupled) gaussian random variables whose covariance matrix can in principle be explicitly computed by inverting the matrix of the quadratic form. Instead of doing this precisely, we shall instead work heuristically (and somewhat inaccurately) by re-expressing the quadratic form in the Haar basis. For simplicity, let us assume that {n} is a power of {2}. Then the Haar basis consists of the basis vector

\displaystyle  \psi_0 := \frac{1}{\sqrt{n}} (1,\ldots,1)

together with the basis vectors

\displaystyle  \psi_I := \frac{1}{\sqrt{|I|}} ( 1_{I_l} - 1_{I_r} )

for every discrete dyadic interval {I \subset \{1,\ldots,n\}} of length between {2} and {n}, where {I_l} and {I_r} are the left and right halves of {I}, and {1_{I_l}}, {1_{I_r} \in {\bf R}^n} are the vectors that are one on {I_l, I_r} respectively and zero elsewhere. These form an orthonormal basis of {{\bf R}^n}, thus we can write

\displaystyle  x = \xi_0 \psi_0 + \sum_I \xi_I \psi_I

for some coefficients {\xi_0, \xi_I}.

From orthonormality we have

\displaystyle  \sum_{j=1}^n x_j^2 = \xi_0^2 + \sum_I \xi_I^2

and we have

\displaystyle  \sum_{1 \leq i < j \leq n} \frac{(x_i-x_j)^2}{(\gamma_i - \gamma_j)^2} = \sum_{I, J} \xi_I \xi_J c_{I,J}

where the matrix coefficients {c_{I,J}} are given by

\displaystyle  c_{I,J} := \sum_{1 \leq i < j \leq n} \frac{(\psi_I(i) - \psi_I(j))(\psi_J(i) - \psi_J(j))}{(\gamma_i - \gamma_j)^2}.

A standard heuristic wavelet computation using (3) suggests that {c_{I,J}} is small unless {I} and {J} are actually equal, in which case one has

\displaystyle  c_{I,I} \sim \frac{n}{|I|}

(in the bulk, at least). Actually, the decay of the {c_{I,J}} away from the diagonal {I=J} is not so large, because the Haar wavelets {\psi_I} have poor moment and regularity properties. But one could in principle use much smoother and much more balanced wavelets, in which case the decay should be much faster.

This suggests that the GUE distribution could be modeled by the distribution

\displaystyle  \frac{1}{Z'_n} e^{-\xi_0^2/2} e^{- C \sum_I \frac{n}{|I|} \xi_I^2} d\xi \ \ \ \ \ (4)

for some absolute constant {C}; thus we may model {\xi_0 \equiv N(0,1)} and {\xi_I \equiv C' \sqrt{|I|}{\sqrt{n}} g_I} for some iid gaussians {g_I \equiv N(0,1)} independent of {\xi_0}. We then have as a model

\displaystyle  x_i = \frac{\xi_0}{\sqrt{n}} + \frac{C'}{\sqrt{n}} \sum_I (1_{I_l}(i) - 1_{I_r}(i)) g_I

for the fluctuations of the eigenvalues (in the bulk, at least), leading of course to the model

\displaystyle  \lambda_i = \gamma_i + \frac{\xi_0}{\sqrt{n}} + \frac{C'}{\sqrt{n}} \sum_I (1_{I_l}(i) - 1_{I_r}(i)) g_I \ \ \ \ \ (5)

for the fluctuations themselves. This model does not capture the microscopic behaviour of the eigenvalues such as the sine kernel (indeed, as noted before, the contribution of the very short {I} (which corresponds to very small values of {|j-i|}) is inaccurate), but appears to be a good model to describe the mesoscopic behaviour. For instance, observe that for each {i} there are {\sim \log n} independent normalised gaussians in the above sum, and so this model is consistent with the result of Gustavsson that each {\lambda_i} is gaussian with standard deviation {\sim \frac{\sqrt{\log n}}{\sqrt{n}}}. Also, if {|i-j| \sim n^\theta}, then the expansions (5) of {\lambda_i, \lambda_j} share about {(1-\theta) \log n} of the {\log n} terms in the sum in common, which is consistent with the further result of Gustavsson that the correlation between such eigenvalues is comparable to {1-\theta}.

If one looks at the gap {\lambda_{i+1}-\lambda_i} using (5) (and replacing the Haar cutoff {1_{I_l}(i) - 1_{I_r}(i)} by something smoother for the purposes of computing the gap), one is led to a heuristic of the form

\displaystyle  \lambda_{i+1}-\lambda_i = \frac{1}{\rho_{sc}(\gamma_i/\sqrt{n})} \frac{1}{\sqrt{n}} + \frac{C'}{\sqrt{n}} \sum_I (1_{I_l}(i) - 1_{I_r}(i)) \frac{g_I}{|I|}.

The dominant terms here are the first term and the contribution of the very short intervals {I}. At present, this model cannot be accurate, because it predicts that the gap can sometimes be negative; the contribution of the very short intervals must instead be replaced some other model that gives sine process behaviour, but we do not know of an easy way to set up a plausible such model.

On the other hand, the model suggests that the gaps are largely decoupled from each other, and have gaussian tails. Standard heuristics then suggest that of the {\sim n} gaps in the bulk, the largest one should be comparable to {\sqrt{\frac{\log n}{n}}}, which was indeed established recently by Ben Arous and Bourgarde.

Given any probability measure {\mu = \rho\ dx} on {{\bf R}^n} (or on the Weyl chamber) with a smooth nonzero density, one can can create an associated heat flow on other smooth probability measures {f\ dx} by performing gradient flow with respect to the Dirichlet form

\displaystyle  D(f\ dx) := \frac{1}{2} \int_{{\bf R}^n} |\nabla \frac{f}{\rho}|^2\ d\mu.

Using the ansatz (4), this flow decouples into a system of independent Ornstein-Uhlenbeck processes

\displaystyle  d\xi_0 = - \xi_0 dt + dW_0

and

\displaystyle  d g_I = C'' \frac{n}{|I|} ( - g_I dt + dW_I )

where {dW_0, dW_I} are independent Wiener processes (i.e. Brownian motion). This is a toy model for the Dyson Brownian motion. In this model, we see that the mixing time for each {g_I} is {O(|I|/n)}; thus, the large-scale variables ({g_I} for large {I}) evolve very slowly by Dyson Brownian motion, taking as long as {O(1)} to reach equilibrium, while the fine scale modes ({g_I} for small {I}) can achieve equilibrium in as brief a time as {O(1/n)}, with the intermediate modes taking an intermediate amount of time to reach equilibrium. It is precisely this picture that underlies the Erdos-Schlein-Yau approach to universality for Wigner matrices via the local equilibrium flow, in which the measure (4) is given an additional (artificial) weight, roughly of the shape {e^{- n^{1-\epsilon} (\xi_0^2 + \sum_I \xi_I^2)}}, in order to make equilibrium achieved globally in just time {O(n^{1-\epsilon})}, leading to a local log-Sobolev type inequality that ensures convergence of the local statistics once one controls a Dirichlet form connected to the local equilibrium measure; and then one can use the localisation of eigenvalues provided by a local semicircle law to control that Dirichlet form in turn for measures that have undergone Dyson Brownian motion.