You are currently browsing the monthly archive for February 2010.

Our study of random matrices, to date, has focused on somewhat general ensembles, such as iid random matrices or Wigner random matrices, in which the distribution of the individual entries of the matrices was essentially arbitrary (as long as certain moments, such as the mean and variance, were normalised). In these notes, we now focus on two much more special, and much more symmetric, ensembles:

<ul> <li> The <a href=””>Gaussian Unitary Ensemble</a> (GUE), which is an ensemble of random {n \times n} Hermitian matrices {M_n} in which the upper-triangular entries are iid with distribution {N(0,1)_{\bf C}}, and the diagonal entries are iid with distribution {N(0,1)_{\bf R}}, and independent of the upper-triangular ones; and <li> The <em>Gaussian random matrix ensemble</em>, which is an ensemble of random {n \times n} (non-Hermitian) matrices {M_n} whose entries are iid with distribution {N(0,1)_{\bf C}}.

The symmetric nature of these ensembles will allow us to compute the spectral distribution by exact algebraic means, revealing a surprising connection with orthogonal polynomials and with determinantal processes. This will, for instance, recover the semi-circular law for GUE, but will also reveal <em>fine</em> spacing information, such as the distribution of the gap between <em>adjacent</em> eigenvalues, which is largely out of reach of tools such as the Stieltjes transform method and the moment method (although the moment method, with some effort, is able to control the extreme edges of the spectrum).
Similarly, we will see for the first time the <em>circular law</em> for eigenvalues of non-Hermitian matrices.
There are a number of other highly symmetric ensembles which can also be treated by the same methods, most notably the Gaussian Orthogonal Ensemble (GOE) and the Gaussian Symplectic Ensemble (GSE). However, for simplicity we shall focus just on the above two ensembles. For a systematic treatment of these ensembles, see the <a href=”″>text by Deift</a>.

<p align=center><b> &mdash; 1. The spectrum of GUE &mdash; </b></p>

We have already shown using Dyson Brownian motion in <a href=””>Notes 3b</a> that we have the <a href=”″>Ginibre formula</a> <a name=”rhod”><p align=center>\displaystyle \rho_n(\lambda) = \frac{1}{(2\pi)^{n/2} 1! \ldots (n-1)!} e^{-|\lambda|^2/2} |\Delta_n(\lambda)|^2 \ \ \ \ \ (1)</p>
</a> for the density function of the eigenvalues {(\lambda_1,\ldots,\lambda_n) \in {\bf R}^n_\geq} of a GUE matrix {M_n}, where <p align=center>\displaystyle \Delta_n(\lambda) = \prod_{1 \leq i < j \leq n} (\lambda_i - \lambda_j)</p>
is the <a href=””>Vandermonde determinant</a>. We now give an alternate proof of this result (omitting the exact value of the normalising constant {\frac{1}{(2\pi)^{n/2} 1! \ldots (n-1)!}}) that exploits unitary invariance and the change of variables formula (the latter of which we shall do from first principles). The one thing to be careful about is that one has to somehow quotient out by the invariances of the problem before being able to apply the change of variables formula.
One approach here would be to artificially “fix a gauge” and work on some slice of the parameter space which is “transverse” to all the symmetries. With such an approach, one can use the classical change of variables formula. While this can certainly be done, we shall adopt a more “gauge-invariant” approach and carry the various invariances with us throughout the computation. (For a comparison of the two approaches, see <a href=””>this previous blog post</a>.)
We turn to the details. Let {V_n} be the space of Hermitian {n \times n} matrices, then the distribution {\mu_{M_n}} of a GUE matrix {M_n} is a absolutely continuous probability measure on {V_n}, which can be written using the definition of GUE as <p align=center>\displaystyle \mu_{M_n} = C_n (\prod_{1 \leq i < j \leq n} e^{-|\xi_{ij}|^2}) (\prod_{1 \leq i \leq n} e^{-|\xi_{ii}|^2/2})\ dM_n</p>
where {dM_n} is Lebesgue measure on {V}, {\xi_{ij}} are the coordinates of {M_n}, and {C_n} is a normalisation constant (the exact value of which depends on how one normalises Lebesgue measure on {V}). We can express this more compactly as <p align=center>\displaystyle \mu_{M_n} = C_n e^{- \hbox{tr}(M_n^2) / 2}\ dM_n.</p>
Expressed this way, it is clear that the GUE ensemble is invariant under conjugations {M_n \mapsto U M_n U^{-1}} by any unitary matrix.
Let {D} be the diagonal matrix whose entries {\lambda_1 \geq \ldots \geq \lambda_n} are the eigenvalues of {M_n} in descending order. Then we have {M_n = U D U^{-1}} for some unitary matrix {U \in U(n)}. The matirx {U} is not uniquely determined; if {R} is diagonal unitary matrix, then {R} commutes with {D}, and so one can freely replace {U} with {UR}. On the other hand, if the eigenvalues of {M} are simple, then the diagonal matrices are the <em>only</em> matrices that commute with {D}, and so this freedom to right-multiply {U} by diagonal unitaries is the only failure of uniqueness here. And in any case, from the unitary invariance of GUE, we see that even after conditioning on {D}, we may assume without loss of generality that {U} is drawn from the invariant Haar measure on {U(n)}. In particular, {U} and {D} can be taken to be independent.
Fix a diagonal matrix {D_0 = \hbox{diag}(\lambda_1^0,\ldots,\lambda_n^0)} for some {\lambda_1^0 > \ldots > \lambda_n^0}, let {\varepsilon > 0} be extremely small, and let us compute the probability <a name=”dprop”><p align=center>\displaystyle \mathop{\bf P}( \| M_n - D_0 \|_F \leq \varepsilon ) \ \ \ \ \ (2)</p>
</a> that {M_n} lies within {\varepsilon} of {D_0} in the Frobenius norm. On the one hand, the probability density of {M_n} is proportional to <p align=center>\displaystyle e^{-\hbox{tr}(D_0^2)/2} = e^{-|\lambda^0|^2/2}</p>
near {D_0} (where we write {\lambda^0 := (\lambda_1^0,\ldots,\lambda_n^0)}) and the volume of a ball of radius {\varepsilon} in the {n^2}-dimensional space {V_n} is proportional to {\varepsilon^{n^2}}, so <a href=”#dprop”>(2)</a> is equal to <a name=”ciprian”><p align=center>\displaystyle (C'_n + o(1)) \varepsilon^{n^2} e^{-\hbox{tr}(D_0^2)/2} \ \ \ \ \ (3)</p>
</a> for some constant {C'_n > 0} depending only on {n}, where {o(1)} goes to zero as {\varepsilon \rightarrow 0} (keeping {n} and {D_0} fixed). On the other hand, if {\|M_n - D_0\|_F \leq \varepsilon}, then by the Weyl inequality (or Hoffman-Weilandt inequality) we have {D = D_0 + O(\varepsilon)} (we allow implied constants here to depend on {n} and on {D_0}). This implies {UDU^{-1} = D + O(\varepsilon)}, thus {UD - DU = O(\varepsilon)}. As a consequence we see that the off-diagonal elements of {U} are of size {O(\varepsilon)}. We can thus use the inverse function theorem in this local region of parameter space and make the ansatz <p align=center>\displaystyle D = D_0 + \varepsilon E; \quad U = \exp( \varepsilon S ) R</p>
where {E} is a bounded diagonal matrix, {R} is a diagonal unitary matrix, and {S} is a bounded skew-adjoint matrix with zero diagonal. (Note here the emergence of the freedom to right-multiply {U} by diagonal unitaries.) Note that the map {(R,S) \mapsto \exp( \varepsilon S ) R} has a non-degenerate Jacobian, so the inverse function theorem applies to uniquely specify {R, S} (and thus {E}) from {U, D} in this local region of parameter space.
Conversely, if {D, U} take the above form, then we can Taylor expand and conclude that <p align=center>\displaystyle M_n = UDU^* = D_0 + \varepsilon E + \varepsilon (SD_0 - D_0S) + O(\varepsilon^2) </p>
and so <p align=center>\displaystyle \| M_n - D_0 \|_F = \varepsilon \| E + (SD_0 - D_0S)\|_F + O(\varepsilon^2).</p>
We can thus bound <a href=”#dprop”>(2)</a> from above and below by expressions of the form <a name=”esd”><p align=center>\displaystyle \mathop{\bf P}( \| E + (SD_0 - D_0S)\|_F \leq 1 + O(\varepsilon) ). \ \ \ \ \ (4)</p>
</a> As {U} is distributed using Haar measure on {U(n)}, {S} is (locally) distributed using {(1+o(1)) \varepsilon^{n^2-n}} times a constant multiple of Lebesgue measure on the space {W} of skew-adjoint matrices with zero diagonal, which has dimension {n^2-n}. Meanwhile, {E} is distributed using {(\rho_n(\lambda^0)+o(1)) \varepsilon^{n}} times Lebesgue measure on the space of diagonal elements. Thus we can rewrite <a href=”#esd”>(4)</a> as <p align=center>\displaystyle C''_n \varepsilon^{n^2} (\rho_n(\lambda^0)+o(1)) \int\int_{\| E + (SD_0 - D_0S)\|_F \leq 1 + O(\varepsilon)}\ dE dS</p>
where {dE} and {dS} denote Lebesgue measure and {C''_n > 0} depends only on {n}.
Observe that the map {S \mapsto SD_0-D_0 S} dilates the (complex-valued) {ij} entry of {S} by {\lambda^0_j-\lambda^0_i}, and so the Jacobian of this map is {\prod_{1 \leq i < j \leq n} |\lambda^0_j-\lambda^0_i|^2 = |\Delta_n(\lambda^0)|^2}. Applying the change of variables, we can express the above as <p align=center>\displaystyle C''_n \varepsilon^{n^2} \frac{\rho_n(\lambda^0)+o(1)}{|\Delta_n(\lambda^0)|^2} \int\int_{\| E + S\|_F \leq 1 + O(\varepsilon)}\ dE dS.</p>
The integral here is of the form {C'''_n + O(\varepsilon)} for some other constant {C'''_n > 0}. Comparing this formula with <a href=”#ciprian”>(3)</a> we see that <p align=center>\displaystyle \rho_n(\lambda^0)+o(1) = C''''_n e^{-|\lambda^0|^2/2} \Delta_n(\lambda^0)|^2 + o(1)</p>
for yet another constant {C''''_n > 0}. Sending {\varepsilon \rightarrow 0} we recover an exact formula <p align=center>\displaystyle \rho_n(\lambda)+o(1) = C''''_n e^{-|\lambda|^2/2} |\Delta_n(\lambda)|^2 </p>
when {\lambda} is simple. Since almost all Hermitian matrices have simple spectrum (see Exercise 10 of <a href=””>Notes 3a</a>), this gives the full spectral distribution of GUE, except for the issue of the unspecified constant.

<blockquote><b>Remark 1</b> In principle, this method should also recover the explicit normalising constant {\frac{1}{(2\pi)^{n/2}}} in <a href=”#rhod”>(1)</a>, but to do this it appears one needs to understand the volume of the fundamental domain of {U(n)} with respect to the logarithm map, or equivalently to understand the volume of the unit ball of Hermitian matrices in the operator norm. I do not know of a simple way to compute this quantity (though it can be inferred from <a href=”#rhod”>(1)</a> and the above analysis). One can also recover the normalising constant through the machinery of determinantal processes, see below. </blockquote>



<blockquote><b>Remark 2</b> <a name=”daft”></a> The above computation can be generalised to other {U(n)}-conjugation-invariant ensembles {M_n} whose probability distribution is of the form <p align=center>\displaystyle \mu_{M_n} = C_n e^{- \hbox{tr} V(M_n)}\ dM_n</p>
for some potential function {V: {\bf R} \rightarrow {\bf R}} (where we use the spectral theorem to define {V(M_n)}), yielding a density function for the spectrum of the form <p align=center>\displaystyle \rho_n(\lambda) = C'_n e^{- \sum_{j=1}^n V(\lambda_j)} |\Delta_n(\lambda)|^2.</p>
Given suitable regularity conditions on {V}, one can then generalise many of the arguments in these notes to such ensembles. See <a href=”″>this book by Deift</a> for details. </blockquote>



<p align=center><b> &mdash; 2. The spectrum of gaussian matrices &mdash; </b></p>

The above method also works for gaussian matrices {G}, as was first observed by Dyson (though the final formula was first obtained by Ginibre, using a different method). Here, the density function is given by <a name=”can”><p align=center>\displaystyle C_n e^{-\hbox{tr}( GG^*)} dG = C_n e^{-\|G\|_F^2} dG \ \ \ \ \ (5)</p>
</a> where {C_n > 0} is a constant and {dG} is Lebesgue measure on the space {M_n({\bf C})} of all complex {n \times n} matrices. This is invariant under both left and right multiplication by unitary matrices, so in particular is invariant under unitary conjugations as before.
This matrix {G} has {n} complex (generalised) eigenvalues {\sigma(G) = \{\lambda_1,\ldots,\lambda_n\}}, which are usually distinct:

<blockquote><b>Exercise 3</b> Let {n \geq 2}. Show that the space of matrices in {M_n({\bf C})} with a repeated eigenvalue has codimension {2}. </blockquote>


Unlike the Hermitian situation, though, there is no natural way to order these {n} complex eigenvalues. We will thus consider all {n!} possible permutations at once, and define the spectral density function {\rho_n(\lambda_1,\ldots,\lambda_n)} of {G} by duality and the formula <p align=center>\displaystyle \int_{{\bf C}^n} F(\lambda) \rho_n(\lambda)\ d\lambda := \mathop{\bf E} \sum_{\{\lambda_1,\ldots,\lambda_n\} = \sigma(G)} F(\lambda_1,\ldots,\lambda_n)</p>
for all test functions {F}. By the Riesz representation theorem, this uniquely defines {\rho_n} (as a distribution, at least), although the total mass of {\rho_n} is {n!} rather than {1} due to the ambiguity in the spectrum.
Now we compute {\rho_n} (up to constants). In the Hermitian case, the key was to use the factorisation {M_n = UDU^{-1}}. This particular factorisation is of course unavailable in the non-Hermitian case. However, if the non-Hermitian matrix {G} has simple spectrum, it can always be factored instead as {G = UTU^{-1}}, where {U} is unitary and {T} is upper triangular. Indeed, if one applies the <a href=””>Gram-Schmidt process</a> to the eigenvectors of {G} and uses the resulting orthonormal basis to form {U}, one easily verifies the desired factorisation. Note that the eigenvalues of {G} are the same as those of {T}, which in turn are just the diagonal entries of {T}.

<blockquote><b>Exercise 4</b> Show that this factorisation is also available when there are repeated eigenvalues. (Hint: use the <a href=””>Jordan normal form</a>.) </blockquote>


To use this factorisation, we first have to understand how unique it is, at least in the generic case when there are no repeated eigenvalues. As noted above, if {G = UTU^{-1}}, then the diagonal entries of {T} form the same set as the eigenvalues of {G}.
Now suppose we fix the diagonal {\lambda_1,\ldots,\lambda_n} of {T}, which amounts to picking an ordering of the {n} eigenvalues of {G}. The eigenvalues of {T} are {\lambda_1,\ldots,\lambda_n}, and furthermore for each {1 \leq j \leq n}, the eigenvector of {T} associated to {\lambda_j} lies in the span of the last {n-j+1} basis vectors {e_j,\ldots,e_n} of {{\bf C}^n}, with a non-zero {e_j} coefficient (as can be seen by Gaussian elimination or Cramer’s rule). As {G=UTU^{-1}} with {U} unitary, we conclude that for each {1 \leq j \leq n}, the {j^{th}} column of {U} lies in the span of the eigenvectors associated to {\lambda_j,\ldots,\lambda_n}. As these columns are orthonormal, they must thus arise from applying the Gram-Schmidt process to these eigenvectors (as discussed earlier). This argument also shows that once the diagonal entries {\lambda_1,\ldots,\lambda_n} of {T} are fixed, each column of {U} is determined up to rotation by a unit phase. In other words, the only remaining freedom is to replace {U} by {UR} for some unit diagonal matrix {R}, and then to replace {T} by {R^{-1}TR} to counterbalance this change of {U}.
To summarise, the factorisation {G = UTU^{-1}} is unique up to specifying an enumeration {\lambda_1,\ldots,\lambda_n} of the eigenvalues of {G} and right-multiplying {U} by diagonal unitary matrices, and then conjugating {T} by the same matrix. Given a matrix {G}, we may apply these symmetries randomly, ending up with a random enumeration {\lambda_1,\ldots,\lambda_n} of the eigenvalues of {G} (whose distribution invariant with respect to permutations) together with a random factorisation {UTU^{-1}} such that {T} has diagonal entries {\lambda_1,\ldots,\lambda_n} in that order, and the distribution of {T} is invariant under conjugation by diagonal unitary matrices. Also, since {G} is itself invariant under unitary conjugations, we may also assume that {U} is distributed uniformly according to the Haar measure of {U(n)}, and independently of {T}.
To summarise, the gaussian matrix ensemble {G}, together with a randomly chosen enumeration {\lambda_1,\ldots,\lambda_n} of the eigenvalues, can almost surely be factorised as {UTU^{-1}}, where {T = ( t_{ij} )_{1 \leq i \leq j \leq n}} is an upper-triangular matrix with diagonal entries {\lambda_1,\ldots,\lambda_n}, distributed according to some distribution <p align=center>\displaystyle \psi( (t_{ij})_{1 \leq i \leq j \leq n} )\ \prod_{1 \leq i \leq j \leq n} dt_{ij}</p>
which is invariant with respect to conjugating {T} by diagonal unitary matrices, and {U} is uniformly distributed according to the Haar measure of {U(n)}, independently of {T}.
Now let {T_0 = ( t_{ij}^0 )_{1 \leq i \leq j \leq n}} be an upper triangular matrix with complex entries whose entries {t_{11}^0,\ldots,t_{nn}^0 \in {\bf C}} are distinct. As in the previous section, we consider the probability <a name=”gatf”><p align=center>\displaystyle \mathop{\bf P}( \| G - T_0 \|_F \leq \varepsilon ) . \ \ \ \ \ (6)</p>
</a> On the one hand, since the space {M_n({\bf C})} of complex {n \times n} matrices has {2n^2} real dimensions, we see from <a href=”#can”>(9)</a> that this expression is equal to <a name=”gatf2″><p align=center>\displaystyle (C'_n+o(1)) e^{-\|T_0\|_F^2} \varepsilon^{2n^2} \ \ \ \ \ (7)</p>
</a> for some constant {C'_n > 0}.
Now we compute <a href=”#gatf”>(6)</a> using the factorisation {G = UTU^{-1}}. Suppose that {\|G-T_0\|_F \leq \varepsilon}, so {G = T_0+O(\varepsilon)} As the eigenvalues of {T_0} are {t_{11}^0,\ldots,t_{nn}^0}, which are assumed to be distinct, we see (from the inverse function theorem) that for {\varepsilon} small enough, {G} has eigenvalues {t_{11}^0 +O(\varepsilon), \ldots, t_{nn}^0 + O(\varepsilon)}. With probability {1/n!}, the diagonal entries of {T} are thus {t_{11}^0 +O(\varepsilon), \ldots, t_{nn}^0 + O(\varepsilon)} (in that order). We now restrict to this event (the {1/n!} factor will eventually be absorbed into one of the unspecified constants).
Let {u^0_1,\ldots,u^0_n} be eigenvector of {T_0} associated to {t_{11}^0,\ldots,t_{nn}^0}, then the Gram-Schmidt process applied to {u_1,\ldots,u_n} (starting at {u^0_n} and working backwards to {u^0_1}) gives the standard basis {e_1,\ldots,e_n} (in reverse order). By the inverse function theorem, we thus see that we have eigenvectors {u_1 = u^0_1+O(\varepsilon),\ldots,u_n=u^0_n+O(\varepsilon)} of {G}, which when the Gram-Schmidt process is applied, gives a perturbation {e_1+O(\varepsilon),\ldots,e_n+O(\varepsilon)} in reverse order. This gives a factorisation {G=UTU^{-1}} in which {U = I + O(\varepsilon)}, and hence {T = T_0+O(\varepsilon)}. This is however not the most general factorisation available, even after fixing the diagonal entries of {T}, due to the freedom to right-multiply {U} by diagonal unitary matrices {R}. We thus see that the correct ansatz here is to have <p align=center>\displaystyle U = R + O(\varepsilon); \quad T = R^{-1} T_0 R + O(\varepsilon)</p>
for some diagonal unitary matrix {R}.
In analogy with the GUE case, we can use the inverse function theorem and make the more precise ansatz <p align=center>\displaystyle U = \exp(\varepsilon S) R; \quad T = R^{-1} (T_0 + \varepsilon E) R</p>
where {S} is skew-Hermitian with zero diagonal and size {O(1)}, {R} is diagonal unitary, and {E} is an upper triangular matrix of size {O(1)}. From the invariance {U \mapsto UR; T \mapsto R^{-1} TR} we see that {R} is distributed uniformly across all diagonal unitaries. Meanwhile, from the unitary conjugation invariance, {S} is distributed according to a constant multiple of {\varepsilon^{n^2-n}} times Lebesgue measure {dS} on the {n^2-n}-dimensional space of skew Hermitian matrices with zero diagonal; and from the definition of {\psi}, {E} is distributed according to a constant multiple of the measure <p align=center>\displaystyle (1+o(1)) \varepsilon^{n^2+n} \psi(T_0)\ dE,</p>
where {dE} is Lebesgue measure on the {n^2+n}-dimensional space of upper-triangular matrices. Furthermore, the invariances ensure that the random variables {S, R, E} are distributed independently. Finally, we have <p align=center>\displaystyle G = UTU^{-1} = \exp(\varepsilon S) (T_0 + \varepsilon E) \exp(-\varepsilon S).</p>
Thus we may rewrite <a href=”#gatf”>(6)</a> as <a name=”gatf-2″><p align=center>\displaystyle (C''_n \psi(T_0)+o(1))\varepsilon^{2n^2} \int\int_{\|\exp(\varepsilon S) (T_0 + \varepsilon E) \exp(-\varepsilon S) - T_0\|_F \leq \varepsilon} dS dE \ \ \ \ \ (8)</p>
</a> for some {C''_n>0} (the {R} integration being absorbable into this constant {C''_n}). We can Taylor expand <p align=center>\displaystyle \exp(\varepsilon S) (T_0 + \varepsilon E) \exp(-\varepsilon S) = T_0 + \varepsilon (E + ST_0 - T_0S) + O(\varepsilon^2)</p>
and so we can bound <a href=”#gatf-2″>(8)</a> above and below by expressions of the form <p align=center>\displaystyle (C''_n \psi(T_0)+o(1))\varepsilon^{2n^2} \int\int_{\|E + ST_0 - T_0 S\|_F \leq 1+O(\varepsilon)} dS dE. </p>
The Lebesgue measure {dE} is invariant under translations by upper triangular matrices, so we may rewrite the above expression as <a name=”can”><p align=center>\displaystyle (C''_n \psi(T_0)+o(1))\varepsilon^{2n^2} \int\int_{\|E + \pi(ST_0 - T_0 S)\|_F \leq 1+O(\varepsilon)} dS dE, \ \ \ \ \ (9)</p>
</a> where {\pi(ST_0 - T_0 S)} is the strictly lower triangular component of {ST_0-T_0S}.
The next step is to make the (linear) change of variables {V := \pi(ST_0-T_0 S)}. We check dimensions: {S} ranges in the space {S} of skew-adjoint Hermitian matrices with zero diagonal, which has dimension {(n^2-n)/2}, as does the space of strictly lower-triangular matrices, which is where {V} ranges. So we can in principle make this change of variables, but we first have to compute the Jacobian of the transformation (and check that it is non-zero). For this, we switch to coordinates. Write {S = (s_{ij})_{1 \leq i, j \leq n}} and {V = (v_{ij})_{1 \leq j < i \leq n}}. In coordinates, the equation {V=\pi(ST_0-T_0S)} becomes <p align=center>\displaystyle v_{ij} = \sum_{k=1}^j s_{ik} t_{kj}^0 - \sum_{k=i}^n t_{ik}^0 s_{kj}</p>
or equivalently <p align=center>\displaystyle v_{ij} = (t_{jj}^0-t_{ii}^0) s_{ij} + \sum_{k=1}^{j-1} t_{kj}^0 s_{ik}- \sum_{k=i+1}^n t_{ik}^0 s_{kj}.</p>
Thus for instance <p align=center>\displaystyle v_{n1} = (t^0_{11}-t^0_{nn}) s_{n1} </p>
<p align=center>\displaystyle v_{n2} = (t^0_{22}-t^0_{nn}) s_{n2} + t_{12}^0 s_{n1} </p>
<p align=center>\displaystyle v_{(n-1)1} = (t^0_{11} - t^0_{(n-1)(n-1)}) s_{(n-1)1} - t_{(n-1)n}^0 s_{n1}</p>
<p align=center>\displaystyle v_{n3} = (t^0_{33}-t^0_{nn}) s_{n3} + t_{13}^0 s_{n1} + t_{23}^0 s_{n2}</p>
<p align=center>\displaystyle v_{(n-1)2} = (t^0_{22}-t^0_{(n-1)(n-1)}) s_{(n-1)2} +t_{12}^0 s_{(n-1)1} - t_{(n-1)n}^0 s_{n2}</p>
<p align=center>\displaystyle v_{(n-2)1} = (t^0_{11}-t^0_{(n-2)(n-2)}) s_{(n-2)1} - t_{(n-2)(n-1)}^0 s_{(n-1)1} - t_{(n-2)n}^0 s_{n1}</p>
etc. We then observe that the transformation matrix from {s_{n1}, s_{n2}, s_{(n-1)1},\ldots} to {v_{n1}, v_{n2}, v_{(n-1)1}, \ldots} is triangular, with diagonal entries given by {t^0_{jj}-t^0_{ii}} for {1 \leq j < i \leq n}. The Jacobian of the (complex-linear) map {S \mapsto V} is thus given by <p align=center>\displaystyle |\prod_{1 \leq j < i \leq n} t^0_{jj}-t^0_{ii}|^2 = |\Delta( t^0_{11},\ldots,t^0_{nn})|^2</p>
which is non-zero by the hypothesis that the {t_0^{11},\ldots,t_0^{nn}} are distinct. We may thus rewrite <a href=”#can”>(9)</a> as <p align=center>\displaystyle \frac{C''_n \psi(T_0)+o(1)}{|\Delta( t^0_{11},\ldots,t^0_{nn})|^2} \varepsilon^{2n^2}\int\int_{\|E + V\|_F \leq 1+O(\varepsilon)} dS dV</p>
where {dV} is Lebesgue measure on strictly lower-triangular matrices. The integral here is equal to {C'''_n + O(\varepsilon)} for some constant {C'''_n}. Comparing this with <a href=”#gatf”>(6)</a>, cancelling the factor of {\varepsilon^{2n^2}}, and sending {\varepsilon \rightarrow 0}, we obtain the formula <p align=center>\displaystyle \psi( (t^0_{ij})_{1 \leq i \leq j \leq n} ) = C''''_n |\Delta( t^0_{11},\ldots,t^0_{nn})|^2 e^{-\|T_0\|_F^2} </p>
for some constant {C''''_n > 0}. We can expand <p align=center>\displaystyle e^{-\|T_0\|_F^2} = \prod_{1 \leq i \leq j \leq n} e^{-|t_{ij}^0|^2}.</p>
If we integrate out the off-diagonal variables {t^0_{ij}} for {1 \leq i < j \leq n}, we see that the density function for the diagonal entries {(\lambda_1,\ldots,\lambda_n)} of {T} is proportional to <p align=center>\displaystyle |\Delta(\lambda_1,\ldots,\lambda_n)|^2 e^{-\sum_{j=1}^n |\lambda_j|^2}.</p>
Since these entries are a random permutation of the eigenvalues of {G}, we conclude the <em>Ginibre formula</em> <a name=”ginibre-gaussian”><p align=center>\displaystyle \rho_n(\lambda_1,\ldots,\lambda_n) = c_n |\Delta(\lambda_1,\ldots,\lambda_n)|^2 e^{-\sum_{j=1}^n |\lambda_j|^2} \ \ \ \ \ (10)</p>
</a> for the joint density of the eigenvalues of a gaussian random matrix, where {c_n > 0} is a constant.

<blockquote><b>Remark 5</b> Given that <a href=”#rhod”>(1)</a> can be derived using Dyson Brownian motion, it is natural to ask whether <a href=”#ginibre-gaussian”>(10)</a> can be derived by a similar method. It seems that in order to do this, one needs to consider a Dyson-like process not just on the eigenvalues {\lambda_1,\ldots,\lambda_n}, but on the entire triangular matrix {T} (or more precisely, on the moduli space formed by quotienting out the action of conjugation by unitary diagonal matrices). Unfortunately the computations seem to get somewhat complicated, and we do not present them here. </blockquote>



<p align=center><b> &mdash; 3. Mean field approximation &mdash; </b></p>

We can use the formula <a href=”#rhod”>(1)</a> for the joint distribution to heuristically derive the semicircular law, as follows.
It is intuitively plausible that the spectrum {(\lambda_1,\ldots,\lambda_n)} should concentrate in regions in which {\rho_n(\lambda_1,\ldots,\lambda_n)} is as large as possible. So it is now natural to ask how to optimise this function. Note that the expression in <a href=”#rhod”>(1)</a> is non-negative, and vanishes whenever two of the {\lambda_i} collide, or when one or more of the {\lambda_i} go off to infinity, so a maximum should exist away from these degenerate situations.
We may take logarithms and write <a name=”rhond”><p align=center>\displaystyle - \log \rho_n(\lambda_1,\ldots,\lambda_n) = \sum_{j=1}^n \frac{1}{2} |\lambda_j|^2 + \sum\sum_{i \neq j} \log \frac{1}{|\lambda_i-\lambda_j|} + C \ \ \ \ \ (11)</p>
</a> where {C = C_n} is a constant whose exact value is not of importance to us. From a mathematical physics perspective, one can interpret <a href=”#rhond”>(11)</a> as a Hamiltonian for {n} particles at positions {\lambda_1,\ldots,\lambda_n}, subject to a confining harmonic potential (these are the {\frac{1}{2} |\lambda_j|^2} terms) and a repulsive logarithmic potential between particles (these are the {\log \frac{1}{|\lambda_i-\lambda_j|}} terms).
Our objective is now to find a distribution of {\lambda_1,\ldots,\lambda_n} that minimises this expression.
We know from previous notes that the {\lambda_i} should have magnitude {O(\sqrt{n})}. Let us then heuristically make a <a href=””>mean field approximation</a>, in that we approximate the discrete spectral measure {\frac{1}{n} \sum_{j=1}^n \delta_{\lambda_j/\sqrt{n}}} by a continuous probability measure {\rho(x)\ dx}. (Secretly, we know from the semi-circular law that we should be able to take {\rho = \frac{1}{2\pi} (4-x^2)_+^{1/2}}, but pretend that we do not know this fact yet.) Then we can heuristically approximate <a href=”#rhond”>(11)</a> as <p align=center>\displaystyle n^2 ( \int_{\bf R} \frac{1}{2} x^2 \rho(x)\ dx + \int_{\bf R} \int_{\bf R} \log \frac{1}{|x-y|} \rho(x) \rho(y)\ dx dy ) + C'_n</p>
and so we expect the distribution {\rho} to minimise the functional <a name=”rhos”><p align=center>\displaystyle \int_{\bf R} \frac{1}{2} x^2 \rho(x)\ dx + \int_{\bf R} \int_{\bf R} \log \frac{1}{|x-y|} \rho(x) \rho(y)\ dx dy. \ \ \ \ \ (12)</p>
One can compute the Euler-Lagrange equations of this functional:

<blockquote><b>Exercise 6</b> Working formally, and assuming that {\rho} is a probability measure that minimises <a href=”#rhos”>(12)</a>, argue that <p align=center>\displaystyle \frac{1}{2} x^2 + 2 \int_{\bf R} \log \frac{1}{|x-y|} \rho(y)\ dy = C</p>
for some constant {C} and all {x} in the support of {\rho}. For all {x} outside of the support, establish the inequality <p align=center>\displaystyle \frac{1}{2} x^2 + 2 \int_{\bf R} \log \frac{1}{|x-y|} \rho(y)\ dy \geq C.</p>


There are various ways we can solve this equation for {\rho}; we sketch here a complex-analytic method. Differentiating in {x}, we formally obtain <p align=center>\displaystyle x - 2 p.v. \int_{\bf R} \frac{1}{x-y} \rho(y)\ dy = 0</p>
on the support of {\rho}. But recall that if we let <p align=center>\displaystyle s(z) := \int_{\bf R} \frac{1}{y-z} \rho(y)\ dy</p>
be the Stieltjes transform of the probability measure {\rho(x)\ dx}, then we have <p align=center>\displaystyle \hbox{Im}(s(x+i0^+)) = \pi \rho(x)</p>
and <p align=center>\displaystyle \hbox{Re}(s(x+i0^+)) = - p.v. \int_{\bf R} \frac{1}{x-y} \rho(y)\ dy.</p>
We conclude that <p align=center>\displaystyle (x + 2 \hbox{Re}(s(x+i0^+)) \hbox{Im}(s(x+i0^+))) = 0</p>
for all {x}, which we rearrange as <p align=center>\displaystyle \hbox{Im}( s^2(x+i0^+) + x s(x+i0^+) ) = 0.</p>
This makes the function {f(z) = s^2(z) + z s(z)} entire (it is analytic in the upper half-plane, obeys the symmetry {f(\overline{z}) = \overline{f(z)}}, and has no jump across the real line). On the other hand, as {s(z) = \frac{-1+o(1)}{z}} as {z \rightarrow \infty}, {f} goes to {-1} at infinity. Applying <a href=””>Liouville’s theorem</a>, we conclude that {f} is constant, thus we have the familiar equation <p align=center>\displaystyle s^2 + zs = -1</p>
which can then be solved to obtain the semi-circular law as in <a href=””>previous notes</a>.

<blockquote><b>Remark 7</b> Recall from <a href=””>Notes 3b</a> that Dyson Brownian motion can be used to derive the formula <a href=”#rhod”>(1)</a>. One can then interpret the Dyson Brownian motion proof of the semi-circular law for GUE in <a href=””>Notes 4</a> as a rigorous formalisation of the above mean field approximation heuristic argument. </blockquote>


One can perform a similar heuristic analysis for the spectral measure {\mu_G} of a random gaussian matrix, giving a description of the limiting density:

<blockquote><b>Exercise 8</b> Using heuristic arguments similar to those above, argue that {\mu_G} should be close to a continuous probability distribution {\rho(z)\ dz} obeying the equation <p align=center>\displaystyle |z|^2 + 2\int_{\bf C} \log \frac{1}{|z-w|} \rho(w)\ dw = C </p>
on the support of {\rho}, for some constant {C}, with the inequality <a name=”zaw”><p align=center>\displaystyle |z|^2 + 2\int_{\bf C} \log \frac{1}{|z-w|} \rho(w)\ dw \geq C \ \ \ \ \ (13)</p>
</a> outside of this support. Using the <a href=””>Newton potential</a> {\frac{1}{2\pi} \log |z|} for the fundamental solution of the two-dimensional Laplacian {-\partial_x^2 - \partial_y^2}, conclude (non-rigorously) that {\rho} is equal to {\frac{1}{\pi}} on its support.
Also argue that {\rho} should be rotationally symmetric. Use <a href=”#zaw”>(13)</a> and Green’s formula to argue why the support of {\rho} should be simply connected, and then conclude (again non-rigorously) the <em>circular law</em> <a name=”circular”><p align=center>\displaystyle \mu_G \approx \frac{1}{\pi} 1_{|z| \leq 1}\ dz. \ \ \ \ \ (14)</p>
</a> </blockquote>


We will see more rigorous derivations of the circular law later in these notes, and also in subsequent notes.

<p align=center><b> &mdash; 4. Determinantal form of the GUE spectral distribution &mdash; </b></p>

In a previous section, we showed (up to constants) that the density function {\rho_n(\lambda_1,\ldots,\lambda_n)} for the eigenvalues {\lambda_1 \geq \ldots \geq \lambda_n} of GUE was given by the formula <a href=”#rhod”>(1)</a>.
As is well known, the Vandermonde determinant {\Delta(\lambda_1,\ldots,\lambda_n)} that appears in <a href=”#rhod”>(1)</a> can be expressed up to sign as a determinant of an {n \times n} matrix, namely the matrix {(\lambda_i^{j-1})_{1 \leq i,j \leq n}}. Indeed, this determinant is clearly a polynomial of degree {n(n-1)/2} in {\lambda_1,\ldots,\lambda_n} which vanishes whenever two of the {\lambda_i} agree, and the claim then follows from the factor theorem (and inspecting a single coefficient of the Vandermonde determinant, e.g. the {\prod_{j=1}^n \lambda_j^{j-1}} coefficient, to get the sign).
We can square the above fact (or more precisely, multiply the above matrix matrix by its adjoint) and conclude that {|\Delta(\lambda_1,\ldots,\lambda_n)|^2} is the determinant of the matrix <p align=center>\displaystyle ( \sum_{k=0}^{n-1} \lambda_i^k \lambda_j^k )_{1 \leq i,j \leq n}.</p>
More generally, if {P_0(x),\ldots,P_{n-1}(x)} are any sequence of polynomials, in which {P_i(x)} has degree {i}, then we see from row operations that the determinant of <p align=center>\displaystyle (P_{j-1}(\lambda_i))_{1 \leq i,j \leq n}</p>
is a non-zero constant multiple of {\Delta(\lambda_1,\ldots,\lambda_n)} (with the constant depending on the leading coefficients of the {P_i}), and so the determinant of <p align=center>\displaystyle ( \sum_{k=0}^{n-1} P_k(\lambda_i) P_k(\lambda_j) )_{1 \leq i,j \leq n}</p>
is a non-zero constant multiple of {|\Delta(\lambda_1,\ldots,\lambda_n)|^2}. Comparing this with <a href=”#rhod”>(1)</a>, we obtain the formula <p align=center>\displaystyle \rho_n(\lambda) = C \det( \sum_{k=0}^{n-1} P_k(\lambda_i) e^{-\lambda_i^2/4} P_k(\lambda_j) e^{-\lambda_j^2/4})_{1 \leq i,j \leq n}</p>
for some non-zero constant {C}.
This formula is valid for any choice of polynomials {P_i} of degree {i}. But the formula is particularly useful when we set {P_i} equal to the (normalised) <a href=””>Hermite polynomials</a>, defined by applying the Gram-Schmidt process in {L^2({\bf R})} to the polynomials {x^i e^{-x^2/4}} for {i=0,\ldots,n-1} to yield {P_i(x) e^{-x^2/4}}. (Equivalently, the {P_i} are the <a href=””>orthogonal polynomials</a> associated to the measure {e^{-x^2/2}\ dx}.) In that case, the expression <a name=”knxy”><p align=center>\displaystyle K_n(x,y) := \sum_{k=0}^{n-1} P_k(x) e^{-x^2/4} P_k(y) e^{-y^2/4} \ \ \ \ \ (15)</p>
</a> becomes the integral kernel of the orthogonal projection {\pi_{V_n}} operator in {L^2({\bf R})} to the span of the {x^i e^{-x^2/4}}, thus <p align=center>\displaystyle \pi_{V_n} f(x) = \int_{\bf R} K_n(x,y) f(y)\ dy</p>
for all {f \in L^2({\bf R})}, and so {\rho_n(\lambda)} is now a constant multiple of <p align=center>\displaystyle \det( K_n(\lambda_i, \lambda_j) )_{1 \leq i,j \leq n}.</p>

The reason for working with orthogonal polynomials is that we have the trace identity <a name=”knxx”><p align=center>\displaystyle \int_{\bf R} K_n(x,x)\ dx = \hbox{tr}(\pi_{V_n}) = n \ \ \ \ \ (16)</p>
</a> and the reproducing formula <a name=”knxy2″><p align=center>\displaystyle K_n(x,y) = \int_{\bf R} K_n(x,z) K_n(z,y)\ dz \ \ \ \ \ (17)</p>
</a> which reflects the identity {\pi_{V_n} = \pi_{V_n}^2}. These two formulae have an important consequence:

<blockquote><b>Lemma 9 (Determinantal integration formula)</b> Let {K_n: {\bf R} \times {\bf R} \rightarrow {\bf R}} be any symmetric rapidly decreasing function obeying <a href=”#knxx”>(16)</a>, <a href=”#knxy2″>(17)</a>. Then for any {k \geq 0}, one has <a name=”city”><p align=center>\displaystyle \int_{\bf R} \det( K_n(\lambda_i, \lambda_j) )_{1 \leq i,j \leq k+1}\ d\lambda_{k+1} = (n-k) \det( K_n(\lambda_i, \lambda_j) )_{1 \leq i,j \leq k}. \ \ \ \ \ (18)</p>
</a> </blockquote>



<blockquote><b>Remark 10</b> This remarkable identity is part of the beautiful algebraic theory of <em>determinantal processes</em>, which I discuss further in <a href=””>this blog post</a>. </blockquote>


<em>Proof:</em> We induct on {k}. When {k=0} this is just <a href=”#knxx”>(16)</a>. Now assume that {k \geq 1} and that the claim has already been proven for {k-1}. We apply <a href=””>cofactor expansion</a> to the bottom row of the determinant {\det( K_n(\lambda_i, \lambda_j) )_{1 \leq i,j \leq k+1}}. This gives a principal term <a name=”princ”><p align=center>\displaystyle \det( K_n(\lambda_i, \lambda_j) )_{1 \leq i,j \leq k} K_n(\lambda_{k+1},\lambda_{k+1}) \ \ \ \ \ (19)</p>
</a> plus a sum of {k} additional terms, the {l^{th}} term of which is of the form <a name=”nonprinc”><p align=center>\displaystyle (-1)^{k+1-l} K_n(\lambda_l,\lambda_{k+1}) \det( K_n(\lambda_i, \lambda_j) )_{1 \leq i \leq k; 1 \leq j \leq k+1; j \neq l}. \ \ \ \ \ (20)</p>
</a> Using <a href=”#knxx”>(16)</a>, the principal term <a href=”#princ”>(19)</a> gives a contribution of {n \det( K_n(\lambda_i, \lambda_j) )_{1 \leq i,j \leq k}} to <a href=”#city”>(18)</a>. For each nonprincipal term <a href=”#nonprinc”>(20)</a>, we use the multilinearity of the determinant to absorb the {K_n(\lambda_l,\lambda_{k+1})} term into the {j=k+1} column of the matrix. Using <a href=”#knxy2″>(17)</a>, we thus see that the contribution of <a href=”#nonprinc”>(20)</a> to <a href=”#city”>(18)</a> can be simplified as <p align=center>\displaystyle (-1)^{k+1-l} \det( ( K_n(\lambda_i, \lambda_j) )_{1 \leq i \leq k; 1 \leq j \leq k; j \neq l}, (K_n(\lambda_i,\lambda_l))_{1 \leq i\leq k} )</p>
which after row exchange, simplifies to {-\det( K_n(\lambda_i, \lambda_j) )_{1 \leq i,j \leq k}}. The claim follows. \Box

In particular, if we iterate the above lemma using the Fubini-Tonelli theorem, we see that <p align=center>\displaystyle \int_{{\bf R}^n} \det( K_n(\lambda_i, \lambda_j) )_{1 \leq i,j \leq n}\ d\lambda_1 \ldots d\lambda_n = n!.</p>
On the other hand, if we extend the probability density function {\rho_n(\lambda_1,\ldots,\lambda_n)} symmetrically from the Weyl chamber {{\bf R}^n_{\geq}} to all of {{\bf R}^n}, its integral is also {n!}. Since {\det( K_n(\lambda_i, \lambda_j) )_{1 \leq i,j \leq n}} is clearly symmetric in the {\lambda_1,\ldots,\lambda_n}, we can thus compare constants and conclude the <a href=”″>Gaudin-Mehta formula</a> <p align=center>\displaystyle \rho_n(\lambda_1,\ldots,\lambda_n) = \det( K_n(\lambda_i, \lambda_j) )_{1 \leq i,j \leq n}.</p>
More generally, if we define {\rho_k: {\bf R}^k \rightarrow {\bf R}^+} to be the function <a name=”rhok-1″><p align=center>\displaystyle \rho_k(\lambda_1,\ldots,\lambda_k) = \det( K_n(\lambda_i, \lambda_j) )_{1 \leq i,j \leq k}, \ \ \ \ \ (21)</p>
</a> then the above formula shows that {\rho_k} is the <em>{k}-point correlation function</em> for the spectrum, in the sense that <a name=”rhok-2″><p align=center>\displaystyle \int_{{\bf R}^k} \rho_k(\lambda_1,\ldots,\lambda_k) F(\lambda_1,\ldots,\lambda_k)\ d\lambda_1 \ldots d\lambda_k \ \ \ \ \ (22)</p>
</a> <p align=center>\displaystyle = \mathop{\bf E} \sum_{1 \leq i_1, \ldots, i_k \leq n, \hbox{ distinct}} F( \lambda_{i_1}(M_n),\ldots,\lambda_{i_k}(M_n) ) </p>
for any test function {F: {\bf R}^k \rightarrow {\bf C}} supported in the region {\{ (x_1,\ldots,x_k): x_1 \leq \ldots \leq x_k \}}.
In particular, if we set {k=1}, we obtain the explicit formula <p align=center>\displaystyle \mathop{\bf E} \mu_{M_n} = \frac{1}{n} K_n(x,x)\ dx</p>
for the expected empirical spectral measure of {M_n}. Equivalently after renormalising by {\sqrt{n}}, we have <a name=”mun”><p align=center>\displaystyle \mathop{\bf E} \mu_{M_n/\sqrt{n}} = \frac{1}{n^{1/2}} K_n(\sqrt{n}x,\sqrt{n}x)\ dx. \ \ \ \ \ (23)</p>
It is thus of interest to understand the kernel {K_n} better.
To do this, we begin by recalling that the functions {P_i(x) e^{-x^2/4}} were obtained from {x^i e^{-x^2/4}} by the Gram-Schmidt process. In particular, each {P_i(x) e^{-x^2/4}} is orthogonal to the {x^j e^{-x^2/4}} for all {0 \leq j<i}. This implies that {x P_i(x) e^{-x^2/4}} is orthogonal to {x^j e^{-x^2/4}} for {0 \leq j < i-1}. On the other hand, {xP_i(x)} is a polynomial of degree {i+1}, so {x P_i(x) e^{-x^2/4}} must lie in the span of {x^j e^{-x^2/4}} for {0 \leq j \leq i+1}. Combining the two facts, we see that {x P_i} must be a linear combination of {P_{i-1}, P_i, P_{i+1}}, with the {P_{i+1}} coefficient being non-trivial. We rewrite this fact in the form <a name=”pii”><p align=center>\displaystyle P_{i+1}(x) = (a_i x + b_i) P_i(x) - c_i P_{i-1}(x) \ \ \ \ \ (24)</p>
</a> for some real numbers {a_i,b_i,c_i} (with {c_0=0}). Taking inner products with {P_{i+1}} and {P_{i-1}} we see that <a name=”low”><p align=center>\displaystyle \int_{\bf R} x P_i(x) P_{i+1}(x) e^{-x^2/2}\ dx = \frac{1}{a_i} \ \ \ \ \ (25)</p>
</a> and <p align=center>\displaystyle \int_{\bf R} x P_i(x) P_{i-1}(x) e^{-x^2/2}\ dx = \frac{c_i}{a_i}</p>
and so <a name=”aii”><p align=center>\displaystyle c_i := \frac{a_i}{a_{i-1}} \ \ \ \ \ (26)</p>
</a> (with the convention {a_{-1}=\infty}).
We will continue the computation of {a_i,b_i,c_i} later. For now, we we pick two distinct real numbers {x,y} and consider the <a href=””>Wronskian</a>-type expression <p align=center>\displaystyle P_{i+1}(x) P_i(y) - P_i(x) P_{i+1}(y).</p>
Using <a href=”#pii”>(24)</a>, <a href=”#aii”>(26)</a>, we can write this as <p align=center>\displaystyle a_i(x-y) P_i(x) P_i(y) + \frac{a_i}{a_{i-1}} (P_{i-1}(x) P_i(y) - P_i(x) P_{i-1}(y))</p>
or in other words <p align=center>\displaystyle P_i(x) P_i(y) = \frac{P_{i+1}(x) P_i(y) - P_i(x) P_{i+1}(y)}{a_i(x-y)} </p>
<p align=center>\displaystyle - \frac{P_i(x) P_{i-1}(y) - P_{i-1}(x) P_i(y)}{a_{i-1}(x-y)}.</p>
We telescope this and obtain the <em>Christoffel-Darboux formula</em> for the kernel <a href=”#knxy”>(15)</a>: <a name=”darbo”><p align=center>\displaystyle K_n(x,y) = \frac{P_{n}(x)P_{n-1}(y) - P_{n-1}(x) P_{n}(y)}{a_{n-1}(x-y)} e^{-(x^2+y^2)/4}. \ \ \ \ \ (27)</p>
</a> Sending {y \rightarrow x} using <a href=””>L’Hopital’s rule</a>, we obtain in particular that <a name=”darbo-2″><p align=center>\displaystyle K_n(x,x) = \frac{1}{a_{n-1}}( P'_{n}(x)P_{n-1}(x) - P'_{n-1}(x) P_{n}(x) ) e^{-x^2/2}. \ \ \ \ \ (28)</p>
Inserting this into <a href=”#mun”>(23)</a>, we see that if we want to understand the expected spectral measure of GUE, we should understand the asymptotic behaviour of {P_n} and the associated constants {a_n}. For this, we need to exploit the specific properties of the gaussian weight {e^{-x^2/2}}. In particular, we have the identity <a name=”gauss”><p align=center>\displaystyle x e^{-x^2/2} = - \frac{d}{dx} e^{-x^2/2} \ \ \ \ \ (29)</p>
</a> so upon integrating <a href=”#low”>(25)</a> by parts, we have <p align=center>\displaystyle \int_{\bf R} (P'_i(x) P_{i+1}(x) + P_i(x) P'_{i+1}(x)) e^{-x^2/2}\ dx = \frac{1}{a_i}. </p>
As {P'_i} has degree at most {i-1}, the first term vanishes by the orthonormal nature of the {P_i(x) e^{-x^2/4}}, thus <a name=”ip”><p align=center>\displaystyle \int_{\bf R} P_i(x) P'_{i+1}(x) e^{-x^2/2}\ dx = \frac{1}{a_i}. \ \ \ \ \ (30)</p>
</a> To compute this, let us denote the leading coefficient of {P_i} as {k_i}. Then {P'_{i+1}} is equal to {\frac{(i+1)k_{i+1}}{k_i} P_i} plus lower-order terms, and so we have <p align=center>\displaystyle \frac{(i+1)k_{i+1}}{k_i} = \frac{1}{a_i}.</p>
On the other hand, by inspecting the {x^{i+1}} coefficient of <a href=”#pii”>(24)</a> we have <p align=center>\displaystyle k_{i+1} = a_i k_i.</p>
Combining the two formulae (and making the sign convention that the {k_i} are always positive), we see that <p align=center>\displaystyle a_i = \frac{1}{\sqrt{i+1}}</p>
and <p align=center>\displaystyle k_{i+1} = \frac{k_i}{\sqrt{i+1}}.</p>
Meanwhile, a direct computation shows that {P_0(x) = k_0 = \frac{1}{(2\pi)^{1/4}}}, and thus by induction <p align=center>\displaystyle k_i := \frac{1}{(2\pi)^{1/4} \sqrt{i!}}.</p>
A similar method lets us compute the {b_i}. Indeed, taking inner products of <a href=”#pii”>(24)</a> with {P_i(x) e^{-x^2/2}} and using orthonormality we have <p align=center>\displaystyle b_i = - a_i \int_{\bf R} x P_i(x)^2 e^{-x^2/2}\ dx</p>
which upon integrating by parts using <a href=”#gauss”>(29)</a> gives <p align=center>\displaystyle b_i = - 2 a_i \int_{\bf R} P_i(x) P'_i(x) e^{-x^2/2}\ dx.</p>
As {P'_i} is of degree strictly less than {i}, the integral vanishes by orthonormality, thus {b_i=0}. The identity <a href=”#pii”>(24)</a> thus becomes <em>Hermite recurrence relation</em> <a name=”i2″><p align=center>\displaystyle P_{i+1}(x) = \frac{1}{\sqrt{i+1}} x P_i(x) - \frac{\sqrt{i}}{\sqrt{i+1}} P_{i-1}(x). \ \ \ \ \ (31)</p>
</a> Another recurrence relation arises by considering the integral <p align=center>\displaystyle \int_{\bf R} P_j(x) P'_{i+1}(x) e^{-x^2/2}\ dx.</p>
On the one hand, as {P'_{i+1}} has degree at most {i}, this integral vanishes if {j>i} by orthonormality. On the other hand, integrating by parts using <a href=”#gauss”>(29)</a>, we can write the integral as <p align=center>\displaystyle \int_{\bf R} (xP_j - P'_j)(x) P_{i+1}(x) e^{-x^2/2}\ dx.</p>
If {j < i}, then {xP_j-P'_j} has degree less than {i+1}, so the integral again vanishes. Thus the integral is non-vanishing only when {j=i}. Using <a href=”#ip”>(30)</a>, we conclude that <a name=”i1″><p align=center>\displaystyle P'_{i+1} = \frac{1}{a_i} P_i = \sqrt{i+1} P_i. \ \ \ \ \ (32)</p>
</a> We can combine <a href=”#i1″>(32)</a> with <a href=”#i2″>(31)</a> to obtain the formula <p align=center>\displaystyle \frac{d}{dx}( e^{-x^2/2} P_i(x) ) = - \sqrt{i+1} e^{-x^2/2} P_{i+1}(x),</p>
which together with the initial condition {P_0 = \frac{1}{(2\pi)^{1/4}}} gives the explicit representation <a name=”pn-form”><p align=center>\displaystyle P_n(x) := \frac{(-1)^n}{(2\pi)^{1/4} \sqrt{n!}} e^{x^2/2} \frac{d^n}{dx^n} e^{-x^2/2} \ \ \ \ \ (33)</p>
</a> for the Hermite polynomials. Thus, for instance, at {x=0} one sees from Taylor expansion that <a name=”piano”><p align=center>\displaystyle P_n(0) = \frac{(-1)^{n/2}\sqrt{n!}}{(2\pi)^{1/4} 2^{n/2} (n/2)!} ;\quad P'_n(0) = 0 \ \ \ \ \ (34)</p>
</a> when {n} is even, and <a name=”piano-2″><p align=center>\displaystyle P_n(0) = 0;\quad P'_n(0) = \frac{(-1)^{(n+1)/2} (n+1) \sqrt{n!}}{(2\pi)^{1/4} 2^{(n+1)/2} ((n+1)/2)!} \ \ \ \ \ (35)</p>
</a> when {n} is odd.
In principle, the formula <a href=”#pn-form”>(33)</a>, together with <a href=”#darbo-2″>(28)</a>, gives us an explicit description of the kernel {K_n(x,x)} (and thus of {\mathop{\bf E} \mu_{M_n/\sqrt{n}}}, by <a href=”#mun”>(23)</a>). However, to understand the asymptotic behaviour as {n \rightarrow \infty}, we would have to understand the asymptotic behaviour of {\frac{d^n}{dx^n} e^{-x^2/2}} as {n \rightarrow \infty}, which is not immediately discernable by inspection. However, one can obtain such asymptotics by a variety of means. We give two such methods here: a method based on ODE analysis, and a complex-analytic method, based on the <a href=””>method of steepest descent</a>.
We begin with the ODE method. Combining <a href=”#i2″>(31)</a> with <a href=”#i1″>(32)</a> we see that each polynomial {P_m} obeys the <em>Hermite differential equation</em> <p align=center>\displaystyle P''_m(x) - x P'_m(x) + m P_m(x) = 0.</p>
If we look instead at the Hermite functions {\phi_m(x) := P_m(x) e^{-x^2/4}}, we obtain the differential equation <p align=center>\displaystyle L \phi_m(x) = (m+\frac{1}{2}) \phi_m</p>
where {L} is the <em>harmonic oscillator operator</em> <p align=center>\displaystyle L \phi := - \phi'' + \frac{x^2}{4} \phi.</p>
Note that the self-adjointness of {L} here is consistent with the orthogonal nature of the {\phi_m}.

<blockquote><b>Exercise 11</b> Use <a href=”#knxy”>(15)</a>, <a href=”#darbo-2″>(28)</a>, <a href=”#pn-form”>(33)</a>, <a href=”#i2″>(31)</a>, <a href=”#i1″>(32)</a> to establish the identities <p align=center>\displaystyle K_n(x,x) = \sum_{j=0}^{n-1} \phi_j(x)^2 </p>
<p align=center>\displaystyle = \phi'_n(x)^2 + (n - \frac{x^2}{4}) \phi_n(x)^2</p>
and thus by <a href=”#mun”>(23)</a> <p align=center>\displaystyle \mathop{\bf E} \mu_{M_n/\sqrt{n}} = \frac{1}{\sqrt{n}} \sum_{j=0}^{n-1} \phi_j(\sqrt{n} x)^2\ dx = </p>
<p align=center>\displaystyle = [\frac{1}{\sqrt{n}} \phi'_n(\sqrt{n} x)^2 + \sqrt{n} (1 - \frac{x^2}{4}) \phi_n(\sqrt{n} x)^2]\ dx.</p>


It is thus natural to look at the rescaled functions <p align=center>\displaystyle \tilde \phi_m(x) := \sqrt{n} \phi_m(\sqrt{n} x)</p>
which are orthonormal in {L^2({\bf R})} and solve the equation <p align=center>\displaystyle L_{1/n} \tilde \phi_m(x) = \frac{m+1/2}{n} \tilde \phi_m</p>
where {L_h} is the <em>semiclassical harmonic oscillator operator</em> <p align=center>\displaystyle L_h \phi := - h^2 \phi'' + \frac{x^2}{4} \phi,</p>
thus <p align=center>\displaystyle \mathop{\bf E} \mu_{M_n/\sqrt{n}} = \frac{1}{n} \sum_{j=0}^{n-1} \tilde \phi_j(x)^2\ dx = </p>
<a name=”wings”><p align=center>\displaystyle = [\frac{1}{n} \tilde \phi'_n(x)^2 + (1 - \frac{x^2}{4}) \tilde \phi_n(x)^2]\ dx. \ \ \ \ \ (36)</p>
The projection {\pi_{V_n}} is then the spectral projection operator of {L_{1/\sqrt{n}}} to {[0,1]}. According to <a href=””>semi-classical analysis</a>, with {h} being interpreted as analogous to Planck’s constant, the operator {L_h} has symbol {p^2 + \frac{x^2}{4}}, where {p := -i h \frac{d}{dx}} is the momentum operator, so the projection {\pi_{V_n}} is a projection to the region {\{ (x,p): p^2 + \frac{x^2}{4} \leq 1 \}} of phase space, or equivalently to the region {\{ (x,p): |p| < (4-x^2)_+^{1/2}\}}. In the semi-classical limit {h \rightarrow 0}, we thus expect the diagonal {K_n(x,x)} of the normalised projection {h^2 \pi_{V_n}} to be proportional to the projection of this region to the {x} variable, i.e. proportional to {(4-x^2)_+^{1/2}}. We are thus led to the semi-circular law via semi-classical analysis.
It is possible to make the above argument rigorous, but this would require developing the theory of <a href=””>microlocal analysis</a>, which would be overkill given that we are just dealing with an ODE rather than a PDE here (and an extremely classical ODE at that). We instead use a more basic semiclassical approximation, the <a href=””>WKB approximation</a>, which we will make rigorous using the classical <a href=””>method of variation of parameters</a> (one could also proceed using the closely related <em>Pr&uuml;fer transformation</em>, which we will not detail here). We study the eigenfunction equation <p align=center>\displaystyle L_h \phi = \lambda \phi</p>
where we think of {h > 0} as being small, and {\lambda} as being close to {1}. We rewrite this as <a name=”phik”><p align=center>\displaystyle \phi'' = - \frac{1}{h^2} k(x)^2 \phi \ \ \ \ \ (37)</p>
</a> where {k(x) := \sqrt{\lambda - x^2/4}}, where we will only work in the “classical” region {x^2/4 < \lambda} (so {k(x) > 0}) for now.
Recall that the general solution to the constant coefficient ODE {\phi'' = -\frac{1}{h^2} k^2 \phi} is given by {\phi(x) = A e^{ikx/h} + B e^{-ikx/h}}. Inspired by this, we make the ansatz <p align=center>\displaystyle \phi(x) = A(x) e^{i \Psi(x)/h} + B(x) e^{-i \Psi(x)/h}</p>
where {\Psi(x) := \int_0^x k(y)\ dy} is the antiderivative of {k}. Differentiating this, we have <p align=center>\displaystyle \phi'(x) = \frac{i k(x)}{h}( A(x) e^{i\Psi(x)/h} - B(x) e^{-i\Psi(x)/h} ) </p>
<p align=center>\displaystyle + A'(x) e^{i\Psi(x)/h} + B'(x) e^{-i\Psi(x)/h}.</p>
Because we are representing a single function {\phi} by two functions {A, B}, we have the freedom to place an additional constraint on {A,B}. Following the usual variation of parameters strategy, we will use this freedom to eliminate the last two terms in the expansion of {\phi}, thus <a name=”aprime”><p align=center>\displaystyle A'(x) e^{i\Psi(x)/h} + B'(x) e^{-i\Psi(x)/h} = 0. \ \ \ \ \ (38)</p>
</a> We can now differentiate again and obtain <p align=center>\displaystyle \phi''(x) = -\frac{k(x)^2}{h^2} \phi(x) + \frac{ik'(x)}{h} ( A(x) e^{i\Psi(x)/h} - B(x) e^{-i\Psi(x)/h} )</p>
<p align=center>\displaystyle + \frac{ik(x)}{h} ( A'(x) e^{i\Psi(x)/h} - B'(x) e^{-i\Psi(x)/h} ).</p>
Comparing this with <a href=”#phik”>(37)</a> we see that <p align=center>\displaystyle A'(x) e^{i\Psi(x)/h} - B'(x) e^{-i\Psi(x)/h} = - \frac{k'(x)}{k(x)} ( A(x) e^{i\Psi(x)/h} - B(x) e^{-i\Psi(x)/h} ).</p>
Combining this with <a href=”#aprime”>(38)</a>, we obtain equations of motion for {A} and {B}: <p align=center>\displaystyle A'(x) = - \frac{k'(x)}{2 k(x)} A(x) + \frac{k'(x)}{2 k(x)} B(x) e^{-2i\Psi(x)/h} </p>
<p align=center>\displaystyle B'(x) = - \frac{k'(x)}{2 k(x)} B(x) + \frac{k'(x)}{2 k(x)} A(x) e^{2i\Psi(x)/h}.</p>
We can simplify this using the <a href=””>integrating factor</a> substitution <p align=center>\displaystyle A(x) = k(x)^{-1/2} a(x); \quad B(x) = k(x)^{-1/2} b(x)</p>
to obtain <a name=”a1″><p align=center>\displaystyle a'(x) = \frac{k'(x)}{2 k(x)} b(x) e^{-2i\Psi(x)/h}; \ \ \ \ \ (39)</p>
</a> <a name=”a2″><p align=center>\displaystyle b'(x) = \frac{k'(x)}{2 k(x)} a(x) e^{2i\Psi(x)/h}. \ \ \ \ \ (40)</p>
</a> The point of doing all these transformations is that the role of the {h} parameter no longer manifests itself through amplitude factors, and instead only is present in a phase factor. In particular, we have <p align=center>\displaystyle a', b' = O( |a| + |b| )</p>
on any compact interval {I} in the interior of the classical region {x^2/4 < \lambda} (where we allow implied constants to depend on {I}), which by <a href=”’s_inequality”>Gronwall’s inequality</a> gives the bounds <p align=center>\displaystyle a'(x), b'(x), a(x), b(x) = O( |a(0)| + |b(0)| )</p>
on this interval {I}. We can then insert these bounds into <a href=”#a1″>(39)</a>, <a href=”#a2″>(40)</a> again and integrate by parts (taking advantage of the non-stationary nature of {\Psi}) to obtain the improved bounds <a name=”axo”><p align=center>\displaystyle a(x) = a(0) + O( h (|a(0)|+|b(0)|) ); \quad b(x) = b(0) + O(h (|a(0)|+|b(0)|) ) \ \ \ \ \ (41)</p>
</a> on this interval. (More precise asymptotic expansions can be obtained by iterating this procedure, but we will not need them here.) This is already enough to get the asymptotics that we need:

<blockquote><b>Exercise 12</b> Use <a href=”#wings”>(36)</a> to Show that on any compact interval {I} in {(-2,2)}, the density of {\mathop{\bf E} \mu_{M_n/\sqrt{n}}} is given by <p align=center>\displaystyle (|a|^2(x)+|b|^2(x)) ( \sqrt{1-x^2/4} + o(1) ) + O( |a(x)| |b(x)| )</p>
where {a, b} are as above with {\lambda=1+\frac{1}{2n}} and {h=\frac{1}{n}}. Combining this with <a href=”#axo”>(41)</a>, <a href=”#piano”>(34)</a>, <a href=”#piano-2″>(35)</a>, and Stirling’s formula, conclude that {\mathop{\bf E} \mu_{M_n/\sqrt{n}}} converges in the vague topology to the semicircular law {\frac{1}{2\pi} (4-x^2)_+^{1/2}\ dx}. (Note that once one gets convergence inside {(-2,2)}, the convergence outside of {[-2,2]} can be obtained for free since {\mu_{M_n/\sqrt{n}}} and {\frac{1}{2\pi} (4-x^2)_+^{1/2}\ dx} are both probability measures. </blockquote>


We now sketch out the approach using the <a href=””>method of steepest descent</a>. The starting point is the Fourier inversion formula <p align=center>\displaystyle e^{-x^2/2} = \frac{1}{\sqrt{2\pi}} \int_{\bf R} e^{itx} e^{-t^2/2}\ dt</p>
which upon repeated differentiation gives <p align=center>\displaystyle \frac{d^n}{dx^n} e^{-x^2/2} = \frac{i^n}{\sqrt{2\pi}} \int_{\bf R} t^n e^{itx} e^{-t^2/2}\ dt</p>
and thus by <a href=”#pn-form”>(33)</a> <p align=center>\displaystyle P_n(x) = \frac{(-i)^n}{(2\pi)^{3/4} \sqrt{n!}} \int_{\bf R} t^n e^{-(t-ix)^2/2}\ dt</p>
and thus <p align=center>\displaystyle \tilde \phi_n(x) = \frac{(-i)^n}{(2\pi)^{3/4} \sqrt{n!}} n^{(n+1)/2} \int_{\bf R} e^{n \phi(t)} \ dt</p>
where <p align=center>\displaystyle \phi(t) := \log t - (t-ix)^2/2 - x^2/4</p>
where we use a suitable branch of the complex logarithm to handle the case of negative {t}.
The idea of the principle of steepest descent is to shift the contour of integration to where the real part of {\phi(z)} is as small as possible. For this, it turns out that the stationary points of {\phi(z)} play a crucial role. A brief calculation using the quadratic formula shows that there are two such stationary points, at <p align=center>\displaystyle z = \frac{ix \pm \sqrt{4-x^2}}{2}.</p>
When {|x| < 2}, {\phi} is purely imaginary at these stationary points, while for {|x| > 2} the real part of {\phi} is negative at both points. One then draws a contour through these two stationary points in such a way that near each such point, the imaginary part of {\phi(z)} is kept fixed, which keeps oscillation to a minimum and allows the real part to decay as steeply as possible (which explains the name of the method). After a certain tedious amount of computation, one obtains the same type of asymptotics for {\tilde \phi_n} that were obtained by the ODE method when {|x| < 2} (and exponentially decaying estimates for {|x|>2}).

<blockquote><b>Exercise 13</b> Let {f: {\bf C} \rightarrow {\bf C}}, {g:{\bf C} \rightarrow {\bf C}} be functions which are analytic near a complex number {z_0}, with {f'(z_0) = 0} and {f''(z_0) \neq 0}. Let {\varepsilon > 0} be a small number, and let {\gamma} be the line segment {\{ z_0 + t v: -\varepsilon < t < \varepsilon \}}, where {v} is a complex phase such that {f''(z_0) v^2} is a negative real. Show that for {\varepsilon} sufficiently small, one has <p align=center>\displaystyle \int_\gamma e^{\lambda f(z)} g(z)\ dz = (1+o(1)) \frac{\sqrt{2\pi} v}{\sqrt{f''(z_0) \lambda}} e^{\lambda f(z_0)} g(z_0)</p>
as {\lambda \rightarrow +\infty}. This is the basic estimate behind the method of steepest descent; readers who are also familiar with the <a href=””>method of stationary phase</a> may see a close parallel. </blockquote>



<blockquote><b>Remark 14</b> The method of steepest descent requires an explicit representation of the orthogonal polynomials as contour integrals, and as such is largely restricted to the classical orthogonal polynomials (such as the Hermite polynomials). However, there is a non-linear generalisation of the method of steepest descent developed by Deift and Zhou, in which one solves a matrix Riemann-Hilbert problem rather than a contour integral; see <a href=”″>this book by Deift</a> for details. Using these sorts of tools, one can generalise much of the above theory to the spectral distribution of {U(n)}-conjugation-invariant discussed in Remark <a href=”#daft”>2</a>, with the theory of Hermite polynomials being replaced by the more general theory of orthogonal polynomials; this is discussed in the above book of Deift, as well as the more recent <a href=”″>book of Deift and Gioev</a>. </blockquote>


The computations performed above for the diagonal kernel {K_n(x,x)} can be summarised by the asymptotic <p align=center>\displaystyle K_n( \sqrt{n} x, \sqrt{n} x ) = \sqrt{n} ( \rho_{sc}(x) + o(1) )</p>
whenever {x \in {\bf R}} is fixed and {n \rightarrow \infty}, and {\rho_{sc}(x) := \frac{1}{2\pi} (4-x^2)_+^{1/2}} is the semi-circular law distribution. It is reasonably straightforward to generalise these asymptotics to the off-diagonal case as well, obtaining the more general result <a name=”kn”><p align=center>\displaystyle K_n( \sqrt{n} x + \frac{y_1}{\sqrt{n} \rho_{sc}(x)}, \sqrt{n} x + \frac{y_2}{\sqrt{n} \rho_{sc}(x)} ) = \sqrt{n} ( \rho_{sc}(x) K(y_1,y_2) + o(1) ) \ \ \ \ \ (42)</p>
</a> for fixed {x \in (-2,2)} and {y_1,y_2 \in {\bf R}}, where {K} is the <em>Dyson sine kernel</em> <p align=center>\displaystyle K(y_1,y_2) := \frac{\sin(\pi(y_1-y_2)}{\pi(y_1-y_2)}.</p>
In the language of semi-classical analysis, what is going on here is that the rescaling in the left-hand side of <a href=”#kn”>(42)</a> is transforming the phase space region {\{ (x,p): p^2 + \frac{x^2}{4} \leq 1 \}} to the region {\{ (x,p): |p| \leq 1 \}} in the limit {n \rightarrow \infty}, and the projection to the latter region is given by the Dyson sine kernel. A formal proof of <a href=”#kn”>(42)</a> can be given by using either the ODE method or the steepest descent method to obtain asymptotics for Hermite polynomials, and thence (via the Christoffel-Darboux formula) to asymptotics for {K_n}; we do not give the details here, but see for instance the recent book of Anderson, Guionnet, and Zeitouni.
From <a href=”#kn”>(42)</a> and <a href=”#rhok-1″>(21)</a>, <a href=”#rhok-2″>(22)</a> we obtain the asymptotic formula <p align=center>\displaystyle \mathop{\bf E} \sum_{1 \leq i_1 < \ldots < i_k \leq n} F( \sqrt{n} \rho_{sc}(x) (\lambda_{i_1}(M_n) - \sqrt{n} x),\ldots,</p>
<p align=center>\displaystyle \sqrt{n} \rho_{sc}(x) (\lambda_{i_k}(M_n) - \sqrt{n} x) )</p>
<p align=center>\displaystyle \rightarrow \int_{{\bf R}^k} F(y_1,\ldots,y_k) \det(K(y_i,y_j))_{1 \leq i,j \leq k}\ dy_1 \ldots dy_k</p>
for the local statistics of eigenvalues. By means of further algebraic manipulations (using the general theory of determinantal processes), this allows one to control such quantities as the distribution of eigenvalue gaps near {\sqrt{n} x}, normalised at the scale {\frac{1}{\sqrt{n} \rho_{sc}(x)}}, which is the average size of these gaps as predicted by the semicircular law. For instance, for any {s_0>0}, one can show (basically by the above formulae combined with the inclusion-exclusion principle) that the proportion of eigenvalues {\lambda_i} with normalised gap {\sqrt{n} \frac{\lambda_{i+1}-\lambda_i}{\rho_{sc}(t_{i/n})}} less than {s_0} converges as {n \rightarrow \infty} to {\int_0^{s_0} \frac{d^2}{ds^2} \det(1-K)_{L^2[0,s]}\ ds}, where {t_c \in [-2,2]} is defined by the formula {\int_{-2}^{t_c} \rho_{sc}(x)\ dx = c}, and {K} is the integral operator with kernel {K(x,y)} (this operator can be verified to be <a href=””>trace class</a>, so the determinant can be defined in a <a href=””>Fredholm sense</a>). See for instance this <a href=”″>book of Mehta</a> (and my <a href=””>blog post on determinantal processes</a> describe a finitary version of the inclusion-exclusion argument used to obtain such a result).

<blockquote><b>Remark 15</b> One can also analyse the distribution of the eigenvalues at the edge of the spectrum, i.e. close to {\pm 2 \sqrt{n}}. This ultimately hinges on understanding the behaviour of the projection {\pi_{V_n}} near the corners {(0, \pm 2)} of the phase space region {\Omega = \{ (p,x): p^2 + \frac{x^2}{4} \leq 1 \}}, or of the Hermite polynomials {P_n(x)} for {x} close to {\pm 2 \sqrt{n}}. For instance, by using steepest descent methods, one can show that <p align=center>\displaystyle n^{1/12} \phi_n(2\sqrt{n} + n^{1/6} x) \rightarrow Ai(x)</p>
as {n \rightarrow \infty} for any fixed {x,y}, where {Ai} is the <a href=””>Airy function</a> <p align=center>\displaystyle Ai(x) := \frac{1}{\pi} \int_0^\infty \cos( \frac{t^3}{3} + tx )\ dt.</p>
This asymptotic and the Christoffel-Darboux formula then gives the asymptotic <a name=”nkn”><p align=center>\displaystyle n^{1/6} K_n(2\sqrt{n} + n^{1/6} x, 2\sqrt{n} + n^{1/6} y) \rightarrow K_{Ai}(x,y) \ \ \ \ \ (43)</p>
</a> for any fixed {x,y}, where {K_{Ai}} is the <em>Airy kernel</em> <p align=center>\displaystyle K_{Ai}(x,y) := \frac{Ai(x) Ai'(y) - Ai'(x) Ai(y)}{x-y}.</p>
(<em>Aside</em>: Semiclassical heuristics suggest that the rescaled kernel <a href=”#nkn”>(43)</a> should correspond to projection to the parabolic region of phase space {\{ (p,x): p^2 + x \leq 1\}}, but I do not know of a connection between this region and the Airy kernel; I am not sure whether semiclassical heuristics are in fact valid at this scaling regime. On the other hand, these heuristics do explain the emergence of the length scale {n^{1/6}} that emerges in <a href=”#nkn”>(43)</a>, as this is the smallest scale at the edge which occupies a region in {\Omega} consistent with the Heisenberg uncertainty principle.) This then gives an asymptotic description of the largest eigenvalues of a GUE matrix, which cluster in the region {2\sqrt{n} + O(n^{1/6})}. For instance, one can use the above asymptotics to show that the largest eigenvalue {\lambda_1} of a GUE matrix obeys the <a href=””>Tracy-Widom law</a> <p align=center>\displaystyle \mathop{\bf P}( \frac{\lambda_1 - 2\sqrt{n}}{n^{1/6}} < t ) \rightarrow \det(1 - A)_{L^2[0,t]}</p>
for any fixed {t}, where {A} is the integral operator with kernel {K_{Ai}}. See for instance the recent book of Anderson, Guionnet, and Zeitouni. </blockquote>



<p align=center><b> &mdash; 5. Determinantal form of the gaussian matrix distribution &mdash; </b></p>

One can perform an analogous analysis of the joint distribution function <a href=”#ginibre-gaussian”>(10)</a> of gaussian random matrices. Indeed, given any family {P_0,\ldots,P_{n-1}(z)} of polynomials, with each {P_i} of degree {i}, much the same arguments as before show that <a href=”#ginibre-gaussian”>(10)</a> is equal to a constant multiple of <p align=center>\displaystyle \det( \sum_{k=0}^{n-1} P_k(\lambda_i) e^{-|\lambda_i|^2/2} \overline{P_k(\lambda_j)} e^{-|\lambda_j|^2/2} )_{1 \leq i,j \leq n}.</p>
One can then select {P_k(z) e^{-|z|^2/2}} to be orthonormal in {L^2({\bf C})}. Actually in this case, the polynomials are very simple, being given explicitly by the formula <p align=center>\displaystyle P_k(z) := \frac{1}{\sqrt{\pi k!}} z^k.</p>


<blockquote><b>Exercise 16</b> Verify that the {P_k(z) e^{-|z|^2/2}} are indeed orthonormal, and then conclude that <a href=”#ginibre-gaussian”>(10)</a> is equal to {\det( K_n(\lambda_i,\lambda_j) )_{1 \leq i,j \leq n}}, where <p align=center>\displaystyle K_n(z,w) := \frac{1}{\pi} e^{-(|z|^2+|w|^2)/2} \sum_{k=0}^{n-1} \frac{(z\overline{w})^k}{k!}.</p>
Conclude further that the {m}-point correlation functions {\rho_m(z_1,\ldots,z_m)} are given as <p align=center>\displaystyle \rho_m(z_1,\ldots,z_m) = \det( K_n(z_i,z_j) )_{1 \leq i,j \leq m}.</p>



<blockquote><b>Exercise 17</b> Show that as {n \rightarrow \infty}, one has <p align=center>\displaystyle K_n( \sqrt{n} z, \sqrt{n} z ) = \frac{1}{\pi} 1_{|z| \leq 1} + o(1)</p>
and deduce that the expected spectral measure {{\bf E} \mu_{G/\sqrt{n}}} converges vaguely to the circular measure {\mu_c := \frac{1}{\pi} 1_{|z| \leq 1}\ dz}; this is a special case of the <em>circular law</em>. </blockquote>



<blockquote><b>Exercise 18</b> For any {|z| < 1} and {w_1, w_2 \in {\bf C}}, show that <p align=center>\displaystyle K_n( \sqrt{n} (z+w_1), \sqrt{n} (z+w_2) ) = \frac{1}{\pi} \exp( - |w_1-w_2|^2 / 2 ) + o(1)</p>
as {n \rightarrow \infty}. This formula (in principle, at least) describes the asymptotic local {m}-point correlation functions of the spectrum of gaussian matrices. </blockquote>



<blockquote><b>Remark 19</b> One can use the above formulae as the starting point for many other computations on the spectrum of random gaussian matrices; to give just one example, one can show that expected number of eigenvalues which are real is of the order of {\sqrt{n}} (see <a href=”″>this paper of Edelman</a> for more precise results of this nature). It remains a challenge to extend these results to more general ensembles than the gaussian ensemble. </blockquote>



When solving the initial value problem to an ordinary differential equation, such as

\displaystyle  \partial_t u = F(u); \quad u(0) = u_0, \ \ \ \ \ (1)

where {u: {\bf R} \rightarrow V} is the unknown solution (taking values in some finite-dimensional vector space {V}), {u_0 \in V} is the initial datum, and {F: V \rightarrow V} is some nonlinear function (which we will take to be smooth for sake of argument), then one can construct a solution locally in time via the Picard iteration method. There are two basic ideas. The first is to use the fundamental theorem of calculus to rewrite the initial value problem (1) as the problem of solving an integral equation,

\displaystyle  u(t) = u_0 + \int_0^t F(u(s))\ ds. \ \ \ \ \ (2)

The second idea is to solve this integral equation by the contraction mapping theorem, showing that the integral operator {{\mathcal N}} defined by

\displaystyle  {\mathcal N}(u) (t) := u_0 + \int_0^t F(u(s))\ ds

is a contraction on a suitable complete metric space (e.g. a closed ball in the function space {C^0([0,T]; V)}), and thus has a unique fixed point in this space. This method works as long as one only seeks to construct local solutions (for time {t} in {[0,T]} for sufficiently small {T>0}), but the solutions constructed have a number of very good properties, including

  • Existence: A solution {u} exists in the space {C^0([0,T];V)} (and even in {C^\infty([0,T];V)}) for {T} sufficiently small.
  • Uniqueness: There is at most one solution {u} to the initial value problem in the space {C^0([0,T];V)} (or in smoother spaces, such as {C^\infty([0,T];V)}). (For solutions in the weaker space {C^0([0,T];V)} we use the integral formulation (2) to define the solution concept.)
  • Lipschitz continuous dependence on the data: If {u_0^{(n)}} is a sequence of initial data converging to {u_0}, then the associated solutions {u^{(n)}} converge uniformly to {u} on {[0,T]} (possibly after shrinking {T} slightly). In fact we have the Lipschitz bound {\| u^{(n)}(t) - u(t) \|_V \leq C \| u^{(n)}_0 - u_0 \|_V} for {n} large enough and {t \in [0,T]}, where {C} is an absolute constant.

This package of properties is referred to as (Lipschitz) wellposedness.

This method extends to certain partial differential equations, particularly those of a semilinear nature (linear except for lower order nonlinear terms). For instance, if trying to solve an initial value problem of the form

\displaystyle  \partial_t u + Lu = F(u); \quad u(0,x) = u_0(x),

where now {u: {\bf R} \rightarrow V} takes values in a function space {V} (e.g. a Sobolev space {H^k({\bf R}^d)}), {u_0 \in V} is an initial datum, {L} is some (differential) operator (independent of {u}) that is (densely) defined on {V}, and {F} is a nonlinearity which is also (densely) defined on {V}, then (formally, at least) one can solve this problem by using Duhamel’s formula to convert the problem to that of solving an integral equation

\displaystyle  u(t) = e^{-tL} u_0 + \int_0^t e^{-(t-s)L} F(u(s))\ ds

and one can then hope to show that the associated nonlinear integral operator

\displaystyle  u \mapsto e^{-tL} u_0 + \int_0^t e^{-(t-s)L} F(u(s))\ ds

is a contraction in a subset of a suitably chosen function space.

This method turns out to work surprisingly well for many semilinear partial differential equations, and in particular for semilinear parabolic, semilinear dispersive, and semilinear wave equations. As in the ODE case, when the method works, it usually gives the entire package of Lipschitz well-posedness: existence, uniqueness, and Lipschitz continuous dependence on the initial data, for short times at least.

However, when one moves from semilinear initial value problems to quasilinear initial value problems such as

\displaystyle  \partial_t u + L_u u = F(u); \quad u(0,x) = u_0(x)

in which the top order operator {L_u} now depends on the solution {u} itself, then the nature of well-posedness changes; one can still hope to obtain (local) existence and uniqueness, and even continuous dependence on the data, but one usually is forced to give up Lipschitz continuous dependence at the highest available regularity (though one can often recover it at lower regularities). As a consequence, the Picard iteration method is not directly suitable for constructing solutions to such equations.

One can already see this phenomenon with a very simple equation, namely the one-dimensional constant-velocity transport equation

\displaystyle  \partial_t u + c \partial_x u = 0; \quad u(0,x) = u_0(x) \ \ \ \ \ (3)

where we consider {c = c_0} as part of the initial data. (If one wishes, one could view this equation as a rather trivial example of a system.

\displaystyle  \partial_t u + c \partial_x u = 0; \quad \partial_t c = 0

\displaystyle  u(0,x) = u_0(x); \quad c(0) = c_0,

to emphasis this viewpoint, but this would be somewhat idiosyncratic.) One can solve this equation explicitly of course to get the solution

\displaystyle  u(t,x) = u_0(x-ct).

In particular, if we look at the solution just at time {t=1} for simplicity, we have

\displaystyle  u(1,x) = u_0(x-c).

Now let us see how this solution {u(1,x)} depends on the parameter {c}. One can ask whether this dependence is Lipschitz in {c}, in some function space {V}:

\displaystyle  \| u_0(\cdot - c) - u_0(\cdot - c') \|_V \leq A |c-c'|

for some finite {A}. But using the Newton approximation

\displaystyle  u_0(\cdot - c) - u_0(\cdot - c') \approx (c-c') \partial_x u_0(\cdot - c)

we see that we should only expect such a bound when {\partial_x u_0} (and its translates) lie in {V}. Thus, we see a loss of derivatives phenomenon with regard to Lipschitz well-posedness; if the initial data {u_0} is in some regularity space, say {C^3}, then one only obtains Lipschitz dependence on {c} in a lower regularity space such as {C^2}.

We have just seen that if all one knows about the initial data {u_0} is that it is bounded in a function space {V}, then one usually cannot hope to make the dependence of {u} on the velocity parameter {c} Lipschitz continuous. Indeed, one cannot even make it continuous uniformly in {V}. Given two values of {c} that are close together, e.g. {c = 0} and {c=\epsilon}, and a reasonable function space {V} (e.g. a Sobolev space {H^k}, or a classical regularity space {C^k}) one can easily cook up a function {u_0} that is bounded in {V} but whose two solutions {u_0(\cdot)} and {u_0(\cdot-\epsilon)} separate in the {V} norm at time {1}, simply by choosing {u_0} to be supported on an interval of width {\epsilon}.

(Part of the problem here is that using a subtractive method {\|u-v\|_V} to determine the distance between two solutions {u, v} is not a physically natural operation when transport mechanisms are present that could cause the key features of {u, v} (such as singularities) to be situated in slightly different locations. In such cases, the correct notion of distance may need to take transport into account, e.g. by using metrics of Wasserstein type.)

On the other hand, one still has non-uniform continuous dependence on the initial parameters: if {u_0} lies in some reasonable function space {V}, then the map {c \mapsto u_0(\cdot-c)} is continuous in the {V} topology, even if it is not uniformly continuous with respect to {v_0}. (More succinctly: translation is a continuous but not uniformly continuous operation in most function spaces.) The reason for this is that we already have established this continuity in the case when {u_0} is so smooth that an additional derivative of {u_0} lies in {V}; and such smooth functions tend to be dense in the original space {V}, so the general case can then be established by a limiting argument, approximating a general function in {V} by a smoother function. We then see that the non-uniformity ultimately comes from the fact that a given function in {V} may be arbitrarily rough (or concentrated at an arbitrarily fine scale), and so the ability to approximate such a function by a smooth one can be arbitrarily poor.

In many quasilinear PDE, one often encounters qualitatively similar phenomena. Namely, one often has local well-posedness in sufficiently smooth function spaces {V} (so that if the initial data lies in {V}, then for short times one has existence, uniqueness, and continuous dependence on the data in the {V} topology), but Lipschitz or uniform continuity in the {V} topology is usually false. However, if the data (and solution) is known to be in a high-regularity function space {V}, one can often recover Lipschitz or uniform continuity in a lower-regularity topology.

Because the continuous dependence on the data in quasilinear equations is necessarily non-uniform, the arguments needed to establish this dependence can be remarkably delicate. As with the simple example of the transport equation, the key is to approximate a rough solution by a smooth solution first, by smoothing out the data (this is the non-uniform step, as it depends on the physical scale (or wavelength) that the data features are located). But for quasilinear equations, keeping the rough and smooth solution together can require a little juggling of function space norms, in particular playing the low-frequency nature of the smooth solution against the high-frequency nature of the residual between the rough and smooth solutions.

Below the fold I will illustrate this phenomenon with one of the simplest quasilinear equations, namely the initial value problem for the inviscid Burgers’ equation

\displaystyle  \partial_t u + u u_x = 0; \quad u(0,x) = u_0(x) \ \ \ \ \ (4)

which is a modification of the transport equation (3) in which the velocity {c} is no longer a parameter, but now depends (and is, in this case, actually equal to) the solution. To avoid technicalities we will work only with the classical function spaces {C^k} of {k} times continuously differentiable functions, though one can certainly work with other spaces (such as Sobolev spaces) by exploiting the Sobolev embedding theorem. To avoid having to distinguish continuity from uniform continuity, we shall work in a compact domain by assuming periodicity in space, thus for instance restricting {x} to the unit circle {{\bf R}/{\bf Z}}.

This discussion is inspired by this survey article of Nikolay Tzvetkov, which further explores the distinction between well-posedness and ill-posedness in both semilinear and quasilinear settings.

Read the rest of this entry »

A celebrated theorem of Gromov reads:

Theorem 1 Every finitely generated group of polynomial growth is virtually nilpotent.

The original proof of Gromov’s theorem was quite non-elementary, using an infinitary limit and exploiting the work surrounding the solution to Hilbert’s fifth problem. More recently, Kleiner provided a proof which was more elementary (based in large part on an earlier paper of Colding and Minicozzi), though still not entirely so, relying in part on (a weak form of the) Tits alternative and also on an ultrafilter argument of Korevaar-Schoen and Mok. I discuss Kleiner’s argument more in this previous blog post.

Recently, Yehuda Shalom and I established a quantitative version of Gromov’s theorem by making every component of Kleiner’s argument finitary. Technically, this provides a fully elementary proof of Gromov’s theorem (we do use one infinitary limit to simplify the argument a little bit, but this is not truly necessary); however, because we were trying to quantify as much of the result as possible, the argument became quite lengthy.

In this note I want to record a short version of the argument of Yehuda and myself which is not quantitative, but gives a self-contained and largely elementary proof of Gromov’s theorem. The argument is not too far from the Kleiner argument, but has a number of simplifications at various places. In a number of places, there was a choice to take between a short argument that was “inefficient” in the sense that it did not lead to a good quantitative bound, and a lengthier argument which led to better quantitative bounds. I have opted for the former in all such cases.

Yehuda and I plan to write a short paper containing this argument as well as some additional material, but due to some interest in this particular proof, we are detailing it here on this blog in advance of our paper.

Note: this post will assume familiarity with the basic terminology of group theory, and will move somewhat quickly through the technical details.

Read the rest of this entry »

Ben Green, and I have just uploaded to the arXiv a short (six-page) paper “Yet another proof of Szemeredi’s theorem“, submitted to the 70th birthday conference proceedings for Endre Szemerédi. In this paper we put in print a folklore observation, namely that the inverse conjecture for the Gowers norm, together with the density increment argument, easily implies Szemerédi’s famous theorem on arithmetic progressions. This is unsurprising, given that Gowers’ proof of Szemerédi’s theorem proceeds through a weaker version of the inverse conjecture and a density increment argument, and also given that it is possible to derive Szemerédi’s theorem from knowledge of the characteristic factor for multiple recurrence (the ergodic theory analogue of the inverse conjecture, first established by Host and Kra), as was done by Bergelson, Leibman, and Lesigne (and also implicitly in the earlier paper of Bergelson, Host, and Kra); but to our knowledge the exact derivation of Szemerédi’s theorem from the inverse conjecture was not in the literature. Ordinarily this type of folklore might be considered too trifling (and too well known among experts in the field) to publish; but we felt that the venue of the Szemerédi birthday conference provided a natural venue for this particular observation.

The key point is that one can show (by an elementary argument relying primarily an induction on dimension argument and the Weyl recurrence theorem, i.e. that given any real {\alpha} and any integer {s \geq 1}, that the expression {\alpha n^s} gets arbitrarily close to an integer) that given a (polynomial) nilsequence {n \mapsto F(g(n)\Gamma)}, one can subdivide any long arithmetic progression (such as {[N] = \{1,\ldots,N\}}) into a number of medium-sized progressions, where the nilsequence is nearly constant on each progression. As a consequence of this and the inverse conjecture for the Gowers norm, if a set has no arithmetic progressions, then it must have an elevated density on a subprogression; iterating this observation as per the usual density-increment argument as introduced long ago by Roth, one obtains the claim. (This is very close to the scheme of Gowers’ proof.)

Technically, one might call this the shortest proof of Szemerédi’s theorem in the literature (and would be something like the sixteenth such genuinely distinct proof, by our count), but that would be cheating quite a bit, primarily due to the fact that it assumes the inverse conjecture for the Gowers norm, our current proof of which is checking in at about 100 pages…

Ben Green, and I have just uploaded to the arXiv our paper “An arithmetic regularity lemma, an associated counting lemma, and applications“, submitted (a little behind schedule) to the 70th birthday conference proceedings for Endre Szemerédi. In this paper we describe the general-degree version of the arithmetic regularity lemma, which can be viewed as the counterpart of the Szemerédi regularity lemma, in which the object being regularised is a function {f: [N] \rightarrow [0,1]} on a discrete interval {[N] = \{1,\ldots,N\}} rather than a graph, and the type of patterns one wishes to count are additive patterns (such as arithmetic progressions {n,n+d,\ldots,n+(k-1)d}) rather than subgraphs. Very roughly speaking, this regularity lemma asserts that all such functions can be decomposed as a degree {\leq s} nilsequence (or more precisely, a variant of a nilsequence that we call an virtual irrational nilsequence), plus a small error, plus a third error which is extremely tiny in the Gowers uniformity norm {U^{s+1}[N]}. In principle, at least, the latter two errors can be readily discarded in applications, so that the regularity lemma reduces many questions in additive combinatorics to questions concerning (virtual irrational) nilsequences. To work with these nilsequences, we also establish a arithmetic counting lemma that gives an integral formula for counting additive patterns weighted by such nilsequences.

The regularity lemma is a manifestation of the “dichotomy between structure and randomness”, as discussed for instance in my ICM article or FOCS article. In the degree {1} case {s=1}, this result is essentially due to Green. It is powered by the inverse conjecture for the Gowers norms, which we and Tamar Ziegler have recently established (paper to be forthcoming shortly; the {k=4} case of our argument is discussed here). The counting lemma is established through the quantitative equidistribution theory of nilmanifolds, which Ben and I set out in this paper.

The regularity and counting lemmas are designed to be used together, and in the paper we give three applications of this combination. Firstly, we give a new proof of Szemerédi’s theorem, which proceeds via an energy increment argument rather than a density increment one. Secondly, we establish a conjecture of Bergelson, Host, and Kra, namely that if {A \subset [N]} has density {\alpha}, and {\epsilon > 0}, then there exist {\gg_{\alpha,\epsilon} N} shifts {h} for which {A} contains at least {(\alpha^4 - \epsilon)N} arithmetic progressions of length {k=4} of spacing {h}. (The {k=3} case of this conjecture was established earlier by Green; the {k=5} case is false, as was shown by Ruzsa in an appendix to the Bergelson-Host-Kra paper.) Thirdly, we establish a variant of a recent result of Gowers-Wolf, showing that the true complexity of a system of linear forms over {[N]} indeed matches the conjectured value predicted in their first paper.

In all three applications, the scheme of proof can be described as follows:

  • Apply the arithmetic regularity lemma, and decompose a relevant function {f} into three pieces, {f_{nil}, f_{sml}, f_{unf}}.
  • The uniform part {f_{unf}} is so tiny in the Gowers uniformity norm that its contribution can be easily dealt with by an appropriate “generalised von Neumann theorem”.
  • The contribution of the (virtual, irrational) nilsequence {f_{nil}} can be controlled using the arithmetic counting lemma.
  • Finally, one needs to check that the contribution of the small error {f_{sml}} does not overwhelm the main term {f_{nil}}. This is the trickiest bit; one often needs to use the counting lemma again to show that one can find a set of arithmetic patterns for {f_{nil}} that is so sufficiently “equidistributed” that it is not impacted by the small error.

To illustrate the last point, let us give the following example. Suppose we have a set {A \subset [N]} of some positive density (say {|A| = 10^{-1} N}) and we have managed to prove that {A} contains a reasonable number of arithmetic progressions of length {5} (say), e.g. it contains at least {10^{-10} N^2} such progressions. Now we perturb {A} by deleting a small number, say {10^{-2} N}, elements from {A} to create a new set {A'}. Can we still conclude that the new set {A'} contains any arithmetic progressions of length {5}?

Unfortunately, the answer could be no; conceivably, all of the {10^{-10} N^2} arithmetic progressions in {A} could be wiped out by the {10^{-2} N} elements removed from {A}, since each such element of {A} could be associated with up to {|A|} (or even {5|A|}) arithmetic progressions in {A}.

But suppose we knew that the {10^{-10} N^2} arithmetic progressions in {A} were equidistributed, in the sense that each element in {A} belonged to the same number of such arithmetic progressions, namely {5 \times 10^{-9} N}. Then each element deleted from {A} only removes at most {5 \times 10^{-9} N} progressions, and so one can safely remove {10^{-2} N} elements from {A} and still retain some arithmetic progressions. The same argument works if the arithmetic progressions are only approximately equidistributed, in the sense that the number of progressions that a given element {a \in A} belongs to concentrates sharply around its mean (for instance, by having a small variance), provided that the equidistribution is sufficiently strong. Fortunately, the arithmetic regularity and counting lemmas are designed to give precisely such a strong equidistribution result.

A succinct (but slightly inaccurate) summation of the regularity+counting lemma strategy would be that in order to solve a problem in additive combinatorics, it “suffices to check it for nilsequences”. But this should come with a caveat, due to the issue of the small error above; in addition to checking it for nilsequences, the answer in the nilsequence case must be sufficiently “dispersed” in a suitable sense, so that it can survive the addition of a small (but not completely negligible) perturbation.

One last “production note”. Like our previous paper with Emmanuel Breuillard, we used Subversion to write this paper, which turned out to be a significant efficiency boost as we could work on different parts of the paper simultaneously (this was particularly important this time round as the paper was somewhat lengthy and complicated, and there was a submission deadline). When doing so, we found it convenient to split the paper into a dozen or so pieces (one for each section of the paper, basically) in order to avoid conflicts, and to help coordinate the writing process. I’m also looking into git (a more advanced version control system), and am planning to use it for another of my joint projects; I hope to be able to comment on the relative strengths of these systems (and with plain old email) in the future.

As an experiment, I’ve recently started using Google Buzz as an outlet for various things I wanted to say or share, but which were too insubstantial to merit a mention on this blog. (In turn, one of the reasons of starting this blog was to share various bits of mathematics which were too insubstantial for a published paper.  Presumably the process becomes degenerate if iterated any further…)    I don’t know how frequently I will be updating, though.

In the foundations of modern probability, as laid out by Kolmogorov, the basic objects of study are constructed in the following order:

  1. Firstly, one selects a sample space {\Omega}, whose elements {\omega} represent all the possible states that one’s stochastic system could be in.
  2. Then, one selects a {\sigma}-algebra {{\mathcal B}} of events {E} (modeled by subsets of {\Omega}), and assigns each of these events a probability {{\bf P}(E) \in [0,1]} in a countably additive manner, so that the entire sample space has probability {1}.
  3. Finally, one builds (commutative) algebras of random variables {X} (such as complex-valued random variables, modeled by measurable functions from {\Omega} to {{\bf C}}), and (assuming suitable integrability or moment conditions) one can assign expectations {\mathop{\bf E} X} to each such random variable.

In measure theory, the underlying measure space {\Omega} plays a prominent foundational role, with the measurable sets and measurable functions (the analogues of the events and the random variables) always being viewed as somehow being attached to that space. In probability theory, in contrast, it is the events and their probabilities that are viewed as being fundamental, with the sample space {\Omega} being abstracted away as much as possible, and with the random variables and expectations being viewed as derived concepts. See Notes 0 for further discussion of this philosophy.

However, it is possible to take the abstraction process one step further, and view the algebra of random variables and their expectations as being the foundational concept, and ignoring both the presence of the original sample space, the algebra of events, or the probability measure.

There are two reasons for wanting to shed (or abstract away) these previously foundational structures. Firstly, it allows one to more easily take certain types of limits, such as the large {n} limit {n \rightarrow \infty} when considering {n \times n} random matrices, because quantities built from the algebra of random variables and their expectations, such as the normalised moments of random matrices tend to be quite stable in the large {n} limit (as we have seen in previous notes), even as the sample space and event space varies with {n}. (This theme of using abstraction to facilitate the taking of the large {n} limit also shows up in the application of ergodic theory to combinatorics via the correspondence principle; see this previous blog post for further discussion.)

Secondly, this abstract formalism allows one to generalise the classical, commutative theory of probability to the more general theory of non-commutative probability theory, which does not have a classical underlying sample space or event space, but is instead built upon a (possibly) non-commutative algebra of random variables (or “observables”) and their expectations (or “traces”). This more general formalism not only encompasses classical probability, but also spectral theory (with matrices or operators taking the role of random variables, and the trace taking the role of expectation), random matrix theory (which can be viewed as a natural blend of classical probability and spectral theory), and quantum mechanics (with physical observables taking the role of random variables, and their expected value on a given quantum state being the expectation). It is also part of a more general “non-commutative way of thinking” (of which non-commutative geometry is the most prominent example), in which a space is understood primarily in terms of the ring or algebra of functions (or function-like objects, such as sections of bundles) placed on top of that space, and then the space itself is largely abstracted away in order to allow the algebraic structures to become less commutative. In short, the idea is to make algebra the foundation of the theory, as opposed to other possible choices of foundations such as sets, measures, categories, etc..

[Note that this foundational preference is to some extent a metamathematical one rather than a mathematical one; in many cases it is possible to rewrite the theory in a mathematically equivalent form so that some other mathematical structure becomes designated as the foundational one, much as probability theory can be equivalently formulated as the measure theory of probability measures. However, this does not negate the fact that a different choice of foundations can lead to a different way of thinking about the subject, and thus to ask a different set of questions and to discover a different set of proofs and solutions. Thus it is often of value to understand multiple foundational perspectives at once, to get a truly stereoscopic view of the subject.]

It turns out that non-commutative probability can be modeled using operator algebras such as {C^*}-algebras, von Neumann algebras, or algebras of bounded operators on a Hilbert space, with the latter being accomplished via the Gelfand-Naimark-Segal construction. We will discuss some of these models here, but just as probability theory seeks to abstract away its measure-theoretic models, the philosophy of non-commutative probability is also to downplay these operator algebraic models once some foundational issues are settled.

When one generalises the set of structures in one’s theory, for instance from the commutative setting to the non-commutative setting, the notion of what it means for a structure to be “universal”, “free”, or “independent” can change. The most familiar example of this comes from group theory. If one restricts attention to the category of abelian groups, then the “freest” object one can generate from two generators {e,f} is the free abelian group of commutative words {e^n f^m} with {n,m \in {\bf Z}}, which is isomorphic to the group {{\bf Z}^2}. If however one generalises to the non-commutative setting of arbitrary groups, then the “freest” object that can now be generated from two generators {e,f} is the free group {{\Bbb F}_2} of non-commutative words {e^{n_1} f^{m_1} \ldots e^{n_k} f^{m_k}} with {n_1,m_1,\ldots,n_k,m_k \in {\bf Z}}, which is a significantly larger extension of the free abelian group {{\bf Z}^2}.

Similarly, when generalising classical probability theory to non-commutative probability theory, the notion of what it means for two or more random variables to be independent changes. In the classical (commutative) setting, two (bounded, real-valued) random variables {X, Y} are independent if one has

\displaystyle \mathop{\bf E} f(X) g(Y) = 0

whenever {f, g: {\bf R} \rightarrow {\bf R}} are well-behaved functions (such as polynomials) such that {\mathop{\bf E} f(X)}, {\mathop{\bf E} g(Y)} both vanish. In the non-commutative setting, one can generalise the above definition to two commuting bounded self-adjoint variables; this concept is useful for instance in quantum probability, which is an abstraction of the theory of observables in quantum mechanics. But for two (bounded, self-adjoint) non-commutative random variables {X, Y}, the notion of classical independence no longer applies. As a substitute, one can instead consider the notion of being freely independent (or free for short), which means that

\displaystyle \mathop{\bf E} f_1(X) g_1(Y) \ldots f_k(X) g_k(Y) = 0

whenever {f_1,g_1,\ldots,f_k,g_k: {\bf R} \rightarrow {\bf R}} are well-behaved functions such that all of {\mathop{\bf E} f_1(X), \mathop{\bf E} g_1(Y), \ldots, \mathop{\bf E} f_k(X), \mathop{\bf E} g_k(Y)} vanish.

The concept of free independence was introduced by Voiculescu, and its study is now known as the subject of free probability. We will not attempt a systematic survey of this subject here; for this, we refer the reader to the surveys of Speicher and of Biane. Instead, we shall just discuss a small number of topics in this area to give the flavour of the subject only.

The significance of free probability to random matrix theory lies in the fundamental observation that random matrices which are independent in the classical sense, also tend to be independent in the free probability sense, in the large {n} limit {n \rightarrow \infty}. (This is only possible because of the highly non-commutative nature of these matrices; as we shall see, it is not possible for non-trivial commuting independent random variables to be freely independent.) Because of this, many tedious computations in random matrix theory, particularly those of an algebraic or enumerative combinatorial nature, can be done more quickly and systematically by using the framework of free probability, which by design is optimised for algebraic tasks rather than analytical ones.

Much as free groups are in some sense “maximally non-commutative”, freely independent random variables are about as far from being commuting as possible. For instance, if {X, Y} are freely independent and of expectation zero, then {\mathop{\bf E} XYXY} vanishes, but {\mathop{\bf E} XXYY} instead factors as {(\mathop{\bf E} X^2) (\mathop{\bf E} Y^2)}. As a consequence, the behaviour of freely independent random variables can be quite different from the behaviour of their classically independent commuting counterparts. Nevertheless there is a remarkably strong analogy between the two types of independence, in that results which are true in the classically independent case often have an interesting analogue in the freely independent setting. For instance, the central limit theorem (Notes 2) for averages of classically independent random variables, which roughly speaking asserts that such averages become gaussian in the large {n} limit, has an analogue for averages of freely independent variables, the free central limit theorem, which roughly speaking asserts that such averages become semicircular in the large {n} limit. One can then use this theorem to provide yet another proof of Wigner’s semicircle law (Notes 4).

Another important (and closely related) analogy is that while the distribution of sums of independent commutative random variables can be quickly computed via the characteristic function (i.e. the Fourier transform of the distribution), the distribution of sums of freely independent non-commutative random variables can be quickly computed using the Stieltjes transform instead (or with closely related objects, such as the {R}-transform of Voiculescu). This is strongly reminiscent of the appearance of the Stieltjes transform in random matrix theory, and indeed we will see many parallels between the use of the Stieltjes transform here and in Notes 4.

As mentioned earlier, free probability is an excellent tool for computing various expressions of interest in random matrix theory, such as asymptotic values of normalised moments in the large {n} limit {n \rightarrow \infty}. Nevertheless, as it only covers the asymptotic regime in which {n} is sent to infinity while holding all other parameters fixed, there are some aspects of random matrix theory to which the tools of free probability are not sufficient by themselves to resolve (although it can be possible to combine free probability theory with other tools to then answer these questions). For instance, questions regarding the rate of convergence of normalised moments as {n \rightarrow \infty} are not directly answered by free probability, though if free probability is combined with tools such as concentration of measure (Notes 1) then such rate information can often be recovered. For similar reasons, free probability lets one understand the behaviour of {k^{th}} moments as {n \rightarrow \infty} for fixed {k}, but has more difficulty dealing with the situation in which {k} is allowed to grow slowly in {n} (e.g. {k = O(\log n)}). Because of this, free probability methods are effective at controlling the bulk of the spectrum of a random matrix, but have more difficulty with the edges of that spectrum (as well as with related concepts such as the operator norm, Notes 3) as well as with fine-scale structure of the spectrum. Finally, free probability methods are most effective when dealing with matrices that are Hermitian with bounded operator norm, largely because the spectral theory of bounded self-adjoint operators in the infinite-dimensional setting of the large {n} limit is non-pathological. (This is ultimately due to the stable nature of eigenvalues in the self-adjoint setting; see this previous blog post for discussion.) For non-self-adjoint operators, free probability needs to be augmented with additional tools, most notably by bounds on least singular values, in order to recover the required stability for the various spectral data of random matrices to behave continuously with respect to the large {n} limit. We will discuss this latter point in a later set of notes.

Read the rest of this entry »

I have just finished the first draft of my blog book for 2009, under the title of “An epsilon of room: pages from year three of a mathematical blog“.  It largely follows the format of my previous two blog books, “Structure and Randomness” and “Poincaré’s legacies“.

There is still some amount of work to be done on the texts; for instance, I need to create an index (which I had neglected to do in the previous two books in the series), and will probably end up splitting the book into two volumes (as was done for “Poincaré’s legacies”).

As always, any feedback or comments are very welcome.

I’ve just uploaded the D.H.J. Polymath article “Density Hales-Jewett and Moser numbers” to the arXiv, submitted to the Szemeredi birthday conference proceedings.

This article investigates the Density Hales-Jewett numbers c_{n,3}, defined as the largest subset of the n-dimensional cube [3]^n containing no combinatorial line, as well as the related Moser numbers c'_{n,3}, defined as the largest subset of the n-dimensional cube containing no geometric line.  We compute the first six numbers in each sequence in this paper.  For the DHJ numbers, they are 1,2,6,18,52,150,450 and for the Moser numbers they are 1,2,6,16,43,124,353.  The last two elements of both sequences are new; the computation c'_{6,3}=353 was the hardest and required a non-trivial amount of computer assistance.

We also establish the asymptotic lower bounds

c_{n,3} \geq 3^n \exp( - O(\sqrt{\log n}) )


c'_{n,3} \geq (2 \sqrt{\frac{9}{4\pi}} + o(1)) \frac{3^n}{\sqrt{n}}.

In contrast, the best known upper bound to these quantities is O(3^n / \log^{1/3}_* n), obtained by the sister project to this Polymath project.

We also show a counterexample to a certain “hyper-optimistic conjecture” which would have generalised the Lubell-Yamamoto–Meshalkin (LYM) inequality to this setting.

Thanks to all the participants for this interesting experiment in collaborative mathematics.  This is certainly a different type of project from the ones I normally am involved in, but I found it to be an enjoyable and educational experience.

There is still some time to make further (minor) corrections; I think the deadline for the final submission should be in April.

We can now turn attention to one of the centerpiece universality results in random matrix theory, namely the Wigner semi-circle law for Wigner matrices. Recall from previous notes that a Wigner Hermitian matrix ensemble is a random matrix ensemble {M_n = (\xi_{ij})_{1 \leq i,j \leq n}} of Hermitian matrices (thus {\xi_{ij} = \overline{\xi_{ji}}}; this includes real symmetric matrices as an important special case), in which the upper-triangular entries {\xi_{ij}}, {i>j} are iid complex random variables with mean zero and unit variance, and the diagonal entries {\xi_{ii}} are iid real variables, independent of the upper-triangular entries, with bounded mean and variance. Particular special cases of interest include the Gaussian Orthogonal Ensemble (GOE), the symmetric random sign matrices (aka symmetric Bernoulli ensemble), and the Gaussian Unitary Ensemble (GUE).

In previous notes we saw that the operator norm of {M_n} was typically of size {O(\sqrt{n})}, so it is natural to work with the normalised matrix {\frac{1}{\sqrt{n}} M_n}. Accordingly, given any {n \times n} Hermitian matrix {M_n}, we can form the (normalised) empirical spectral distribution (or ESD for short)

\displaystyle  \mu_{\frac{1}{\sqrt{n}} M_n} := \frac{1}{n} \sum_{j=1}^n \delta_{\lambda_j(M_n) / \sqrt{n}},

of {M_n}, where {\lambda_1(M_n) \leq \ldots \leq \lambda_n(M_n)} are the (necessarily real) eigenvalues of {M_n}, counting multiplicity. The ESD is a probability measure, which can be viewed as a distribution of the normalised eigenvalues of {M_n}.

When {M_n} is a random matrix ensemble, then the ESD {\mu_{\frac{1}{\sqrt{n}} M_n}} is now a random measure – i.e. a random variable taking values in the space {\hbox{Pr}({\mathbb R})} of probability measures on the real line. (Thus, the distribution of {\mu_{\frac{1}{\sqrt{n}} M_n}} is a probability measure on probability measures!)

Now we consider the behaviour of the ESD of a sequence of Hermitian matrix ensembles {M_n} as {n \rightarrow \infty}. Recall from Notes 0 that for any sequence of random variables in a {\sigma}-compact metrisable space, one can define notions of convergence in probability and convergence almost surely. Specialising these definitions to the case of random probability measures on {{\mathbb R}}, and to deterministic limits, we see that a sequence of random ESDs {\mu_{\frac{1}{\sqrt{n}} M_n}} converge in probability (resp. converge almost surely) to a deterministic limit {\mu \in \hbox{Pr}({\mathbb R})} (which, confusingly enough, is a deterministic probability measure!) if, for every test function {\varphi \in C_c({\mathbb R})}, the quantities {\int_{\mathbb R} \varphi\ d\mu_{\frac{1}{\sqrt{n}} M_n}} converge in probability (resp. converge almost surely) to {\int_{\mathbb R} \varphi\ d\mu}.

Remark 1 As usual, convergence almost surely implies convergence in probability, but not vice versa. In the special case of random probability measures, there is an even weaker notion of convergence, namely convergence in expectation, defined as follows. Given a random ESD {\mu_{\frac{1}{\sqrt{n}} M_n}}, one can form its expectation {{\bf E} \mu_{\frac{1}{\sqrt{n}} M_n} \in \hbox{Pr}({\mathbb R})}, defined via duality (the Riesz representation theorem) as

\displaystyle  \int_{\mathbb R} \varphi\ d{\bf E} \mu_{\frac{1}{\sqrt{n}} M_n} := {\bf E} \int_{\mathbb R} \varphi\ d	 \mu_{\frac{1}{\sqrt{n}} M_n};

this probability measure can be viewed as the law of a random eigenvalue {\frac{1}{\sqrt{n}}\lambda_i(M_n)} drawn from a random matrix {M_n} from the ensemble. We then say that the ESDs converge in expectation to a limit {\mu \in \hbox{Pr}({\mathbb R})} if {{\bf E} \mu_{\frac{1}{\sqrt{n}} M_n}} converges the vague topology to {\mu}, thus

\displaystyle  {\bf E} \int_{\mathbb R} \varphi\ d	 \mu_{\frac{1}{\sqrt{n}} M_n} \rightarrow \int_{\mathbb R} \varphi\ d\mu

for all {\phi \in C_c({\mathbb R})}.

In general, these notions of convergence are distinct from each other; but in practice, one often finds in random matrix theory that these notions are effectively equivalent to each other, thanks to the concentration of measure phenomenon.

Exercise 1 Let {M_n} be a sequence of {n \times n} Hermitian matrix ensembles, and let {\mu} be a continuous probability measure on {{\mathbb R}}.

  • Show that {\mu_{\frac{1}{\sqrt{n}} M_n}} converges almost surely to {\mu} if and only if {\mu_{\frac{1}{\sqrt{n}}}(-\infty,\lambda)} converges almost surely to {\mu(-\infty,\lambda)} for all {\lambda \in {\mathbb R}}.
  • Show that {\mu_{\frac{1}{\sqrt{n}} M_n}} converges in probability to {\mu} if and only if {\mu_{\frac{1}{\sqrt{n}}}(-\infty,\lambda)} converges in probability to {\mu(-\infty,\lambda)} for all {\lambda \in {\mathbb R}}.
  • Show that {\mu_{\frac{1}{\sqrt{n}} M_n}} converges in expectation to {\mu} if and only if {\mathop{\mathbb E} \mu_{\frac{1}{\sqrt{n}}}(-\infty,\lambda)} converges to {\mu(-\infty,\lambda)} for all {\lambda \in {\mathbb R}}.

We can now state the Wigner semi-circular law.

Theorem 1 (Semicircular law) Let {M_n} be the top left {n \times n} minors of an infinite Wigner matrix {(\xi_{ij})_{i,j \geq 1}}. Then the ESDs {\mu_{\frac{1}{\sqrt{n}} M_n}} converge almost surely (and hence also in probability and in expectation) to the Wigner semi-circular distribution

\displaystyle  \mu_{sc} := \frac{1}{2\pi} (4-|x|^2)^{1/2}_+\ dx. \ \ \ \ \ (1)

A numerical example of this theorem in action can be seen at the MathWorld entry for this law.

The semi-circular law nicely complements the upper Bai-Yin theorem from Notes 3, which asserts that (in the case when the entries have finite fourth moment, at least), the matrices {\frac{1}{\sqrt{n}} M_n} almost surely has operator norm at most {2+o(1)}. Note that the operator norm is the same thing as the largest magnitude of the eigenvalues. Because the semi-circular distribution (1) is supported on the interval {[-2,2]} with positive density on the interior of this interval, Theorem 1 easily supplies the lower Bai-Yin theorem, that the operator norm of {\frac{1}{\sqrt{n}} M_n} is almost surely at least {2-o(1)}, and thus (in the finite fourth moment case) the norm is in fact equal to {2+o(1)}. Indeed, we have just shown that the circular law provides an alternate proof of the lower Bai-Yin bound (Proposition 11 of Notes 3).

As will hopefully become clearer in the next set of notes, the semi-circular law is the noncommutative (or free probability) analogue of the central limit theorem, with the semi-circular distribution (1) taking on the role of the normal distribution. Of course, there is a striking difference between the two distributions, in that the former is compactly supported while the latter is merely subgaussian. One reason for this is that the concentration of measure phenomenon is more powerful in the case of ESDs of Wigner matrices than it is for averages of iid variables; compare the concentration of measure results in Notes 3 with those in Notes 1.

There are several ways to prove (or at least to heuristically justify) the circular law. In this set of notes we shall focus on the two most popular methods, the moment method and the Stieltjes transform method, together with a third (heuristic) method based on Dyson Brownian motion (Notes 3b). In the next set of notes we shall also study the free probability method, and in the set of notes after that we use the determinantal processes method (although this method is initially only restricted to highly symmetric ensembles, such as GUE).

Read the rest of this entry »