You are currently browsing the category archive for the ‘math.PR’ category.

The classical foundations of probability theory (discussed for instance in this previous blog post) is founded on the notion of a probability space ${(\Omega, {\cal E}, {\bf P})}$ – a space ${\Omega}$ (the sample space) equipped with a ${\sigma}$-algebra ${{\cal E}}$ (the event space), together with a countably additive probability measure ${{\bf P}: {\cal E} \rightarrow [0,1]}$ that assigns a real number in the interval ${[0,1]}$ to each event.

One can generalise the concept of a probability space to a finitely additive probability space, in which the event space ${{\cal E}}$ is now only a Boolean algebra rather than a ${\sigma}$-algebra, and the measure ${\mu}$ is now only finitely additive instead of countably additive, thus ${{\bf P}( E \vee F ) = {\bf P}(E) + {\bf P}(F)}$ when ${E,F}$ are disjoint events. By giving up countable additivity, one loses a fair amount of measure and integration theory, and in particular the notion of the expectation of a random variable becomes problematic (unless the random variable takes only finitely many values). Nevertheless, one can still perform a fair amount of probability theory in this weaker setting.

In this post I would like to describe a further weakening of probability theory, which I will call qualitative probability theory, in which one does not assign a precise numerical probability value ${{\bf P}(E)}$ to each event, but instead merely records whether this probability is zero, one, or something in between. Thus ${{\bf P}}$ is now a function from ${{\cal E}}$ to the set ${\{0, I, 1\}}$, where ${I}$ is a new symbol that replaces all the elements of the open interval ${(0,1)}$. In this setting, one can no longer compute quantitative expressions, such as the mean or variance of a random variable; but one can still talk about whether an event holds almost surely, with positive probability, or with zero probability, and there are still usable notions of independence. (I will refer to classical probability theory as quantitative probability theory, to distinguish it from its qualitative counterpart.)

The main reason I want to introduce this weak notion of probability theory is that it becomes suited to talk about random variables living inside algebraic varieties, even if these varieties are defined over fields other than ${{\bf R}}$ or ${{\bf C}}$. In algebraic geometry one often talks about a “generic” element of a variety ${V}$ defined over a field ${k}$, which does not lie in any specified variety of lower dimension defined over ${k}$. Once ${V}$ has positive dimension, such generic elements do not exist as classical, deterministic ${k}$-points ${x}$ in ${V}$, since of course any such point lies in the ${0}$-dimensional subvariety ${\{x\}}$ of ${V}$. There are of course several established ways to deal with this problem. One way (which one might call the “Weil” approach to generic points) is to extend the field ${k}$ to a sufficiently transcendental extension ${\tilde k}$, in order to locate a sufficient number of generic points in ${V(\tilde k)}$. Another approach (which one might dub the “Zariski” approach to generic points) is to work scheme-theoretically, and interpret a generic point in ${V}$ as being associated to the zero ideal in the function ring of ${V}$. However I want to discuss a third perspective, in which one interprets a generic point not as a deterministic object, but rather as a random variable ${{\bf x}}$ taking values in ${V}$, but which lies in any given lower-dimensional subvariety of ${V}$ with probability zero. This interpretation is intuitive, but difficult to implement in classical probability theory (except perhaps when considering varieties over ${{\bf R}}$ or ${{\bf C}}$) due to the lack of a natural probability measure to place on algebraic varieties; however it works just fine in qualitative probability theory. In particular, the algebraic geometry notion of being “generically true” can now be interpreted probabilistically as an assertion that something is “almost surely true”.

It turns out that just as qualitative random variables may be used to interpret the concept of a generic point, they can also be used to interpret the concept of a type in model theory; the type of a random variable ${x}$ is the set of all predicates ${\phi(x)}$ that are almost surely obeyed by ${x}$. In contrast, model theorists often adopt a Weil-type approach to types, in which one works with deterministic representatives of a type, which often do not occur in the original structure of interest, but only in a sufficiently saturated extension of that structure (this is the analogue of working in a sufficiently transcendental extension of the base field). However, it seems that (in some cases at least) one can equivalently view types in terms of (qualitative) random variables on the original structure, avoiding the need to extend that structure. (Instead, one reserves the right to extend the sample space of one’s probability theory whenever necessary, as part of the “probabilistic way of thinking” discussed in this previous blog post.) We illustrate this below the fold with two related theorems that I will interpret through the probabilistic lens: the “group chunk theorem” of Weil (and later developed by Hrushovski), and the “group configuration theorem” of Zilber (and again later developed by Hrushovski). For sake of concreteness we will only consider these theorems in the theory of algebraically closed fields, although the results are quite general and can be applied to many other theories studied in model theory.

Van Vu and I have just uploaded to the arXiv our joint paper “Local universality of zeroes of random polynomials“. This paper is a sequel to our previous work on local universality of eigenvalues of (non-Hermitian) random matrices ${M_n}$ with independent entries. One can re-interpret these previous results as a universality result for a certain type of random polynomial ${f: {\bf C} \rightarrow {\bf C}}$, namely the characteristic polynomial ${f(z) = \hbox{det}(M_n-z)}$ of the random matrix ${M_n}$. In this paper, we consider the analogous problem for a different model of random polynomial, namely polynomials ${f}$ with independent random coefficients. More precisely, we consider random polynomials ${f = f_n}$ of the form

$\displaystyle f(z) = \sum_{i=0}^n c_i \xi_i z^n$

where ${c_0,\ldots,c_n \in {\bf C}}$ are deterministic complex coefficients, and ${\xi_0,\ldots,\xi_n \equiv \xi}$ are independent identically distributed copies of a complex random variable ${\xi}$, which we normalise to have mean zero and variance one. For simplicity we will ignore the technical issue that the leading coefficient ${c_n \xi_n}$ is allowed to vanish; then ${f}$ has ${n}$ zeroes ${\zeta_1,\ldots,\zeta_n \in {\bf C}}$ (counting multiplicity), which can be viewed as a random point process ${\Sigma = \{\zeta_1,\ldots,\zeta_n\}}$ in the complex plane. In analogy with other models (such as random matrix models), we expect the (suitably normalised) asymptotic statistics of this point process in the limit ${n \rightarrow \infty}$ to be universal, in the sense that it is largely independent of the precise distribution of the atom variable ${\xi}$.

Our results are fairly general with regard to the choice of coefficients ${c_i}$, but we isolate three particular choice of coefficients that are particularly natural and well-studied in the literature:

• Flat polynomials (or Weyl polynomials) in which ${c_i := \frac{1}{\sqrt{i!}}}$.
• Elliptic polynomials (or binomial polynomials) in which ${c_i := \sqrt{\binom{n}{i}}}$.
• Kac polynomials in which ${c_i := 1}$.

The flat and elliptic polynomials enjoy special symmetries in the model case when the atom distribution ${\xi}$ is a complex Gaussian ${N(0,1)_{\bf C}}$. Indeed, the zeroes ${\Sigma}$ of elliptic polynomials with complex Gaussian coefficients have a distribution which is invariant with respect to isometries ${T: {\bf C} \cup \{\infty\} \rightarrow {\bf C} \cup \{\infty\}}$ of the Riemann sphere ${{\bf C} \cup \{\infty\}}$ (thus ${T\Sigma}$ has the same distribution as ${\Sigma}$), while the zeroes of the limiting case ${\sum_{i=0}^\infty \frac{1}{\sqrt{i!}} \xi_i z^i}$ of the flat polynomials with complex Gaussian coefficients are similarly invariant with respect to isometries ${T: {\bf C} \rightarrow {\bf C}}$ of the complex plane ${{\bf C}}$. (For a nice geometric proof of this facts, I recommend the nice book of Hough, Krishnapur, Peres, and Virag.)

The global (i.e. coarse-scale) distribution of zeroes of these polynomials is well understood, first in the case of gaussian distributions using the fundamental tool of the Kac-Rice formula, and then for more general choices of atom distribution in the recent work of Kabluchko and Zaporozhets. The zeroes of the flat polynomials are asymptotically distributed according to the circular law, normalised to be uniformly distributed on the disk ${B(0,\sqrt{n})}$ of radius ${\sqrt{n}}$ centred at the origin. To put it a bit informally, the zeroes are asymptotically distributed according to the measure ${\frac{1}{\pi} 1_{|z| \leq \sqrt{n}} dz}$, where ${dz}$ denotes Lebesgue measure on the complex plane. One can non-rigorously see the scale ${\sqrt{n}}$ appear by observing that when ${|z|}$ is much larger than ${\sqrt{n}}$, we expect the leading term ${\frac{1}{\sqrt{n!}} \xi_n z^n}$ of the flat polynomial ${\sum_{i=0}^n \frac{1}{\sqrt{i!}} \xi_i z^i}$ to dominate, so that the polynomial should not have any zeroes in this region.

Similarly, the distribution of the elliptic polynomials is known to be asymptotically distributed according to a Cauchy-type distribution ${\frac{1}{\pi} \frac{1}{1+|z|^2/n} dz}$. The Kac polynomials ${\sum_{i=0}^n \xi_i z^i}$ behave differently; the zeroes concentrate uniformly on the unit circle ${|z|=1}$ (which is reasonable, given that one would expect the top order term ${\xi_i z^i}$ to dominate for ${|z| > 1}$ and the bottom order term ${\xi_0}$ to dominate for ${|z| < 1}$). In particular, whereas the typical spacing between zeroes in the flat and elliptic cases would be expected to be comparable to ${1}$, the typical spacing between zeroes for a Kac polynomial would be expected instead to be comparable to ${1/n}$.

In our paper we studied the local distribution of zeroes at the scale of the typical spacing. In the case of polynomials with continuous complex atom disribution ${\xi}$, the natural statistic to measure here is the ${k}$-point correlation function ${\rho^{(k)}(z_1,\ldots,z_k)}$, which for distinct complex numbers ${z_1,\ldots,z_k}$ can be defined as the probability that there is a zero in each of the balls ${B(z_1,\varepsilon),\ldots,B(z_k,\varepsilon)}$ for some infinitesimal ${\epsilon >0}$, divided by the normalising factor ${(\pi \epsilon^2)^k}$. (One can also define a ${k}$-point correlation function in the case of a discrete distribution, but it will be a measure rather than a function in that case.) Our first main theorem is a general replacement principle which asserts, very roughly speaking, that the asymptotic ${k}$-point correlation functions of two random polynomials ${f, \tilde f}$ will agree if the log-magnitudes ${\log |f(z)|, \log |\tilde f(z)|}$ have asymptotically the same distribution (actually we have to consider the joint distribution of ${\log |f(z_1)|, \ldots \log |f(z_k)|}$ for several points ${z_1,\ldots,z_k}$, but let us ignore this detail for now), and if the polynomials ${f, \tilde f}$ obeys a “non-clustering property” which asserts, roughly speaking, that not too many of the zeroes of ${f}$ can typically concentrate in a small ball. This replacement principle was implicit in our previous paper (and can be viewed as a local-scale version of the global-scale replacement principle in this earlier paper of ours). Specialising the replacement principle to the elliptic or flat polynomials, and using the Lindeberg swapping argument, we obtain a Two Moment Theorem that asserts, roughly speaking, that the asymptotic behaviour of the ${k}$-point correlation functions depends only on the first two moments of the real and imaginary components of ${\xi}$, as long as one avoids some regions of space where universality is expected to break down. (In particular, because ${f(0) = c_0 \xi_0}$ does not have a universal distribution, one does not expect universality to hold near the origin; there is a similar problem near infinity.) Closely related results, by a slightly different method, have also been obtained recently by Ledoan, Merkli, and Starr. A similar result holds for the Kac polynomials after rescaling to account for the narrower spacing between zeroes.

We also have analogous results in the case of polynomials with real coefficients (so that the coefficients ${c_i}$ and the atom distribution ${\xi}$ are both real). In this case one expects to see a certain number of real zeroes, together with conjugate pairs of complex zeroes. Instead of the ${k}$-point correlation function ${\rho^{(k)}(z_1,\ldots,z_k)}$, the natural object of study is now the mixed ${(k,l)}$-point correlation function ${\rho^{(k,l)}(x_1,\ldots,x_k,z_1,\ldots,z_l)}$ that (roughly speaking) controls how often one can find a real zero near the real numbers ${x_1,\ldots,x_k}$, and a complex zero near the points ${z_1,\ldots,z_l}$. It turns out that one can disentangle the real and strictly complex zeroes and obtain separate universality results for both zeroes, provided that at least one of the polynomials involved obeys a “weak repulsion estimate” that shows that the real zeroes do not cluster very close to each other (and that the complex zeroes do not cluster very close to their complex conjugates). Such an estimate is needed to avoid the situation of two nearby real zeroes “colliding” to create a (barely) complex zero and its complex conjugate, or the time reversal of such a collision. Fortunately, in the case of real gaussian polynomials one can use the Kac-Rice formula to establish such a weak repulsion estimate, allowing analogues of the above universality results for complex random polynomials in the real case. Among other things, this gives universality results for the number ${N_{\bf R}}$ of real zeroes of a random flat or elliptic polynomial; it turns out this number is typically equal to ${\frac{2}{\pi} \sqrt{n} + O(n^{1/2-c})}$ and ${\frac{n} + O(n^{1/2-c})}$ respectively. (For Kac polynomials, the situation is somewhat different; it was already known that ${N_{\bf R} = \frac{2}{\pi} \log n + o(\log n)}$ thanks to a long series of papers, starting with the foundational work of Kac and culminating in the work of Ibragimov and Maslova.)

While our methods are based on our earlier work on eigenvalues of random matrices, the situation with random polynomials is actually somewhat easier to deal with. This is because the log-magnitude ${\log |f(z)|}$ of a random polynomial with independent coefficients is significantly easier to control than the log-determinant ${\log |\hbox{det}(M-z)|}$ of a random matrix, as the former can be controlled by the central limit theorem, while the latter requires significantly more difficult arguments (in particular, bounds on the least singular value combined with Girko’s Hermitization trick). As such, one could view the current paper as an introduction to our more complicated previous paper, and with this in mind we have written the current paper to be self-contained (though this did make the paper somewhat lengthy, checking in at 68 pages).

The fundamental notions of calculus, namely differentiation and integration, are often viewed as being the quintessential concepts in mathematical analysis, as their standard definitions involve the concept of a limit. However, it is possible to capture most of the essence of these notions by purely algebraic means (almost completely avoiding the use of limits, Riemann sums, and similar devices), which turns out to be useful when trying to generalise these concepts to more abstract situations in which it becomes convenient to permit the underlying number systems involved to be something other than the real or complex numbers, even if this makes many standard analysis constructions unavailable. For instance, the algebraic notion of a derivation often serves as a substitute for the analytic notion of a derivative in such cases, by abstracting out the key algebraic properties of differentiation, namely linearity and the Leibniz rule (also known as the product rule).

Abstract algebraic analogues of integration are less well known, but can still be developed. To motivate such an abstraction, consider the integration functional ${I: {\mathcal S}({\bf R} \rightarrow {\bf C}) \rightarrow {\bf C}}$ from the space ${{\mathcal S}({\bf R} \rightarrow {\bf C})}$ of complex-valued Schwarz functions ${f: {\bf R} \rightarrow {\bf C}}$ to the complex numbers, defined by

$\displaystyle I(f) := \int_{\bf R} f(x)\ dx$

where the integration on the right is the usual Lebesgue integral (or improper Riemann integral) from analysis. This functional obeys two obvious algebraic properties. Firstly, it is linear over ${{\bf C}}$, thus

$\displaystyle I(cf) = c I(f) \ \ \ \ \ (1)$

and

$\displaystyle I(f+g) = I(f) + I(g) \ \ \ \ \ (2)$

for all ${f,g \in {\mathcal S}({\bf R} \rightarrow {\bf C})}$ and ${c \in {\bf C}}$. Secondly, it is translation invariant, thus

$\displaystyle I(\tau_h f) = I(f) \ \ \ \ \ (3)$

for all ${h \in {\bf C}}$, where ${\tau_h f(x) := f(x-h)}$ is the translation of ${f}$ by ${h}$. Motivated by the uniqueness theory of Haar measure, one might expect that these two axioms already uniquely determine ${I}$ after one sets a normalisation, for instance by requiring that

$\displaystyle I( x \mapsto e^{-\pi x^2} ) = 1. \ \ \ \ \ (4)$

This is not quite true as stated (one can modify the proof of the Hahn-Banach theorem, after first applying a Fourier transform, to create pathological translation-invariant linear functionals on ${{\mathcal S}({\bf R} \rightarrow {\bf C})}$ that are not multiples of the standard Fourier transform), but if one adds a mild analytical axiom, such as continuity of ${I}$ (using the usual Schwartz topology on ${{\mathcal S}({\bf R} \rightarrow {\bf C})}$), then the above axioms are enough to uniquely pin down the notion of integration. Indeed, if ${I: {\mathcal S}({\bf R} \rightarrow {\bf C}) \rightarrow {\bf C}}$ is a continuous linear functional that is translation invariant, then from the linearity and translation invariance axioms one has

$\displaystyle I( \frac{\tau_h f - f}{h} ) = 0$

for all ${f \in {\mathcal S}({\bf R} \rightarrow {\bf C})}$ and non-zero reals ${h}$. If ${f}$ is Schwartz, then as ${h \rightarrow 0}$, one can verify that the Newton quotients ${\frac{\tau_h f - f}{h}}$ converge in the Schwartz topology to the derivative ${f'}$ of ${f}$, so by the continuity axiom one has

$\displaystyle I(f') = 0.$

Next, note that any Schwartz function of integral zero has an antiderivative which is also Schwartz, and so ${I}$ annihilates all zero-integral Schwartz functions, and thus must be a scalar multiple of the usual integration functional. Using the normalisation (4), we see that ${I}$ must therefore be the usual integration functional, giving the claimed uniqueness.

Motivated by the above discussion, we can define the notion of an abstract integration functional ${I: X \rightarrow R}$ taking values in some vector space ${R}$, and applied to inputs ${f}$ in some other vector space ${X}$ that enjoys a linear action ${h \mapsto \tau_h}$ (the “translation action”) of some group ${V}$, as being a functional which is both linear and translation invariant, thus one has the axioms (1), (2), (3) for all ${f,g \in X}$, scalars ${c}$, and ${h \in V}$. The previous discussion then considered the special case when ${R = {\bf C}}$, ${X = {\mathcal S}({\bf R} \rightarrow {\bf C})}$, ${V = {\bf R}}$, and ${\tau}$ was the usual translation action.

Once we have performed this abstraction, we can now present analogues of classical integration which bear very little analytic resemblance to the classical concept, but which still have much of the algebraic structure of integration. Consider for instance the situation in which we keep the complex range ${R = {\bf C}}$, the translation group ${V = {\bf R}}$, and the usual translation action ${h \mapsto \tau_h}$, but we replace the space ${{\mathcal S}({\bf R} \rightarrow {\bf C})}$ of Schwartz functions by the space ${Poly_{\leq d}({\bf R} \rightarrow {\bf C})}$ of polynomials ${x \mapsto a_0 + a_1 x + \ldots + a_d x^d}$ of degree at most ${d}$ with complex coefficients, where ${d}$ is a fixed natural number; note that this space is translation invariant, so it makes sense to talk about an abstract integration functional ${I: Poly_{\leq d}({\bf R} \rightarrow {\bf C}) \rightarrow {\bf C}}$. Of course, one cannot apply traditional integration concepts to non-zero polynomials, as they are not absolutely integrable. But one can repeat the previous arguments to show that any abstract integration functional must annihilate derivatives of polynomials of degree at most ${d}$:

$\displaystyle I(f') = 0 \hbox{ for all } f \in Poly_{\leq d}({\bf R} \rightarrow {\bf C}). \ \ \ \ \ (5)$

Clearly, every polynomial of degree at most ${d-1}$ is thus annihilated by ${I}$, which makes ${I}$ a scalar multiple of the functional that extracts the top coefficient ${a_d}$ of a polynomial, thus if one sets a normalisation

$\displaystyle I( x \mapsto x^d ) = c$

for some constant ${c}$, then one has

$\displaystyle I( x \mapsto a_0 + a_1 x + \ldots + a_d x^d ) = c a_d \ \ \ \ \ (6)$

for any polynomial ${x \mapsto a_0 + a_1 x + \ldots + a_d x^d}$. So we see that up to a normalising constant, the operation of extracting the top order coefficient of a polynomial of fixed degree serves as the analogue of integration. In particular, despite the fact that integration is supposed to be the “opposite” of differentiation (as indicated for instance by (5)), we see in this case that integration is basically (${d}$-fold) differentiation; indeed, compare (6) with the identity

$\displaystyle (\frac{d}{dx})^d ( a_0 + a_1 x + \ldots + a_d x^d ) = d! a_d.$

In particular, we see, in contrast to the usual Lebesgue integral, the integration functional (6) can be localised to an arbitrary location: one only needs to know the germ of the polynomial ${x \mapsto a_0 + a_1 x + \ldots + a_d x^d}$ at a single point ${x_0}$ in order to determine the value of the functional (6). This localisation property may initially seem at odds with the translation invariance, but the two can be reconciled thanks to the extremely rigid nature of the class ${Poly_{\leq d}({\bf R} \rightarrow {\bf C})}$, in contrast to the Schwartz class ${{\mathcal S}({\bf R} \rightarrow {\bf C})}$ which admits bump functions and so can generate local phenomena that can only be detected in small regions of the underlying spatial domain, and which therefore forces any translation-invariant integration functional on such function classes to measure the function at every single point in space.

The reversal of the relationship between integration and differentiation is also reflected in the fact that the abstract integration operation on polynomials interacts with the scaling operation ${\delta_\lambda f(x) := f(x/\lambda)}$ in essentially the opposite way from the classical integration operation. Indeed, for classical integration on ${{\bf R}^d}$, one has

$\displaystyle \int_{{\bf R}^d} f(x/\lambda)\ dx = \lambda^d \int f(x)\ dx$

for Schwartz functions ${f \in {\mathcal S}({\bf R}^d \rightarrow {\bf C})}$, and so in this case the integration functional ${I(f) := \int_{{\bf R}^d} f(x)\ dx}$ obeys the scaling law

$\displaystyle I( \delta_\lambda f ) = \lambda^d I(f).$

In contrast, the abstract integration operation defined in (6) obeys the opposite scaling law

$\displaystyle I( \delta_\lambda f ) = \lambda^{-d} I(f). \ \ \ \ \ (7)$

Remark 1 One way to interpret what is going on is to view the integration operation (6) as a renormalised version of integration. A polynomial ${x \mapsto a_0 + a_1 + \ldots + a_d x^d}$ is, in general, not absolutely integrable, and the partial integrals

$\displaystyle \int_0^N a_0 + a_1 + \ldots + a_d x^d\ dx$

diverge as ${N \rightarrow \infty}$. But if one renormalises these integrals by the factor ${\frac{1}{N^{d+1}}}$, then one recovers convergence,

$\displaystyle \lim_{N \rightarrow \infty} \frac{1}{N^{d+1}} \int_0^N a_0 + a_1 + \ldots + a_d x^d\ dx = \frac{1}{d+1} a_d$

thus giving an interpretation of (6) as a renormalised classical integral, with the renormalisation being responsible for the unusual scaling relationship in (7). However, this interpretation is a little artificial, and it seems that it is best to view functionals such as (6) from an abstract algebraic perspective, rather than to try to force an analytic interpretation on them.

$\displaystyle I(f) := \int_{\bf R} f(x)\ dx. \ \ \ \ \ (8)$

As noted earlier, this integration functional has a translation invariance associated to translations along the real line ${{\bf R}}$, as well as a dilation invariance by real dilation parameters ${\lambda>0}$. However, if we refine the class ${{\mathcal S}({\bf R} \rightarrow {\bf C})}$ of functions somewhat, we can obtain a stronger family of invariances, in which we allow complex translations and dilations. More precisely, let ${\mathcal{SE}({\bf C} \rightarrow {\bf C})}$ denote the space of all functions ${f: {\bf C} \rightarrow {\bf C}}$ which are entire (or equivalently, are given by a Taylor series with an infinite radius of convergence around the origin) and also admit rapid decay in a sectorial neighbourhood of the real line, or more precisely there exists an ${\epsilon>0}$ such that for every ${A > 0}$ there exists ${C_A > 0}$ such that one has the bound

$\displaystyle |f(z)| \leq C_A (1+|z|)^{-A}$

whenever ${|\hbox{Im}(z)| \leq A + \epsilon |\hbox{Re}(z)|}$. For want of a better name, we shall call elements of this space Schwartz entire functions. This is clearly a complex vector space. A typical example of a Schwartz entire function are the complex gaussians

$\displaystyle f(z) := e^{-\pi (az^2 + 2bz + c)}$

where ${a,b,c}$ are complex numbers with ${\hbox{Re}(a) > 0}$. From the Cauchy integral formula (and its derivatives) we see that if ${f}$ lies in ${\mathcal{SE}({\bf C} \rightarrow {\bf C})}$, then the restriction of ${f}$ to the real line lies in ${{\mathcal S}({\bf R} \rightarrow {\bf C})}$; conversely, from analytic continuation we see that every function in ${{\mathcal S}({\bf R} \rightarrow {\bf C})}$ has at most one extension in ${\mathcal{SE}({\bf C} \rightarrow {\bf C})}$. Thus one can identify ${\mathcal{SE}({\bf C} \rightarrow {\bf C})}$ with a subspace of ${{\mathcal S}({\bf R} \rightarrow {\bf C})}$, and in particular the integration functional (8) is inherited by ${\mathcal{SE}({\bf C} \rightarrow {\bf C})}$, and by abuse of notation we denote the resulting functional ${I: \mathcal{SE}({\bf C} \rightarrow {\bf C}) \rightarrow {\bf C}}$ as ${I}$ also. Note, in analogy with the situation with polynomials, that this abstract integration functional is somewhat localised; one only needs to evaluate the function ${f}$ on the real line, rather than the entire complex plane, in order to compute ${I(f)}$. This is consistent with the rigid nature of Schwartz entire functions, as one can uniquely recover the entire function from its values on the real line by analytic continuation.

Of course, the functional ${I: \mathcal{SE}({\bf C} \rightarrow {\bf C}) \rightarrow {\bf C}}$ remains translation invariant with respect to real translation:

$\displaystyle I(\tau_h f) = I(f) \hbox{ for all } h \in {\bf R}.$

However, thanks to contour shifting, we now also have translation invariance with respect to complex translation:

$\displaystyle I(\tau_h f) = I(f) \hbox{ for all } h \in {\bf C},$

where of course we continue to define the translation operator ${\tau_h}$ for complex ${h}$ by the usual formula ${\tau_h f(x) := f(x-h)}$. In a similar vein, we also have the scaling law

$\displaystyle I(\delta_\lambda f) = \lambda I(f)$

for any ${f \in \mathcal{SE}({\bf C} \rightarrow {\bf C})}$, if ${\lambda}$ is a complex number sufficiently close to ${1}$ (where “sufficiently close” depends on ${f}$, and more precisely depends on the sectoral aperture parameter ${\epsilon}$ associated to ${f}$); again, one can verify that ${\delta_\lambda f}$ lies in ${\mathcal{SE}({\bf C} \rightarrow {\bf C})}$ for ${\lambda}$ sufficiently close to ${1}$. These invariances (which relocalise the integration functional ${I}$ onto other contours than the real line ${{\bf R}}$) are very useful for computing integrals, and in particular for computing gaussian integrals. For instance, the complex translation invariance tells us (after shifting by ${b/a}$) that

$\displaystyle I( z \mapsto e^{-\pi (az^2 + 2bz + c) } ) = e^{-\pi (c-b^2/a)} I( z \mapsto e^{-\pi a z^2} )$

when ${a,b,c \in {\bf C}}$ with ${\hbox{Re}(a) > 0}$, and then an application of the complex scaling law (and a continuity argument, observing that there is a compact path connecting ${a}$ to ${1}$ in the right half plane) gives

$\displaystyle I( z \mapsto e^{-\pi (az^2 + 2bz + c) } ) = a^{-1/2} e^{-\pi (c-b^2/a)} I( z \mapsto e^{-\pi z^2} )$

using the branch of ${a^{-1/2}}$ on the right half-plane for which ${1^{-1/2} = 1}$. Using the normalisation (4) we thus have

$\displaystyle I( z \mapsto e^{-\pi (az^2 + 2bz + c) } ) = a^{-1/2} e^{-\pi (c-b^2/a)}$

giving the usual gaussian integral formula

$\displaystyle \int_{\bf R} e^{-\pi (ax^2 + 2bx + c)}\ dx = a^{-1/2} e^{-\pi (c-b^2/a)}. \ \ \ \ \ (9)$

This is a basic illustration of the power that a large symmetry group (in this case, the complex homothety group) can bring to bear on the task of computing integrals.

One can extend this sort of analysis to higher dimensions. For any natural number ${n \geq 1}$, let ${\mathcal{SE}({\bf C}^n \rightarrow {\bf C})}$ denote the space of all functions ${f: {\bf C}^n \rightarrow {\bf C}}$ which is jointly entire in the sense that ${f(z_1,\ldots,z_n)}$ can be expressed as a Taylor series in ${z_1,\ldots,z_n}$ which is absolutely convergent for all choices of ${z_1,\ldots,z_n}$, and such that there exists an ${\epsilon > 0}$ such that for any ${A>0}$ there is ${C_A>0}$ for which one has the bound

$\displaystyle |f(z)| \leq C_A (1+|z|)^{-A}$

whenever ${|\hbox{Im}(z_j)| \leq A + \epsilon |\hbox{Re}(z_j)|}$ for all ${1 \leq j \leq n}$, where ${z = \begin{pmatrix} z_1 \\ \vdots \\ z_n \end{pmatrix}}$ and ${|z| := (|z_1|^2+\ldots+|z_n|^2)^{1/2}}$. Again, we call such functions Schwartz entire functions; a typical example is the function

$\displaystyle f(z) := e^{-\pi (z^T A z + 2b^T z + c)}$

where ${A}$ is an ${n \times n}$ complex symmetric matrix with positive definite real part, ${b}$ is a vector in ${{\bf C}^n}$, and ${c}$ is a complex number. We can then define an abstract integration functional ${I: \mathcal{SE}({\bf C}^n \rightarrow {\bf C}) \rightarrow {\bf C}}$ by integration on the real slice ${{\bf R}^n}$:

$\displaystyle I(f) := \int_{{\bf R}^n} f(x)\ dx$

where ${dx}$ is the usual Lebesgue measure on ${{\bf R}^n}$. By contour shifting in each of the ${n}$ variables ${z_1,\ldots,z_n}$ separately, we see that ${I}$ is invariant with respect to complex translations of each of the ${z_j}$ variables, and is thus invariant under translating the joint variable ${z}$ by ${{\bf C}^n}$. One can also verify the scaling law

$\displaystyle I(\delta_A f) = \hbox{det}(A) I(f)$

for ${n \times n}$ complex matrices ${A}$ sufficiently close to the origin, where ${\delta_A f(z) := f(A^{-1} z)}$. This can be seen for shear transformations ${A}$ by Fubini’s theorem and the aforementioned translation invariance, while for diagonal transformations near the origin this can be seen from ${n}$ applications of one-dimensional scaling law, and the general case then follows by composition. Among other things, these laws then easily lead to the higher-dimensional generalisation

$\displaystyle \int_{{\bf R}^n} e^{-\pi (x^T A x + 2 b^T x + c)}\ dx = \hbox{det}(A)^{-1/2} e^{-\pi (c-b^T A^{-1} b)} \ \ \ \ \ (10)$

whenever ${A}$ is a complex symmetric matrix with positive definite real part, ${b}$ is a vector in ${{\bf C}^n}$, and ${c}$ is a complex number, basically by repeating the one-dimensional argument sketched earlier. Here, we choose the branch of ${\hbox{det}(A)^{-1/2}}$ for all matrices ${A}$ in the indicated class for which ${\hbox{det}(1)^{-1/2} = 1}$.

Now we turn to an integration functional suitable for computing complex gaussian integrals such as

$\displaystyle \int_{{\bf C}^n} e^{-2\pi (z^\dagger A z + b^\dagger z + z^\dagger \tilde b + c)}\ dz d\overline{z}, \ \ \ \ \ (11)$

where ${z}$ is now a complex variable

$\displaystyle z = \begin{pmatrix} z_1 \\ \vdots \\ z_n \end{pmatrix},$

${z^\dagger}$ is the adjoint

$\displaystyle z^\dagger := (\overline{z_1},\ldots, \overline{z_n}),$

${A}$ is a complex ${n \times n}$ matrix with positive definite Hermitian part, ${b, \tilde b}$ are column vectors in ${{\bf C}^n}$, ${c}$ is a complex number, and ${dz d\overline{z} = \prod_{j=1}^n 2 d\hbox{Re}(z_j) d\hbox{Im}(z_j)}$ is ${2^n}$ times Lebesgue measure on ${{\bf C}^n}$. (The factors of two here turn out to be a natural normalisation, but they can be ignored on a first reading.) As we shall see later, such integrals are relevant when performing computations on the Gaussian Unitary Ensemble (GUE) in random matrix theory. Note that the integrand here is not complex analytic due to the presence of the complex conjugates. However, this can be dealt with by the trick of replacing the complex conjugate ${\overline{z}}$ by a variable ${z^*}$ which is formally conjugate to ${z}$, but which is allowed to vary independently of ${z}$. More precisely, let ${\mathcal{SA}({\bf C}^n \times {\bf C}^n \rightarrow {\bf C})}$ be the space of all functions ${f: (z,z^*) \mapsto f(z,z^*)}$ of two independent ${n}$-tuples

$\displaystyle z = \begin{pmatrix} z_1 \\ \vdots \\ z_n \end{pmatrix}, z^* = \begin{pmatrix} z_1^* \\ \vdots \\ z_n^* \end{pmatrix}$

of complex variables, which is jointly entire in all ${2n}$ variables (in the sense defined previously, i.e. there is a joint Taylor series that is absolutely convergent for all independent choices of ${z, z^* \in {\bf C}^n}$), and such that there is an ${\epsilon>0}$ such that for every ${A>0}$ there is ${C_A>0}$ such that one has the bound

$\displaystyle |f(z,z^*)| \leq C_A (1 + |z|)^{-A}$

whenever ${|z^* - \overline{z}| \leq A + \epsilon |z|}$. We will call such functions Schwartz analytic. Note that the integrand in (11) is Schwartz analytic when ${A}$ has positive definite Hermitian part, if we reinterpret ${z^\dagger}$ as the transpose of ${z^*}$ rather than as the adjoint of ${z}$ in order to make the integrand entire in ${z}$ and ${z^*}$. We can then define an abstract integration functional ${I: \mathcal{SA}({\bf C}^n \times {\bf C}^n \rightarrow {\bf C}) \rightarrow {\bf C}}$ by the formula

$\displaystyle I(f) := \int_{{\bf C}^n} f(z,\overline{z})\ dz d\overline{z}, \ \ \ \ \ (12)$

thus ${I}$ can be localised to the slice ${\{ (z,\overline{z}): z \in {\bf C}^n\}}$ of ${{\bf C}^n \times {\bf C}^n}$ (though, as with previous functionals, one can use contour shifting to relocalise ${I}$ to other slices also.) One can also write this integral as

$\displaystyle I(f) = 2^n \int_{{\bf R}^n \times {\bf R}^n} f(x+iy, x-iy)\ dx dy$

and note that the integrand here is a Schwartz entire function on ${{\bf C}^n \times {\bf C}^n}$, thus linking the Schwartz analytic integral with the Schwartz entire integral. Using this connection, one can verify that this functional ${I}$ is invariant with respect to translating ${z}$ and ${z^*}$ by independent shifts in ${{\bf C}^n}$ (thus giving a ${{\bf C}^n \times {\bf C}^n}$ translation symmetry), and one also has the independent dilation symmetry

$\displaystyle I(\delta_{A,B} f) = \hbox{det}(A) \hbox{det}(B) I(f)$

for ${n \times n}$ complex matrices ${A,B}$ that are sufficiently close to the identity, where ${\delta_{A,B} f(z,z^*) := f(A^{-1} z, B^{-1} z^*)}$. Arguing as before, we can then compute (11) as

$\displaystyle \int_{{\bf C}^n} e^{-2\pi (z^\dagger A z + b^\dagger z + z^\dagger \tilde b + c)}\ dz d\overline{z} = \hbox{det}(A)^{-1} e^{-2\pi (c - b^\dagger A^{-1} \tilde b)}. \ \ \ \ \ (13)$

In particular, this gives an integral representation for the determinant-reciprocal ${\hbox{det}(A)^{-1}}$ of a complex ${n \times n}$ matrix with positive definite Hermitian part, in terms of gaussian expressions in which ${A}$ only appears linearly in the exponential:

$\displaystyle \hbox{det}(A)^{-1} = \int_{{\bf C}^n} e^{-2\pi z^\dagger A z}\ dz d\overline{z}.$

This formula is then convenient for computing statistics such as

$\displaystyle \mathop{\bf E} \hbox{det}(W_n-E-i\eta)^{-1}$

for random matrices ${W_n}$ drawn from the Gaussian Unitary Ensemble (GUE), and some choice of spectral parameter ${E+i\eta}$ with ${\eta>0}$; we review this computation later in this post. By the trick of matrix differentiation of the determinant (as reviewed in this recent blog post), one can also use this method to compute matrix-valued statistics such as

$\displaystyle \mathop{\bf E} \hbox{det}(W_n-E-i\eta)^{-1} (W_n-E-i\eta)^{-1}.$

However, if one restricts attention to classical integrals over real or complex (and in particular, commuting or bosonic) variables, it does not seem possible to easily eradicate the negative determinant factors in such calculations, which is unfortunate because many statistics of interest in random matrix theory, such as the expected Stieltjes transform

$\displaystyle \mathop{\bf E} \frac{1}{n} \hbox{tr} (W_n-E-i\eta)^{-1},$

which is the Stieltjes transform of the density of states. However, it turns out (as I learned recently from Peter Sarnak and Tom Spencer) that it is possible to cancel out these negative determinant factors by balancing the bosonic gaussian integrals with an equal number of fermionic gaussian integrals, in which one integrates over a family of anticommuting variables. These fermionic integrals are closer in spirit to the polynomial integral (6) than to Lebesgue type integrals, and in particular obey a scaling law which is inverse to the Lebesgue scaling (in particular, a linear change of fermionic variables ${\zeta \mapsto A \zeta}$ ends up transforming a fermionic integral by ${\hbox{det}(A)}$ rather than ${\hbox{det}(A)^{-1}}$), which conveniently cancels out the reciprocal determinants in the previous calculations. Furthermore, one can combine the bosonic and fermionic integrals into a unified integration concept, known as the Berezin integral (or Grassmann integral), in which one integrates functions of supervectors (vectors with both bosonic and fermionic components), and is of particular importance in the theory of supersymmetry in physics. (The prefix “super” in physics means, roughly speaking, that the object or concept that the prefix is attached to contains both bosonic and fermionic aspects.) When one applies this unified integration concept to gaussians, this can lead to quite compact and efficient calculations (provided that one is willing to work with “super”-analogues of various concepts in classical linear algebra, such as the supertrace or superdeterminant).

Abstract integrals of the flavour of (6) arose in quantum field theory, when physicists sought to formally compute integrals of the form

$\displaystyle \int F( x_1, \ldots, x_n, \xi_1, \ldots, \xi_m )\ dx_1 \ldots dx_n d\xi_1 \ldots d\xi_m \ \ \ \ \ (14)$

where ${x_1,\ldots,x_n}$ are familiar commuting (or bosonic) variables (which, in particular, can often be localised to be scalar variables taking values in ${{\bf R}}$ or ${{\bf C}}$), while ${\xi_1,\ldots,\xi_m}$ were more exotic anticommuting (or fermionic) variables, taking values in some vector space of fermions. (As we shall see shortly, one can formalise these concepts by working in a supercommutative algebra.) The integrand ${F(x_1,\ldots,x_n,\xi_1,\ldots,\xi_m)}$ was a formally analytic function of ${x_1,\ldots,x_n,\xi_1,\ldots,\xi_m}$, in that it could be expanded as a (formal, noncommutative) power series in the variables ${x_1,\ldots,x_n,\xi_1,\ldots,\xi_m}$. For functions ${F(x_1,\ldots,x_n)}$ that depend only on bosonic variables, it is certainly possible for such analytic functions to be in the Schwartz class and thus fall under the scope of the classical integral, as discussed previously. However, functions ${F(\xi_1,\ldots,\xi_m)}$ that depend on fermionic variables ${\xi_1,\ldots,\xi_m}$ behave rather differently. Indeed, a fermonic variable ${\xi}$ must anticommute with itself, so that ${\xi^2 = 0}$. In particular, any power series in ${\xi}$ terminates after the linear term in ${\xi}$, so that a function ${F(\xi)}$ can only be analytic in ${\xi}$ if it is a polynomial of degree at most ${1}$ in ${\xi}$; more generally, an analytic function ${F(\xi_1,\ldots,\xi_m)}$ of ${m}$ fermionic variables ${\xi_1,\ldots,\xi_m}$ must be a polynomial of degree at most ${m}$, and an analytic function ${F(x_1,\ldots,x_n,\xi_1,\ldots,\xi_m)}$ of ${n}$ bosonic and ${m}$ fermionic variables can be Schwartz in the bosonic variables but will be polynomial in the fermonic variables. As such, to interpret the integral (14), one can use classical (Lebesgue) integration (or the variants discussed above for integrating Schwartz entire or Schwartz analytic functions) for the bosonic variables, but must use abstract integrals such as (6) for the fermonic variables, leading to the concept of Berezin integration mentioned earlier.

In this post I would like to set out some of the basic algebraic formalism of Berezin integration, particularly with regards to integration of gaussian-type expressions, and then show how this formalism can be used to perform computations involving GUE (for instance, one can compute the density of states of GUE by this machinery without recourse to the theory of orthogonal polynomials). The use of supersymmetric gaussian integrals to analyse ensembles such as GUE appears in the work of Efetov (and was also proposed in the slightly earlier works of Parisi-Sourlas and McKane, with a related approach also appearing in the work of Wegner); the material here is adapted from this survey of Mirlin, as well as the later papers of Disertori-Pinson-Spencer and of Disertori.

[These are notes intended mostly for myself, as these topics are useful in random matrix theory, but may be of interest to some readers also. -T.]

One of the most fundamental partial differential equations in mathematics is the heat equation

$\displaystyle \partial_t f = L f \ \ \ \ \ (1)$

where ${f: [0,+\infty) \times {\bf R}^n \rightarrow {\bf R}}$ is a scalar function ${(t,x) \mapsto f(t,x)}$ of both time and space, and ${L}$ is the Laplacian ${L := \frac{1}{2} \Delta = \sum_{i=1}^n \frac{\partial^2}{\partial x_i^2}}$. For the purposes of this post, we will ignore all technical issues of regularity and decay, and always assume that the solutions to equations such as (1) have all the regularity and decay in order to justify all formal operations such as the chain rule, integration by parts, or differentiation under the integral sign. The factor of ${\frac{1}{2}}$ in the definition of the heat propagator ${L}$ is of course an arbitrary normalisation, chosen for some minor technical reasons; one can certainly continue the discussion below with other choices of normalisations if desired.

In probability theory, this equation takes on particular significance when ${f}$ is restricted to be non-negative, and furthermore to be a probability measure at each time, in the sense that

$\displaystyle \int_{{\bf R}^n} f(t,x)\ dx = 1$

for all ${t}$. (Actually, it suffices to verify this constraint at time ${t=0}$, as the heat equation (1) will then preserve this constraint.) Indeed, in this case, one can interpret ${f(t,x)\ dx}$ as the probability distribution of a Brownian motion

$\displaystyle dx = dB(t) \ \ \ \ \ (2)$

where ${x = x(t) \in {\bf R}^n}$ is a stochastic process with initial probability distribution ${f(0,x)\ dx}$; see for instance this previous blog post for more discussion.

A model example of a solution to the heat equation to keep in mind is that of the fundamental solution

$\displaystyle G(t,x) = \frac{1}{(2\pi t)^{n/2}} e^{-|x|^2/2t} \ \ \ \ \ (3)$

defined for any ${t>0}$, which represents the distribution of Brownian motion of a particle starting at the origin ${x=0}$ at time ${t=0}$. At time ${t}$, ${G(t,x)}$ represents an ${{\bf R}^n}$-valued random variable, each coefficient of which is an independent random variable of mean zero and variance ${t}$. (As ${t \rightarrow 0^+}$, ${G(t)}$ converges in the sense of distributions to a Dirac mass at the origin.)

The heat equation can also be viewed as the gradient flow for the Dirichlet form

$\displaystyle D(f,g) := \frac{1}{2} \int_{{\bf R}^n} \nabla f \cdot \nabla g\ dx \ \ \ \ \ (4)$

since one has the integration by parts identity

$\displaystyle \int_{{\bf R}^n} Lf(x) g(x)\ dx = \int_{{\bf R}^n} f(x) Lg(x)\ dx = - D(f,g) \ \ \ \ \ (5)$

for all smooth, rapidly decreasing ${f,g}$, which formally implies that ${L f}$ is (half of) the negative gradient of the Dirichlet energy ${D(f,f) = \frac{1}{2} \int_{{\bf R}^n} |\nabla f|^2\ dx}$ with respect to the ${L^2({\bf R}^n,dx)}$ inner product. Among other things, this implies that the Dirichlet energy decreases in time:

$\displaystyle \partial_t D(f,f) = - 2 \int_{{\bf R}^n} |Lf|^2\ dx. \ \ \ \ \ (6)$

For instance, for the fundamental solution (3), one can verify for any time ${t>0}$ that

$\displaystyle D(G,G) = \frac{n}{2^{n+2} \pi^{n/2}} \frac{1}{t^{(n+2)/2}} \ \ \ \ \ (7)$

(assuming I have not made a mistake in the calculation). In a similar spirit we have

$\displaystyle \partial_t \int_{{\bf R}^n} |f|^2\ dx = - 2 D(f,f). \ \ \ \ \ (8)$

Since ${D(f,f)}$ is non-negative, the formula (6) implies that ${\int_{{\bf R}^n} |Lf|^2\ dx}$ is integrable in time, and in particular we see that ${Lf}$ converges to zero as ${t \rightarrow \infty}$, in some averaged ${L^2}$ sense at least; similarly, (8) suggests that ${D(f,f)}$ also converges to zero. This suggests that ${f}$ converges to a constant function; but as ${f}$ is also supposed to decay to zero at spatial infinity, we thus expect solutions to the heat equation in ${{\bf R}^n}$ to decay to zero in some sense as ${t \rightarrow \infty}$. However, the decay is only expected to be polynomial in nature rather than exponential; for instance, the solution (3) decays in the ${L^\infty}$ norm like ${O(t^{-n/2})}$.

Since ${L1=0}$, we also observe the basic cancellation property

$\displaystyle \int_{{\bf R}^n} Lf(x) \ dx = 0 \ \ \ \ \ (9)$

for any function ${f}$.

There are other quantities relating to ${f}$ that also decrease in time under heat flow, particularly in the important case when ${f}$ is a probability measure. In this case, it is natural to introduce the entropy

$\displaystyle S(f) := \int_{{\bf R}^n} f(x) \log f(x)\ dx.$

Thus, for instance, if ${f(x)\ dx}$ is the uniform distribution on some measurable subset ${E}$ of ${{\bf R}^n}$ of finite measure ${|E|}$, the entropy would be ${-\log |E|}$. Intuitively, as the entropy decreases, the probability distribution gets wider and flatter. For instance, in the case of the fundamental solution (3), one has ${S(G) = -\frac{n}{2} \log( 2 \pi e t )}$ for any ${t>0}$, reflecting the fact that ${G(t)}$ is approximately uniformly distributed on a ball of radius ${O(\sqrt{t})}$ (and thus of measure ${O(t^{n/2})}$).

A short formal computation shows (if one assumes for simplicity that ${f}$ is strictly positive, which is not an unreasonable hypothesis, particularly in view of the strong maximum principle) using (9), (5) that

$\displaystyle \partial_t S(f) = \int_{{\bf R}^n} (Lf) \log f + f \frac{Lf}{f}\ dx$

$\displaystyle = \int_{{\bf R}^n} (Lf) \log f\ dx$

$\displaystyle = - D( f, \log f )$

$\displaystyle = - \frac{1}{2} \int_{{\bf R}^n} \frac{|\nabla f|^2}{f}\ dx$

$\displaystyle = - 4D( g, g )$

where ${g := \sqrt{f}}$ is the square root of ${f}$. For instance, if ${f}$ is the fundamental solution (3), one can check that ${D(g,g) = \frac{n}{8t}}$ (note that this is a significantly cleaner formula than (7)!).

In particular, the entropy is decreasing, which corresponds well to one’s intuition that the heat equation (or Brownian motion) should serve to spread out a probability distribution over time.

Actually, one can say more: the rate of decrease ${4D(g,g)}$ of the entropy is itself decreasing, or in other words the entropy is convex. I do not have a satisfactorily intuitive reason for this phenomenon, but it can be proved by straightforward application of basic several variable calculus tools (such as the chain rule, product rule, quotient rule, and integration by parts), and completing the square. Namely, by using the chain rule we have

$\displaystyle L \phi(f) = \phi'(f) Lf + \frac{1}{2} \phi''(f) |\nabla f|^2, \ \ \ \ \ (10)$

valid for for any smooth function ${\phi: {\bf R} \rightarrow {\bf R}}$, we see from (1) that

$\displaystyle 2 g \partial_t g = 2 g L g + |\nabla g|^2$

and thus (again assuming that ${f}$, and hence ${g}$, is strictly positive to avoid technicalities)

$\displaystyle \partial_t g = Lg + \frac{|\nabla g|^2}{2g}.$

We thus have

$\displaystyle \partial_t D(g,g) = 2 D(g,Lg) + D(g, \frac{|\nabla g|^2}{g} ).$

It is now convenient to compute using the Einstein summation convention to hide the summation over indices ${i,j = 1,\ldots,n}$. We have

$\displaystyle 2 D(g,Lg) = \frac{1}{2} \int_{{\bf R}^n} (\partial_i g) (\partial_i \partial_j \partial_j g)\ dx$

and

$\displaystyle D(g, \frac{|\nabla g|^2}{g} ) = \frac{1}{2} \int_{{\bf R}^n} (\partial_i g) \partial_i \frac{\partial_j g \partial_j g}{g}\ dx.$

By integration by parts and interchanging partial derivatives, we may write the first integral as

$\displaystyle 2 D(g,Lg) = - \frac{1}{2} \int_{{\bf R}^n} (\partial_i \partial_j g) (\partial_i \partial_j g)\ dx,$

and from the quotient and product rules, we may write the second integral as

$\displaystyle D(g, \frac{|\nabla g|^2}{g} ) = \int_{{\bf R}^n} \frac{(\partial_i g) (\partial_j g) (\partial_i \partial_j g)}{g} - \frac{(\partial_i g) (\partial_j g) (\partial_i g) (\partial_j g)}{2g^2}\ dx.$

Gathering terms, completing the square, and making the summations explicit again, we see that

$\displaystyle \partial_t D(g,g) =- \frac{1}{2} \int_{{\bf R}^n} \frac{\sum_{i=1}^n \sum_{j=1}^n |g \partial_i \partial_j g - (\partial_i g) (\partial_j g)|^2}{g^2}\ dx$

and so in particular ${D(g,g)}$ is always decreasing.

The above identity can also be written as

$\displaystyle \partial_t D(g,g) = - \frac{1}{2} \int_{{\bf R}^n} |\nabla^2 \log g|^2 g^2\ dx.$

Exercise 1 Give an alternate proof of the above identity by writing ${f = e^{2u}}$, ${g = e^u}$ and deriving the equation ${\partial_t u = Lu + |\nabla u|^2}$ for ${u}$.

It was observed in a well known paper of Bakry and Emery that the above monotonicity properties hold for a much larger class of heat flow-type equations, and lead to a number of important relations between energy and entropy, such as the log-Sobolev inequality of Gross and of Federbush, and the hypercontractivity inequality of Nelson; we will discuss one such family of generalisations (or more precisely, variants) below the fold.

Let ${n}$ be a large natural number, and let ${M_n}$ be a matrix drawn from the Gaussian Unitary Ensemble (GUE), by which we mean that ${M_n}$ is a Hermitian matrix whose upper triangular entries are iid complex gaussians with mean zero and variance one, and whose diagonal entries are iid real gaussians with mean zero and variance one (and independent of the upper triangular entries). The eigenvalues ${\lambda_1(M_n) \leq \ldots \leq \lambda_n(M_n)}$ are then real and almost surely distinct, and can be viewed as a random point process ${\Sigma^{(n)} := \{\lambda_1(M_n),\ldots,\lambda_n(M_n)\}}$ on the real line. One can then form the ${k}$-point correlation functions ${\rho_k^{(n)}: {\bf R}^k \rightarrow {\bf R}^+}$ for every ${k \geq 0}$, which can be defined by duality by requiring

$\displaystyle \mathop{\bf E} \sum_{i_1,\ldots,i_k \hbox{ distinct}} F( \lambda_{i_1}(M_n),\ldots,\lambda_{i_k}(M_n))$

$\displaystyle = \int_{{\bf R}^k} \rho_k^{(n)}(x_1,\ldots,x_k) F(x_1,\ldots,x_k)\ dx_1 \ldots dx_k$

for any test function ${F: {\bf R}^k \rightarrow {\bf R}^+}$. For GUE, which is a continuous matrix ensemble, one can also define ${\rho_k^{(n)}(x_1,\ldots,x_k)}$ for distinct ${x_1<\ldots as the unique quantity such that the probability that there is an eigenvalue in each of the intervals ${[x_1,x_1+\epsilon],\ldots,[x_k,x_k+\epsilon]}$ is ${(\rho_k^{(n)}(x_1,\ldots,x_k)+o(1))\epsilon^k}$ in the limit ${\epsilon\rightarrow 0}$.

As is well known, the GUE process is a determinantal point process, which means that ${k}$-point correlation functions can be explicitly computed as

$\displaystyle \rho^{(n)}_k(x_1,\ldots,x_k) = \det( K^{(n)}(x_i,x_j) )_{1 \leq i,j \leq k}$

for some kernel ${K^{(n)}: {\bf R} \times {\bf R} \rightarrow {\bf C}}$; explicitly, one has

$\displaystyle K^{(n)}(x,y) := \sum_{k=0}^{n-1} P_k(x) e^{-x^2/4}P_k(y) e^{-y^2/4}$

where ${P_0, P_1,\ldots}$ are the (normalised) Hermite polynomials; see this previous blog post for details.

Using the asymptotics of Hermite polynomials (which then give asymptotics for the kernel ${K^{(n)}}$), one can take a limit of a (suitably rescaled) sequence of GUE processes to obtain the Dyson sine process, which is a determinantal point process ${\Sigma}$ on the real line with correlation functions

$\displaystyle \rho_k(x_1,\ldots,x_k) = \det( K(x_i,x_j) )_{1 \leq i,j \leq k} \ \ \ \ \ (1)$

where ${K}$ is the Dyson sine kernel

$\displaystyle K(x,y) := \frac{\sin(\pi(x-y))}{\pi(x-y)}. \ \ \ \ \ (2)$

A bit more precisely, for any fixed bulk energy ${-2 < u < 2}$, the renormalised point processes ${\rho_{sc}(u) \sqrt{n} ( \Sigma^{(n)} - \sqrt{n} u )}$ converge in distribution in the vague topology to ${\Sigma}$ as ${n \rightarrow \infty}$, where ${\rho_{sc}(u) := \frac{1}{2\pi} (4-u^2)^{1/2}_+}$ is the semi-circular law density.

On the other hand, an important feature of the GUE process ${\Sigma^{(n)} = \{\lambda_1,\ldots,\lambda_n\}}$ is its stationarity (modulo rescaling) under Dyson Brownian motion

$\displaystyle d\lambda_i = dB_i + \sum_{j \neq i} \frac{dt}{\lambda_i-\lambda_j}$

which describes the stochastic evolution of eigenvalues of a Hermitian matrix under independent Brownian motion of its entries, and is discussed in this previous blog post. To cut a long story short, this stationarity tells us that the self-similar ${n}$-point correlation function

$\displaystyle \rho^{(n)}_n(t,x) := t^{-n/2} \rho^{(n)}_n(x/\sqrt{t})$

obeys the Dyson heat equation

$\displaystyle \partial_t \rho^{(n)}_n = \frac{1}{2} \sum_{i=1}^n \partial_{x_i}^2 \rho^{(n)}_n - \sum_{1 \leq i,j \leq n;i\neq j} \partial_{x_i} \frac{\rho^{(n)}_n}{x_i-x_j}$

(see Exercise 11 of the previously mentioned blog post). Note that ${\rho^{(n)}_n}$ vanishes to second order whenever two of the ${x_i}$ coincide, so there is no singularity on the right-hand side. Setting ${t=1}$ and using self-similarity, we can rewrite this equation in time-independent form as

$\displaystyle -\frac{1}{2} \sum_{i=1}^n \partial_i (x_i \rho^{(n)}_n) = \frac{1}{2} \sum_{i=1}^n \partial_{x_i}^2 \rho^{(n)}_n - \sum_{1 \leq i,j \leq n;i\neq j} \partial_{x_i} \frac{\rho^{(n)}_n}{x_i-x_j}.$

One can then integrate out all but ${k}$ of these variables (after carefully justifying convergence) to obtain a system of equations for the ${k}$-point correlation functions ${\rho^{(n)}_k}$:

$\displaystyle -\frac{1}{2} \sum_{i=1}^k \partial_i (x_i \rho^{(n)}_k) = \frac{1}{2} \sum_{i=1}^k \partial_{x_i}^2 \rho^{(n)}_k - \sum_{1 \leq i,j \leq k;i\neq j} \partial_{x_i} \frac{\rho^{(n)}_k}{x_i-x_j}$

$\displaystyle - \sum_{i=1}^k \partial_{x_i} \int_{\bf R} \frac{\rho^{(n)}_{k+1}(x_1,\ldots,x_{k+1})}{x_i-x_{k+1}}\ dx_{k+1},$

where the integral is interpreted in the principal value case. This system is an example of a BBGKY hierarchy.

If one carefully rescales and takes limits (say at the energy level ${u=0}$, for simplicity), the left-hand side turns out to rescale to be a lower order term, and one ends up with a hierarchy for the Dyson sine process:

$\displaystyle 0 = \frac{1}{2} \sum_{i=1}^k \partial_{x_i}^2 \rho_k - \sum_{1 \leq i,j \leq k;i\neq j} \partial_{x_i} \frac{\rho_k}{x_i-x_j} \ \ \ \ \ (3)$

$\displaystyle - \sum_{i=1}^k \partial_{x_i} \int_{\bf R} \frac{\rho_{k+1}(x_1,\ldots,x_{k+1})}{x_i-x_{k+1}}\ dx_{k+1}.$

Informally, these equations show that the Dyson sine process ${\Sigma = \{ \lambda_i: i \in {\bf Z} \}}$ is stationary with respect to the infinite Dyson Brownian motion

$\displaystyle d\lambda_i = dB_i + \sum_{j \neq i} \frac{dt}{\lambda_i-\lambda_j}$

where ${dB_i}$ are independent Brownian increments, and the sum is interpreted in a suitable principal value sense.

I recently set myself the exercise of deriving the identity (3) directly from the definition (1) of the Dyson sine process, without reference to GUE. This turns out to not be too difficult when done the right way (namely, by modifying the proof of Gaudin’s lemma), although it did take me an entire day of work before I realised this, and I could not find it in the literature (though I suspect that many people in the field have privately performed this exercise in the past). In any case, I am recording the computation here, largely because I really don’t want to have to do it again, but perhaps it will also be of interest to some readers.

One of the basic general problems in analytic number theory is to understand as much as possible the fluctuations of the Möbius function ${\mu(n)}$, defined as ${(-1)^k}$ when ${n}$ is the product of ${k}$ distinct primes, and zero otherwise. For instance, as ${\mu}$ takes values in ${\{-1,0,1\}}$, we have the trivial bound

$\displaystyle |\sum_{n \leq x} \mu(n)| \leq x$

and the seemingly slight improvement

$\displaystyle \sum_{n \leq x} \mu(n) = o(x) \ \ \ \ \ (1)$

is already equivalent to the prime number theorem, as observed by Landau (see e.g. this previous blog post for a proof), while the much stronger (and still open) improvement

$\displaystyle \sum_{n \leq x} \mu(n) = O(x^{1/2+o(1)})$

is equivalent to the notorious Riemann hypothesis.

There is a general Möbius pseudorandomness heuristic that suggests that the sign pattern ${\mu}$ behaves so randomly (or pseudorandomly) that one should expect a substantial amount of cancellation in sums that involve the sign fluctuation of the Möbius function in a nontrivial fashion, with the amount of cancellation present comparable to the amount that an analogous random sum would provide; cf. the probabilistic heuristic discussed in this recent blog post. There are a number of ways to make this heuristic precise. As already mentioned, the Riemann hypothesis can be considered one such manifestation of the heuristic. Another manifestation is the following old conjecture of Chowla:

Conjecture 1 (Chowla’s conjecture) For any fixed integer ${m}$ and exponents ${a_1,a_2,\ldots,a_m \geq 0}$, with at least one of the ${a_i}$ odd (so as not to completely destroy the sign cancellation), we have

$\displaystyle \sum_{n \leq x} \mu(n+1)^{a_1} \ldots \mu(n+m)^{a_m} = o_{x \rightarrow \infty;m}(x).$

Note that as ${\mu^a = \mu^{a+2}}$ for any ${a \geq 1}$, we can reduce to the case when the ${a_i}$ take values in ${0,1,2}$ here. When only one of the ${a_i}$ are odd, this is essentially the prime number theorem in arithmetic progressions (after some elementary sieving), but with two or more of the ${a_i}$ are odd, the problem becomes completely open. For instance, the estimate

$\displaystyle \sum_{n \leq x} \mu(n) \mu(n+2) = o(x)$

is morally very close to the conjectured asymptotic

$\displaystyle \sum_{n \leq x} \Lambda(n) \Lambda(n+2) = 2\Pi_2 x + o(x)$

for the von Mangoldt function ${\Lambda}$, where ${\Pi_2 := \prod_{p > 2} (1 - \frac{1}{(p-1)^2}) = 0.66016\ldots}$ is the twin prime constant; this asymptotic in turn implies the twin prime conjecture. (To formally deduce estimates for von Mangoldt from estimates for Möbius, though, typically requires some better control on the error terms than ${o()}$, in particular gains of some power of ${\log x}$ are usually needed. See this previous blog post for more discussion.)

Remark 1 The Chowla conjecture resembles an assertion that, for ${n}$ chosen randomly and uniformly from ${1}$ to ${x}$, the random variables ${\mu(n+1),\ldots,\mu(n+k)}$ become asymptotically independent of each other (in the probabilistic sense) as ${x \rightarrow \infty}$. However, this is not quite accurate, because some moments (namely those with all exponents ${a_i}$ even) have the “wrong” asymptotic value, leading to some unwanted correlation between the two variables. For instance, the events ${\mu(n)=0}$ and ${\mu(n+4)=0}$ have a strong correlation with each other, basically because they are both strongly correlated with the event of ${n}$ being divisible by ${4}$. A more accurate interpretation of the Chowla conjecture is that the random variables ${\mu(n+1),\ldots,\mu(n+k)}$ are asymptotically conditionally independent of each other, after conditioning on the zero pattern ${\mu(n+1)^2,\ldots,\mu(n+k)^2}$; thus, it is the sign of the Möbius function that fluctuates like random noise, rather than the zero pattern. (The situation is a bit cleaner if one works instead with the Liouville function ${\lambda}$ instead of the Möbius function ${\mu}$, as this function never vanishes, but we will stick to the traditional Möbius function formalism here.)

A more recent formulation of the Möbius randomness heuristic is the following conjecture of Sarnak. Given a bounded sequence ${f: {\bf N} \rightarrow {\bf C}}$, define the topological entropy of the sequence to be the least exponent ${\sigma}$ with the property that for any fixed ${\epsilon > 0}$, and for ${m}$ going to infinity the set ${\{ (f(n+1),\ldots,f(n+m)): n \in {\bf N} \} \subset {\bf C}^m}$ of ${f}$ can be covered by ${O( \exp( \sigma m + o(m) ) )}$ balls of radius ${\epsilon}$. (If ${f}$ arises from a minimal topological dynamical system ${(X,T)}$ by ${f(n) := f(T^n x)}$, the above notion is equivalent to the usual notion of the topological entropy of a dynamical system.) For instance, if the sequence is a bit sequence (i.e. it takes values in ${\{0,1\}}$), then there are only ${\exp(\sigma m + o(m))}$ ${m}$-bit patterns that can appear as blocks of ${m}$ consecutive bits in this sequence. As a special case, a Turing machine with bounded memory that had access to a random number generator at the rate of one random bit produced every ${T}$ units of time, but otherwise evolved deterministically, would have an output sequence that had a topological entropy of at most ${\frac{1}{T} \log 2}$. A bounded sequence is said to be deterministic if its topological entropy is zero. A typical example is a polynomial sequence such as ${f(n) := e^{2\pi i \alpha n^2}}$ for some fixed ${\sigma}$; the ${m}$-blocks of such polynomials sequence have covering numbers that only grow polynomially in ${m}$, rather than exponentially, thus yielding the zero entropy. Unipotent flows, such as the horocycle flow on a compact hyperbolic surface, are another good source of deterministic sequences.

Conjecture 2 (Sarnak’s conjecture) Let ${f: {\bf N} \rightarrow {\bf C}}$ be a deterministic bounded sequence. Then

$\displaystyle \sum_{n \leq x} \mu(n) f(n) = o_{x \rightarrow \infty;f}(x).$

This conjecture in general is still quite far from being solved. However, special cases are known:

• For constant sequences, this is essentially the prime number theorem (1).
• For periodic sequences, this is essentially the prime number theorem in arithmetic progressions.
• For quasiperiodic sequences such as ${f(n) = F(\alpha n \hbox{ mod } 1)}$ for some continuous ${F}$, this follows from the work of Davenport.
• For nilsequences, this is a result of Ben Green and myself.
• For horocycle flows, this is a result of Bourgain, Sarnak, and Ziegler.
• For the Thue-Morse sequence, this is a result of Dartyge-Tenenbaum (with a stronger error term obtained by Maduit-Rivat). A subsequent result of Bourgain handles all bounded rank one sequences (though the Thue-Morse sequence is actually of rank two), and a related result of Green establishes asymptotic orthogonality of the Möbius function to bounded depth circuits, although such functions are not necessarily deterministic in nature.
• For the Rudin-Shapiro sequence, I sketched out an argument at this MathOverflow post.
• The Möbius function is known to itself be non-deterministic, because its square ${\mu^2(n)}$ (i.e. the indicator of the square-free functions) is known to be non-deterministic (indeed, its topological entropy is ${\frac{6}{\pi^2}\log 2}$). (The corresponding question for the Liouville function ${\lambda(n)}$, however, remains open, as the square ${\lambda^2(n)=1}$ has zero entropy.)
• In the converse direction, it is easy to construct sequences of arbitrarily small positive entropy that correlate with the Möbius function (a rather silly example is ${\mu(n) 1_{k|n}}$ for some fixed large (squarefree) ${k}$, which has topological entropy at most ${\log 2/k}$ but clearly correlates with ${\mu}$).

See this survey of Sarnak for further discussion of this and related topics.

In this post I wanted to give a very nice argument of Sarnak that links the above two conjectures:

Proposition 3 The Chowla conjecture implies the Sarnak conjecture.

The argument does not use any number-theoretic properties of the Möbius function; one could replace ${\mu}$ in both conjectures by any other function from the natural numbers to ${\{-1,0,+1\}}$ and obtain the same implication. The argument consists of the following ingredients:

1. To show that ${\sum_{n, it suffices to show that the expectation of the random variable ${\frac{1}{m} (\mu(n+1)f(n+1)+\ldots+\mu(n+m)f(n+m))}$, where ${n}$ is drawn uniformly at random from ${1}$ to ${x}$, can be made arbitrary small by making ${m}$ large (and ${n}$ even larger).
2. By the union bound and the zero topological entropy of ${f}$, it suffices to show that for any bounded deterministic coefficients ${c_1,\ldots,c_m}$, the random variable ${\frac{1}{m}(c_1 \mu(n+1) + \ldots + c_m \mu(n+m))}$ concentrates with exponentially high probability.
3. Finally, this exponentially high concentration can be achieved by the moment method, using a slight variant of the moment method proof of the large deviation estimates such as the Chernoff inequality or Hoeffding inequality (as discussed in this blog post).

As is often the case, though, while the “top-down” order of steps presented above is perhaps the clearest way to think conceptually about the argument, in order to present the argument formally it is more convenient to present the arguments in the reverse (or “bottom-up”) order. This is the approach taken below the fold.

There has been a lot of recent interest in the abc conjecture, since the release a few weeks ago of the last of a series of papers by Shinichi Mochizuki which, as one of its major applications, claims to establish this conjecture. It’s still far too early to judge whether this proof is likely to be correct or not (the entire argument encompasses several hundred pages of argument, mostly in the area of anabelian geometry, which very few mathematicians are expert in, to the extent that we still do not even have a full outline of the proof strategy yet), and I don’t have anything substantial to add to the existing discussion around that conjecture. (But, for those that are interested, the Polymath wiki page on the ABC conjecture has collected most of the links to that discussion, and to various background materials.)

In the meantime, though, I thought I might give the standard probabilistic heuristic argument that explains why we expect the ABC conjecture to be true. The underlying heuristic is a common one, used throughout number theory, and it can be summarised as follows:

Heuristic 1 (Probabilistic heuristic) Even though number theory is a deterministic subject (one does not need to roll any dice to factorise a number, or figure out if a number is prime), one expects to get a good asymptotic prediction for the answers to many number-theoretic questions by pretending that various number-theoretic assertions ${E}$ (e.g. that a given number ${n}$ is prime) are probabilistic events (with a probability ${\mathop{\bf P}(E)}$ that can vary between ${0}$ and ${1}$) rather than deterministic events (that are either always true or always false). Furthermore:

• (Basic heuristic) If two or more of these heuristically probabilistic events have no obvious reason to be strongly correlated to each other, then we should expect them to behave as if they were (jointly) independent.
• (Advanced heuristic) If two or more of these heuristically probabilistic events have some obvious correlation between them, but no further correlations are suspected, then we should expect them to behave as if they were conditionally independent, relative to whatever data is causing the correlation.

This is, of course, an extremely vague and completely non-rigorous heuristic, requiring (among other things) a subjective and ad hoc determination of what an “obvious reason” is, but in practice it tends to give remarkably plausible predictions, some fraction of which can in fact be backed up by rigorous argument (although in many cases, the actual argument has almost nothing in common with the probabilistic heuristic). A famous special case of this heuristic is the Cramér random model for the primes, but this is not the only such instance for that heuristic.

To give the most precise predictions, one should use the advanced heuristic in Heuristic 1, but this can be somewhat complicated to execute, and so we shall focus instead on the predictions given by the basic heuristic (thus ignoring the presence of some number-theoretic correlations), which tends to give predictions that are quantitatively inaccurate but still reasonably good at the qualitative level.

Here is a basic “corollary” of Heuristic 1:

Heuristic 2 (Heuristic Borel-Cantelli) Suppose one has a sequence ${E_1, E_2, \ldots}$ of number-theoretic statements, which we heuristically interpet as probabilistic events with probabilities ${\mathop{\bf P}(E_1), \mathop{\bf P}(E_2), \ldots}$. Suppose also that we know of no obvious reason for these events to have much of a correlation with each other. Then:

• If ${\sum_{i=1}^\infty \mathop{\bf P}(E_i) < \infty}$, we expect only finitely many of the statements ${E_n}$ to be true. (And if ${\sum_{i=1}^\infty \mathop{\bf P}(E_i)}$ is much smaller than ${1}$, we in fact expect none of the ${E_n}$ to be true.)
• If ${\sum_{i=1}^\infty \mathop{\bf P}(E_i) = \infty}$, we expect infinitely many of the statements ${E_n}$ to be true.

This heuristic is motivated both by the Borel-Cantelli lemma, and by the standard probabilistic computation that if one is given jointly independent, and genuinely probabilistic, events ${E_1, E_2, \ldots}$ with ${\sum_{i=1}^\infty \mathop{\bf P}(E_i) = \infty}$, then one almost surely has an infinite number of the ${E_i}$ occuring.

Before we get to the ABC conjecture, let us give two simpler (and well known) demonstrations of these heuristics in action:

Example 1 (Twin prime conjecture) One can heuristically justify the twin prime conjecture as follows. Using the prime number theorem, one can heuristically assign a probability of ${1/\log n}$ to the event that any given large integer ${n}$ is prime. In particular, the probability that ${n+2}$ is prime will then be ${1/\log(n+2)}$. Making the assumption that there are no strong correlations between these events, we are led to the prediction that the probability that ${n}$ and ${n+2}$ are simultaneously prime is ${\frac{1}{(\log n)(\log n+2)}}$. Since ${\sum_{n=1}^\infty \frac{1}{(\log n) (\log n+2)} = \infty}$, the Borel-Cantelli heuristic then predicts that there should be infinitely many twin primes.

Note that the above argument is a bit too naive, because there are some non-trivial correlations between the primality of ${n}$ and the primality of ${n+2}$. Most obviously, if ${n}$ is prime, this greatly increases the probability that ${n}$ is odd, which implies that ${n+2}$ is odd, which then elevates the probability that ${n+2}$ is prime. A bit more subtly, if ${n}$ is prime, then ${n}$ is likely to avoid the residue class ${0 \hbox{ mod } 3}$, which means that ${n+2}$ avoids the residue class ${2 \hbox{ mod } 3}$, which ends up decreasing the probability that ${n+2}$ is prime. However, there is a standard way to correct for these local correlations; see for instance in this previous blog post. As it turns out, these local correlations ultimately alter the prediction for the asymptotic density of twin primes by a constant factor (the twin prime constant), but do not affect the qualitative prediction of there being infinitely many twin primes.

Example 2 (Fermat’s last theorem) Let us now heuristically count the number of solutions to ${x^n+y^n=z^n}$ for various ${n}$ and natural numbers ${x,y,z}$ (which we can reduce to be coprime if desired). We recast this (in the spirit of the ABC conjecture) as ${a+b=c}$, where ${a,b,c}$ are ${n^{th}}$ powers. The number of ${n^{th}}$ powers up to any given number ${N}$ is about ${N^{1/n}}$, so heuristically any given natural number ${a}$ has a probability about ${a^{1/n - 1}}$ of being an ${n^{th}}$ power. If we make the naive assumption that (in the coprime case at least) there is no strong correlation between the events that ${a}$ is an ${n^{th}}$ power, ${b}$ is an ${n^{th}}$ power, and ${a+b}$ being an ${n^{th}}$ power, then for typical ${a,b}$, the probability that ${a,b,a+b}$ are all simultaneously ${n^{th}}$ powers would then be ${a^{1/n-1} b^{1/n-1} (a+b)^{1/n-1}}$. For fixed ${n}$, the total number of solutions to the Fermat equation would then be predicted to be

$\displaystyle \sum_{a=1}^\infty \sum_{b=1}^\infty a^{1/n-1} b^{1/n-1} (a+b)^{1/n-1}.$

(Strictly speaking, we need to restrict to the coprime case, but given that a positive density of pairs of integers are coprime, it should not affect the qualitative conclusion significantly if we now omit this restriction.) It might not be immediately obvious as to whether this sum converges or diverges, but (as is often the case with these sorts of unsigned sums) one can clarify the situation by dyadic decomposition. Suppose for instance that we consider the portion of the sum where ${c=a+b}$ lies between ${2^k}$ and ${2^{k+1}}$. Then this portion of the sum can be controlled by

$\displaystyle \sum_{a \leq 2^{k+1}} \sum_{b \leq 2^{k+1}} a^{1/n-1} b^{1/n-1} O( ( 2^k )^{1/n - 1} )$

which simplifies to

$\displaystyle O( 2^{(3/n - 1)k} ).$

Summing in ${k}$, one thus expects infinitely many solutions for ${n=2}$, only finitely many solutions for ${n>3}$ (indeed, a refinement of this argument shows that one expects only finitely many solutions even if one considers all ${n>3}$ at once), and a borderline prediction of there being a barely infinite number of solutions when ${n=3}$. Here is of course a place where a naive application of the probabilistic heuristic breaks down; there is enough arithmetic structure in the equation ${x^3+y^3=z^3}$ that the naive probabilistic prediction ends up being an inaccurate model. Indeed, while this heuristic suggests that a typical homogeneous cubic should have a logarithmic number of integer solutions of a given height ${N}$, it turns out that some homogeneous cubics (namely, those associated to elliptic curves of positive rank) end up with the bulk of these solutions, while other homogeneous cubics (including those associated to elliptic curves of zero rank, including the Fermat curve ${x^3+y^3=z^3}$) only get finitely many solutions. The reasons for this are subtle, but certainly the high degree of arithmetic structure present in an elliptic curve (starting with the elliptic curve group law which allows one to generate new solutions from old ones, and which also can be used to exclude solutions to ${x^3+y^3=z^3}$ via the method of descent) is a major contributing factor.

Below the fold, we apply similar heuristics to suggest the truth of the ABC conjecture.

Van Vu and I have just uploaded to the arXiv our paper “Random matrices: Universality of local spectral statistics of non-Hermitian matrices“. The main result of this paper is a “Four Moment Theorem” that establishes universality for local spectral statistics of non-Hermitian matrices with independent entries, under the additional hypotheses that the entries of the matrix decay exponentially, and match moments with either the real or complex gaussian ensemble to fourth order. This is the non-Hermitian analogue of a long string of recent results establishing universality of local statistics in the Hermitian case (as discussed for instance in this recent survey of Van and myself, and also in several other places).

The complex case is somewhat easier to describe. Given a (non-Hermitian) random matrix ensemble ${M_n}$ of ${n \times n}$ matrices, one can arbitrarily enumerate the (geometric) eigenvalues as ${\lambda_1(M_n),\ldots,\lambda_n(M_n) \in {\bf C}}$, and one can then define the ${k}$-point correlation functions ${\rho^{(k)}_n: {\bf C}^k \rightarrow {\bf R}^+}$ to be the symmetric functions such that

$\displaystyle \int_{{\bf C}^k} F(z_1,\ldots,z_k) \rho^{(k)}_n(z_1,\ldots,z_k)\ dz_1 \ldots dz_k$

$\displaystyle = {\bf E} \sum_{1 \leq i_1 < \ldots < i_k \leq n} F(\lambda_1(M_n),\ldots,\lambda_k(M_n)).$

In the case when ${M_n}$ is drawn from the complex gaussian ensemble, so that all the entries are independent complex gaussians of mean zero and variance one, it is a classical result of Ginibre that the asymptotics of ${\rho^{(k)}_n}$ near some point ${z \sqrt{n}}$ as ${n \rightarrow \infty}$ and ${z \in {\bf C}}$ is fixed are given by the determinantal rule

$\displaystyle \rho^{(k)}_n(z\sqrt{n} + w_1,\ldots,z\sqrt{n}+w_k) \rightarrow \hbox{det}( K(w_i,w_j) )_{1 \leq i,j \leq k} \ \ \ \ \ (1)$

for ${|z| < 1}$ and

$\displaystyle \rho^{(k)}_n(z\sqrt{n} + w_1,\ldots,z\sqrt{n}+w_k) \rightarrow 0$

for ${|z| > 1}$, where ${K}$ is the reproducing kernel

$\displaystyle K(z,w) := \frac{1}{\pi} e^{-|z|^2/2 - |w|^2/2 + z \overline{w}}.$

(There is also an asymptotic for the boundary case ${|z|=1}$, but it is more complicated to state.) In particular, we see that ${\rho^{(k)}_n(z \sqrt{n}) \rightarrow \frac{1}{\pi} 1_{|z| \leq 1}}$ for almost every ${z}$, which is a manifestation of the well-known circular law for these matrices; but the circular law only captures the macroscopic structure of the spectrum, whereas the asymptotic (1) describes the microscopic structure.

Our first main result is that the asymptotic (1) for ${|z|<1}$ also holds (in the sense of vague convergence) when ${M_n}$ is a matrix whose entries are independent with mean zero, variance one, exponentially decaying tails, and which all match moments with the complex gaussian to fourth order. (Actually we prove a stronger result than this which is valid for all bounded ${z}$ and has more uniform bounds, but is a bit more technical to state.) An analogous result is also established for real gaussians (but now one has to separate the correlation function into components depending on how many eigenvalues are real and how many are strictly complex; also, the limiting distribution is more complicated, being described by Pfaffians rather than determinants). Among other things, this allows us to partially extend some known results on complex or real gaussian ensembles to more general ensembles. For instance, there is a central limit theorem of Rider which establishes a central limit theorem for the number of eigenvalues of a complex gaussian matrix in a mesoscopic disk; from our results, we can extend this central limit theorem to matrices that match the complex gaussian ensemble to fourth order, provided that the disk is small enough (for technical reasons, our error bounds are not strong enough to handle large disks). Similarly, extending some results of Edelman-Kostlan-Shub and of Forrester-Nagao, we can show that for a matrix matching the real gaussian ensemble to fourth order, the number of real eigenvalues is ${\sqrt{\frac{2n}{\pi}} + O(n^{1/2-c})}$ with probability ${1-O(n^{-c})}$ for some absolute constant ${c>0}$.

There are several steps involved in the proof. The first step is to apply the Girko Hermitisation trick to replace the problem of understanding the spectrum of a non-Hermitian matrix, with that of understanding the spectrum of various Hermitian matrices. The two identities that realise this trick are, firstly, Jensen’s formula

$\displaystyle \log |\det(M_n-z_0)| = - \sum_{1 \leq i \leq n: \lambda_i(M_n) \in B(z_0,r)} \log \frac{r}{|\lambda_i(M_n)-z_0|}$

$\displaystyle + \frac{1}{2\pi} \int_0^{2\pi} \log |\det(M_n-z_0-re^{i\theta})|\ d\theta$

that relates the local distribution of eigenvalues to the log-determinants ${\log |\det(M_n-z_0)|}$, and secondly the elementary identity

$\displaystyle \log |\det(M_n - z)| = \frac{1}{2} \log|\det W_{n,z}| + \frac{1}{2} n \log n$

that relates the log-determinants of ${M_n-z}$ to the log-determinants of the Hermitian matrices

$\displaystyle W_{n,z} := \frac{1}{\sqrt{n}} \begin{pmatrix} 0 & M_n -z \\ (M_n-z)^* & 0 \end{pmatrix}.$

The main difficulty is then to obtain concentration and universality results for the Hermitian log-determinants ${\log|\det W_{n,z}|}$. This turns out to be a task that is analogous to the task of obtaining concentration for Wigner matrices (as we did in this recent paper), as well as central limit theorems for log-determinants of Wigner matrices (as we did in this other recent paper). In both of these papers, the main idea was to use the Four Moment Theorem for Wigner matrices (which can now be proven relatively easily by a combination of the local semi-circular law and resolvent swapping methods), combined with (in the latter paper) a central limit theorem for the gaussian unitary ensemble (GUE). This latter task was achieved by using the convenient Trotter normal form to tridiagonalise a GUE matrix, which has the effect of revealing the determinant of that matrix as the solution to a certain linear stochastic difference equation, and one can analyse the distribution of that solution via such tools as the martingale central limit theorem.

The matrices ${W_{n,z}}$ are somewhat more complicated than Wigner matrices (for instance, the semi-circular law must be replaced by a distorted Marchenko-Pastur law), but the same general strategy works to obtain concentration and universality for their log-determinants. The main new difficulty that arises is that the analogue of the Trotter norm for gaussian random matrices is not tridiagonal, but rather Hessenberg (i.e. upper-triangular except for the lower diagonal). This ultimately has the effect of expressing the relevant determinant as the solution to a nonlinear stochastic difference equation, which is a bit trickier to solve for. Fortunately, it turns out that one only needs good lower bounds on the solution, as one can use the second moment method to upper bound the determinant and hence the log-determinant (following a classical computation of Turan). This simplifies the analysis on the equation somewhat.

While this result is the first local universality result in the category of random matrices with independent entries, there are still two limitations to the result which one would like to remove. The first is the moment matching hypotheses on the matrix. Very recently, one of the ingredients of our paper, namely the local circular law, was proved without moment matching hypotheses by Bourgade, Yau, and Yin (provided one stays away from the edge of the spectrum); however, as of this time of writing the other main ingredient – the universality of the log-determinant – still requires moment matching. (The standard tool for obtaining universality without moment matching hypotheses is the heat flow method (and more specifically, the local relaxation flow method), but the analogue of Dyson Brownian motion in the non-Hermitian setting appears to be somewhat intractible, being a coupled flow on both the eigenvalues and eigenvectors rather than just on the eigenvalues alone.)

I’ve just uploaded to the arXiv my paper The asymptotic distribution of a single eigenvalue gap of a Wigner matrix, submitted to Probability Theory and Related Fields. This paper (like several of my previous papers) is concerned with the asymptotic distribution of the eigenvalues ${\lambda_1(M_n) \leq \ldots \leq \lambda_n(M_n)}$ of a random Wigner matrix ${M_n}$ in the limit ${n \rightarrow \infty}$, with a particular focus on matrices drawn from the Gaussian Unitary Ensemble (GUE). This paper is focused on the bulk of the spectrum, i.e. to eigenvalues ${\lambda_i(M_n)}$ with ${\delta n \leq i \leq (1-\delta) n}$ for some fixed ${\delta>0}$.

The location of an individual eigenvalue ${\lambda_i(M_n)}$ is by now quite well understood. If we normalise the entries of the matrix ${M_n}$ to have mean zero and variance ${1}$, then in the asymptotic limit ${n \rightarrow \infty}$, the Wigner semicircle law tells us that with probability ${1-o(1)}$ one has

$\displaystyle \lambda_i(M_n) =\sqrt{n} u + o(\sqrt{n})$

where the classical location ${u = u_{i/n} \in [-2,2]}$ of the eigenvalue is given by the formula

$\displaystyle \int_{-2}^{u} \rho_{sc}(x)\ dx = \frac{i}{n}$

and the semicircular distribution ${\rho_{sc}(x)\ dx}$ is given by the formula

$\displaystyle \rho_{sc}(x) := \frac{1}{2\pi} (4-x^2)_+^{1/2}.$

Actually, one can improve the error term here from ${o(\sqrt{n})}$ to ${O( \log^{1/2+\epsilon} n)}$ for any ${\epsilon>0}$ (see this previous recent paper of Van and myself for more discussion of these sorts of estimates, sometimes known as eigenvalue rigidity estimates).

From the semicircle law (and the fundamental theorem of calculus), one expects the ${i^{th}}$ eigenvalue spacing ${\lambda_{i+1}(M_n)-\lambda_i(M_n)}$ to have an average size of ${\frac{1}{\sqrt{n} \rho_{sc}(u)}}$. It is thus natural to introduce the normalised eigenvalue spacing

$\displaystyle X_i := \frac{\lambda_{i+1}(M_n) - \lambda_i(M_n)}{1/\sqrt{n} \rho_{sc}(u)}$

and ask what the distribution of ${X_i}$ is.

As mentioned previously, we will focus on the bulk case ${\delta n \leq i\leq (1-\delta)n}$, and begin with the model case when ${M_n}$ is drawn from GUE. (In the edge case when ${i}$ is close to ${1}$ or to ${n}$, the distribution is given by the famous Tracy-Widom law.) Here, the distribution was almost (but as we shall see, not quite) worked out by Gaudin and Mehta. By using the theory of determinantal processes, they were able to compute a quantity closely related to ${X_i}$, namely the probability

$\displaystyle {\bf P}( N_{[\sqrt{n} u + \frac{x}{\sqrt{n} \rho_{sc}(u)}, \sqrt{n} u + \frac{y}{\sqrt{n} \rho_{sc}(u)}]} = 0) \ \ \ \ \ (1)$

that an interval ${[\sqrt{n} u + \frac{x}{\sqrt{n} \rho_{sc}(u)}, \sqrt{n} u + \frac{y}{\sqrt{n} \rho_{sc}(u)}]}$ near ${\sqrt{n} u}$ of length comparable to the expected eigenvalue spacing ${1/\sqrt{n} \rho_{sc}(u)}$ is devoid of eigenvalues. For ${u}$ in the bulk and fixed ${x,y}$, they showed that this probability is equal to

$\displaystyle \det( 1 - 1_{[x,y]} P 1_{[x,y]} ) + o(1),$

where ${P}$ is the Dyson projection

$\displaystyle P f(x) = \int_{\bf R} \frac{\sin(\pi(x-y))}{\pi(x-y)} f(y)\ dy$

to Fourier modes in ${[-1/2,1/2]}$, and ${\det}$ is the Fredholm determinant. As shown by Jimbo, Miwa, Tetsuji, Mori, and Sato, this determinant can also be expressed in terms of a solution to a Painleve V ODE, though we will not need this fact here. In view of this asymptotic and some standard integration by parts manipulations, it becomes plausible to propose that ${X_i}$ will be asymptotically distributed according to the Gaudin-Mehta distribution ${p(x)\ dx}$, where

$\displaystyle p(x) := \frac{d^2}{dx^2} \det( 1 - 1_{[0,x]} P 1_{[0,x]} ).$

A reasonably accurate approximation for ${p}$ is given by the Wigner surmise ${p(x) \approx \frac{1}{2} \pi x e^{-\pi x^2/4}}$, which was presciently proposed by Wigner as early as 1957; it is exact for ${n=2}$ but not in the asymptotic limit ${n \rightarrow \infty}$.

Unfortunately, when one tries to make this argument rigorous, one finds that the asymptotic for (1) does not control a single gap ${X_i}$, but rather an ensemble of gaps ${X_i}$, where ${i}$ is drawn from an interval ${[i_0 - L, i_0 + L]}$ of some moderate size ${L}$ (e.g. ${L = \log n}$); see for instance this paper of Deift, Kriecherbauer, McLaughlin, Venakides, and Zhou for a more precise formalisation of this statement (which is phrased slightly differently, in which one samples all gaps inside a fixed window of spectrum, rather than inside a fixed range of eigenvalue indices ${i}$). (This result is stated for GUE, but can be extended to other Wigner ensembles by the Four Moment Theorem, at least if one assumes a moment matching condition; see this previous paper with Van Vu for details. The moment condition can in fact be removed, as was done in this subsequent paper with Erdos, Ramirez, Schlein, Vu, and Yau.)

The problem is that when one specifies a given window of spectrum such as ${[\sqrt{n} u + \frac{x}{\sqrt{n} \rho_{sc}(u)}, \sqrt{n} u + \frac{y}{\sqrt{n} \rho_{sc}(u)}]}$, one cannot quite pin down in advance which eigenvalues ${\lambda_i(M_n)}$ are going to lie to the left or right of this window; even with the strongest eigenvalue rigidity results available, there is a natural uncertainty of ${\sqrt{\log n}}$ or so in the ${i}$ index (as can be quantified quite precisely by this central limit theorem of Gustavsson).

The main difficulty here is that there could potentially be some strange coupling between the event (1) of an interval being devoid of eigenvalues, and the number ${N_{(-\infty,\sqrt{n} u + \frac{x}{\sqrt{n} \rho_{sc}(u)})}(M_n)}$ of eigenvalues to the left of that interval. For instance, one could conceive of a possible scenario in which the interval in (1) tends to have many eigenvalues when ${N_{(-\infty,\sqrt{n} u + \frac{x}{\sqrt{n} \rho_{sc}(u)})}(M_n)}$ is even, but very few when ${N_{(-\infty,\sqrt{n} u + \frac{x}{\sqrt{n} \rho_{sc}(u)})}(M_n)}$ is odd. In this sort of situation, the gaps ${X_i}$ may have different behaviour for even ${i}$ than for odd ${i}$, and such anomalies would not be picked up in the averaged statistics in which ${i}$ is allowed to range over some moderately large interval.

The main result of the current paper is that these anomalies do not actually occur, and that all of the eigenvalue gaps ${X_i}$ in the bulk are asymptotically governed by the Gaudin-Mehta law without the need for averaging in the ${i}$ parameter. Again, this is shown first for GUE, and then extended to other Wigner matrices obeying a matching moment condition using the Four Moment Theorem. (It is likely that the moment matching condition can be removed here, but I was unable to achieve this, despite all the recent advances in establishing universality of local spectral statistics for Wigner matrices, mainly because the universality results in the literature are more focused on specific energy levels ${u}$ than on specific eigenvalue indices ${i}$. To make matters worse, in some cases universality is currently known only after an additional averaging in the energy parameter.)

The main task in the proof is to show that the random variable ${N_{(-\infty,\sqrt{n} u + \frac{x}{\sqrt{n} \rho_{sc}(u)})}(M_n)}$ is largely decoupled from the event in (1) when ${M_n}$ is drawn from GUE. To do this we use some of the theory of determinantal processes, and in particular the nice fact that when one conditions a determinantal process to the event that a certain spatial region (such as an interval) contains no points of the process, then one obtains a new determinantal process (with a kernel that is closely related to the original kernel). The main task is then to obtain a sufficiently good control on the distance between the new determinantal kernel and the old one, which we do by some functional-analytic considerations involving the manipulation of norms of operators (and specifically, the operator norm, Hilbert-Schmidt norm, and nuclear norm). Amusingly, the Fredholm alternative makes a key appearance, as I end up having to invert a compact perturbation of the identity at one point (specifically, I need to invert ${1 - 1_{[x,y]}P1_{[x,y]}}$, where ${P}$ is the Dyson projection and ${[x,y]}$ is an interval). As such, the bounds in my paper become ineffective, though I am sure that with more work one can invert this particular perturbation of the identity by hand, without the need to invoke the Fredholm alternative.

In the last three notes, we discussed the Bourgain-Gamburd expansion machine and two of its three ingredients, namely quasirandomness and product theorems, leaving only the non-concentration ingredient to discuss. We can summarise the results of the last three notes, in the case of fields of prime order, as the following theorem.

Theorem 1 (Non-concentration implies expansion in ${SL_d}$) Let ${p}$ be a prime, let ${d \geq 1}$, and let ${S}$ be a symmetric set of elements in ${G := SL_d(F_p)}$ of cardinality ${|S|=k}$ not containing the identity. Write ${\mu := \frac{1}{|S|} \sum_{s\in S}\delta_s}$, and suppose that one has the non-concentration property

$\displaystyle \sup_{H < G}\mu^{(n)}(H) < |G|^{-\kappa} \ \ \ \ \ (1)$

for some ${\kappa>0}$ and some even integer ${n \leq \Lambda \log |G|}$. Then ${Cay(G,S)}$ is a two-sided ${\epsilon}$-expander for some ${\epsilon>0}$ depending only on ${k, d, \kappa,\Lambda}$.

Proof: From (1) we see that ${\mu^{(n)}}$ is not supported in any proper subgroup ${H}$ of ${G}$, which implies that ${S}$ generates ${G}$. The claim now follows from the Bourgain-Gamburd expansion machine (Theorem 2 of Notes 4), the product theorem (Theorem 1 of Notes 5), and quasirandomness (Exercise 8 of Notes 3). $\Box$

Remark 1 The same argument also works if we replace ${F_p}$ by the field ${F_{p^j}}$ of order ${p^j}$ for some bounded ${j}$. However, there is a difficulty in the regime when ${j}$ is unbounded, because the quasirandomness property becomes too weak for the Bourgain-Gamburd expansion machine to be directly applicable. On theother hand, the above type of theorem was generalised to the setting of cyclic groups ${{\bf Z}/q{\bf Z}}$ with ${q}$ square-free by Varju, to arbitrary ${q}$ by Bourgain and Varju, and to more general algebraic groups than ${SL_d}$ and square-free ${q}$ by Salehi Golsefidy and Varju. It may be that some modification of the proof techniques in these papers may also be able to handle the field case ${F_{p^j}}$ with unbounded ${j}$.

It thus remains to construct tools that can establish the non-concentration property (1). The situation is particularly simple in ${SL_2(F_p)}$, as we have a good understanding of the subgroups of that group. Indeed, from Theorem 14 from Notes 5, we obtain the following corollary to Theorem 1:

Corollary 2 (Non-concentration implies expansion in ${SL_2}$) Let ${p}$ be a prime, and let ${S}$ be a symmetric set of elements in ${G := SL_2(F_p)}$ of cardinality ${|S|=k}$ not containing the identity. Write ${\mu := \frac{1}{|S|} \sum_{s\in S}\delta_s}$, and suppose that one has the non-concentration property

$\displaystyle \sup_{B}\mu^{(n)}(B) < |G|^{-\kappa} \ \ \ \ \ (2)$

for some ${\kappa>0}$ and some even integer ${n \leq \Lambda \log |G|}$, where ${B}$ ranges over all Borel subgroups of ${SL_2(\overline{F})}$. Then, if ${|G|}$ is sufficiently large depending on ${k,\kappa,\Lambda}$, ${Cay(G,S)}$ is a two-sided ${\epsilon}$-expander for some ${\epsilon>0}$ depending only on ${k, \kappa,\Lambda}$.

It turns out (2) can be verified in many cases by exploiting the solvable nature of the Borel subgroups ${B}$. We give two examples of this in these notes. The first result, due to Bourgain and Gamburd (with earlier partial results by Gamburd and by Shalom) generalises Selberg’s expander construction to the case when ${S}$ generates a thin subgroup of ${SL_2({\bf Z})}$:

Theorem 3 (Expansion in thin subgroups) Let ${S}$ be a symmetric subset of ${SL_2({\bf Z})}$ not containing the identity, and suppose that the group ${\langle S \rangle}$ generated by ${S}$ is not virtually solvable. Then as ${p}$ ranges over all sufficiently large primes, the Cayley graphs ${Cay(SL_2(F_p), \pi_p(S))}$ form a two-sided expander family, where ${\pi_p: SL_2({\bf Z}) \rightarrow SL_2(F_p)}$ is the usual projection.

Remark 2 One corollary of Theorem 3 (or of the non-concentration estimate (3) below) is that ${\pi_p(S)}$ generates ${SL_2(F_p)}$ for all sufficiently large ${p}$, if ${\langle S \rangle}$ is not virtually solvable. This is a special case of a much more general result, known as the strong approximation theorem, although this is certainly not the most direct way to prove such a theorem. Conversely, the strong approximation property is used in generalisations of this result to higher rank groups than ${SL_2}$.

Exercise 1 In the converse direction, if ${\langle S\rangle}$ is virtually solvable, show that for sufficiently large ${p}$, ${\pi_p(S)}$ fails to generate ${SL_2(F_p)}$. (Hint: use Theorem 14 from Notes 5 to prevent ${SL_2(F_p)}$ from having bounded index solvable subgroups.)

Exercise 2 (Lubotzsky’s 1-2-3 problem) Let ${S := \{ \begin{pmatrix}1 & \pm 3 \\ 0 & 1 \end{pmatrix}, \begin{pmatrix}1 & 0 \\ \pm 3 & 1 \end{pmatrix}}$.

• (i) Show that ${S}$ generates a free subgroup of ${SL_2({\bf Z})}$. (Hint: use a ping-pong argument, as in Exercise 23 of Notes 2.)
• (ii) Show that if ${v, w}$ are two distinct elements of the sector ${\{ (x,y) \in {\bf R}^2_+: x/2 < y < 2x \}}$, then there os no element ${g \in \langle S \rangle}$ for which ${gv = w}$. (Hint: this is another ping-pong argument.) Conclude that ${\langle S \rangle}$ has infinite index in ${SL_2({\bf Z})}$. (Contrast this with the situation in which the ${3}$ coefficients in ${S}$ are replaced by ${1}$ or ${2}$, in which case ${\langle S \rangle}$ is either all of ${SL_2({\bf Z})}$, or a finite index subgroup, as demonstrated in Exercise 23 of Notes 2).
• (iii) Show that ${Cay(SL_2(F_p), \pi_p(S))}$ for sufficiently large primes ${p}$ form a two-sided expander family.

Remark 3 Theorem 3 has been generalised to arbitrary linear groups, and with ${F_p}$ replaced by ${{\bf Z}/q{\bf Z}}$ for square-free ${q}$; see this paper of Salehi Golsefidy and Varju. In this more general setting, the condition of virtual solvability must be replaced by the condition that the connected component of the Zariski closure of ${\langle S \rangle}$ is perfect. An effective version of Theorem 3 (with completely explicit constants) was recently obtained by Kowalski.

The second example concerns Cayley graphs constructed using random elements of ${SL_2(F_p)}$.

Theorem 4 (Random generators expand) Let ${p}$ be a prime, and let ${x,y}$ be two elements of ${SL_2(F_p)}$ chosen uniformly at random. Then with probability ${1-o_{p \rightarrow \infty}(1)}$, ${Cay(SL_2(F_p), \{x,x^{-1},y,y^{-1}\})}$ is a two-sided ${\epsilon}$-expander for some absolute constant ${\epsilon}$.

Remark 4 As with Theorem 3, Theorem 4 has also been extended to a number of other groups, such as the Suzuki groups (in this paper of Breuillard, Green, and Tao), and more generally to finite simple groups of Lie type of bounded rank (in forthcoming work of Breuillard, Green, Guralnick, and Tao). There are a number of other constructions of expanding Cayley graphs in such groups (and in other interesting groups, such as the alternating groups) beyond those discussed in these notes; see this recent survey of Lubotzky for further discussion. It has been conjectured by Lubotzky and Weiss that any pair ${x,y}$ of (say) ${SL_2(F_p)}$ that generates the group, is a two-sided ${\epsilon}$-expander for an absolute constant ${\epsilon}$: in the case of ${SL_2(F_p)}$, this has been established for a density one set of primes by Breuillard and Gamburd.

— 1. Expansion in thin subgroups —

We now prove Theorem 3. The first observation is that the expansion property is monotone in the group ${\langle S \rangle}$:

Exercise 3 Let ${S, S'}$ be symmetric subsets of ${SL_2({\bf Z})}$ not containing the identity, such that ${\langle S \rangle \subset \langle S' \rangle}$. Suppose that ${Cay(SL_2(F_p), \pi_p(S))}$ is a two-sided expander family for sufficiently large primes ${p}$. Show that ${Cay(SL_2(F_p), \pi_p(S'))}$ is also a two-sided expander family.

As a consequence, Theorem 3 follows from the following two statments:

Theorem 5 (Tits alternative) Let ${\Gamma \subset SL_2({\bf Z})}$ be a group. Then exactly one of the following statements holds:

• (i) ${\Gamma}$ is virtually solvable.
• (ii) ${\Gamma}$ contains a copy of the free group ${F_2}$ of two generators as a subgroup.

Theorem 6 (Expansion in free groups) Let ${x,y \in SL_2({\bf Z})}$ be generators of a free subgroup of ${SL_2({\bf Z})}$. Then as ${p}$ ranges over all sufficiently large primes, the Cayley graphs ${Cay(SL_2(F_p), \pi_p(\{x,y,x^{-1},y^{-1}\}))}$ form a two-sided expander family.

Theorem 5 is a special case of the famous Tits alternative, which among other things allows one to replace ${SL_2({\bf Z})}$ by ${GL_d(k)}$ for any ${d \geq 1}$ and any field ${k}$ of characteristic zero (and fields of positive characteristic are also allowed, if one adds the requirement that ${\Gamma}$ be finitely generated). We will not prove the full Tits alternative here, but instead just give an ad hoc proof of the special case in Theorem 5 in the following exercise.

Exercise 4 Given any matrix ${g \in SL_2({\bf Z})}$, the singular values are ${\|g\|_{op}}$ and ${\|g\|_{op}^{-1}}$, and we can apply the singular value decomposition to decompose

$\displaystyle g = u_1(g) \|g\|_{op} v_1^*(g) + u_2(g) \|g\|_{op}^{-1} v_2(g)^*$

where ${u_1(g),u_2(g)\in {\bf C}^2}$ and ${v_1(g), v_2(g) \in {\bf C}^2}$ are orthonormal bases. (When ${\|g\|_{op}>1}$, these bases are uniquely determined up to phase rotation.) We let ${\tilde u_1(g) \in {\bf CP}^1}$ be the projection of ${u_1(g)}$ to the projective complex plane, and similarly define ${\tilde v_2(g)}$.

Let ${\Gamma}$ be a subgroup of ${SL_2({\bf Z})}$. Call a pair ${(u,v) \in {\bf CP}^1 \times {\bf CP}^1}$ a limit point of ${\Gamma}$ if there exists a sequence ${g_n \in \Gamma}$ with ${\|g_n\|_{op} \rightarrow \infty}$ and ${(\tilde u_1(g_n), \tilde v_2(g_n)) \rightarrow (u,v)}$.

• (i) Show that if ${\Gamma}$ is infinite, then there is at least one limit point.
• (ii) Show that if ${(u,v)}$ is a limit point, then so is ${(v,u)}$.
• (iii) Show that if there are two limit points ${(u,v), (u',v')}$ with ${\{u,v\} \cap \{u',v'\} = \emptyset}$, then there exist ${g,h \in \Gamma}$ that generate a free group. (Hint: Choose ${(\tilde u_1(g), \tilde v_2(g))}$ close to ${(u,v)}$ and ${(\tilde u_1(h),\tilde v_2(h))}$ close to ${(u',v')}$, and consider the action of ${g}$ and ${h}$ on ${{\bf CP}^1}$, and specifically on small neighbourhoods of ${u,v,u',v'}$, and set up a ping-pong type situation.)
• (iv) Show that if ${g \in SL_2({\bf Z})}$ is hyperbolic (i.e. it has an eigenvalue greater than 1), with eigenvectors ${u,v}$, then the projectivisations ${(\tilde u,\tilde v)}$ of ${u,v}$ form a limit point. Similarly, if ${g}$ is regular parabolic (i.e. it has an eigenvalue at 1, but is not the identity) with eigenvector ${u}$, show that ${(\tilde u,\tilde bu)}$ is a limit point.
• (v) Show that if ${\Gamma}$ has no free subgroup of two generators, then all hyperbolic and regular parabolic elements of ${\Gamma}$ have a common eigenvector. Conclude that all such elements lie in a solvable subgroup of ${\Gamma}$.
• (vi) Show that if an element ${g \in SL_2({\bf Z})}$ is neither hyperbolic nor regular parabolic, and is not a multiple of the identity, then ${g}$ is conjugate to a rotation by ${\pi/2}$ (in particular, ${g^2=-1}$).
• (vii) Establish Theorem 5. (Hint: show that two square roots of ${-1}$ in ${SL_2({\bf Z})}$ cannot multiply to another square root of ${-1}$.)

Now we prove Theorem 6. Let ${\Gamma}$ be a free subgroup of ${SL_2({\bf Z})}$ generated by two generators ${x,y}$. Let ${\mu := \frac{1}{4} (\delta_x +\delta_{x^{-1}} + \delta_y + \delta_{y^{-1}})}$ be the probability measure generating a random walk on ${SL_2({\bf Z})}$, thus ${(\pi_p)_* \mu}$ is the corresponding generator on ${SL_2(F_p)}$. By Corollary 2, it thus suffices to show that

$\displaystyle \sup_{B}((\pi_p)_* \mu)^{(n)}(B) < p^{-\kappa} \ \ \ \ \ (3)$

for all sufficiently large ${p}$, some absolute constant ${\kappa>0}$, and some even ${n = O(\log p)}$ (depending on ${p}$, of course), where ${B}$ ranges over Borel subgroups.

As ${\pi_p}$ is a homomorphism, one has ${((\pi_p)_* \mu)^{(n)}(B) = (\pi_p)_* (\mu^{(n)})(B) = \mu^{(n)}(\pi_p^{-1}(B))}$ and so it suffices to show that

$\displaystyle \sup_{B} \mu^{(n)}(\pi_p^{-1}(B)) < p^{-\kappa}.$

To deal with the supremum here, we will use an argument of Bourgain and Gamburd, taking advantage of the fact that all Borel groups of ${SL_2}$ obey a common group law, the point being that free groups such as ${\Gamma}$ obey such laws only very rarely. More precisely, we use the fact that the Borel groups are solvable of derived length two; in particular we have

$\displaystyle [[a,b],[c,d]] = 1 \ \ \ \ \ (4)$

for all ${a,b,c,d \in B}$. Now, ${\mu^{(n)}}$ is supported on matrices in ${SL_2({\bf Z})}$ whose coefficients have size ${O(\exp(O(n)))}$ (where we allow the implied constants to depend on the choice of generators ${x,y}$), and so ${(\pi_p)_*( \mu^{(n)} )}$ is supported on matrices in ${SL_2(F_p)}$ whose coefficients also have size ${O(\exp(O(n)))}$. If ${n}$ is less than a sufficiently small multiple of ${\log p}$, these coefficients are then less than ${p^{1/10}}$ (say). As such, if ${\tilde a,\tilde b,\tilde c,\tilde d \in SL_2({\bf Z})}$ lie in the support of ${\mu^{(n)}}$ and their projections ${a = \pi_p(\tilde a), \ldots, d = \pi_p(\tilde d)}$ obey the word law (4) in ${SL_2(F_p)}$, then the original matrices ${\tilde a, \tilde b, \tilde c, \tilde d}$ obey the word law (4) in ${SL_2({\bf Z})}$. (This lifting of identities from the characteristic ${p}$ setting of ${SL_2(F_p)}$ to the characteristic ${0}$ setting of ${SL_2({\bf Z})}$ is a simple example of the “Lefschetz principle”.)

To summarise, if we let ${E_{n,p,B}}$ be the set of all elements of ${\pi_p^{-1}(B)}$ that lie in the support of ${\mu^{(n)}}$, then (4) holds for all ${a,b,c,d \in E_{n,p,B}}$. This severely limits the size of ${E_{n,p,B}}$ to only be of polynomial size, rather than exponential size:

Proposition 7 Let ${E}$ be a subset of the support of ${\mu^{(n)}}$ (thus, ${E}$ consists of words in ${x,y,x^{-1},y^{-1}}$ of length ${n}$) such that the law (4) holds for all ${a,b,c,d \in E}$. Then ${|E| \ll n^2}$.

The proof of this proposition is laid out in the exercise below.

Exercise 5 Let ${\Gamma}$ be a free group generated by two generators ${x,y}$. Let ${B}$ be the set of all words of length at most ${n}$ in ${x,y,x^{-1},y^{-1}}$.

• (i) Show that if ${a,b \in \Gamma}$ commute, then ${a, b}$ lie in the same cyclic group, thus ${a = c^i, b = c^j}$ for some ${c \in \Gamma}$ and ${i,j \in {\bf Z}}$.
• (ii) Show that if ${a \in \Gamma}$, there are at most ${O(n)}$ elements of ${B}$ that commute with ${a}$.
• (iii) Show that if ${a,c \in \Gamma}$, there are at most ${O(n)}$ elements ${b}$ of ${B}$ with ${[a,b] = c}$.
• (iv) Prove Proposition 7.

Now we can conclude the proof of Theorem 3:

Exercise 6 Let ${\Gamma}$ be a free group generated by two generators ${x,y}$.

• (i) Show that ${\| \mu^{(n)} \|_{\ell^\infty(\Gamma)} \ll c^n}$ for some absolute constant ${0 < c<1}$. (For much more precise information on ${\mu^{(n)}}$, see this paper of Kesten.)
• (ii) Conclude the proof of Theorem 3.

— 2. Random generators expand —

We now prove Theorem 4. Let ${{\bf F}_2}$ be the free group on two formal generators ${a,b}$, and let ${\mu := \frac{1}{4}(\delta_a + \delta_b + \delta_{a^{-1}}+ \delta_{b^{-1}}}$ be the generator of the random walk. For any word ${w \in {\bf F}_2}$ and any ${x,y}$ in a group ${G}$, let ${w(x,y) \in G}$ be the element of ${G}$ formed by substituting ${x,y}$ for ${a,b}$ respectively in the word ${w}$; thus ${w}$ can be viewed as a map ${w: G \times G \rightarrow G}$ for any group ${G}$. Observe that if ${w}$ is drawn randomly using the distribution ${\mu^{(n)}}$, and ${x,y \in SL_2(F_p)}$, then ${w(x,y)}$ is distributed according to the law ${\tilde \mu^{(n)}}$, where ${\tilde \mu := \frac{1}{4}(\delta_x + \delta_y + \delta_{x^{-1}}+ \delta_{y^{-1}})}$. Applying Corollary 2, it suffices to show that whenever ${p}$ is a large prime and ${x,y}$ are chosen uniformly and independently at random from ${SL_2(F_p)}$, that with probability ${1-o_{p \rightarrow \infty}(1)}$, one has

$\displaystyle \sup_B {\bf P}_w ( w(x,y) \in B ) \leq p^{-\kappa} \ \ \ \ \ (5)$

for some absolute constant ${\kappa}$, where ${B}$ ranges over all Borel subgroups of ${SL_2(\overline{F_p})}$ and ${w}$ is drawn from the law ${\mu^{(n)}}$ for some even natural number ${n = O(\log p)}$.

Let ${B_n}$ denote the words in ${{\bf F}_2}$ of length at most ${n}$. We may use the law (4) to obtain good bound on the supremum in (5) assuming a certain non-degeneracy property of the word evaluations ${w(x,y)}$:

Exercise 7 Let ${n}$ be a natural number, and suppose that ${x,y \in SL_2(F_p)}$ is such that ${w(x,y) \neq 1}$ for ${w \in B_{100n} \backslash \{1\}}$. Show that

$\displaystyle \sup_B {\bf P}_w ( w(x,y) \in B ) \ll \exp(-cn)$

for some absolute constant ${c>0}$, where ${w}$ is drawn from the law ${\mu^{(n)}}$. (Hint: use (4) and the hypothesis to lift the problem up to ${{\bf F}_2}$, at which point one can use Proposition 7 and Exercise 6.)

In view of this exercise, it suffices to show that with probability ${1-o_{p \rightarrow\infty}(1)}$, one has ${w(x,y) \neq 1}$ for all ${w \in B_{100n} \backslash \{1\}}$ for some ${n}$ comparable to a small multiple of ${\log p}$. As ${B_{100n}}$ has ${\exp(O(n))}$ elements, it thus suffices by the union bound to show that

$\displaystyle {\bf P}_{x,y}(w(x,y)=1) \leq p^{-\gamma} \ \ \ \ \ (6)$

for some absolute constant ${\gamma > 0}$, and any ${w \in {\bf F}_2 \backslash \{1\}}$ of length less than ${c\log p}$ for some sufficiently small absolute constant ${c>0}$.

Let us now fix a non-identity word ${w}$ of length ${|w|}$ less than ${c\log p}$, and consider ${w}$ as a function from ${SL_2(k) \times SL_2(k)}$ to ${SL_2(k)}$ for an arbitrary field ${k}$. We can identify ${SL_2(k)}$ with the set ${\{ (a,b,c,d)\in k^4: ad-bc=1\}}$. A routine induction then shows that the expression ${w((a,b,c,d),(a',b',c',d'))}$ is then a polynomial in the eight variables ${a,b,c,d,a',b',c',d'}$ of degree ${O(|w|)}$ and coefficients which are integers of size ${O( \exp( O(|w|) ) )}$. Let us then make the additional restriction to the case ${a,a' \neq 0}$, in which case we can write ${d = \frac{bc+1}{a}}$ and ${d' =\frac{b'c'+1}{a'}}$. Then ${w((a,b,c,d),(a',b',c',d'))}$ is now a rational function of ${a,b,c,a',b',c'}$ whose numerator is a polynomial of degree ${O(|w|)}$ and coefficients of size ${O( \exp( O(|w|) ) )}$, and the denominator is a monomial of ${a,a'}$ of degree ${O(|w|)}$.

We then specialise this rational function to the field ${k=F_p}$. It is conceivable that when one does so, the rational function collapses to the constant polynomial ${(1,0,0,1)}$, thus ${w((a,b,c,d),(a',b',c',d'))=1}$ for all ${(a,b,c,d),(a',b',c',d') \in SL_2(F_p)}$ with ${a,a' \neq 0}$. (For instance, this would be the case if ${w(x,y) = x^{|SL_2(F_p)|}}$, by Lagrange’s theorem, if it were not for the fact that ${|w|}$ is far too large here.) But suppose that this rational function does not collapse to the constant rational function. Applying the Schwarz-Zippel lemma (Exercise 23 from Notes 5), we then see that the set of pairs ${(a,b,c,d),(a',b',c',d') \in SL_2(F_p)}$ with ${a,a' \neq 0}$ and ${w((a,b,c,d),(a',b',c',d'))=1}$ is at most ${O( |w| p^5 )}$; adding in the ${a=0}$ and ${a'=0}$ cases, one still obtains a bound of ${O(|w|p^5)}$, which is acceptable since ${|SL_2(F_p)|^2 \sim p^6}$ and ${|w| = O( \log p )}$. Thus, the only remaining case to consider is when the rational function ${w((a,b,c,d),(a',b',c',d'))}$ is identically ${1}$ on ${SL_2(F_p)}$ with ${a,a' \neq 0}$.

Now we perform another “Lefschetz principle” maneuvre to change the underlying field. Recall that the denominator of rational function ${w((a,b,c,d),(a',b',c',d'))}$ is monomial in ${a,a'}$, and the numerator has coefficients of size ${O(\exp(O(|w|)))}$. If ${|w|}$ is less than ${c\log p}$ for a sufficiently small ${p}$, we conclude in particular (for ${p}$ large enough) that the coefficients all have magnitude less than ${p}$. As such, the only way that this function can be identically ${1}$ on ${SL_2(F_p)}$ is if it is identically ${1}$ on ${SL_2(k)}$ for all ${k}$ with ${a,a' \neq 0}$, and hence for ${a=0}$ or ${a'=0}$ also by taking Zariski closures.

On the other hand, we know that for some choices of ${k}$, e.g. ${k={\bf R}}$, ${SL_2(k)}$ contains a copy ${\Gamma}$ of the free group on two generators (see e.g. Exercise 23 of Notes 2). As such, it is not possible for any non-identity word ${w}$ to be identically trivial on ${SL_2(k) \times SL_2(k)}$. Thus this case cannot actually occur, completing the proof of (6) and hence of Theorem 4.

Remark 5 We see from the above argument that the existence of subgroups ${\Gamma}$ of an algebraic group with good “independence” properties – such as that of generating a free group – can be useful in studying the expansion properties of that algebraic group, even if the field of interest in the latter is distinct from that of the former. For more complicated algebraic groups than ${SL_2}$, in which laws such as (4) are not always available, it turns out to be useful to place further properties on the subgroup ${\Gamma}$, for instance by requiring that all non-abelian subgroups of that group be Zariski dense (a property which has been called strong density), as this turns out to be useful for preventing random walks from concentrating in proper algebraic subgroups. See this paper of Breuillard, Guralnick, Green and Tao for constructions of strongly dense free subgroups of algebraic groups and further discussion.