You are currently browsing the category archive for the ‘math.PR’ category.
The fundamental notions of calculus, namely differentiation and integration, are often viewed as being the quintessential concepts in mathematical analysis, as their standard definitions involve the concept of a limit. However, it is possible to capture most of the essence of these notions by purely algebraic means (almost completely avoiding the use of limits, Riemann sums, and similar devices), which turns out to be useful when trying to generalise these concepts to more abstract situations in which it becomes convenient to permit the underlying number systems involved to be something other than the real or complex numbers, even if this makes many standard analysis constructions unavailable. For instance, the algebraic notion of a derivation often serves as a substitute for the analytic notion of a derivative in such cases, by abstracting out the key algebraic properties of differentiation, namely linearity and the Leibniz rule (also known as the product rule).
Abstract algebraic analogues of integration are less well known, but can still be developed. To motivate such an abstraction, consider the integration functional from the space
of complex-valued Schwarz functions
to the complex numbers, defined by
where the integration on the right is the usual Lebesgue integral (or improper Riemann integral) from analysis. This functional obeys two obvious algebraic properties. Firstly, it is linear over , thus
and
. Secondly, it is translation invariant, thus
, where
is the translation of
by
. Motivated by the uniqueness theory of Haar measure, one might expect that these two axioms already uniquely determine
after one sets a normalisation, for instance by requiring that
that are not multiples of the standard Fourier transform), but if one adds a mild analytical axiom, such as continuity of
(using the usual Schwartz topology on
), then the above axioms are enough to uniquely pin down the notion of integration. Indeed, if
is a continuous linear functional that is translation invariant, then from the linearity and translation invariance axioms one has
for all and non-zero reals
. If
is Schwartz, then as
, one can verify that the Newton quotients
converge in the Schwartz topology to the derivative
of
, so by the continuity axiom one has
Next, note that any Schwartz function of integral zero has an antiderivative which is also Schwartz, and so annihilates all zero-integral Schwartz functions, and thus must be a scalar multiple of the usual integration functional. Using the normalisation (4), we see that
must therefore be the usual integration functional, giving the claimed uniqueness.
Motivated by the above discussion, we can define the notion of an abstract integration functional taking values in some vector space
, and applied to inputs
in some other vector space
that enjoys a linear action
(the “translation action”) of some group
, as being a functional which is both linear and translation invariant, thus one has the axioms (1), (2), (3) for all
, scalars
, and
. The previous discussion then considered the special case when
,
,
, and
was the usual translation action.
Once we have performed this abstraction, we can now present analogues of classical integration which bear very little analytic resemblance to the classical concept, but which still have much of the algebraic structure of integration. Consider for instance the situation in which we keep the complex range , the translation group
, and the usual translation action
, but we replace the space
of Schwartz functions by the space
of polynomials
of degree at most
with complex coefficients, where
is a fixed natural number; note that this space is translation invariant, so it makes sense to talk about an abstract integration functional
. Of course, one cannot apply traditional integration concepts to non-zero polynomials, as they are not absolutely integrable. But one can repeat the previous arguments to show that any abstract integration functional must annihilate derivatives of polynomials of degree at most
:
is thus annihilated by
, which makes
a scalar multiple of the functional that extracts the top coefficient
of a polynomial, thus if one sets a normalisation
for some constant , then one has
. So we see that up to a normalising constant, the operation of extracting the top order coefficient of a polynomial of fixed degree serves as the analogue of integration. In particular, despite the fact that integration is supposed to be the “opposite” of differentiation (as indicated for instance by (5)), we see in this case that integration is basically (
-fold) differentiation; indeed, compare (6) with the identity
In particular, we see, in contrast to the usual Lebesgue integral, the integration functional (6) can be localised to an arbitrary location: one only needs to know the germ of the polynomial at a single point
in order to determine the value of the functional (6). This localisation property may initially seem at odds with the translation invariance, but the two can be reconciled thanks to the extremely rigid nature of the class
, in contrast to the Schwartz class
which admits bump functions and so can generate local phenomena that can only be detected in small regions of the underlying spatial domain, and which therefore forces any translation-invariant integration functional on such function classes to measure the function at every single point in space.
The reversal of the relationship between integration and differentiation is also reflected in the fact that the abstract integration operation on polynomials interacts with the scaling operation in essentially the opposite way from the classical integration operation. Indeed, for classical integration on
, one has
for Schwartz functions , and so in this case the integration functional
obeys the scaling law
In contrast, the abstract integration operation defined in (6) obeys the opposite scaling law
Remark 1 One way to interpret what is going on is to view the integration operation (6) as a renormalised version of integration. A polynomial
is, in general, not absolutely integrable, and the partial integrals
diverge as
. But if one renormalises these integrals by the factor
, then one recovers convergence,
thus giving an interpretation of (6) as a renormalised classical integral, with the renormalisation being responsible for the unusual scaling relationship in (7). However, this interpretation is a little artificial, and it seems that it is best to view functionals such as (6) from an abstract algebraic perspective, rather than to try to force an analytic interpretation on them.
Now we return to the classical Lebesgue integral
, as well as a dilation invariance by real dilation parameters
. However, if we refine the class
of functions somewhat, we can obtain a stronger family of invariances, in which we allow complex translations and dilations. More precisely, let
denote the space of all functions
which are entire (or equivalently, are given by a Taylor series with an infinite radius of convergence around the origin) and also admit rapid decay in a sectorial neighbourhood of the real line, or more precisely there exists an
such that for every
there exists
such that one has the bound
whenever . For want of a better name, we shall call elements of this space Schwartz entire functions. This is clearly a complex vector space. A typical example of a Schwartz entire function are the complex gaussians
where are complex numbers with
. From the Cauchy integral formula (and its derivatives) we see that if
lies in
, then the restriction of
to the real line lies in
; conversely, from analytic continuation we see that every function in
has at most one extension in
. Thus one can identify
with a subspace of
, and in particular the integration functional (8) is inherited by
, and by abuse of notation we denote the resulting functional
as
also. Note, in analogy with the situation with polynomials, that this abstract integration functional is somewhat localised; one only needs to evaluate the function
on the real line, rather than the entire complex plane, in order to compute
. This is consistent with the rigid nature of Schwartz entire functions, as one can uniquely recover the entire function from its values on the real line by analytic continuation.
Of course, the functional remains translation invariant with respect to real translation:
However, thanks to contour shifting, we now also have translation invariance with respect to complex translation:
where of course we continue to define the translation operator for complex
by the usual formula
. In a similar vein, we also have the scaling law
for any , if
is a complex number sufficiently close to
(where “sufficiently close” depends on
, and more precisely depends on the sectoral aperture parameter
associated to
); again, one can verify that
lies in
for
sufficiently close to
. These invariances (which relocalise the integration functional
onto other contours than the real line
) are very useful for computing integrals, and in particular for computing gaussian integrals. For instance, the complex translation invariance tells us (after shifting by
) that
when with
, and then an application of the complex scaling law (and a continuity argument, observing that there is a compact path connecting
to
in the right half plane) gives
using the branch of on the right half-plane for which
. Using the normalisation (4) we thus have
giving the usual gaussian integral formula
One can extend this sort of analysis to higher dimensions. For any natural number , let
denote the space of all functions
which is jointly entire in the sense that
can be expressed as a Taylor series in
which is absolutely convergent for all choices of
, and such that there exists an
such that for any
there is
for which one has the bound
whenever for all
, where
and
. Again, we call such functions Schwartz entire functions; a typical example is the function
where is an
complex symmetric matrix with positive definite real part,
is a vector in
, and
is a complex number. We can then define an abstract integration functional
by integration on the real slice
:
where is the usual Lebesgue measure on
. By contour shifting in each of the
variables
separately, we see that
is invariant with respect to complex translations of each of the
variables, and is thus invariant under translating the joint variable
by
. One can also verify the scaling law
for complex matrices
sufficiently close to the origin, where
. This can be seen for shear transformations
by Fubini’s theorem and the aforementioned translation invariance, while for diagonal transformations near the origin this can be seen from
applications of one-dimensional scaling law, and the general case then follows by composition. Among other things, these laws then easily lead to the higher-dimensional generalisation
is a complex symmetric matrix with positive definite real part,
is a vector in
, and
is a complex number, basically by repeating the one-dimensional argument sketched earlier. Here, we choose the branch of
for all matrices
in the indicated class for which
.
Now we turn to an integration functional suitable for computing complex gaussian integrals such as
is now a complex variable
is the adjoint
is a complex
matrix with positive definite Hermitian part,
are column vectors in
,
is a complex number, and
is
times Lebesgue measure on
. (The factors of two here turn out to be a natural normalisation, but they can be ignored on a first reading.) As we shall see later, such integrals are relevant when performing computations on the Gaussian Unitary Ensemble (GUE) in random matrix theory. Note that the integrand here is not complex analytic due to the presence of the complex conjugates. However, this can be dealt with by the trick of replacing the complex conjugate
by a variable
which is formally conjugate to
, but which is allowed to vary independently of
. More precisely, let
be the space of all functions
of two independent
-tuples
of complex variables, which is jointly entire in all variables (in the sense defined previously, i.e. there is a joint Taylor series that is absolutely convergent for all independent choices of
), and such that there is an
such that for every
there is
such that one has the bound
whenever . We will call such functions Schwartz analytic. Note that the integrand in (11) is Schwartz analytic when
has positive definite Hermitian part, if we reinterpret
as the transpose of
rather than as the adjoint of
in order to make the integrand entire in
and
. We can then define an abstract integration functional
by the formula
can be localised to the slice
of
(though, as with previous functionals, one can use contour shifting to relocalise
to other slices also.) One can also write this integral as
and note that the integrand here is a Schwartz entire function on , thus linking the Schwartz analytic integral with the Schwartz entire integral. Using this connection, one can verify that this functional
is invariant with respect to translating
and
by independent shifts in
(thus giving a
translation symmetry), and one also has the independent dilation symmetry
for complex matrices
that are sufficiently close to the identity, where
. Arguing as before, we can then compute (11) as
In particular, this gives an integral representation for the determinant-reciprocal of a complex
matrix with positive definite Hermitian part, in terms of gaussian expressions in which
only appears linearly in the exponential:
This formula is then convenient for computing statistics such as
for random matrices drawn from the Gaussian Unitary Ensemble (GUE), and some choice of spectral parameter
with
; we review this computation later in this post. By the trick of matrix differentiation of the determinant (as reviewed in this recent blog post), one can also use this method to compute matrix-valued statistics such as
However, if one restricts attention to classical integrals over real or complex (and in particular, commuting or bosonic) variables, it does not seem possible to easily eradicate the negative determinant factors in such calculations, which is unfortunate because many statistics of interest in random matrix theory, such as the expected Stieltjes transform
which is the Stieltjes transform of the density of states. However, it turns out (as I learned recently from Peter Sarnak and Tom Spencer) that it is possible to cancel out these negative determinant factors by balancing the bosonic gaussian integrals with an equal number of fermionic gaussian integrals, in which one integrates over a family of anticommuting variables. These fermionic integrals are closer in spirit to the polynomial integral (6) than to Lebesgue type integrals, and in particular obey a scaling law which is inverse to the Lebesgue scaling (in particular, a linear change of fermionic variables ends up transforming a fermionic integral by
rather than
), which conveniently cancels out the reciprocal determinants in the previous calculations. Furthermore, one can combine the bosonic and fermionic integrals into a unified integration concept, known as the Berezin integral (or Grassmann integral), in which one integrates functions of supervectors (vectors with both bosonic and fermionic components), and is of particular importance in the theory of supersymmetry in physics. (The prefix “super” in physics means, roughly speaking, that the object or concept that the prefix is attached to contains both bosonic and fermionic aspects.) When one applies this unified integration concept to gaussians, this can lead to quite compact and efficient calculations (provided that one is willing to work with “super”-analogues of various concepts in classical linear algebra, such as the supertrace or superdeterminant).
Abstract integrals of the flavour of (6) arose in quantum field theory, when physicists sought to formally compute integrals of the form
are familiar commuting (or bosonic) variables (which, in particular, can often be localised to be scalar variables taking values in
or
), while
were more exotic anticommuting (or fermionic) variables, taking values in some vector space of fermions. (As we shall see shortly, one can formalise these concepts by working in a supercommutative algebra.) The integrand
was a formally analytic function of
, in that it could be expanded as a (formal, noncommutative) power series in the variables
. For functions
that depend only on bosonic variables, it is certainly possible for such analytic functions to be in the Schwartz class and thus fall under the scope of the classical integral, as discussed previously. However, functions
that depend on fermionic variables
behave rather differently. Indeed, a fermonic variable
must anticommute with itself, so that
. In particular, any power series in
terminates after the linear term in
, so that a function
can only be analytic in
if it is a polynomial of degree at most
in
; more generally, an analytic function
of
fermionic variables
must be a polynomial of degree at most
, and an analytic function
of
bosonic and
fermionic variables can be Schwartz in the bosonic variables but will be polynomial in the fermonic variables. As such, to interpret the integral (14), one can use classical (Lebesgue) integration (or the variants discussed above for integrating Schwartz entire or Schwartz analytic functions) for the bosonic variables, but must use abstract integrals such as (6) for the fermonic variables, leading to the concept of Berezin integration mentioned earlier.
In this post I would like to set out some of the basic algebraic formalism of Berezin integration, particularly with regards to integration of gaussian-type expressions, and then show how this formalism can be used to perform computations involving GUE (for instance, one can compute the density of states of GUE by this machinery without recourse to the theory of orthogonal polynomials). The use of supersymmetric gaussian integrals to analyse ensembles such as GUE appears in the work of Efetov (and was also proposed in the slightly earlier works of Parisi-Sourlas and McKane, with a related approach also appearing in the work of Wegner); the material here is adapted from this survey of Mirlin, as well as the later papers of Disertori-Pinson-Spencer and of Disertori.
[These are notes intended mostly for myself, as these topics are useful in random matrix theory, but may be of interest to some readers also. -T.]
One of the most fundamental partial differential equations in mathematics is the heat equation
is a scalar function
of both time and space, and
is the Laplacian
. For the purposes of this post, we will ignore all technical issues of regularity and decay, and always assume that the solutions to equations such as (1) have all the regularity and decay in order to justify all formal operations such as the chain rule, integration by parts, or differentiation under the integral sign. The factor of
in the definition of the heat propagator
is of course an arbitrary normalisation, chosen for some minor technical reasons; one can certainly continue the discussion below with other choices of normalisations if desired.
In probability theory, this equation takes on particular significance when is restricted to be non-negative, and furthermore to be a probability measure at each time, in the sense that
for all . (Actually, it suffices to verify this constraint at time
, as the heat equation (1) will then preserve this constraint.) Indeed, in this case, one can interpret
as the probability distribution of a Brownian motion
is a stochastic process with initial probability distribution
; see for instance this previous blog post for more discussion.
A model example of a solution to the heat equation to keep in mind is that of the fundamental solution
, which represents the distribution of Brownian motion of a particle starting at the origin
at time
. At time
,
represents an
-valued random variable, each coefficient of which is an independent random variable of mean zero and variance
. (As
,
converges in the sense of distributions to a Dirac mass at the origin.)
The heat equation can also be viewed as the gradient flow for the Dirichlet form
, which formally implies that
is (half of) the negative gradient of the Dirichlet energy
with respect to the
inner product. Among other things, this implies that the Dirichlet energy decreases in time:
that
Since is non-negative, the formula (6) implies that
is integrable in time, and in particular we see that
converges to zero as
, in some averaged
sense at least; similarly, (8) suggests that
also converges to zero. This suggests that
converges to a constant function; but as
is also supposed to decay to zero at spatial infinity, we thus expect solutions to the heat equation in
to decay to zero in some sense as
. However, the decay is only expected to be polynomial in nature rather than exponential; for instance, the solution (3) decays in the
norm like
.
Since , we also observe the basic cancellation property
.
There are other quantities relating to that also decrease in time under heat flow, particularly in the important case when
is a probability measure. In this case, it is natural to introduce the entropy
Thus, for instance, if is the uniform distribution on some measurable subset
of
of finite measure
, the entropy would be
. Intuitively, as the entropy decreases, the probability distribution gets wider and flatter. For instance, in the case of the fundamental solution (3), one has
for any
, reflecting the fact that
is approximately uniformly distributed on a ball of radius
(and thus of measure
).
A short formal computation shows (if one assumes for simplicity that is strictly positive, which is not an unreasonable hypothesis, particularly in view of the strong maximum principle) using (9), (5) that
where is the square root of
. For instance, if
is the fundamental solution (3), one can check that
(note that this is a significantly cleaner formula than (7)!).
In particular, the entropy is decreasing, which corresponds well to one’s intuition that the heat equation (or Brownian motion) should serve to spread out a probability distribution over time.
Actually, one can say more: the rate of decrease of the entropy is itself decreasing, or in other words the entropy is convex. I do not have a satisfactorily intuitive reason for this phenomenon, but it can be proved by straightforward application of basic several variable calculus tools (such as the chain rule, product rule, quotient rule, and integration by parts), and completing the square. Namely, by using the chain rule we have
, we see from (1) that
and thus (again assuming that , and hence
, is strictly positive to avoid technicalities)
We thus have
It is now convenient to compute using the Einstein summation convention to hide the summation over indices . We have
and
By integration by parts and interchanging partial derivatives, we may write the first integral as
and from the quotient and product rules, we may write the second integral as
Gathering terms, completing the square, and making the summations explicit again, we see that
and so in particular is always decreasing.
The above identity can also be written as
Exercise 1 Give an alternate proof of the above identity by writing
,
and deriving the equation
for
.
It was observed in a well known paper of Bakry and Emery that the above monotonicity properties hold for a much larger class of heat flow-type equations, and lead to a number of important relations between energy and entropy, such as the log-Sobolev inequality of Gross and of Federbush, and the hypercontractivity inequality of Nelson; we will discuss one such family of generalisations (or more precisely, variants) below the fold.
Let be a large natural number, and let
be a matrix drawn from the Gaussian Unitary Ensemble (GUE), by which we mean that
is a Hermitian matrix whose upper triangular entries are iid complex gaussians with mean zero and variance one, and whose diagonal entries are iid real gaussians with mean zero and variance one (and independent of the upper triangular entries). The eigenvalues
are then real and almost surely distinct, and can be viewed as a random point process
on the real line. One can then form the
-point correlation functions
for every
, which can be defined by duality by requiring
for any test function . For GUE, which is a continuous matrix ensemble, one can also define
for distinct
as the unique quantity such that the probability that there is an eigenvalue in each of the intervals
is
in the limit
.
As is well known, the GUE process is a determinantal point process, which means that -point correlation functions can be explicitly computed as
for some kernel ; explicitly, one has
where are the (normalised) Hermite polynomials; see this previous blog post for details.
Using the asymptotics of Hermite polynomials (which then give asymptotics for the kernel ), one can take a limit of a (suitably rescaled) sequence of GUE processes to obtain the Dyson sine process, which is a determinantal point process
on the real line with correlation functions
is the Dyson sine kernel
, the renormalised point processes
converge in distribution in the vague topology to
as
, where
is the semi-circular law density.
On the other hand, an important feature of the GUE process is its stationarity (modulo rescaling) under Dyson Brownian motion
which describes the stochastic evolution of eigenvalues of a Hermitian matrix under independent Brownian motion of its entries, and is discussed in this previous blog post. To cut a long story short, this stationarity tells us that the self-similar -point correlation function
obeys the Dyson heat equation
(see Exercise 11 of the previously mentioned blog post). Note that vanishes to second order whenever two of the
coincide, so there is no singularity on the right-hand side. Setting
and using self-similarity, we can rewrite this equation in time-independent form as
One can then integrate out all but of these variables (after carefully justifying convergence) to obtain a system of equations for the
-point correlation functions
:
where the integral is interpreted in the principal value case. This system is an example of a BBGKY hierarchy.
If one carefully rescales and takes limits (say at the energy level , for simplicity), the left-hand side turns out to rescale to be a lower order term, and one ends up with a hierarchy for the Dyson sine process:
Informally, these equations show that the Dyson sine process is stationary with respect to the infinite Dyson Brownian motion
where are independent Brownian increments, and the sum is interpreted in a suitable principal value sense.
I recently set myself the exercise of deriving the identity (3) directly from the definition (1) of the Dyson sine process, without reference to GUE. This turns out to not be too difficult when done the right way (namely, by modifying the proof of Gaudin’s lemma), although it did take me an entire day of work before I realised this, and I could not find it in the literature (though I suspect that many people in the field have privately performed this exercise in the past). In any case, I am recording the computation here, largely because I really don’t want to have to do it again, but perhaps it will also be of interest to some readers.
One of the basic general problems in analytic number theory is to understand as much as possible the fluctuations of the Möbius function , defined as
when
is the product of
distinct primes, and zero otherwise. For instance, as
takes values in
, we have the trivial bound
and the seemingly slight improvement
is equivalent to the notorious Riemann hypothesis.
There is a general Möbius pseudorandomness heuristic that suggests that the sign pattern behaves so randomly (or pseudorandomly) that one should expect a substantial amount of cancellation in sums that involve the sign fluctuation of the Möbius function in a nontrivial fashion, with the amount of cancellation present comparable to the amount that an analogous random sum would provide; cf. the probabilistic heuristic discussed in this recent blog post. There are a number of ways to make this heuristic precise. As already mentioned, the Riemann hypothesis can be considered one such manifestation of the heuristic. Another manifestation is the following old conjecture of Chowla:
Conjecture 1 (Chowla’s conjecture) For any fixed integer
and exponents
, with at least one of the
odd (so as not to completely destroy the sign cancellation), we have
Note that as for any
, we can reduce to the case when the
take values in
here. When only one of the
are odd, this is essentially the prime number theorem in arithmetic progressions (after some elementary sieving), but with two or more of the
are odd, the problem becomes completely open. For instance, the estimate
is morally very close to the conjectured asymptotic
for the von Mangoldt function , where
is the twin prime constant; this asymptotic in turn implies the twin prime conjecture. (To formally deduce estimates for von Mangoldt from estimates for Möbius, though, typically requires some better control on the error terms than
, in particular gains of some power of
are usually needed. See this previous blog post for more discussion.)
Remark 1 The Chowla conjecture resembles an assertion that, for
chosen randomly and uniformly from
to
, the random variables
become asymptotically independent of each other (in the probabilistic sense) as
. However, this is not quite accurate, because some moments (namely those with all exponents
even) have the “wrong” asymptotic value, leading to some unwanted correlation between the two variables. For instance, the events
and
have a strong correlation with each other, basically because they are both strongly correlated with the event of
being divisible by
. A more accurate interpretation of the Chowla conjecture is that the random variables
are asymptotically conditionally independent of each other, after conditioning on the zero pattern
; thus, it is the sign of the Möbius function that fluctuates like random noise, rather than the zero pattern. (The situation is a bit cleaner if one works instead with the Liouville function
instead of the Möbius function
, as this function never vanishes, but we will stick to the traditional Möbius function formalism here.)
A more recent formulation of the Möbius randomness heuristic is the following conjecture of Sarnak. Given a bounded sequence , define the topological entropy of the sequence to be the least exponent
with the property that for any fixed
, and for
going to infinity the set
of
can be covered by
balls of radius
. (If
arises from a minimal topological dynamical system
by
, the above notion is equivalent to the usual notion of the topological entropy of a dynamical system.) For instance, if the sequence is a bit sequence (i.e. it takes values in
), then there are only
-bit patterns that can appear as blocks of
consecutive bits in this sequence. As a special case, a Turing machine with bounded memory that had access to a random number generator at the rate of one random bit produced every
units of time, but otherwise evolved deterministically, would have an output sequence that had a topological entropy of at most
. A bounded sequence is said to be deterministic if its topological entropy is zero. A typical example is a polynomial sequence such as
for some fixed
; the
-blocks of such polynomials sequence have covering numbers that only grow polynomially in
, rather than exponentially, thus yielding the zero entropy. Unipotent flows, such as the horocycle flow on a compact hyperbolic surface, are another good source of deterministic sequences.
Conjecture 2 (Sarnak’s conjecture) Let
be a deterministic bounded sequence. Then
This conjecture in general is still quite far from being solved. However, special cases are known:
- For constant sequences, this is essentially the prime number theorem (1).
- For periodic sequences, this is essentially the prime number theorem in arithmetic progressions.
- For quasiperiodic sequences such as
for some continuous
, this follows from the work of Davenport.
- For nilsequences, this is a result of Ben Green and myself.
- For horocycle flows, this is a result of Bourgain, Sarnak, and Ziegler.
- For the Thue-Morse sequence, this is a result of Dartyge-Tenenbaum (with a stronger error term obtained by Maduit-Rivat). A subsequent result of Bourgain handles all bounded rank one sequences (though the Thue-Morse sequence is actually of rank two), and a related result of Green establishes asymptotic orthogonality of the Möbius function to bounded depth circuits, although such functions are not necessarily deterministic in nature.
- For the Rudin-Shapiro sequence, I sketched out an argument at this MathOverflow post.
- The Möbius function is known to itself be non-deterministic, because its square
(i.e. the indicator of the square-free functions) is known to be non-deterministic (indeed, its topological entropy is
). (The corresponding question for the Liouville function
, however, remains open, as the square
has zero entropy.)
- In the converse direction, it is easy to construct sequences of arbitrarily small positive entropy that correlate with the Möbius function (a rather silly example is
for some fixed large (squarefree)
, which has topological entropy at most
but clearly correlates with
).
See this survey of Sarnak for further discussion of this and related topics.
In this post I wanted to give a very nice argument of Sarnak that links the above two conjectures:
Proposition 3 The Chowla conjecture implies the Sarnak conjecture.
The argument does not use any number-theoretic properties of the Möbius function; one could replace in both conjectures by any other function from the natural numbers to
and obtain the same implication. The argument consists of the following ingredients:
- To show that
, it suffices to show that the expectation of the random variable
, where
is drawn uniformly at random from
to
, can be made arbitrary small by making
large (and
even larger).
- By the union bound and the zero topological entropy of
, it suffices to show that for any bounded deterministic coefficients
, the random variable
concentrates with exponentially high probability.
- Finally, this exponentially high concentration can be achieved by the moment method, using a slight variant of the moment method proof of the large deviation estimates such as the Chernoff inequality or Hoeffding inequality (as discussed in this blog post).
As is often the case, though, while the “top-down” order of steps presented above is perhaps the clearest way to think conceptually about the argument, in order to present the argument formally it is more convenient to present the arguments in the reverse (or “bottom-up”) order. This is the approach taken below the fold.
There has been a lot of recent interest in the abc conjecture, since the release a few weeks ago of the last of a series of papers by Shinichi Mochizuki which, as one of its major applications, claims to establish this conjecture. It’s still far too early to judge whether this proof is likely to be correct or not (the entire argument encompasses several hundred pages of argument, mostly in the area of anabelian geometry, which very few mathematicians are expert in, to the extent that we still do not even have a full outline of the proof strategy yet), and I don’t have anything substantial to add to the existing discussion around that conjecture. (But, for those that are interested, the Polymath wiki page on the ABC conjecture has collected most of the links to that discussion, and to various background materials.)
In the meantime, though, I thought I might give the standard probabilistic heuristic argument that explains why we expect the ABC conjecture to be true. The underlying heuristic is a common one, used throughout number theory, and it can be summarised as follows:
Heuristic 1 (Probabilistic heuristic) Even though number theory is a deterministic subject (one does not need to roll any dice to factorise a number, or figure out if a number is prime), one expects to get a good asymptotic prediction for the answers to many number-theoretic questions by pretending that various number-theoretic assertions
(e.g. that a given number
is prime) are probabilistic events (with a probability
that can vary between
and
) rather than deterministic events (that are either always true or always false). Furthermore:
- (Basic heuristic) If two or more of these heuristically probabilistic events have no obvious reason to be strongly correlated to each other, then we should expect them to behave as if they were (jointly) independent.
- (Advanced heuristic) If two or more of these heuristically probabilistic events have some obvious correlation between them, but no further correlations are suspected, then we should expect them to behave as if they were conditionally independent, relative to whatever data is causing the correlation.
This is, of course, an extremely vague and completely non-rigorous heuristic, requiring (among other things) a subjective and ad hoc determination of what an “obvious reason” is, but in practice it tends to give remarkably plausible predictions, some fraction of which can in fact be backed up by rigorous argument (although in many cases, the actual argument has almost nothing in common with the probabilistic heuristic). A famous special case of this heuristic is the Cramér random model for the primes, but this is not the only such instance for that heuristic.
To give the most precise predictions, one should use the advanced heuristic in Heuristic 1, but this can be somewhat complicated to execute, and so we shall focus instead on the predictions given by the basic heuristic (thus ignoring the presence of some number-theoretic correlations), which tends to give predictions that are quantitatively inaccurate but still reasonably good at the qualitative level.
Here is a basic “corollary” of Heuristic 1:
Heuristic 2 (Heuristic Borel-Cantelli) Suppose one has a sequence
of number-theoretic statements, which we heuristically interpet as probabilistic events with probabilities
. Suppose also that we know of no obvious reason for these events to have much of a correlation with each other. Then:
- If
, we expect only finitely many of the statements
to be true. (And if
is much smaller than
, we in fact expect none of the
to be true.)
- If
, we expect infinitely many of the statements
to be true.
This heuristic is motivated both by the Borel-Cantelli lemma, and by the standard probabilistic computation that if one is given jointly independent, and genuinely probabilistic, events with
, then one almost surely has an infinite number of the
occuring.
Before we get to the ABC conjecture, let us give two simpler (and well known) demonstrations of these heuristics in action:
Example 1 (Twin prime conjecture) One can heuristically justify the twin prime conjecture as follows. Using the prime number theorem, one can heuristically assign a probability of
to the event that any given large integer
is prime. In particular, the probability that
is prime will then be
. Making the assumption that there are no strong correlations between these events, we are led to the prediction that the probability that
and
are simultaneously prime is
. Since
, the Borel-Cantelli heuristic then predicts that there should be infinitely many twin primes.
Note that the above argument is a bit too naive, because there are some non-trivial correlations between the primality of
and the primality of
. Most obviously, if
is prime, this greatly increases the probability that
is odd, which implies that
is odd, which then elevates the probability that
is prime. A bit more subtly, if
is prime, then
is likely to avoid the residue class
, which means that
avoids the residue class
, which ends up decreasing the probability that
is prime. However, there is a standard way to correct for these local correlations; see for instance in this previous blog post. As it turns out, these local correlations ultimately alter the prediction for the asymptotic density of twin primes by a constant factor (the twin prime constant), but do not affect the qualitative prediction of there being infinitely many twin primes.
Example 2 (Fermat’s last theorem) Let us now heuristically count the number of solutions to
for various
and natural numbers
(which we can reduce to be coprime if desired). We recast this (in the spirit of the ABC conjecture) as
, where
are
powers. The number of
powers up to any given number
is about
, so heuristically any given natural number
has a probability about
of being an
power. If we make the naive assumption that (in the coprime case at least) there is no strong correlation between the events that
is an
power,
is an
power, and
being an
power, then for typical
, the probability that
are all simultaneously
powers would then be
. For fixed
, the total number of solutions to the Fermat equation would then be predicted to be
(Strictly speaking, we need to restrict to the coprime case, but given that a positive density of pairs of integers are coprime, it should not affect the qualitative conclusion significantly if we now omit this restriction.) It might not be immediately obvious as to whether this sum converges or diverges, but (as is often the case with these sorts of unsigned sums) one can clarify the situation by dyadic decomposition. Suppose for instance that we consider the portion of the sum where
lies between
and
. Then this portion of the sum can be controlled by
which simplifies to
Summing in
, one thus expects infinitely many solutions for
, only finitely many solutions for
(indeed, a refinement of this argument shows that one expects only finitely many solutions even if one considers all
at once), and a borderline prediction of there being a barely infinite number of solutions when
. Here is of course a place where a naive application of the probabilistic heuristic breaks down; there is enough arithmetic structure in the equation
that the naive probabilistic prediction ends up being an inaccurate model. Indeed, while this heuristic suggests that a typical homogeneous cubic should have a logarithmic number of integer solutions of a given height
, it turns out that some homogeneous cubics (namely, those associated to elliptic curves of positive rank) end up with the bulk of these solutions, while other homogeneous cubics (including those associated to elliptic curves of zero rank, including the Fermat curve
) only get finitely many solutions. The reasons for this are subtle, but certainly the high degree of arithmetic structure present in an elliptic curve (starting with the elliptic curve group law which allows one to generate new solutions from old ones, and which also can be used to exclude solutions to
via the method of descent) is a major contributing factor.
Below the fold, we apply similar heuristics to suggest the truth of the ABC conjecture.
Van Vu and I have just uploaded to the arXiv our paper “Random matrices: Universality of local spectral statistics of non-Hermitian matrices“. The main result of this paper is a “Four Moment Theorem” that establishes universality for local spectral statistics of non-Hermitian matrices with independent entries, under the additional hypotheses that the entries of the matrix decay exponentially, and match moments with either the real or complex gaussian ensemble to fourth order. This is the non-Hermitian analogue of a long string of recent results establishing universality of local statistics in the Hermitian case (as discussed for instance in this recent survey of Van and myself, and also in several other places).
The complex case is somewhat easier to describe. Given a (non-Hermitian) random matrix ensemble of
matrices, one can arbitrarily enumerate the (geometric) eigenvalues as
, and one can then define the
-point correlation functions
to be the symmetric functions such that
In the case when is drawn from the complex gaussian ensemble, so that all the entries are independent complex gaussians of mean zero and variance one, it is a classical result of Ginibre that the asymptotics of
near some point
as
and
is fixed are given by the determinantal rule
and
for , where
is the reproducing kernel
(There is also an asymptotic for the boundary case , but it is more complicated to state.) In particular, we see that
for almost every
, which is a manifestation of the well-known circular law for these matrices; but the circular law only captures the macroscopic structure of the spectrum, whereas the asymptotic (1) describes the microscopic structure.
Our first main result is that the asymptotic (1) for also holds (in the sense of vague convergence) when
is a matrix whose entries are independent with mean zero, variance one, exponentially decaying tails, and which all match moments with the complex gaussian to fourth order. (Actually we prove a stronger result than this which is valid for all bounded
and has more uniform bounds, but is a bit more technical to state.) An analogous result is also established for real gaussians (but now one has to separate the correlation function into components depending on how many eigenvalues are real and how many are strictly complex; also, the limiting distribution is more complicated, being described by Pfaffians rather than determinants). Among other things, this allows us to partially extend some known results on complex or real gaussian ensembles to more general ensembles. For instance, there is a central limit theorem of Rider which establishes a central limit theorem for the number of eigenvalues of a complex gaussian matrix in a mesoscopic disk; from our results, we can extend this central limit theorem to matrices that match the complex gaussian ensemble to fourth order, provided that the disk is small enough (for technical reasons, our error bounds are not strong enough to handle large disks). Similarly, extending some results of Edelman-Kostlan-Shub and of Forrester-Nagao, we can show that for a matrix matching the real gaussian ensemble to fourth order, the number of real eigenvalues is
with probability
for some absolute constant
.
There are several steps involved in the proof. The first step is to apply the Girko Hermitisation trick to replace the problem of understanding the spectrum of a non-Hermitian matrix, with that of understanding the spectrum of various Hermitian matrices. The two identities that realise this trick are, firstly, Jensen’s formula
that relates the local distribution of eigenvalues to the log-determinants , and secondly the elementary identity
that relates the log-determinants of to the log-determinants of the Hermitian matrices
The main difficulty is then to obtain concentration and universality results for the Hermitian log-determinants . This turns out to be a task that is analogous to the task of obtaining concentration for Wigner matrices (as we did in this recent paper), as well as central limit theorems for log-determinants of Wigner matrices (as we did in this other recent paper). In both of these papers, the main idea was to use the Four Moment Theorem for Wigner matrices (which can now be proven relatively easily by a combination of the local semi-circular law and resolvent swapping methods), combined with (in the latter paper) a central limit theorem for the gaussian unitary ensemble (GUE). This latter task was achieved by using the convenient Trotter normal form to tridiagonalise a GUE matrix, which has the effect of revealing the determinant of that matrix as the solution to a certain linear stochastic difference equation, and one can analyse the distribution of that solution via such tools as the martingale central limit theorem.
The matrices are somewhat more complicated than Wigner matrices (for instance, the semi-circular law must be replaced by a distorted Marchenko-Pastur law), but the same general strategy works to obtain concentration and universality for their log-determinants. The main new difficulty that arises is that the analogue of the Trotter norm for gaussian random matrices is not tridiagonal, but rather Hessenberg (i.e. upper-triangular except for the lower diagonal). This ultimately has the effect of expressing the relevant determinant as the solution to a nonlinear stochastic difference equation, which is a bit trickier to solve for. Fortunately, it turns out that one only needs good lower bounds on the solution, as one can use the second moment method to upper bound the determinant and hence the log-determinant (following a classical computation of Turan). This simplifies the analysis on the equation somewhat.
While this result is the first local universality result in the category of random matrices with independent entries, there are still two limitations to the result which one would like to remove. The first is the moment matching hypotheses on the matrix. Very recently, one of the ingredients of our paper, namely the local circular law, was proved without moment matching hypotheses by Bourgade, Yau, and Yin (provided one stays away from the edge of the spectrum); however, as of this time of writing the other main ingredient – the universality of the log-determinant – still requires moment matching. (The standard tool for obtaining universality without moment matching hypotheses is the heat flow method (and more specifically, the local relaxation flow method), but the analogue of Dyson Brownian motion in the non-Hermitian setting appears to be somewhat intractible, being a coupled flow on both the eigenvalues and eigenvectors rather than just on the eigenvalues alone.)
I’ve just uploaded to the arXiv my paper The asymptotic distribution of a single eigenvalue gap of a Wigner matrix, submitted to Probability Theory and Related Fields. This paper (like several of my previous papers) is concerned with the asymptotic distribution of the eigenvalues of a random Wigner matrix
in the limit
, with a particular focus on matrices drawn from the Gaussian Unitary Ensemble (GUE). This paper is focused on the bulk of the spectrum, i.e. to eigenvalues
with
for some fixed
.
The location of an individual eigenvalue is by now quite well understood. If we normalise the entries of the matrix
to have mean zero and variance
, then in the asymptotic limit
, the Wigner semicircle law tells us that with probability
one has
where the classical location of the eigenvalue is given by the formula
and the semicircular distribution is given by the formula
Actually, one can improve the error term here from to
for any
(see this previous recent paper of Van and myself for more discussion of these sorts of estimates, sometimes known as eigenvalue rigidity estimates).
From the semicircle law (and the fundamental theorem of calculus), one expects the eigenvalue spacing
to have an average size of
. It is thus natural to introduce the normalised eigenvalue spacing
and ask what the distribution of is.
As mentioned previously, we will focus on the bulk case , and begin with the model case when
is drawn from GUE. (In the edge case when
is close to
or to
, the distribution is given by the famous Tracy-Widom law.) Here, the distribution was almost (but as we shall see, not quite) worked out by Gaudin and Mehta. By using the theory of determinantal processes, they were able to compute a quantity closely related to
, namely the probability
near
of length comparable to the expected eigenvalue spacing
is devoid of eigenvalues. For
in the bulk and fixed
, they showed that this probability is equal to
where is the Dyson projection
to Fourier modes in , and
is the Fredholm determinant. As shown by Jimbo, Miwa, Tetsuji, Mori, and Sato, this determinant can also be expressed in terms of a solution to a Painleve V ODE, though we will not need this fact here. In view of this asymptotic and some standard integration by parts manipulations, it becomes plausible to propose that
will be asymptotically distributed according to the Gaudin-Mehta distribution
, where
A reasonably accurate approximation for is given by the Wigner surmise
, which was presciently proposed by Wigner as early as 1957; it is exact for
but not in the asymptotic limit
.
Unfortunately, when one tries to make this argument rigorous, one finds that the asymptotic for (1) does not control a single gap , but rather an ensemble of gaps
, where
is drawn from an interval
of some moderate size
(e.g.
); see for instance this paper of Deift, Kriecherbauer, McLaughlin, Venakides, and Zhou for a more precise formalisation of this statement (which is phrased slightly differently, in which one samples all gaps inside a fixed window of spectrum, rather than inside a fixed range of eigenvalue indices
). (This result is stated for GUE, but can be extended to other Wigner ensembles by the Four Moment Theorem, at least if one assumes a moment matching condition; see this previous paper with Van Vu for details. The moment condition can in fact be removed, as was done in this subsequent paper with Erdos, Ramirez, Schlein, Vu, and Yau.)
The problem is that when one specifies a given window of spectrum such as , one cannot quite pin down in advance which eigenvalues
are going to lie to the left or right of this window; even with the strongest eigenvalue rigidity results available, there is a natural uncertainty of
or so in the
index (as can be quantified quite precisely by this central limit theorem of Gustavsson).
The main difficulty here is that there could potentially be some strange coupling between the event (1) of an interval being devoid of eigenvalues, and the number of eigenvalues to the left of that interval. For instance, one could conceive of a possible scenario in which the interval in (1) tends to have many eigenvalues when
is even, but very few when
is odd. In this sort of situation, the gaps
may have different behaviour for even
than for odd
, and such anomalies would not be picked up in the averaged statistics in which
is allowed to range over some moderately large interval.
The main result of the current paper is that these anomalies do not actually occur, and that all of the eigenvalue gaps in the bulk are asymptotically governed by the Gaudin-Mehta law without the need for averaging in the
parameter. Again, this is shown first for GUE, and then extended to other Wigner matrices obeying a matching moment condition using the Four Moment Theorem. (It is likely that the moment matching condition can be removed here, but I was unable to achieve this, despite all the recent advances in establishing universality of local spectral statistics for Wigner matrices, mainly because the universality results in the literature are more focused on specific energy levels
than on specific eigenvalue indices
. To make matters worse, in some cases universality is currently known only after an additional averaging in the energy parameter.)
The main task in the proof is to show that the random variable is largely decoupled from the event in (1) when
is drawn from GUE. To do this we use some of the theory of determinantal processes, and in particular the nice fact that when one conditions a determinantal process to the event that a certain spatial region (such as an interval) contains no points of the process, then one obtains a new determinantal process (with a kernel that is closely related to the original kernel). The main task is then to obtain a sufficiently good control on the distance between the new determinantal kernel and the old one, which we do by some functional-analytic considerations involving the manipulation of norms of operators (and specifically, the operator norm, Hilbert-Schmidt norm, and nuclear norm). Amusingly, the Fredholm alternative makes a key appearance, as I end up having to invert a compact perturbation of the identity at one point (specifically, I need to invert
, where
is the Dyson projection and
is an interval). As such, the bounds in my paper become ineffective, though I am sure that with more work one can invert this particular perturbation of the identity by hand, without the need to invoke the Fredholm alternative.
In the last three notes, we discussed the Bourgain-Gamburd expansion machine and two of its three ingredients, namely quasirandomness and product theorems, leaving only the non-concentration ingredient to discuss. We can summarise the results of the last three notes, in the case of fields of prime order, as the following theorem.
Theorem 1 (Non-concentration implies expansion in
) Let
be a prime, let
, and let
be a symmetric set of elements in
of cardinality
not containing the identity. Write
, and suppose that one has the non-concentration property
for some
and some even integer
. Then
is a two-sided
-expander for some
depending only on
.
Proof: From (1) we see that is not supported in any proper subgroup
of
, which implies that
generates
. The claim now follows from the Bourgain-Gamburd expansion machine (Theorem 2 of Notes 4), the product theorem (Theorem 1 of Notes 5), and quasirandomness (Exercise 8 of Notes 3).
Remark 1 The same argument also works if we replace
by the field
of order
for some bounded
. However, there is a difficulty in the regime when
is unbounded, because the quasirandomness property becomes too weak for the Bourgain-Gamburd expansion machine to be directly applicable. On theother hand, the above type of theorem was generalised to the setting of cyclic groups
with
square-free by Varju, to arbitrary
by Bourgain and Varju, and to more general algebraic groups than
and square-free
by Salehi Golsefidy and Varju. It may be that some modification of the proof techniques in these papers may also be able to handle the field case
with unbounded
.
It thus remains to construct tools that can establish the non-concentration property (1). The situation is particularly simple in , as we have a good understanding of the subgroups of that group. Indeed, from Theorem 14 from Notes 5, we obtain the following corollary to Theorem 1:
Corollary 2 (Non-concentration implies expansion in
) Let
be a prime, and let
be a symmetric set of elements in
of cardinality
not containing the identity. Write
, and suppose that one has the non-concentration property
for some
and some even integer
, where
ranges over all Borel subgroups of
. Then, if
is sufficiently large depending on
,
is a two-sided
-expander for some
depending only on
.
It turns out (2) can be verified in many cases by exploiting the solvable nature of the Borel subgroups . We give two examples of this in these notes. The first result, due to Bourgain and Gamburd (with earlier partial results by Gamburd and by Shalom) generalises Selberg’s expander construction to the case when
generates a thin subgroup of
:
Theorem 3 (Expansion in thin subgroups) Let
be a symmetric subset of
not containing the identity, and suppose that the group
generated by
is not virtually solvable. Then as
ranges over all sufficiently large primes, the Cayley graphs
form a two-sided expander family, where
is the usual projection.
Remark 2 One corollary of Theorem 3 (or of the non-concentration estimate (3) below) is that
generates
for all sufficiently large
, if
is not virtually solvable. This is a special case of a much more general result, known as the strong approximation theorem, although this is certainly not the most direct way to prove such a theorem. Conversely, the strong approximation property is used in generalisations of this result to higher rank groups than
.
Exercise 1 In the converse direction, if
is virtually solvable, show that for sufficiently large
,
fails to generate
. (Hint: use Theorem 14 from Notes 5 to prevent
from having bounded index solvable subgroups.)
Exercise 2 (Lubotzsky’s 1-2-3 problem) Let
.
- (i) Show that
generates a free subgroup of
. (Hint: use a ping-pong argument, as in Exercise 23 of Notes 2.)
- (ii) Show that if
are two distinct elements of the sector
, then there os no element
for which
. (Hint: this is another ping-pong argument.) Conclude that
has infinite index in
. (Contrast this with the situation in which the
coefficients in
are replaced by
or
, in which case
is either all of
, or a finite index subgroup, as demonstrated in Exercise 23 of Notes 2).
- (iii) Show that
for sufficiently large primes
form a two-sided expander family.
Remark 3 Theorem 3 has been generalised to arbitrary linear groups, and with
replaced by
for square-free
; see this paper of Salehi Golsefidy and Varju. In this more general setting, the condition of virtual solvability must be replaced by the condition that the connected component of the Zariski closure of
is perfect. An effective version of Theorem 3 (with completely explicit constants) was recently obtained by Kowalski.
The second example concerns Cayley graphs constructed using random elements of .
Theorem 4 (Random generators expand) Let
be a prime, and let
be two elements of
chosen uniformly at random. Then with probability
,
is a two-sided
-expander for some absolute constant
.
Remark 4 As with Theorem 3, Theorem 4 has also been extended to a number of other groups, such as the Suzuki groups (in this paper of Breuillard, Green, and Tao), and more generally to finite simple groups of Lie type of bounded rank (in forthcoming work of Breuillard, Green, Guralnick, and Tao). There are a number of other constructions of expanding Cayley graphs in such groups (and in other interesting groups, such as the alternating groups) beyond those discussed in these notes; see this recent survey of Lubotzky for further discussion. It has been conjectured by Lubotzky and Weiss that any pair
of (say)
that generates the group, is a two-sided
-expander for an absolute constant
: in the case of
, this has been established for a density one set of primes by Breuillard and Gamburd.
— 1. Expansion in thin subgroups —
We now prove Theorem 3. The first observation is that the expansion property is monotone in the group :
Exercise 3 Let
be symmetric subsets of
not containing the identity, such that
. Suppose that
is a two-sided expander family for sufficiently large primes
. Show that
is also a two-sided expander family.
As a consequence, Theorem 3 follows from the following two statments:
Theorem 5 (Tits alternative) Let
be a group. Then exactly one of the following statements holds:
- (i)
is virtually solvable.
- (ii)
contains a copy of the free group
of two generators as a subgroup.
Theorem 6 (Expansion in free groups) Let
be generators of a free subgroup of
. Then as
ranges over all sufficiently large primes, the Cayley graphs
form a two-sided expander family.
Theorem 5 is a special case of the famous Tits alternative, which among other things allows one to replace by
for any
and any field
of characteristic zero (and fields of positive characteristic are also allowed, if one adds the requirement that
be finitely generated). We will not prove the full Tits alternative here, but instead just give an ad hoc proof of the special case in Theorem 5 in the following exercise.
Exercise 4 Given any matrix
, the singular values are
and
, and we can apply the singular value decomposition to decompose
where
and
are orthonormal bases. (When
, these bases are uniquely determined up to phase rotation.) We let
be the projection of
to the projective complex plane, and similarly define
.
Let
be a subgroup of
. Call a pair
a limit point of
if there exists a sequence
with
and
.
- (i) Show that if
is infinite, then there is at least one limit point.
- (ii) Show that if
is a limit point, then so is
.
- (iii) Show that if there are two limit points
with
, then there exist
that generate a free group. (Hint: Choose
close to
and
close to
, and consider the action of
and
on
, and specifically on small neighbourhoods of
, and set up a ping-pong type situation.)
- (iv) Show that if
is hyperbolic (i.e. it has an eigenvalue greater than 1), with eigenvectors
, then the projectivisations
of
form a limit point. Similarly, if
is regular parabolic (i.e. it has an eigenvalue at 1, but is not the identity) with eigenvector
, show that
is a limit point.
- (v) Show that if
has no free subgroup of two generators, then all hyperbolic and regular parabolic elements of
have a common eigenvector. Conclude that all such elements lie in a solvable subgroup of
.
- (vi) Show that if an element
is neither hyperbolic nor regular parabolic, and is not a multiple of the identity, then
is conjugate to a rotation by
(in particular,
).
- (vii) Establish Theorem 5. (Hint: show that two square roots of
in
cannot multiply to another square root of
.)
Now we prove Theorem 6. Let be a free subgroup of
generated by two generators
. Let
be the probability measure generating a random walk on
, thus
is the corresponding generator on
. By Corollary 2, it thus suffices to show that
, some absolute constant
, and some even
(depending on
, of course), where
ranges over Borel subgroups.
As is a homomorphism, one has
and so it suffices to show that
To deal with the supremum here, we will use an argument of Bourgain and Gamburd, taking advantage of the fact that all Borel groups of obey a common group law, the point being that free groups such as
obey such laws only very rarely. More precisely, we use the fact that the Borel groups are solvable of derived length two; in particular we have
. Now,
is supported on matrices in
whose coefficients have size
(where we allow the implied constants to depend on the choice of generators
), and so
is supported on matrices in
whose coefficients also have size
. If
is less than a sufficiently small multiple of
, these coefficients are then less than
(say). As such, if
lie in the support of
and their projections
obey the word law (4) in
, then the original matrices
obey the word law (4) in
. (This lifting of identities from the characteristic
setting of
to the characteristic
setting of
is a simple example of the “Lefschetz principle”.)
To summarise, if we let be the set of all elements of
that lie in the support of
, then (4) holds for all
. This severely limits the size of
to only be of polynomial size, rather than exponential size:
Proposition 7 Let
be a subset of the support of
(thus,
consists of words in
of length
) such that the law (4) holds for all
. Then
.
The proof of this proposition is laid out in the exercise below.
Exercise 5 Let
be a free group generated by two generators
. Let
be the set of all words of length at most
in
.
- (i) Show that if
commute, then
lie in the same cyclic group, thus
for some
and
.
- (ii) Show that if
, there are at most
elements of
that commute with
.
- (iii) Show that if
, there are at most
elements
of
with
.
- (iv) Prove Proposition 7.
Now we can conclude the proof of Theorem 3:
Exercise 6 Let
be a free group generated by two generators
.
- (i) Show that
for some absolute constant
. (For much more precise information on
, see this paper of Kesten.)
- (ii) Conclude the proof of Theorem 3.
— 2. Random generators expand —
We now prove Theorem 4. Let be the free group on two formal generators
, and let
be the generator of the random walk. For any word
and any
in a group
, let
be the element of
formed by substituting
for
respectively in the word
; thus
can be viewed as a map
for any group
. Observe that if
is drawn randomly using the distribution
, and
, then
is distributed according to the law
, where
. Applying Corollary 2, it suffices to show that whenever
is a large prime and
are chosen uniformly and independently at random from
, that with probability
, one has
, where
ranges over all Borel subgroups of
and
is drawn from the law
for some even natural number
.
Let denote the words in
of length at most
. We may use the law (4) to obtain good bound on the supremum in (5) assuming a certain non-degeneracy property of the word evaluations
:
Exercise 7 Let
be a natural number, and suppose that
is such that
for
. Show that
for some absolute constant
, where
is drawn from the law
. (Hint: use (4) and the hypothesis to lift the problem up to
, at which point one can use Proposition 7 and Exercise 6.)
In view of this exercise, it suffices to show that with probability , one has
for all
for some
comparable to a small multiple of
. As
has
elements, it thus suffices by the union bound to show that
, and any
of length less than
for some sufficiently small absolute constant
.
Let us now fix a non-identity word of length
less than
, and consider
as a function from
to
for an arbitrary field
. We can identify
with the set
. A routine induction then shows that the expression
is then a polynomial in the eight variables
of degree
and coefficients which are integers of size
. Let us then make the additional restriction to the case
, in which case we can write
and
. Then
is now a rational function of
whose numerator is a polynomial of degree
and coefficients of size
, and the denominator is a monomial of
of degree
.
We then specialise this rational function to the field . It is conceivable that when one does so, the rational function collapses to the constant polynomial
, thus
for all
with
. (For instance, this would be the case if
, by Lagrange’s theorem, if it were not for the fact that
is far too large here.) But suppose that this rational function does not collapse to the constant rational function. Applying the Schwarz-Zippel lemma (Exercise 23 from Notes 5), we then see that the set of pairs
with
and
is at most
; adding in the
and
cases, one still obtains a bound of
, which is acceptable since
and
. Thus, the only remaining case to consider is when the rational function
is identically
on
with
.
Now we perform another “Lefschetz principle” maneuvre to change the underlying field. Recall that the denominator of rational function is monomial in
, and the numerator has coefficients of size
. If
is less than
for a sufficiently small
, we conclude in particular (for
large enough) that the coefficients all have magnitude less than
. As such, the only way that this function can be identically
on
is if it is identically
on
for all
with
, and hence for
or
also by taking Zariski closures.
On the other hand, we know that for some choices of , e.g.
,
contains a copy
of the free group on two generators (see e.g. Exercise 23 of Notes 2). As such, it is not possible for any non-identity word
to be identically trivial on
. Thus this case cannot actually occur, completing the proof of (6) and hence of Theorem 4.
Remark 5 We see from the above argument that the existence of subgroups
of an algebraic group with good “independence” properties – such as that of generating a free group – can be useful in studying the expansion properties of that algebraic group, even if the field of interest in the latter is distinct from that of the former. For more complicated algebraic groups than
, in which laws such as (4) are not always available, it turns out to be useful to place further properties on the subgroup
, for instance by requiring that all non-abelian subgroups of that group be Zariski dense (a property which has been called strong density), as this turns out to be useful for preventing random walks from concentrating in proper algebraic subgroups. See this paper of Breuillard, Guralnick, Green and Tao for constructions of strongly dense free subgroups of algebraic groups and further discussion.
Van Vu and I have just uploaded to the arXiv our paper “Random matrices: The Universality phenomenon for Wigner ensembles“. This survey is a longer version (58 pages) of a previous short survey we wrote up a few months ago. The survey focuses on recent progress in understanding the universality phenomenon for Hermitian Wigner ensembles, of which the Gaussian Unitary Ensemble (GUE) is the most well known. The one-sentence summary of this progress is that many of the asymptotic spectral statistics (e.g. correlation functions, eigenvalue gaps, determinants, etc.) that were previously known for GUE matrices, are now known for very large classes of Wigner ensembles as well. There are however a wide variety of results of this type, due to the large number of interesting spectral statistics, the varying hypotheses placed on the ensemble, and the different modes of convergence studied, and it is difficult to isolate a single such result currently as the definitive universality result. (In particular, there is at present a tradeoff between generality of ensemble and strength of convergence; the universality results that are available for the most general classes of ensemble are only presently able to demonstrate a rather weak sense of convergence to the universal distribution (involving an additional averaging in the energy parameter), which limits the applicability of such results to a number of interesting questions in which energy averaging is not permissible, such as the study of the least singular value of a Wigner matrix, or of related quantities such as the condition number or determinant. But it is conceivable that this tradeoff is a temporary phenomenon and may be eliminated by future work in this area; in the case of Hermitian matrices whose entries have the same second moments as that of the GUE ensemble, for instance, the need for energy averaging has already been removed.)
Nevertheless, throughout the family of results that have been obtained recently, there are two main methods which have been fundamental to almost all of the recent progress in extending from special ensembles such as GUE to general ensembles. The first method, developed extensively by Erdos, Schlein, Yau, Yin, and others (and building on an initial breakthrough by Johansson), is the heat flow method, which exploits the rapid convergence to equilibrium of the spectral statistics of matrices undergoing Dyson-type flows towards GUE. (An important aspect to this method is the ability to accelerate the convergence to equilibrium by localising the Hamiltonian, in order to eliminate the slowest modes of the flow; this refinement of the method is known as the “local relaxation flow” method. Unfortunately, the translation mode is not accelerated by this process, which is the principal reason why results obtained by pure heat flow methods still require an energy averaging in the final conclusion; it would of interest to find a way around this difficulty.) The other method, which goes all the way back to Lindeberg in his classical proof of the central limit theorem, and which was introduced to random matrix theory by Chatterjee and then developed for the universality problem by Van Vu and myself, is the swapping method, which is based on the observation that spectral statistics of Wigner matrices tend to be stable if one replaces just one or two entries of the matrix with another distribution, with the stability of the swapping process becoming stronger if one assumes that the old and new entries have many matching moments. The main formalisations of this observation are known as four moment theorems, because they require four matching moments between the entries, although there are some variant three moment theorems and two moment theorems in the literature as well. Our initial four moment theorems were focused on individual eigenvalues (and later also to eigenvectors), but it was later observed by Erdos, Yau, and Yin that simpler four moment theorems could also be established for aggregate spectral statistics, such as the coefficients of the Greens function, and Knowles and Yin also subsequently observed that these latter theorems could be used to recover a four moment theorem for eigenvalues and eigenvectors, giving an alternate approach to proving such theorems.
Interestingly, it seems that the heat flow and swapping methods are complementary to each other; the heat flow methods are good at removing moment hypotheses on the coefficients, while the swapping methods are good at removing regularity hypotheses. To handle general ensembles with minimal moment or regularity hypotheses, it is thus necessary to combine the two methods (though perhaps in the future a third method, or a unification of the two existing methods, might emerge).
Besides the heat flow and swapping methods, there are also a number of other basic tools that are also needed in these results, such as local semicircle laws and eigenvalue rigidity, which are also discussed in the survey. We also survey how universality has been established for wide variety of spectral statistics; the -point correlation functions are the most well known of these statistics, but they do not tell the whole story (particularly if one can only control these functions after an averaging in the energy), and there are a number of other statistics, such as eigenvalue counting functions, determinants, or spectral gaps, for which the above methods can be applied.
In order to prevent the survey from becoming too enormous, we decided to restrict attention to Hermitian matrix ensembles, whose entries off the diagonal are identically distributed, as this is the case in which the strongest results are available. There are several results that are applicable to more general ensembles than these which are briefly mentioned in the survey, but they are not covered in detail.
We plan to submit this survey eventually to the proceedings of a workshop on random matrix theory, and will continue to update the references on the arXiv version until the time comes to actually submit the paper.
Finally, in the survey we issue some errata for previous papers of Van and myself in this area, mostly centering around the three moment theorem (a variant of the more widely used four moment theorem), for which the original proof of Van and myself was incomplete. (Fortunately, as the three moment theorem had many fewer applications than the four moment theorem, and most of the applications that it did have ended up being superseded by subsequent papers, the actual impact of this issue was limited, but still an erratum is in order.)
Van Vu and I have just uploaded to the arXiv our paper Random matrices: Sharp concentration of eigenvalues, submitted to the Electronic Journal of Probability. As with many of our previous papers, this paper is concerned with the distribution of the eigenvalues of a random Wigner matrix
(such as a matrix drawn from the Gaussian Unitary Ensemble (GUE) or Gaussian Orthogonal Ensemble (GOE)). To simplify the discussion we shall mostly restrict attention to the bulk of the spectrum, i.e. to eigenvalues
with
for some fixed
, although analogues of most of the results below have also been obtained at the edge of the spectrum.
If we normalise the entries of the matrix to have mean zero and variance
, then in the asymptotic limit
, we have the Wigner semicircle law, which asserts that the eigenvalues are asymptotically distributed according to the semicircular distribution
, where
An essentially equivalent way of saying this is that for large , we expect the
eigenvalue
of
to stay close to the classical location
, defined by the formula
In particular, from the Wigner semicircle law it can be shown that asymptotically almost surely, one has
.
In the modern study of the spectrum of Wigner matrices (and in particular as a key tool in establishing universality results), it has become of interest to improve the error term in (1) as much as possible. A typical early result in this direction was by Bai, who used the Stieltjes transform method to obtain polynomial convergence rates of the shape for some absolute constant
; see also the subsequent papers of Alon-Krivelevich-Vu and of of Meckes, who were able to obtain such convergence rates (with exponentially high probability) by using concentration of measure tools, such as Talagrand’s inequality. On the other hand, in the case of the GUE ensemble it is known (by this paper of Gustavsson) that
has variance comparable to
in the bulk, so that the optimal error term in (1) should be about
. (One may think that if one wanted bounds on (1) that were uniform in
, one would need to enlarge the error term further, but this does not appear to be the case, due to strong correlations between the
; note for instance this recent result of Ben Arous and Bourgarde that the largest gap between eigenvalues in the bulk is typically of order
.)
A significant advance in this direction was achieved by Erdos, Schlein, and Yau in a series of papers where they used a combination of Stieltjes transform and concentration of measure methods to obtain local semicircle laws which showed, among other things, that one had asymptotics of the form
with exponentially high probability for intervals in the bulk that were as short as
for some
, where
is the number of eigenvalues. These asymptotics are consistent with a good error term in (1), and are already sufficient for many applications, but do not quite imply a strong concentration result for individual eigenvalues
(basically because they do not preclude long-range or “secular” shifts in the spectrum that involve large blocks of eigenvalues at mesoscopic scales). Nevertheless, this was rectified in a subsequent paper of Erdos, Yau, and Yin, which roughly speaking obtained a bound of the form
in the bulk with exponentially high probability, for Wigner matrices obeying some exponential decay conditions on the entries. This was achieved by a rather delicate high moment calculation, in which the contribution of the diagonal entries of the resolvent (whose average forms the Stieltjes transform) was shown to mostly cancel each other out.
As the GUE computations show, this concentration result is sharp up to the quasilogarithmic factor . The main result of this paper is to improve the concentration result to one more in line with the GUE case, namely
with exponentially high probability (see the paper for a more precise statement of results). The one catch is that an additional hypothesis is required, namely that the entries of the Wigner matrix have vanishing third moment. We also obtain similar results for the edge of the spectrum (but with a different scaling).
Our arguments are rather different from those of Erdos, Yau, and Yin, and thus provide an alternate approach to establishing eigenvalue concentration. The main tool is the Lindeberg exchange strategy, which is also used to prove the Four Moment Theorem (although we do not directly invoke the Four Moment Theorem in our analysis). The main novelty is that this exchange strategy is now used to establish large deviation estimates (i.e. exponentially small tail probabilities) rather than universality of the limiting distribution. Roughly speaking, the basic point is as follows. The Lindeberg exchange strategy seeks to compare a function of many independent random variables
with the same function
of a different set of random variables (which match moments with the original set of variables to some order, such as to second or fourth order) by exchanging the random variables one at a time. Typically, one tries to upper bound expressions such as
for various smooth test functions , by performing a Taylor expansion in the variable being swapped and taking advantage of the matching moment hypotheses. In previous implementations of this strategy,
was a bounded test function, which allowed one to get control of the bulk of the distribution of
, and in particular in controlling probabilities such as
for various thresholds and
, but did not give good control on the tail as the error terms tended to be polynomially decaying in
rather than exponentially decaying. However, it turns out that one can modify the exchange strategy to deal with moments such as
for various moderately large (e.g. of size comparable to
), obtaining results such as
after performing all the relevant exchanges. As such, one can then use large deviation estimates on to deduce large deviation estimates on
.
In this paper we also take advantage of a simplification, first noted by Erdos, Yau, and Yin, that Four Moment Theorems become somewhat easier to prove if one works with resolvents (and the closely related Stieltjes transform
) rather than with individual eigenvalues, as the Taylor expansion of resolvents are very simple (essentially being a Neumann series). The relationship between the Stieltjes transform and the location of individual eigenvalues can be seen by taking advantage of the identity
for any energy level , which can be verified from elementary calculus. (In practice, we would truncate
near zero and near infinity to avoid some divergences, but this is a minor technicality.) As such, a concentration result for the Stieltjes transform can be used to establish an analogous concentration result for the eigenvalue counting functions
, which in turn can be used to deduce concentration results for individual eigenvalues
by some basic combinatorial manipulations.

Recent Comments