You are currently browsing the category archive for the ‘math.PR’ category.

The 2014 Fields medallists have just been announced as (in alphabetical order of surname) Artur Avila, Manjul Bhargava, Martin Hairer, and Maryam Mirzakhani (see also these nice video profiles for the winners, which is a new initiative of the IMU and the Simons foundation). This time four years ago, I wrote a blog post discussing one result from each of the 2010 medallists; I thought I would try to repeat the exercise here, although the work of the medallists this time around is a little bit further away from my own direct area of expertise than last time, and so my discussion will unfortunately be a bit superficial (and possibly not completely accurate) in places. As before, I am picking these results based on my own idiosyncratic tastes, and they should not be viewed as necessarily being the “best” work of these medallists. (See also the press releases for Avila, Bhargava, Hairer, and Mirzakhani.)

Artur Avila works in dynamical systems and in the study of Schrödinger operators. The work of Avila that I am most familiar with is his solution with Svetlana Jitormiskaya of the ten martini problem of Kac, the solution to which (according to Barry Simon) he offered ten martinis for, hence the name. The problem involves perhaps the simplest example of a Schrödinger operator with non-trivial spectral properties, namely the almost Mathieu operator {H^{\lambda,\alpha}_\omega: \ell^2({\bf Z}) \rightarrow \ell^2({\bf Z})} defined for parameters {\alpha,\omega \in {\bf R}/{\bf Z}} and {\lambda>0} by a discrete one-dimensional Schrödinger operator with cosine potential:

\displaystyle (H^{\lambda,\alpha}_\omega u)_n := u_{n+1} + u_{n-1} + 2\lambda (\cos 2\pi(\theta+n\alpha)) u_n.

This is a bounded self-adjoint operator and thus has a spectrum {\sigma( H^{\lambda,\alpha}_\omega )} that is a compact subset of the real line; it arises in a number of physical contexts, most notably in the theory of the integer quantum Hall effect, though I will not discuss these applications here. Remarkably, the structure of this spectrum depends crucially on the Diophantine properties of the frequency {\alpha}. For instance, if {\alpha = p/q} is a rational number, then the operator is periodic with period {q}, and then basic (discrete) Floquet theory tells us that the spectrum is simply the union of {q} (possibly touching) intervals. But for irrational {\alpha} (in which case the spectrum is independent of the phase {\theta}), the situation is much more fractal in nature, for instance in the critical case {\lambda=1} the spectrum (as a function of {\alpha}) gives rise to the Hofstadter butterfly. The “ten martini problem” asserts that for every irrational {\alpha} and every choice of coupling constant {\lambda > 0}, the spectrum is homeomorphic to a Cantor set. Prior to the work of Avila and Jitormiskaya, there were a number of partial results on this problem, notably the result of Puig establishing Cantor spectrum for a full measure set of parameters {(\lambda,\alpha)}, as well as results requiring a perturbative hypothesis, such as {\lambda} being very small or very large. The result was also already known for {\alpha} being either very close to rational (i.e. a Liouville number) or very far from rational (a Diophantine number), although the analyses for these two cases failed to meet in the middle, leaving some cases untreated. The argument uses a wide variety of existing techniques, both perturbative and non-perturbative, to attack this problem, as well as an amusing argument by contradiction: they assume (in certain regimes) that the spectrum fails to be a Cantor set, and use this hypothesis to obtain additional Lipschitz control on the spectrum (as a function of the frequency {\alpha}), which they can then use (after much effort) to improve existing arguments and conclude that the spectrum was in fact Cantor after all!

Manjul Bhargava produces amazingly beautiful mathematics, though most of it is outside of my own area of expertise. One part of his work that touches on an area of my own interest (namely, random matrix theory) is his ongoing work with many co-authors on modeling (both conjecturally and rigorously) the statistics of various key number-theoretic features of elliptic curves (such as their rank, their Selmer group, or their Tate-Shafarevich groups). For instance, with Kane, Lenstra, Poonen, and Rains, Manjul has proposed a very general random matrix model that predicts all of these statistics (for instance, predicting that the {p}-component of the Tate-Shafarevich group is distributed like the cokernel of a certain random {p}-adic matrix, very much in the spirit of the Cohen-Lenstra heuristics discussed in this previous post). But what is even more impressive is that Manjul and his coauthors have been able to verify several non-trivial fragments of this model (e.g. showing that certain moments have the predicted asymptotics), giving for the first time non-trivial upper and lower bounds for various statistics, for instance obtaining lower bounds on how often an elliptic curve has rank {0} or rank {1}, leading most recently (in combination with existing work of Gross-Zagier and of Kolyvagin, among others) to his amazing result with Skinner and Zhang that at least {66\%} of all elliptic curves over {{\bf Q}} (ordered by height) obey the Birch and Swinnerton-Dyer conjecture. Previously it was not even known that a positive proportion of curves obeyed the conjecture. This is still a fair ways from resolving the conjecture fully (in particular, the situation with the presumably small number of curves of rank {2} and higher is still very poorly understood, and the theory of Gross-Zagier and Kolyvagin that this work relies on is not available for other number fields than {{\bf Q}}), but it certainly does provide hope that the conjecture could be within reach in a statistical sense at least.

Martin Hairer works in at the interface between probability and partial differential equations, and in particular in the theory of stochastic differential equations (SDEs). The result of his that is closest to my own interests is his remarkable demonstration with Jonathan Mattingly of unique invariant measure for the two-dimensional stochastically forced Navier-Stokes equation

\displaystyle \partial_t u + (u \cdot \nabla u) = \nu \Delta u - \nabla p + \xi

\displaystyle \nabla \cdot u = 0

on the two-torus {({\bf R}/{\bf Z})^2}, where {\xi} is a Gaussian field that forces a fixed set of frequencies. It is expected that for any reasonable choice of initial data, the solution to this equation should asymptotically be distributed according to Kolmogorov’s power law, as discussed in this previous post. This is still far from established rigorously (although there are some results in this direction for dyadic models, see e.g. this paper of Cheskidov, Shvydkoy, and Friedlander). However, Hairer and Mattingly were able to show that there was a unique probability distribution to almost every initial data would converge to asymptotically; by the ergodic theorem, this is equivalent to demonstrating the existence and uniqueness of an invariant measure for the flow. Existence can be established using standard methods, but uniqueness is much more difficult. One of the standard routes to uniqueness is to establish a “strong Feller property” that enforces some continuity on the transition operators; among other things, this would mean that two ergodic probability measures with intersecting supports would in fact have a non-trivial common component, contradicting the ergodic theorem (which forces different ergodic measures to be mutually singular). Since all ergodic measures for Navier-Stokes can be seen to contain the origin in their support, this would give uniqueness. Unfortunately, the strong Feller property is unlikely to hold in the infinite-dimensional phase space for Navier-Stokes; but Hairer and Mattingly develop a clean abstract substitute for this property, which they call the asymptotic strong Feller property, which is again a regularity property on the transition operator; this in turn is then demonstrated by a careful application of Malliavin calculus.

Maryam Mirzakhani has mostly focused on the geometry and dynamics of Teichmuller-type moduli spaces, such as the moduli space of Riemann surfaces with a fixed genus and a fixed number of cusps (or with a fixed number of boundaries that are geodesics of a prescribed length). These spaces have an incredibly rich structure, ranging from geometric structure (such as the Kahler geometry given by the Weil-Petersson metric), to dynamical structure (through the action of the mapping class group on this and related spaces), to algebraic structure (viewing these spaces as algebraic varieties), and are thus connected to many other objects of interest in geometry and dynamics. For instance, by developing a new recursive formula for the Weil-Petersson volume of this space, Mirzakhani was able to asymptotically count the number of simple prime geodesics of length up to some threshold {L} in a hyperbolic surface (or more precisely, she obtained asymptotics for the number of such geodesics in a given orbit of the mapping class group); the answer turns out to be polynomial in {L}, in contrast to the much larger class of non-simple prime geodesics, whose asymptotics are exponential in {L} (the “prime number theorem for geodesics”, developed in a classic series of works by Delsart, Huber, Selberg, and Margulis); she also used this formula to establish a new proof of a conjecture of Witten on intersection numbers that was first proven by Kontsevich. More recently, in two lengthy papers with Eskin and with Eskin-Mohammadi, Mirzakhani established rigidity theorems for the action of {SL_2({\bf R})} on such moduli spaces that are close analogues of Ratner’s celebrated rigidity theorems for unipotently generated groups (discussed in this previous blog post). Ratner’s theorems are already notoriously difficult to prove, and rely very much on the polynomial stability properties of unipotent flows; in this even more complicated setting, the unipotent flows are no longer tractable, and Mirzakhani instead uses a recent “exponential drift” method of Benoist and Quint with as a substitute. Ratner’s theorems are incredibly useful for all sorts of problems connected to homogeneous dynamics, and the analogous theorems established by Mirzakhani, Eskin, and Mohammadi have a similarly broad range of applications, for instance in counting periodic billiard trajectories in rational polygons.

In the traditional foundations of probability theory, one selects a probability space {(\Omega, {\mathcal B}, {\mathbf P})}, and makes a distinction between deterministic mathematical objects, which do not depend on the sampled state {\omega \in \Omega}, and stochastic (or random) mathematical objects, which do depend (but in a measurable fashion) on the sampled state {\omega \in \Omega}. For instance, a deterministic real number would just be an element {x \in {\bf R}}, whereas a stochastic real number (or real random variable) would be a measurable function {x: \Omega \rightarrow {\bf R}}, where in this post {{\bf R}} will always be endowed with the Borel {\sigma}-algebra. (For readers familiar with nonstandard analysis, the adjectives “deterministic” and “stochastic” will be used here in a manner analogous to the uses of the adjectives “standard” and “nonstandard” in nonstandard analysis. The analogy is particularly close when comparing with the “cheap nonstandard analysis” discussed in this previous blog post. We will also use “relative to {\Omega}” as a synonym for “stochastic”.)

Actually, for our purposes we will adopt the philosophy of identifying stochastic objects that agree almost surely, so if one was to be completely precise, we should define a stochastic real number to be an equivalence class {[x]} of measurable functions {x: \Omega \rightarrow {\bf R}}, up to almost sure equivalence. However, we shall often abuse notation and write {[x]} simply as {x}.

More generally, given any measurable space {X = (X, {\mathcal X})}, we can talk either about deterministic elements {x \in X}, or about stochastic elements of {X}, that is to say equivalence classes {[x]} of measurable maps {x: \Omega \rightarrow X} up to almost sure equivalence. We will use {\Gamma(X|\Omega)} to denote the set of all stochastic elements of {X}. (For readers familiar with sheaves, it may helpful for the purposes of this post to think of {\Gamma(X|\Omega)} as the space of measurable global sections of the trivial {X}-bundle over {\Omega}.) Of course every deterministic element {x} of {X} can also be viewed as a stochastic element {x|\Omega \in \Gamma(X|\Omega)} given by (the equivalence class of) the constant function {\omega \mapsto x}, thus giving an embedding of {X} into {\Gamma(X|\Omega)}. We do not attempt here to give an interpretation of {\Gamma(X|\Omega)} for sets {X} that are not equipped with a {\sigma}-algebra {{\mathcal X}}.

Remark 1 In my previous post on the foundations of probability theory, I emphasised the freedom to extend the sample space {(\Omega, {\mathcal B}, {\mathbf P})} to a larger sample space whenever one wished to inject additional sources of randomness. This is of course an important freedom to possess (and in the current formalism, is the analogue of the important operation of base change in algebraic geometry), but in this post we will focus on a single fixed sample space {(\Omega, {\mathcal B}, {\mathbf P})}, and not consider extensions of this space, so that one only has to consider two types of mathematical objects (deterministic and stochastic), as opposed to having many more such types, one for each potential choice of sample space (with the deterministic objects corresponding to the case when the sample space collapses to a point).

Any (measurable) {k}-ary operation on deterministic mathematical objects then extends to their stochastic counterparts by applying the operation pointwise. For instance, the addition operation {+: {\bf R} \times {\bf R} \rightarrow {\bf R}} on deterministic real numbers extends to an addition operation {+: \Gamma({\bf R}|\Omega) \times \Gamma({\bf R}|\Omega) \rightarrow \Gamma({\bf R}|\Omega)}, by defining the class {[x]+[y]} for {x,y: \Omega \rightarrow {\bf R}} to be the equivalence class of the function {\omega \mapsto x(\omega) + y(\omega)}; this operation is easily seen to be well-defined. More generally, any measurable {k}-ary deterministic operation {O: X_1 \times \dots \times X_k \rightarrow Y} between measurable spaces {X_1,\dots,X_k,Y} extends to an stochastic operation {O: \Gamma(X_1|\Omega) \times \dots \Gamma(X_k|\Omega) \rightarrow \Gamma(Y|\Omega)} in the obvious manner.

There is a similar story for {k}-ary relations {R: X_1 \times \dots \times X_k \rightarrow \{\hbox{true},\hbox{false}\}}, although here one has to make a distinction between a deterministic reading of the relation and a stochastic one. Namely, if we are given stochastic objects {x_i \in \Gamma(X_i|\Omega)} for {i=1,\dots,k}, the relation {R(x_1,\dots,x_k)} does not necessarily take values in the deterministic Boolean algebra {\{ \hbox{true}, \hbox{false}\}}, but only in the stochastic Boolean algebra {\Gamma(\{ \hbox{true}, \hbox{false}\}|\Omega)} – thus {R(x_1,\dots,x_k)} may be true with some positive probability and also false with some positive probability (with the event that {R(x_1,\dots,x_k)} being stochastically true being determined up to null events). Of course, the deterministic Boolean algebra embeds in the stochastic one, so we can talk about a relation {R(x_1,\dots,x_k)} being determinstically true or deterministically false, which (due to our identification of stochastic objects that agree almost surely) means that {R(x_1(\omega),\dots,x_k(\omega))} is almost surely true or almost surely false respectively. For instance given two stochastic objects {x,y}, one can view their equality relation {x=y} as having a stochastic truth value. This is distinct from the way the equality symbol {=} is used in mathematical logic, which we will now call “equality in the deterministic sense” to reduce confusion. Thus, {x=y} in the deterministic sense if and only if the stochastic truth value of {x=y} is equal to {\hbox{true}}, that is to say that {x(\omega)=y(\omega)} for almost all {\omega}.

Any universal identity for deterministic operations (or universal implication between identities) extends to their stochastic counterparts: for instance, addition is commutative, associative, and cancellative on the space of deterministic reals {{\bf R}}, and is therefore commutative, associative, and cancellative on stochastic reals {\Gamma({\bf R}|\Omega)} as well. However, one has to be more careful when working with mathematical laws that are not expressible as universal identities, or implications between identities. For instance, {{\bf R}} is an integral domain: if {x_1,x_2 \in {\bf R}} are deterministic reals such that {x_1 x_2=0}, then one must have {x_1=0} or {x_2=0}. However, if {x_1, x_2 \in \Gamma({\bf R}|\Omega)} are stochastic reals such that {x_1 x_2 = 0} (in the deterministic sense), then it is no longer necessarily the case that {x_1=0} (in the deterministic sense) or that {x_2=0} (in the deterministic sense); however, it is still true that “{x_1=0} or {x_2=0}” is true in the deterministic sense if one interprets the boolean operator “or” stochastically, thus “{x_1(\omega)=0} or {x_2(\omega)=0}” is true for almost all {\omega}. Another way to properly obtain a stochastic interpretation of the integral domain property of {{\bf R}} is to rewrite it as

\displaystyle  x_1,x_2 \in {\bf R}, x_1 x_2 = 0 \implies x_i=0 \hbox{ for some } i \in \{1,2\}

and then make all sets stochastic to obtain the true statement

\displaystyle  x_1,x_2 \in \Gamma({\bf R}|\Omega), x_1 x_2 = 0 \implies x_i=0 \hbox{ for some } i \in \Gamma(\{1,2\}|\Omega),

thus we have to allow the index {i} for which vanishing {x_i=0} occurs to also be stochastic, rather than deterministic. (A technical note: when one proves this statement, one has to select {i} in a measurable fashion; for instance, one can choose {i(\omega)} to equal {1} when {x_1(\omega)=0}, and {2} otherwise (so that in the “tie-breaking” case when {x_1(\omega)} and {x_2(\omega)} both vanish, one always selects {i(\omega)} to equal {1}).)

Similarly, the law of the excluded middle fails when interpreted deterministically, but remains true when interpreted stochastically: if {S} is a stochastic statement, then it is not necessarily the case that {S} is either deterministically true or deterministically false; however the sentence “{S} or not-{S}” is still deterministically true if the boolean operator “or” is interpreted stochastically rather than deterministically.

To avoid having to keep pointing out which operations are interpreted stochastically and which ones are interpreted deterministically, we will use the following convention: if we assert that a mathematical sentence {S} involving stochastic objects is true, then (unless otherwise specified) we mean that {S} is deterministically true, assuming that all relations used inside {S} are interpreted stochastically. For instance, if {x,y} are stochastic reals, when we assert that “Exactly one of {x < y}, {x=y}, or {x>y} is true”, then by default it is understood that the relations {<}, {=}, {>} and the boolean operator “exactly one of” are interpreted stochastically, and the assertion is that the sentence is deterministically true.

In the above discussion, the stochastic objects {x} being considered were elements of a deterministic space {X}, such as the reals {{\bf R}}. However, it can often be convenient to generalise this situation by allowing the ambient space {X} to also be stochastic. For instance, one might wish to consider a stochastic vector {v(\omega)} inside a stochastic vector space {V(\omega)}, or a stochastic edge {e} of a stochastic graph {G(\omega)}. In order to formally describe this situation within the classical framework of measure theory, one needs to place all the ambient spaces {X(\omega)} inside a measurable space. This can certainly be done in many contexts (e.g. when considering random graphs on a deterministic set of vertices, or if one is willing to work up to equivalence and place the ambient spaces inside a suitable moduli space), but is not completely natural in other contexts. For instance, if one wishes to consider stochastic vector spaces of potentially unbounded dimension (in particular, potentially larger than any given cardinal that one might specify in advance), then the class of all possible vector spaces is so large that it becomes a proper class rather than a set (even if one works up to equivalence), making it problematic to give this class the structure of a measurable space; furthermore, even once one does so, one needs to take additional care to pin down what it would mean for a random vector {\omega \mapsto v_\omega} lying in a random vector space {\omega \mapsto V_\omega} to depend “measurably” on {\omega}.

Of course, in any reasonable application one can avoid the set theoretic issues at least by various ad hoc means, for instance by restricting the dimension of all spaces involved to some fixed cardinal such as {2^{\aleph_0}}. However, the measure-theoretic issues can require some additional effort to resolve properly.

In this post I would like to describe a different way to formalise stochastic spaces, and stochastic elements of these spaces, by viewing the spaces as measure-theoretic analogue of a sheaf, but being over the probability space {\Omega} rather than over a topological space; stochastic objects are then sections of such sheaves. Actually, for minor technical reasons it is convenient to work in the slightly more general setting in which the base space {\Omega} is a finite measure space {(\Omega, {\mathcal B}, \mu)} rather than a probability space, thus {\mu(\Omega)} can take any value in {[0,+\infty)} rather than being normalised to equal {1}. This will allow us to easily localise to subevents {\Omega'} of {\Omega} without the need for normalisation, even when {\Omega'} is a null event (though we caution that the map {x \mapsto x|\Omega'} from deterministic objects {x} ceases to be injective in this latter case). We will however still continue to use probabilistic terminology. despite the lack of normalisation; thus for instance, sets {E} in {{\mathcal B}} will be referred to as events, the measure {\mu(E)} of such a set will be referred to as the probability (which is now permitted to exceed {1} in some cases), and an event whose complement is a null event shall be said to hold almost surely. It is in fact likely that almost all of the theory below extends to base spaces which are {\sigma}-finite rather than finite (for instance, by damping the measure to become finite, without introducing any further null events), although we will not pursue this further generalisation here.

The approach taken in this post is “topos-theoretic” in nature (although we will not use the language of topoi explicitly here), and is well suited to a “pointless” or “point-free” approach to probability theory, in which the role of the stochastic state {\omega \in \Omega} is suppressed as much as possible; instead, one strives to always adopt a “relative point of view”, with all objects under consideration being viewed as stochastic objects relative to the underlying base space {\Omega}. In this perspective, the stochastic version of a set is as follows.

Definition 1 (Stochastic set) Unless otherwise specified, we assume that we are given a fixed finite measure space {\Omega = (\Omega, {\mathcal B}, \mu)} (which we refer to as the base space). A stochastic set (relative to {\Omega}) is a tuple {X|\Omega = (\Gamma(X|E)_{E \in {\mathcal B}}, ((|E))_{E \subset F, E,F \in {\mathcal B}})} consisting of the following objects:

  • A set {\Gamma(X|E)} assigned to each event {E \in {\mathcal B}}; and
  • A restriction map {x \mapsto x|E} from {\Gamma(X|F)} to {\Gamma(X|E)} to each pair {E \subset F} of nested events {E,F \in {\mathcal B}}. (Strictly speaking, one should indicate the dependence on {F} in the notation for the restriction map, e.g. using {x \mapsto x|(E \leftarrow F)} instead of {x \mapsto x|E}, but we will abuse notation by omitting the {F} dependence.)

We refer to elements of {\Gamma(X|E)} as local stochastic elements of the stochastic set {X|\Omega}, localised to the event {E}, and elements of {\Gamma(X|\Omega)} as global stochastic elements (or simply elements) of the stochastic set. (In the language of sheaves, one would use “sections” instead of “elements” here, but I prefer to use the latter terminology here, for compatibility with conventional probabilistic notation, where for instance measurable maps from {\Omega} to {{\bf R}} are referred to as real random variables, rather than sections of the reals.)

Furthermore, we impose the following axioms:

  • (Category) The map {x \mapsto x|E} from {\Gamma(X|E)} to {\Gamma(X|E)} is the identity map, and if {E \subset F \subset G} are events in {{\mathcal B}}, then {((x|F)|E) = (x|E)} for all {x \in \Gamma(X|G)}.
  • (Null events trivial) If {E \in {\mathcal B}} is a null event, then the set {\Gamma(X|E)} is a singleton set. (In particular, {\Gamma(X|\emptyset)} is always a singleton set; this is analogous to the convention that {x^0=1} for any number {x}.)
  • (Countable gluing) Suppose that for each natural number {n}, one has an event {E_n \in {\mathcal B}} and an element {x_n \in \Gamma(X|E_n)} such that {x_n|(E_n \cap E_m) = x_m|(E_n \cap E_m)} for all {n,m}. Then there exists a unique {x\in \Gamma(X|\bigcup_{n=1}^\infty E_n)} such that {x_n = x|E_n} for all {n}.

If {\Omega'} is an event in {\Omega}, we define the localisation {X|\Omega'} of the stochastic set {X|\Omega} to {\Omega'} to be the stochastic set

\displaystyle X|\Omega' := (\Gamma(X|E)_{E \in {\mathcal B}; E \subset \Omega'}, ((|E))_{E \subset F \subset \Omega', E,F \in {\mathcal B}})

relative to {\Omega'}. (Note that there is no need to renormalise the measure on {\Omega'}, as we are not demanding that our base space have total measure {1}.)

The following fact is useful for actually verifying that a given object indeed has the structure of a stochastic set:

Exercise 1 Show that to verify the countable gluing axiom of a stochastic set, it suffices to do so under the additional hypothesis that the events {E_n} are disjoint. (Note that this is quite different from the situation with sheaves over a topological space, in which the analogous gluing axiom is often trivial in the disjoint case but has non-trivial content in the overlapping case. This is ultimately because a {\sigma}-algebra is closed under all Boolean operations, whereas a topology is only closed under union and intersection.)

Let us illustrate the concept of a stochastic set with some examples.

Example 1 (Discrete case) A simple case arises when {\Omega} is a discrete space which is at most countable. If we assign a set {X_\omega} to each {\omega \in \Omega}, with {X_\omega} a singleton if {\mu(\{\omega\})=0}. One then sets {\Gamma(X|E) := \prod_{\omega \in E} X_\omega}, with the obvious restriction maps, giving rise to a stochastic set {X|\Omega}. (Thus, a local element {x} of {\Gamma(X|E)} can be viewed as a map {\omega \mapsto x(\omega)} on {E} that takes values in {X_\omega} for each {\omega \in E}.) Conversely, it is not difficult to see that any stochastic set over an at most countable discrete probability space {\Omega} is of this form up to isomorphism. In this case, one can think of {X|\Omega} as a bundle of sets {X_\omega} over each point {\omega} (of positive probability) in the base space {\Omega}. One can extend this bundle interpretation of stochastic sets to reasonably nice sample spaces {\Omega} (such as standard Borel spaces) and similarly reasonable {X}; however, I would like to avoid this interpretation in the formalism below in order to be able to easily work in settings in which {\Omega} and {X} are very “large” (e.g. not separable in any reasonable sense). Note that we permit some of the {X_\omega} to be empty, thus it can be possible for {\Gamma(X|\Omega)} to be empty whilst {\Gamma(X|E)} for some strict subevents {E} of {\Omega} to be non-empty. (This is analogous to how it is possible for a sheaf to have local sections but no global sections.) As such, the space {\Gamma(X|\Omega)} of global elements does not completely determine the stochastic set {X|\Omega}; one sometimes needs to localise to an event {E} in order to see the full structure of such a set. Thus it is important to distinguish between a stochastic set {X|\Omega} and its space {\Gamma(X|\Omega)} of global elements. (As such, it is a slight abuse of the axiom of extensionality to refer to global elements of {X|\Omega} simply as “elements”, but hopefully this should not cause too much confusion.)

Example 2 (Measurable spaces as stochastic sets) Returning now to a general base space {\Omega}, any (deterministic) measurable space {X} gives rise to a stochastic set {X|\Omega}, with {\Gamma(X|E)} being defined as in previous discussion as the measurable functions from {E} to {X} modulo almost everywhere equivalence (in particular, {\Gamma(X|E)} a singleton set when {E} is null), with the usual restriction maps. The constraint of measurability on the maps {x: E \rightarrow \Omega}, together with the quotienting by almost sure equivalence, means that {\Gamma(X|E)} is now more complicated than a plain Cartesian product {\prod_{\omega \in E} X_\omega} of fibres, but this still serves as a useful first approximation to what {\Gamma(X|E)} is for the purposes of developing intuition. Indeed, the measurability constraint is so weak (as compared for instance to topological or smooth constraints in other contexts, such as sheaves of continuous or smooth sections of bundles) that the intuition of essentially independent fibres is quite an accurate one, at least if one avoids consideration of an uncountable number of objects simultaneously.

Example 3 (Extended Hilbert modules) This example is the one that motivated this post for me. Suppose that one has an extension {(\tilde \Omega, \tilde {\mathcal B}, \tilde \mu)} of the base space {(\Omega, {\mathcal B},\mu)}, thus we have a measurable factor map {\pi: \tilde \Omega \rightarrow \Omega} such that the pushforward of the measure {\tilde \mu} by {\pi} is equal to {\mu}. Then we have a conditional expectation operator {\pi_*: L^2(\tilde \Omega,\tilde {\mathcal B},\tilde \mu) \rightarrow L^2(\Omega,{\mathcal B},\mu)}, defined as the adjoint of the pullback map {\pi^*: L^2(\Omega,{\mathcal B},\mu) \rightarrow L^2(\tilde \Omega,\tilde {\mathcal B},\tilde \mu)}. As is well known, the conditional expectation operator also extends to a contraction {\pi_*: L^1(\tilde \Omega,\tilde {\mathcal B},\tilde \mu) \rightarrow L^1(\Omega,{\mathcal B}, \mu)}; by monotone convergence we may also extend {\pi_*} to a map from measurable functions from {\tilde \Omega} to the extended non-negative reals {[0,+\infty]}, to measurable functions from {\Omega} to {[0,+\infty]}. We then define the “extended Hilbert module” {L^2(\tilde \Omega|\Omega)} to be the space of functions {f \in L^2(\tilde \Omega,\tilde {\mathcal B},\tilde \mu)} with {\pi_*(|f|^2)} finite almost everywhere. This is an extended version of the Hilbert module {L^\infty_{\Omega} L^2(\tilde \Omega|\Omega)}, which is defined similarly except that {\pi_*(|f|^2)} is required to lie in {L^\infty(\Omega,{\mathcal B},\mu)}; this is a Hilbert module over {L^\infty(\Omega, {\mathcal B}, \mu)} which is of particular importance in the Furstenberg-Zimmer structure theory of measure-preserving systems. We can then define the stochastic set {L^2_\pi(\tilde \Omega)|\Omega} by setting

\displaystyle  \Gamma(L^2_\pi(\tilde \Omega)|E) := L^2( \pi^{-1}(E) | E )

with the obvious restriction maps. In the case that {\Omega,\Omega'} are standard Borel spaces, one can disintegrate {\mu'} as an integral {\mu' = \int_\Omega \nu_\omega\ d\mu(\omega)} of probability measures {\nu_\omega} (supported in the fibre {\pi^{-1}(\{\omega\})}), in which case this stochastic set can be viewed as having fibres {L^2( \tilde \Omega, \tilde {\mathcal B}, \nu_\omega )} (though if {\Omega} is not discrete, there are still some measurability conditions in {\omega} on the local and global elements that need to be imposed). However, I am interested in the case when {\Omega,\Omega'} are not standard Borel spaces (in fact, I will take them to be algebraic probability spaces, as defined in this previous post), in which case disintegrations are not available. However, it appears that the stochastic analysis developed in this blog post can serve as a substitute for the tool of disintegration in this context.

We make the remark that if {X|\Omega} is a stochastic set and {E, F} are events that are equivalent up to null events, then one can identify {\Gamma(X|E)} with {\Gamma(X|F)} (through their common restriction to {\Gamma(X|(E \cap F))}, with the restriction maps now being bijections). As such, the notion of a stochastic set does not require the full structure of a concrete probability space {(\Omega, {\mathcal B}, {\mathbf P})}; one could also have defined the notion using only the abstract {\sigma}-algebra consisting of {{\mathcal B}} modulo null events as the base space, or equivalently one could define stochastic sets over the algebraic probability spaces defined in this previous post. However, we will stick with the classical formalism of concrete probability spaces here so as to keep the notation reasonably familiar.

As a corollary of the above observation, we see that if the base space {\Omega} has total measure {0}, then all stochastic sets are trivial (they are just points).

Exercise 2 If {X|\Omega} is a stochastic set, show that there exists an event {\Omega'} with the property that for any event {E}, {\Gamma(X|E)} is non-empty if and only if {E} is contained in {\Omega'} modulo null events. (In particular, {\Omega'} is unique up to null events.) Hint: consider the numbers {\mu( E )} for {E} ranging over all events with {\Gamma(X|E)} non-empty, and form a maximising sequence for these numbers. Then use all three axioms of a stochastic set.

One can now start take many of the fundamental objects, operations, and results in set theory (and, hence, in most other categories of mathematics) and establish analogues relative to a finite measure space. Implicitly, what we will be doing in the next few paragraphs is endowing the category of stochastic sets with the structure of an elementary topos. However, to keep things reasonably concrete, we will not explicitly emphasise the topos-theoretic formalism here, although it is certainly lurking in the background.

Firstly, we define a stochastic function {f: X|\Omega \rightarrow Y|\Omega} between two stochastic sets {X|\Omega, Y|\Omega} to be a collection of maps {f: \Gamma(X|E) \rightarrow \Gamma(Y|E)} for each {E \in {\mathcal B}} which form a natural transformation in the sense that {f(x|E) = f(x)|E} for all {x \in \Gamma(X|F)} and nested events {E \subset F}. In the case when {\Omega} is discrete and at most countable (and after deleting all null points), a stochastic function is nothing more than a collection of functions {f_\omega: X_\omega \rightarrow Y_\omega} for each {\omega \in \Omega}, with the function {f: \Gamma(X|E) \rightarrow \Gamma(Y|E)} then being a direct sum of the factor functions {f_\omega}:

\displaystyle  f( (x_\omega)_{\omega \in E} ) = ( f_\omega(x_\omega) )_{\omega \in E}.

Thus (in the discrete, at most countable setting, at least) stochastic functions do not mix together information from different states {\omega} in a sample space; the value of {f(x)} at {\omega} depends only on the value of {x} at {\omega}. The situation is a bit more subtle for continuous probability spaces, due to the identification of stochastic objects that agree almost surely, nevertheness it is still good intuition to think of stochastic functions as essentially being “pointwise” or “local” in nature.

One can now form the stochastic set {\hbox{Hom}(X \rightarrow Y)|\Omega} of functions from {X|\Omega} to {Y|\Omega}, by setting {\Gamma(\hbox{Hom}(X \rightarrow Y)|E)} for any event {E} to be the set of local stochastic functions {f: X|E \rightarrow Y|E} of the localisations of {X|\Omega, Y|\Omega} to {E}; this is a stochastic set if we use the obvious restriction maps. In the case when {\Omega} is discrete and at most countable, the fibre {\hbox{Hom}(X \rightarrow Y)_\omega} at a point {\omega} of positive measure is simply the set {Y_\omega^{X_\omega}} of functions from {X_\omega} to {Y_\omega}.

In a similar spirit, we say that one stochastic set {Y|\Omega} is a (stochastic) subset of another {X|\Omega}, and write {Y|\Omega \subset X|\Omega}, if we have a stochastic inclusion map, thus {\Gamma(Y|E) \subset \Gamma(X|E)} for all events {E}, with the restriction maps being compatible. We can then define the power set {2^X|\Omega} of a stochastic set {X|\Omega} by setting {\Gamma(2^X|E)} for any event {E} to be the set of all stochastic subsets {Y|E} of {X|E} relative to {E}; it is easy to see that {2^X|\Omega} is a stochastic set with the obvious restriction maps (one can also identify {2^X|\Omega} with {\hbox{Hom}(X, \{\hbox{true},\hbox{false}\})|\Omega} in the obvious fashion). Again, when {\Omega} is discrete and at most countable, the fibre of {2^X|\Omega} at a point {\omega} of positive measure is simply the deterministic power set {2^{X_\omega}}.

Note that if {f: X|\Omega \rightarrow Y|\Omega} is a stochastic function and {Y'|\Omega} is a stochastic subset of {Y|\Omega}, then the inverse image {f^{-1}(Y')|\Omega}, defined by setting {\Gamma(f^{-1}(Y')|E)} for any event {E} to be the set of those {x \in \Gamma(X|E)} with {f(x) \in \Gamma(Y'|E)}, is a stochastic subset of {X|\Omega}. In particular, given a {k}-ary relation {R: X_1 \times \dots \times X_k|\Omega \rightarrow \{\hbox{true}, \hbox{false}\}|\Omega}, the inverse image {R^{-1}( \{ \hbox{true} \}|\Omega )} is a stochastic subset of {X_1 \times \dots \times X_k|\Omega}, which by abuse of notation we denote as

\displaystyle  \{ (x_1,\dots,x_k) \in X_1 \times \dots \times X_k: R(x_1,\dots,x_k) \hbox{ is true} \}|\Omega.

In a similar spirit, if {X'|\Omega} is a stochastic subset of {X|\Omega} and {f: X|\Omega \rightarrow Y|\Omega} is a stochastic function, we can define the image {f(X')|\Omega} by setting {\Gamma(f(X')|E)} to be the set of those {f(x)} with {x \in \Gamma(X'|E)}; one easily verifies that this is a stochastic subset of {Y|\Omega}.

Remark 2 One should caution that in the definition of the subset relation {Y|\Omega \subset X|\Omega}, it is important that {\Gamma(Y|E) \subset \Gamma(X|E)} for all events {E}, not just the global event {\Omega}; in particular, just because a stochastic set {X|\Omega} has no global sections, does not mean that it is contained in the stochastic empty set {\emptyset|\Omega}.

Now we discuss Boolean operations on stochastic subsets of a given stochastic set {X|\Omega}. Given two stochastic subsets {X_1|\Omega, X_2|\Omega} of {X|\Omega}, the stochastic intersection {(X_1 \cap X_2)|\Omega} is defined by setting {\Gamma((X_1 \cap X_2)|E)} to be the set of {x \in \Gamma(X|E)} that lie in both {\Gamma(X_1|E)} and {\Gamma(X_2|E)}:

\displaystyle  \Gamma(X_1 \cap X_2)|E) := \Gamma(X_1|E) \cap \Gamma(X_2|E).

This is easily verified to again be a stochastic subset of {X|\Omega}. More generally one may define stochastic countable intersections {(\bigcap_{n=1}^\infty X_n)|\Omega} for any sequence {X_n|\Omega} of stochastic subsets of {X|\Omega}. One could extend this definition to uncountable families if one wished, but I would advise against it, because some of the usual laws of Boolean algebra (e.g. the de Morgan laws) may break down in this setting.

Stochastic unions are a bit more subtle. The set {\Gamma((X_1 \cup X_2)|E)} should not be defined to simply be the union of {\Gamma(X_1|E)} and {\Gamma(X_2|E)}, as this would not respect the gluing axiom. Instead, we define {\Gamma((X_1 \cup X_2)|E)} to be the set of all {x \in \Gamma(X|E)} such that one can cover {E} by measurable subevents {E_1,E_2} such that {x_i|E_i \in \Gamma(X_i|E_i)} for {i=1,2}; then {(X_1 \cup X_2)|\Omega} may be verified to be a stochastic subset of {X|\Omega}. Thus for instance {\{0,1\}|\Omega} is the stochastic union of {\{0\}|\Omega} and {\{1\}|\Omega}. Similarly for countable unions {(\bigcup_{n=1}^\infty X_n)|\Omega} of stochastic subsets {X_n|\Omega} of {X|\Omega}, although for uncountable unions are extremely problematic (they are disliked by both the measure theory and the countable gluing axiom) and will not be defined here. Finally, the stochastic difference set {\Gamma((X_1 \backslash X_2)|E)} is defined as the set of all {x|E} in {\Gamma(X_1|E)} such that {x|F \not \in \Gamma(X_2|F)} for any subevent {F} of {E} of positive probability. One may verify that in the case when {\Omega} is discrete and at most countable, these Boolean operations correspond to the classical Boolean operations applied separately to each fibre {X_{i,\omega}} of the relevant sets {X_i}. We also leave as an exercise to the reader to verify the usual laws of Boolean arithmetic, e.g. the de Morgan laws, provided that one works with at most countable unions and intersections.

One can also consider a stochastic finite union {(\bigcup_{n=1}^N X_n)|\Omega} in which the number {N} of sets in the union is itself stochastic. More precisely, let {X|\Omega} be a stochastic set, let {N \in {\bf N}|\Omega} be a stochastic natural number, and let {n \mapsto X_n|\Omega} be a stochastic function from the stochastic set {\{ n \in {\bf N}: n \leq N\}|\Omega} (defined by setting {\Gamma(\{n \in {\bf N}: n\leq N\}|E) := \{ n \in {\bf N}|E: n \leq N|E\}})) to the stochastic power set {2^X|\Omega}. Here we are considering {0} to be a natural number, to allow for unions that are possibly empty, with {{\bf N}_+ := {\bf N} \backslash \{0\}} used for the positive natural numbers. We also write {(X_n)_{n=1}^N|\Omega} for the stochastic function {n \mapsto X_n|\Omega}. Then we can define the stochastic union {\bigcup_{n=1}^N X_n|\Omega} by setting {\Gamma(\bigcup_{n=1}^N X_n|E)} for an event {E} to be the set of local elements {x \in \Gamma(X|E)} with the property that there exists a covering of {E} by measurable subevents {E_{n_0}} for {n_0 \in {\bf N}_+}, such that one has {n_0 \leq N|E_{n_0}} and {x|E_{n_0} \in \Gamma(X_{n_0}|E_{n_0})}. One can verify that {\bigcup_{n=1}^N X_n|\Omega} is a stochastic set (with the obvious restriction maps). Again, in the model case when {\Omega} is discrete and at most countable, the fibre {(\bigcup_{n=1}^N X_n)_\omega} is what one would expect it to be, namely {\bigcup_{n=1}^{N(\omega)} (X_n)_\omega}.

The Cartesian product {(X \times Y)|\Omega} of two stochastic sets may be defined by setting {\Gamma((X \times Y)|E) := \Gamma(X|E) \times \Gamma(Y|E)} for all events {E}, with the obvious restriction maps; this is easily seen to be another stochastic set. This lets one define the concept of a {k}-ary operation {f: (X_1 \times \dots \times X_k)|\Omega \rightarrow Y|\Omega} from {k} stochastic sets {X_1,\dots,X_k} to another stochastic set {Y}, or a {k}-ary relation {R: (X_1 \times \dots \times X_k)|\Omega \rightarrow \{\hbox{true}, \hbox{false}\}|\Omega}. In particular, given {x_i \in X_i|\Omega} for {i=1,\dots,k}, the relation {R(x_1,\dots,x_k)} may be deterministically true, deterministically false, or have some other stochastic truth value.

Remark 3 In the degenerate case when {\Omega} is null, stochastic logic becomes a bit weird: all stochastic statements are deterministically true, as are their stochastic negations, since every event in {\Omega} (even the empty set) now holds with full probability. Among other pathologies, the empty set now has a global element over {\Omega} (this is analogous to the notorious convention {0^0=1}), and any two deterministic objects {x,y} become equal over {\Omega}: {x|\Omega=y|\Omega}.

The following simple observation is crucial to subsequent discussion. If {(x_n)_{n \in {\bf N}_+}} is a sequence taking values in the global elements {\Gamma(X|\Omega)} of a stochastic space {X|\Omega}, then we may also define global elements {x_n \in \Gamma(X|\Omega)} for stochastic indices {n \in {\bf N}_+|\Omega} as well, by appealing to the countable gluing axiom to glue together {x_{n_0}} restricted to the set {\{ \omega \in \Omega: n(\omega) = n_0\}} for each deterministic natural number {n_0} to form {x_n}. With this definition, the map {n \mapsto x_n} is a stochastic function from {{\bf N}_+|\Omega} to {X|\Omega}; indeed, this creates a one-to-one correspondence between external sequences (maps {n \mapsto x_n} from {{\bf N}_+} to {\Gamma(X|\Omega)}) and stochastic sequences (stochastic functions {n \mapsto x_n} from {{\bf N}_+|\Omega} to {X|\Omega}). Similarly with {{\bf N}_+} replaced by any other at most countable set. This observation will be important in allowing many deterministic arguments involving sequences will be able to be carried over to the stochastic setting.

We now specialise from the extremely broad discipline of set theory to the more focused discipline of real analysis. There are two fundamental axioms that underlie real analysis (and in particular distinguishes it from real algebra). The first is the Archimedean property, which we phrase in the “no infinitesimal” formulation as follows:

Proposition 2 (Archimedean property) Let {x \in {\bf R}} be such that {x \leq 1/n} for all positive natural numbers {n}. Then {x \leq 0}.

The other is the least upper bound axiom:

Proposition 3 (Least upper bound axiom) Let {S} be a non-empty subset of {{\bf R}} which has an upper bound {M \in {\bf R}}, thus {x \leq M} for all {x \in S}. Then there exists a unique real number {\sup S \in {\bf R}} with the following properties:

  • {x \leq \sup S} for all {x \in S}.
  • For any real {L < \sup S}, there exists {x \in S} such that {L < x \leq \sup S}.
  • {\sup S \leq M}.

Furthermore, {\sup S} does not depend on the choice of {M}.

The Archimedean property extends easily to the stochastic setting:

Proposition 4 (Stochastic Archimedean property) Let {x \in \Gamma({\bf R}|\Omega)} be such that {x \leq 1/n} for all deterministic natural numbers {n}. Then {x \leq 0}.

Remark 4 Here, incidentally, is one place in which this stochastic formalism deviates from the nonstandard analysis formalism, as the latter certainly permits the existence of infinitesimal elements. On the other hand, we caution that stochastic real numbers are permitted to be unbounded, so that formulation of Archimedean property is not valid in the stochastic setting.

The proof is easy and is left to the reader. The least upper bound axiom also extends nicely to the stochastic setting, but the proof requires more work (in particular, our argument uses the monotone convergence theorem):

Theorem 5 (Stochastic least upper bound axiom) Let {S|\Omega} be a stochastic subset of {{\bf R}|\Omega} which has a global upper bound {M \in {\bf R}|\Omega}, thus {x \leq M} for all {x \in \Gamma(S|\Omega)}, and is globally non-empty in the sense that there is at least one global element {x \in \Gamma(S|\Omega)}. Then there exists a unique stochastic real number {\sup S \in \Gamma({\bf R}|\Omega)} with the following properties:

  • {x \leq \sup S} for all {x \in \Gamma(S|\Omega)}.
  • For any stochastic real {L < \sup S}, there exists {x \in \Gamma(S|\Omega)} such that {L < x \leq \sup S}.
  • {\sup S \leq M}.

Furthermore, {\sup S} does not depend on the choice of {M}.

For future reference, we note that the same result holds with {{\bf R}} replaced by {{\bf N} \cup \{+\infty\}} throughout, since the latter may be embedded in the former, for instance by mapping {n} to {1 - \frac{1}{n+1}} and {+\infty} to {1}. In applications, the above theorem serves as a reasonable substitute for the countable axiom of choice, which does not appear to hold in unrestricted generality relative to a measure space; in particular, it can be used to generate various extremising sequences for stochastic functionals on various stochastic function spaces.

Proof: Uniqueness is clear (using the Archimedean property), as well as the independence on {M}, so we turn to existence. By using an order-preserving map from {{\bf R}} to {(-1,1)} (e.g. {x \mapsto \frac{2}{\pi} \hbox{arctan}(x)}) we may assume that {S|\Omega} is a subset of {(-1,1)|\Omega}, and that {M < 1}.

We observe that {\Gamma(S|\Omega)} is a lattice: if {x, y \in \Gamma(S|\Omega)}, then {\max(x,y)} and {\min(x,y)} also lie in {\Gamma(S|\Omega)}. Indeed, {\max(x,y)} may be formed by appealing to the countable gluing axiom to glue {y} (restricted the set {\{ \omega \in \Omega: x(\omega) < y(\omega) \}}) with {x} (restricted to the set {\{ \omega \in \Omega: x(\omega) \geq y(\omega) \}}), and similarly for {\min(x,y)}. (Here we use the fact that relations such as {<} are Borel measurable on {{\bf R}}.)

Let {A \in {\bf R}} denote the deterministic quantity

\displaystyle  A := \sup \{ \int_\Omega x(\omega)\ d\mu(\omega): x \in \Gamma(S|\Omega) \}

then (by Proposition 3!) {A} is well-defined; here we use the hypothesis that {\mu(\Omega)} is finite. Thus we may find a sequence {(x_n)_{n \in {\bf N}}} of elements {x_n} of {\Gamma(S|\Omega)} such that

\displaystyle  \int_\Omega x_n(\omega)\ d\mu(\omega) \rightarrow A \hbox{ as } n \rightarrow \infty. \ \ \ \ \ (1)

Using the lattice property, we may assume that the {x_n} are non-decreasing: {x_n \leq x_m} whenever {n \leq m}. If we then define {\sup S(\omega) := \sup_n x_n(\omega)} (after choosing measurable representatives of each equivalence class {x_n}), then {\sup S} is a stochastic real with {\sup S \leq M}.

If {x \in \Gamma(S|\Omega)}, then {\max(x,x_n) \in \Gamma(S|\Omega)}, and so

\displaystyle  \int_\Omega \max(x,x_n)\ d\mu(\omega) \leq A.

From this and (1) we conclude that

\displaystyle  \int_\Omega \max(x-x_n,0) \rightarrow 0 \hbox{ as } n \rightarrow \infty.

From monotone convergence, we conclude that

\displaystyle  \int_\Omega \max(x-\sup S,0) = 0

and so {x \leq \sup S}, as required.

Now let {L < \sup S} be a stochastic real. After choosing measurable representatives of each relevant equivalence class, we see that for almost every {\omega \in \Omega}, we can find a natural number {n(\omega)} with {x_{n(\omega)} > L}. If we choose {n(\omega)} to be the first such positive natural number when it exists, and (say) {1} otherwise, then {n} is a stochastic positive natural number and {L < x_n}. The claim follows. \Box

Remark 5 One can abstract away the role of the measure {\mu} here, leaving only the ideal of null sets. The property that the measure is finite is then replaced by the more general property that given any non-empty family of measurable sets, there is an at most countable union of sets in that family that is an upper bound modulo null sets for all elements in that faily.

Using Proposition 4 and Theorem 5, one can then revisit many of the other foundational results of deterministic real analysis, and develop stochastic analogues; we give some examples of this below the fold (focusing on the Heine-Borel theorem and a case of the spectral theorem). As an application of this formalism, we revisit some of the Furstenberg-Zimmer structural theory of measure-preserving systems, particularly that of relatively compact and relatively weakly mixing systems, and interpret them in this framework, basically as stochastic versions of compact and weakly mixing systems (though with the caveat that the shift map is allowed to act non-trivially on the underlying probability space). As this formalism is “point-free”, in that it avoids explicit use of fibres and disintegrations, it will be well suited for generalising this structure theory to settings in which the underlying probability spaces are not standard Borel, and the underlying groups are uncountable; I hope to discuss such generalisations in future blog posts.

Remark 6 Roughly speaking, stochastic real analysis can be viewed as a restricted subset of classical real analysis in which all operations have to be “measurable” with respect to the base space. In particular, indiscriminate application of the axiom of choice is not permitted, and one should largely restrict oneself to performing countable unions and intersections rather than arbitrary unions or intersections. Presumably one can formalise this intuition with a suitable “countable transfer principle”, but I was not able to formulate a clean and general principle of this sort, instead verifying various assertions about stochastic objects by hand rather than by direct transfer from the deterministic setting. However, it would be desirable to have such a principle, since otherwise one is faced with the tedious task of redoing all the foundations of real analysis (or whatever other base theory of mathematics one is going to be working in) in the stochastic setting by carefully repeating all the arguments.

More generally, topos theory is a good formalism for capturing precisely the informal idea of performing mathematics with certain operations, such as the axiom of choice, the law of the excluded middle, or arbitrary unions and intersections, being somehow “prohibited” or otherwise “restricted”.

Read the rest of this entry »

As laid out in the foundational work of Kolmogorov, a classical probability space (or probability space for short) is a triplet {(X, {\mathcal X}, \mu)}, where {X} is a set, {{\mathcal X}} is a {\sigma}-algebra of subsets of {X}, and {\mu: {\mathcal X} \rightarrow [0,1]} is a countably additive probability measure on {{\mathcal X}}. Given such a space, one can form a number of interesting function spaces, including

  • the (real) Hilbert space {L^2(X, {\mathcal X}, \mu)} of square-integrable functions {f: X \rightarrow {\bf R}}, modulo {\mu}-almost everywhere equivalence, and with the positive definite inner product {\langle f, g\rangle_{L^2(X, {\mathcal X}, \mu)} := \int_X f g\ d\mu}; and
  • the unital commutative Banach algebra {L^\infty(X, {\mathcal X}, \mu)} of essentially bounded functions {f: X \rightarrow {\bf R}}, modulo {\mu}-almost everywhere equivalence, with {\|f\|_{L^\infty(X, {\mathcal X}, \mu)}} defined as the essential supremum of {|f|}.

There is also a trace {\tau = \tau_\mu: L^\infty(X, {\mathcal X}, \mu) \rightarrow {\bf C}} on {L^\infty} defined by integration: {\tau(f) := \int_X f\ d\mu}.

One can form the category {\mathbf{Prb}} of classical probability spaces, by defining a morphism {\phi: (X, {\mathcal X}, \mu) \rightarrow (Y, {\mathcal Y}, \nu)} between probability spaces to be a function {\phi: X \rightarrow Y} which is measurable (thus {\phi^{-1}(E) \in {\mathcal X}} for all {E \in {\mathcal Y}}) and measure-preserving (thus {\mu(\phi^{-1}(E)) = \nu(E)} for all {E \in {\mathcal Y}}).

Let us now abstract the algebraic features of these spaces as follows; for want of a better name, I will refer to this abstraction as an algebraic probability space, and is very similar to the non-commutative probability spaces studied in this previous post, except that these spaces are now commutative (and real).

Definition 1 An algebraic probability space is a pair {({\mathcal A}, \tau)} where

  • {{\mathcal A}} is a unital commutative real algebra;
  • {\tau: {\mathcal A} \rightarrow {\bf R}} is a homomorphism such that {\tau(1)=1} and {\tau( f^2 ) \geq 0} for all {f \in {\mathcal A}};
  • Every element {f} of {{\mathcal A}} is bounded in the sense that {\sup_{k \geq 1} \tau( f^{2k} )^{1/2k} < \infty}. (Technically, this isn’t an algebraic property, but I need it for technical reasons.)

A morphism {\phi: ({\mathcal A}_1, \tau_1) \rightarrow ({\mathcal A}_2, \tau_2)} is a homomorphism {\phi^*: {\mathcal A}_2 \rightarrow {\mathcal A}_1} which is trace-preserving, in the sense that {\tau_1(\phi^*(f)) = \tau_2(f)} for all {f \in {\mathcal A}_2}.

For want of a better name, I’ll denote the category of algebraic probability spaces as {\mathbf{AlgPrb}}. One can view this category as the opposite category to that of (a subcategory of) the category of tracial commutative real algebras. One could emphasise this opposite nature by denoting the algebraic probability space as {({\mathcal A}, \tau)^{op}} rather than {({\mathcal A},\tau)}; another suggestive (but slightly inaccurate) notation, inspired by the language of schemes, would be {\hbox{Spec}({\mathcal A},\tau)} rather than {({\mathcal A},\tau)}. However, we will not adopt these conventions here, and refer to algebraic probability spaces just by the pair {({\mathcal A},\tau)}.

By the previous discussion, we have a covariant functor {F: \textbf{Prb} \rightarrow \textbf{AlgPrb}} that takes a classical probability space {(X, {\mathcal X}, \mu)} to its algebraic counterpart {(L^\infty(X, {\mathcal X},\mu), \tau_\mu)}, with a morphism {\phi: (X, {\mathcal X}, \mu) \rightarrow (Y, {\mathcal Y}, \nu)} of classical probability spaces mapping to a morphism {F(\phi): (L^\infty(X, {\mathcal X},\mu), \tau_\mu) \rightarrow (L^\infty(Y, {\mathcal Y},\nu), \tau_\nu)} of the corresponding algebraic probability spaces by the formula

\displaystyle  F(\phi)^* f := f \circ \phi

for {f \in L^\infty(Y, {\mathcal Y}, \nu)}. One easily verifies that this is a functor.

In this post I would like to describe a functor {G: \textbf{AlgPrb} \rightarrow \textbf{Prb}} which partially inverts {F} (up to natural isomorphism), that is to say a recipe for starting with an algebraic probability space {({\mathcal A}, \tau)} and producing a classical probability space {(X, {\mathcal X}, \mu)}. This recipe is not new – it is basically the (commutative) Gelfand-Naimark-Segal construction (discussed in this previous post) combined with the Loomis-Sikorski theorem (discussed in this previous post). However, I wanted to put the construction in a single location for sake of reference. I also wanted to make the point that {F} and {G} are not complete inverses; there is a bit of information in the algebraic probability space (e.g. topological information) which is lost when passing back to the classical probability space. In some future posts, I would like to develop some ergodic theory using the algebraic foundations of probability theory rather than the classical foundations; this turns out to be convenient in the ergodic theory arising from nonstandard analysis (such as that described in this previous post), in which the groups involved are uncountable and the underlying spaces are not standard Borel spaces.

Let us describe how to construct the functor {G}, with details postponed to below the fold.

  1. Starting with an algebraic probability space {({\mathcal A}, \tau)}, form an inner product on {{\mathcal A}} by the formula {\langle f, g \rangle := \tau(fg)}, and also form the spectral radius {\rho(f) :=\lim_{k \rightarrow \infty} \tau(f^{2^k})^{1/2^k}}.
  2. The inner product is clearly positive semi-definite. Quotienting out the null vectors and taking completions, we arrive at a real Hilbert space {L^2 = L^2({\mathcal A},\tau)}, to which the trace {\tau} may be extended.
  3. Somewhat less obviously, the spectral radius is well-defined and gives a norm on {{\mathcal A}}. Taking {L^2} limits of sequences in {{\mathcal A}} of bounded spectral radius gives us a subspace {L^\infty = L^\infty({\mathcal A},\tau)} of {L^2} that has the structure of a real commutative Banach algebra.
  4. The idempotents {1_E} of the Banach algebra {L^\infty} may be indexed by elements {E} of an abstract {\sigma}-algebra {{\mathcal B}}.
  5. The Boolean algebra homomorphisms {\delta_x: {\mathcal B} \rightarrow \{0,1\}} (or equivalently, the real algebra homomorphisms {\iota_x: L^\infty \rightarrow {\bf R}}) may be indexed by elements {x} of a space {X}.
  6. Let {{\mathcal X}} denote the {\sigma}-algebra on {X} generated by the basic sets {\overline{E} := \{ x \in X: \delta_x(E) = 1 \}} for every {E \in {\mathcal B}}.
  7. Let {{\mathcal N}} be the {\sigma}-ideal of {{\mathcal X}} generated by the sets {\bigcap_n \overline{E_n}}, where {E_n \in {\mathcal B}} is a sequence with {\bigcap_n E_n = \emptyset}.
  8. One verifies that {{\mathcal B}} is isomorphic to {{\mathcal X}/{\mathcal N}}. Using this isomorphism, the trace {\tau} on {L^\infty} can be used to construct a countably additive measure {\mu} on {{\mathcal X}}. The classical probability space {(X, {\mathcal X}, \mu)} is then {G( {\mathcal A}, \tau )}, and the abstract spaces {L^2, L^\infty} may now be identified with their concrete counterparts {L^2(X, {\mathcal X}, \mu)}, {L^\infty(X, {\mathcal X}, \mu)}.
  9. Every algebraic probability space morphism {\phi: ({\mathcal A}_1,\tau_1) \rightarrow ({\mathcal A}_2,\tau_2)} generates a classical probability morphism {G(\phi): (X_1, {\mathcal X}_1, \mu_1) \rightarrow (X_2, {\mathcal X}_2, \mu_2)} via the formula

    \displaystyle  \delta_{G(\phi)(x_1)}( E_2 ) = \delta_{x_1}( \phi^*(E_2) )

    using a pullback operation {\phi^*} on the abstract {\sigma}-algebras {{\mathcal B}_1, {\mathcal B}_2} that can be defined by density.

Remark 1 The classical probability space {X} constructed by the functor {G} has some additional structure; namely {X} is a {\sigma}-Stone space (a Stone space with the property that the closure of any countable union of clopen sets is clopen), {{\mathcal X}} is the Baire {\sigma}-algebra (generated by the clopen sets), and the null sets are the meager sets. However, we will not use this additional structure here.

The partial inversion relationship between the functors {F: \textbf{Prb} \rightarrow \textbf{AlgPrb}} and {G: \textbf{AlgPrb} \rightarrow \textbf{Prb}} is given by the following assertion:

  1. There is a natural transformation from {F \circ G: \textbf{AlgPrb} \rightarrow \textbf{AlgPrb}} to the identity functor {I: \textbf{AlgPrb} \rightarrow \textbf{AlgPrb}}.

More informally: if one starts with an algebraic probability space {({\mathcal A},\tau)} and converts it back into a classical probability space {(X, {\mathcal X}, \mu)}, then there is a trace-preserving algebra homomorphism of {{\mathcal A}} to {L^\infty( X, {\mathcal X}, \mu )}, which respects morphisms of the algebraic probability space. While this relationship is far weaker than an equivalence of categories (which would require that {F \circ G} and {G \circ F} are both natural isomorphisms), it is still good enough to allow many ergodic theory problems formulated using classical probability spaces to be reformulated instead as an equivalent problem in algebraic probability spaces.

Remark 2 The opposite composition {G \circ F: \textbf{Prb} \rightarrow \textbf{Prb}} is a little odd: it takes an arbitrary probability space {(X, {\mathcal X}, \mu)} and returns a more complicated probability space {(X', {\mathcal X}', \mu')}, with {X'} being the space of homomorphisms {\iota_x: L^\infty(X, {\mathcal X}, \mu) \rightarrow {\bf R}}. while there is “morally” an embedding of {X} into {X'} using the evaluation map, this map does not exist in general because points in {X} may well have zero measure. However, if one takes a “pointless” approach and focuses just on the measure algebras {({\mathcal X}, \mu)}, {({\mathcal X}', \mu')}, then these algebras become naturally isomorphic after quotienting out by null sets.

Remark 3 An algebraic probability space captures a bit more structure than a classical probability space, because {{\mathcal A}} may be identified with a proper subset of {L^\infty} that describes the “regular” functions (or random variables) of the space. For instance, starting with the unit circle {{\bf R}/{\bf Z}} (with the usual Haar measure and the usual trace {\tau(f) = \int_{{\bf R}/{\bf Z}} f}), any unital subalgebra {{\mathcal A}} of {L^\infty({\bf R}/{\bf Z})} that is dense in {L^2({\bf R}/{\bf Z})} will generate the same classical probability space {G( {\mathcal A}, \tau )} on applying the functor {G}, namely one will get the space {({\bf R}/{\bf Z})'} of homomorphisms from {L^\infty({\bf R}/{\bf Z})} to {{\bf R}} (with the measure induced from {\tau}). Thus for instance {{\mathcal A}} could be the continuous functions {C( {\bf R}/{\bf Z} )}, the Wiener algebra {A({\bf R}/{\bf Z})} or the full space {L^\infty({\bf R}/{\bf Z})}, but the classical space {G( {\mathcal A}, \tau )} will be unable to distinguish these spaces from each other. In particular, the functor {F \circ G} loses information (roughly speaking, this functor takes an algebraic probability space and completes it to a von Neumann algebra, but then forgets exactly what algebra was initially used to create this completion). In ergodic theory, this sort of “extra structure” is traditionally encoded in topological terms, by assuming that the underlying probability space {X} has a nice topological structure (e.g. a standard Borel space); however, with the algebraic perspective one has the freedom to have non-topological notions of extra structure, by choosing {{\mathcal A}} to be something other than an algebra {C(X)} of continuous functions on a topological space. I hope to discuss one such example of extra structure (coming from the Gowers-Host-Kra theory of uniformity seminorms) in a later blog post (this generalises the example of the Wiener algebra given previously, which is encoding “Fourier structure”).

A small example of how one could use the functors {F, G} is as follows. Suppose one has a classical probability space {(X, {\mathcal X}, \mu)} with a measure-preserving action of an uncountable group {\Gamma}, which is only defined (and an action) up to almost everywhere equivalence; thus for instance for any set {E} and any {g, h \in \Gamma}, {T^{gh} E} and {T^g T^h E} might not be exactly equal, but only equal up to a null set. For similar reasons, an element {E} of the invariant factor {{\mathcal X}^\Gamma} might not be exactly invariant with respect to {\Gamma}, but instead one only has {T^g E} and {E} equal up to null sets for each {g \in \Gamma}. One might like to “clean up” the action of {\Gamma} to make it defined everywhere, and a genuine action everywhere, but this is not immediately achievable if {\Gamma} is uncountable, since the union of all the null sets where something bad occurs may cease to be a null set. However, by applying the functor {F}, each shift {T^g: X \rightarrow X} defines a morphism {T^g: L^\infty(X, {\mathcal X}, \mu) \rightarrow L^\infty(X, {\mathcal X}, \mu)} on the associated algebraic probability space (i.e. the Koopman operator), and then applying {G}, we obtain a shift {T^g: X' \rightarrow X'} on a new classical probability space {(X', {\mathcal X}', \mu')} which now gives a genuine measure-preserving action of {\Gamma}, and which is equivalent to the original action from a measure algebra standpoint. The invariant factor {{\mathcal X}^\Gamma} now consists of those sets in {{\mathcal X}'} which are genuinely {\Gamma}-invariant, not just up to null sets. (Basically, the classical probability space {(X', {\mathcal X}', \mu')} contains a Boolean algebra {\overline{\mathcal B}} with the property that every measurable set {A \in {\mathcal X}'} is equivalent up to null sets to precisely one set in {\overline{\mathcal B}}, allowing for a canonical “retraction” onto {\overline{\mathcal B}} that eliminates all null set issues.)

More indirectly, the functors {F, G} suggest that one should be able to develop a “pointless” form of ergodic theory, in which the underlying probability spaces are given algebraically rather than classically. I hope to give some more specific examples of this in later posts.

Read the rest of this entry »

The classical foundations of probability theory (discussed for instance in this previous blog post) is founded on the notion of a probability space {(\Omega, {\cal E}, {\bf P})} – a space {\Omega} (the sample space) equipped with a {\sigma}-algebra {{\cal E}} (the event space), together with a countably additive probability measure {{\bf P}: {\cal E} \rightarrow [0,1]} that assigns a real number in the interval {[0,1]} to each event.

One can generalise the concept of a probability space to a finitely additive probability space, in which the event space {{\cal E}} is now only a Boolean algebra rather than a {\sigma}-algebra, and the measure {\mu} is now only finitely additive instead of countably additive, thus {{\bf P}( E \vee F ) = {\bf P}(E) + {\bf P}(F)} when {E,F} are disjoint events. By giving up countable additivity, one loses a fair amount of measure and integration theory, and in particular the notion of the expectation of a random variable becomes problematic (unless the random variable takes only finitely many values). Nevertheless, one can still perform a fair amount of probability theory in this weaker setting.

In this post I would like to describe a further weakening of probability theory, which I will call qualitative probability theory, in which one does not assign a precise numerical probability value {{\bf P}(E)} to each event, but instead merely records whether this probability is zero, one, or something in between. Thus {{\bf P}} is now a function from {{\cal E}} to the set {\{0, I, 1\}}, where {I} is a new symbol that replaces all the elements of the open interval {(0,1)}. In this setting, one can no longer compute quantitative expressions, such as the mean or variance of a random variable; but one can still talk about whether an event holds almost surely, with positive probability, or with zero probability, and there are still usable notions of independence. (I will refer to classical probability theory as quantitative probability theory, to distinguish it from its qualitative counterpart.)

The main reason I want to introduce this weak notion of probability theory is that it becomes suited to talk about random variables living inside algebraic varieties, even if these varieties are defined over fields other than {{\bf R}} or {{\bf C}}. In algebraic geometry one often talks about a “generic” element of a variety {V} defined over a field {k}, which does not lie in any specified variety of lower dimension defined over {k}. Once {V} has positive dimension, such generic elements do not exist as classical, deterministic {k}-points {x} in {V}, since of course any such point lies in the {0}-dimensional subvariety {\{x\}} of {V}. There are of course several established ways to deal with this problem. One way (which one might call the “Weil” approach to generic points) is to extend the field {k} to a sufficiently transcendental extension {\tilde k}, in order to locate a sufficient number of generic points in {V(\tilde k)}. Another approach (which one might dub the “Zariski” approach to generic points) is to work scheme-theoretically, and interpret a generic point in {V} as being associated to the zero ideal in the function ring of {V}. However I want to discuss a third perspective, in which one interprets a generic point not as a deterministic object, but rather as a random variable {{\bf x}} taking values in {V}, but which lies in any given lower-dimensional subvariety of {V} with probability zero. This interpretation is intuitive, but difficult to implement in classical probability theory (except perhaps when considering varieties over {{\bf R}} or {{\bf C}}) due to the lack of a natural probability measure to place on algebraic varieties; however it works just fine in qualitative probability theory. In particular, the algebraic geometry notion of being “generically true” can now be interpreted probabilistically as an assertion that something is “almost surely true”.

It turns out that just as qualitative random variables may be used to interpret the concept of a generic point, they can also be used to interpret the concept of a type in model theory; the type of a random variable {x} is the set of all predicates {\phi(x)} that are almost surely obeyed by {x}. In contrast, model theorists often adopt a Weil-type approach to types, in which one works with deterministic representatives of a type, which often do not occur in the original structure of interest, but only in a sufficiently saturated extension of that structure (this is the analogue of working in a sufficiently transcendental extension of the base field). However, it seems that (in some cases at least) one can equivalently view types in terms of (qualitative) random variables on the original structure, avoiding the need to extend that structure. (Instead, one reserves the right to extend the sample space of one’s probability theory whenever necessary, as part of the “probabilistic way of thinking” discussed in this previous blog post.) We illustrate this below the fold with two related theorems that I will interpret through the probabilistic lens: the “group chunk theorem” of Weil (and later developed by Hrushovski), and the “group configuration theorem” of Zilber (and again later developed by Hrushovski). For sake of concreteness we will only consider these theorems in the theory of algebraically closed fields, although the results are quite general and can be applied to many other theories studied in model theory.

Read the rest of this entry »

Van Vu and I have just uploaded to the arXiv our joint paper “Local universality of zeroes of random polynomials“. This paper is a sequel to our previous work on local universality of eigenvalues of (non-Hermitian) random matrices {M_n} with independent entries. One can re-interpret these previous results as a universality result for a certain type of random polynomial {f: {\bf C} \rightarrow {\bf C}}, namely the characteristic polynomial {f(z) = \hbox{det}(M_n-z)} of the random matrix {M_n}. In this paper, we consider the analogous problem for a different model of random polynomial, namely polynomials {f} with independent random coefficients. More precisely, we consider random polynomials {f = f_n} of the form

\displaystyle  f(z) = \sum_{i=0}^n c_i \xi_i z^n

where {c_0,\ldots,c_n \in {\bf C}} are deterministic complex coefficients, and {\xi_0,\ldots,\xi_n \equiv \xi} are independent identically distributed copies of a complex random variable {\xi}, which we normalise to have mean zero and variance one. For simplicity we will ignore the technical issue that the leading coefficient {c_n \xi_n} is allowed to vanish; then {f} has {n} zeroes {\zeta_1,\ldots,\zeta_n \in {\bf C}} (counting multiplicity), which can be viewed as a random point process {\Sigma = \{\zeta_1,\ldots,\zeta_n\}} in the complex plane. In analogy with other models (such as random matrix models), we expect the (suitably normalised) asymptotic statistics of this point process in the limit {n \rightarrow \infty} to be universal, in the sense that it is largely independent of the precise distribution of the atom variable {\xi}.

Our results are fairly general with regard to the choice of coefficients {c_i}, but we isolate three particular choice of coefficients that are particularly natural and well-studied in the literature:

  • Flat polynomials (or Weyl polynomials) in which {c_i := \frac{1}{\sqrt{i!}}}.
  • Elliptic polynomials (or binomial polynomials) in which {c_i := \sqrt{\binom{n}{i}}}.
  • Kac polynomials in which {c_i := 1}.

The flat and elliptic polynomials enjoy special symmetries in the model case when the atom distribution {\xi} is a complex Gaussian {N(0,1)_{\bf C}}. Indeed, the zeroes {\Sigma} of elliptic polynomials with complex Gaussian coefficients have a distribution which is invariant with respect to isometries {T: {\bf C} \cup \{\infty\} \rightarrow {\bf C} \cup \{\infty\}} of the Riemann sphere {{\bf C} \cup \{\infty\}} (thus {T\Sigma} has the same distribution as {\Sigma}), while the zeroes of the limiting case {\sum_{i=0}^\infty \frac{1}{\sqrt{i!}} \xi_i z^i} of the flat polynomials with complex Gaussian coefficients are similarly invariant with respect to isometries {T: {\bf C} \rightarrow {\bf C}} of the complex plane {{\bf C}}. (For a nice geometric proof of this facts, I recommend the nice book of Hough, Krishnapur, Peres, and Virag.)

The global (i.e. coarse-scale) distribution of zeroes of these polynomials is well understood, first in the case of gaussian distributions using the fundamental tool of the Kac-Rice formula, and then for more general choices of atom distribution in the recent work of Kabluchko and Zaporozhets. The zeroes of the flat polynomials are asymptotically distributed according to the circular law, normalised to be uniformly distributed on the disk {B(0,\sqrt{n})} of radius {\sqrt{n}} centred at the origin. To put it a bit informally, the zeroes are asymptotically distributed according to the measure {\frac{1}{\pi} 1_{|z| \leq \sqrt{n}} dz}, where {dz} denotes Lebesgue measure on the complex plane. One can non-rigorously see the scale {\sqrt{n}} appear by observing that when {|z|} is much larger than {\sqrt{n}}, we expect the leading term {\frac{1}{\sqrt{n!}} \xi_n z^n} of the flat polynomial {\sum_{i=0}^n \frac{1}{\sqrt{i!}} \xi_i z^i} to dominate, so that the polynomial should not have any zeroes in this region.

Similarly, the distribution of the elliptic polynomials is known to be asymptotically distributed according to a Cauchy-type distribution {\frac{1}{\pi} \frac{1}{1+|z|^2/n} dz}. The Kac polynomials {\sum_{i=0}^n \xi_i z^i} behave differently; the zeroes concentrate uniformly on the unit circle {|z|=1} (which is reasonable, given that one would expect the top order term {\xi_i z^i} to dominate for {|z| > 1} and the bottom order term {\xi_0} to dominate for {|z| < 1}). In particular, whereas the typical spacing between zeroes in the flat and elliptic cases would be expected to be comparable to {1}, the typical spacing between zeroes for a Kac polynomial would be expected instead to be comparable to {1/n}.

In our paper we studied the local distribution of zeroes at the scale of the typical spacing. In the case of polynomials with continuous complex atom disribution {\xi}, the natural statistic to measure here is the {k}-point correlation function {\rho^{(k)}(z_1,\ldots,z_k)}, which for distinct complex numbers {z_1,\ldots,z_k} can be defined as the probability that there is a zero in each of the balls {B(z_1,\varepsilon),\ldots,B(z_k,\varepsilon)} for some infinitesimal {\epsilon >0}, divided by the normalising factor {(\pi \epsilon^2)^k}. (One can also define a {k}-point correlation function in the case of a discrete distribution, but it will be a measure rather than a function in that case.) Our first main theorem is a general replacement principle which asserts, very roughly speaking, that the asymptotic {k}-point correlation functions of two random polynomials {f, \tilde f} will agree if the log-magnitudes {\log |f(z)|, \log |\tilde f(z)|} have asymptotically the same distribution (actually we have to consider the joint distribution of {\log |f(z_1)|, \ldots \log |f(z_k)|} for several points {z_1,\ldots,z_k}, but let us ignore this detail for now), and if the polynomials {f, \tilde f} obeys a “non-clustering property” which asserts, roughly speaking, that not too many of the zeroes of {f} can typically concentrate in a small ball. This replacement principle was implicit in our previous paper (and can be viewed as a local-scale version of the global-scale replacement principle in this earlier paper of ours). Specialising the replacement principle to the elliptic or flat polynomials, and using the Lindeberg swapping argument, we obtain a Two Moment Theorem that asserts, roughly speaking, that the asymptotic behaviour of the {k}-point correlation functions depends only on the first two moments of the real and imaginary components of {\xi}, as long as one avoids some regions of space where universality is expected to break down. (In particular, because {f(0) = c_0 \xi_0} does not have a universal distribution, one does not expect universality to hold near the origin; there is a similar problem near infinity.) Closely related results, by a slightly different method, have also been obtained recently by Ledoan, Merkli, and Starr. A similar result holds for the Kac polynomials after rescaling to account for the narrower spacing between zeroes.

We also have analogous results in the case of polynomials with real coefficients (so that the coefficients {c_i} and the atom distribution {\xi} are both real). In this case one expects to see a certain number of real zeroes, together with conjugate pairs of complex zeroes. Instead of the {k}-point correlation function {\rho^{(k)}(z_1,\ldots,z_k)}, the natural object of study is now the mixed {(k,l)}-point correlation function {\rho^{(k,l)}(x_1,\ldots,x_k,z_1,\ldots,z_l)} that (roughly speaking) controls how often one can find a real zero near the real numbers {x_1,\ldots,x_k}, and a complex zero near the points {z_1,\ldots,z_l}. It turns out that one can disentangle the real and strictly complex zeroes and obtain separate universality results for both zeroes, provided that at least one of the polynomials involved obeys a “weak repulsion estimate” that shows that the real zeroes do not cluster very close to each other (and that the complex zeroes do not cluster very close to their complex conjugates). Such an estimate is needed to avoid the situation of two nearby real zeroes “colliding” to create a (barely) complex zero and its complex conjugate, or the time reversal of such a collision. Fortunately, in the case of real gaussian polynomials one can use the Kac-Rice formula to establish such a weak repulsion estimate, allowing analogues of the above universality results for complex random polynomials in the real case. Among other things, this gives universality results for the number {N_{\bf R}} of real zeroes of a random flat or elliptic polynomial; it turns out this number is typically equal to {\frac{2}{\pi} \sqrt{n} + O(n^{1/2-c})} and {\frac{n} + O(n^{1/2-c})} respectively. (For Kac polynomials, the situation is somewhat different; it was already known that {N_{\bf R} = \frac{2}{\pi} \log n + o(\log n)} thanks to a long series of papers, starting with the foundational work of Kac and culminating in the work of Ibragimov and Maslova.)

While our methods are based on our earlier work on eigenvalues of random matrices, the situation with random polynomials is actually somewhat easier to deal with. This is because the log-magnitude {\log |f(z)|} of a random polynomial with independent coefficients is significantly easier to control than the log-determinant {\log |\hbox{det}(M-z)|} of a random matrix, as the former can be controlled by the central limit theorem, while the latter requires significantly more difficult arguments (in particular, bounds on the least singular value combined with Girko’s Hermitization trick). As such, one could view the current paper as an introduction to our more complicated previous paper, and with this in mind we have written the current paper to be self-contained (though this did make the paper somewhat lengthy, checking in at 68 pages).

The fundamental notions of calculus, namely differentiation and integration, are often viewed as being the quintessential concepts in mathematical analysis, as their standard definitions involve the concept of a limit. However, it is possible to capture most of the essence of these notions by purely algebraic means (almost completely avoiding the use of limits, Riemann sums, and similar devices), which turns out to be useful when trying to generalise these concepts to more abstract situations in which it becomes convenient to permit the underlying number systems involved to be something other than the real or complex numbers, even if this makes many standard analysis constructions unavailable. For instance, the algebraic notion of a derivation often serves as a substitute for the analytic notion of a derivative in such cases, by abstracting out the key algebraic properties of differentiation, namely linearity and the Leibniz rule (also known as the product rule).

Abstract algebraic analogues of integration are less well known, but can still be developed. To motivate such an abstraction, consider the integration functional {I: {\mathcal S}({\bf R} \rightarrow {\bf C}) \rightarrow {\bf C}} from the space {{\mathcal S}({\bf R} \rightarrow {\bf C})} of complex-valued Schwarz functions {f: {\bf R} \rightarrow {\bf C}} to the complex numbers, defined by

\displaystyle  I(f) := \int_{\bf R} f(x)\ dx

where the integration on the right is the usual Lebesgue integral (or improper Riemann integral) from analysis. This functional obeys two obvious algebraic properties. Firstly, it is linear over {{\bf C}}, thus

\displaystyle  I(cf) = c I(f) \ \ \ \ \ (1)

and

\displaystyle  I(f+g) = I(f) + I(g) \ \ \ \ \ (2)

for all {f,g \in {\mathcal S}({\bf R} \rightarrow {\bf C})} and {c \in {\bf C}}. Secondly, it is translation invariant, thus

\displaystyle  I(\tau_h f) = I(f) \ \ \ \ \ (3)

for all {h \in {\bf C}}, where {\tau_h f(x) := f(x-h)} is the translation of {f} by {h}. Motivated by the uniqueness theory of Haar measure, one might expect that these two axioms already uniquely determine {I} after one sets a normalisation, for instance by requiring that

\displaystyle  I( x \mapsto e^{-\pi x^2} ) = 1. \ \ \ \ \ (4)

This is not quite true as stated (one can modify the proof of the Hahn-Banach theorem, after first applying a Fourier transform, to create pathological translation-invariant linear functionals on {{\mathcal S}({\bf R} \rightarrow {\bf C})} that are not multiples of the standard Fourier transform), but if one adds a mild analytical axiom, such as continuity of {I} (using the usual Schwartz topology on {{\mathcal S}({\bf R} \rightarrow {\bf C})}), then the above axioms are enough to uniquely pin down the notion of integration. Indeed, if {I: {\mathcal S}({\bf R} \rightarrow {\bf C}) \rightarrow {\bf C}} is a continuous linear functional that is translation invariant, then from the linearity and translation invariance axioms one has

\displaystyle  I( \frac{\tau_h f - f}{h} ) = 0

for all {f \in {\mathcal S}({\bf R} \rightarrow {\bf C})} and non-zero reals {h}. If {f} is Schwartz, then as {h \rightarrow 0}, one can verify that the Newton quotients {\frac{\tau_h f - f}{h}} converge in the Schwartz topology to the derivative {f'} of {f}, so by the continuity axiom one has

\displaystyle  I(f') = 0.

Next, note that any Schwartz function of integral zero has an antiderivative which is also Schwartz, and so {I} annihilates all zero-integral Schwartz functions, and thus must be a scalar multiple of the usual integration functional. Using the normalisation (4), we see that {I} must therefore be the usual integration functional, giving the claimed uniqueness.

Motivated by the above discussion, we can define the notion of an abstract integration functional {I: X \rightarrow R} taking values in some vector space {R}, and applied to inputs {f} in some other vector space {X} that enjoys a linear action {h \mapsto \tau_h} (the “translation action”) of some group {V}, as being a functional which is both linear and translation invariant, thus one has the axioms (1), (2), (3) for all {f,g \in X}, scalars {c}, and {h \in V}. The previous discussion then considered the special case when {R = {\bf C}}, {X = {\mathcal S}({\bf R} \rightarrow {\bf C})}, {V = {\bf R}}, and {\tau} was the usual translation action.

Once we have performed this abstraction, we can now present analogues of classical integration which bear very little analytic resemblance to the classical concept, but which still have much of the algebraic structure of integration. Consider for instance the situation in which we keep the complex range {R = {\bf C}}, the translation group {V = {\bf R}}, and the usual translation action {h \mapsto \tau_h}, but we replace the space {{\mathcal S}({\bf R} \rightarrow {\bf C})} of Schwartz functions by the space {Poly_{\leq d}({\bf R} \rightarrow {\bf C})} of polynomials {x \mapsto a_0 + a_1 x + \ldots + a_d x^d} of degree at most {d} with complex coefficients, where {d} is a fixed natural number; note that this space is translation invariant, so it makes sense to talk about an abstract integration functional {I: Poly_{\leq d}({\bf R} \rightarrow {\bf C}) \rightarrow {\bf C}}. Of course, one cannot apply traditional integration concepts to non-zero polynomials, as they are not absolutely integrable. But one can repeat the previous arguments to show that any abstract integration functional must annihilate derivatives of polynomials of degree at most {d}:

\displaystyle  I(f') = 0 \hbox{ for all } f \in Poly_{\leq d}({\bf R} \rightarrow {\bf C}). \ \ \ \ \ (5)

Clearly, every polynomial of degree at most {d-1} is thus annihilated by {I}, which makes {I} a scalar multiple of the functional that extracts the top coefficient {a_d} of a polynomial, thus if one sets a normalisation

\displaystyle  I( x \mapsto x^d ) = c

for some constant {c}, then one has

\displaystyle  I( x \mapsto a_0 + a_1 x + \ldots + a_d x^d ) = c a_d \ \ \ \ \ (6)

for any polynomial {x \mapsto a_0 + a_1 x + \ldots + a_d x^d}. So we see that up to a normalising constant, the operation of extracting the top order coefficient of a polynomial of fixed degree serves as the analogue of integration. In particular, despite the fact that integration is supposed to be the “opposite” of differentiation (as indicated for instance by (5)), we see in this case that integration is basically ({d}-fold) differentiation; indeed, compare (6) with the identity

\displaystyle  (\frac{d}{dx})^d ( a_0 + a_1 x + \ldots + a_d x^d ) = d! a_d.

In particular, we see, in contrast to the usual Lebesgue integral, the integration functional (6) can be localised to an arbitrary location: one only needs to know the germ of the polynomial {x \mapsto a_0 + a_1 x + \ldots + a_d x^d} at a single point {x_0} in order to determine the value of the functional (6). This localisation property may initially seem at odds with the translation invariance, but the two can be reconciled thanks to the extremely rigid nature of the class {Poly_{\leq d}({\bf R} \rightarrow {\bf C})}, in contrast to the Schwartz class {{\mathcal S}({\bf R} \rightarrow {\bf C})} which admits bump functions and so can generate local phenomena that can only be detected in small regions of the underlying spatial domain, and which therefore forces any translation-invariant integration functional on such function classes to measure the function at every single point in space.

The reversal of the relationship between integration and differentiation is also reflected in the fact that the abstract integration operation on polynomials interacts with the scaling operation {\delta_\lambda f(x) := f(x/\lambda)} in essentially the opposite way from the classical integration operation. Indeed, for classical integration on {{\bf R}^d}, one has

\displaystyle  \int_{{\bf R}^d} f(x/\lambda)\ dx = \lambda^d \int f(x)\ dx

for Schwartz functions {f \in {\mathcal S}({\bf R}^d \rightarrow {\bf C})}, and so in this case the integration functional {I(f) := \int_{{\bf R}^d} f(x)\ dx} obeys the scaling law

\displaystyle  I( \delta_\lambda f ) = \lambda^d I(f).

In contrast, the abstract integration operation defined in (6) obeys the opposite scaling law

\displaystyle  I( \delta_\lambda f ) = \lambda^{-d} I(f). \ \ \ \ \ (7)

Remark 1 One way to interpret what is going on is to view the integration operation (6) as a renormalised version of integration. A polynomial {x \mapsto a_0 + a_1 + \ldots + a_d x^d} is, in general, not absolutely integrable, and the partial integrals

\displaystyle  \int_0^N a_0 + a_1 + \ldots + a_d x^d\ dx

diverge as {N \rightarrow \infty}. But if one renormalises these integrals by the factor {\frac{1}{N^{d+1}}}, then one recovers convergence,

\displaystyle  \lim_{N \rightarrow \infty} \frac{1}{N^{d+1}} \int_0^N a_0 + a_1 + \ldots + a_d x^d\ dx = \frac{1}{d+1} a_d

thus giving an interpretation of (6) as a renormalised classical integral, with the renormalisation being responsible for the unusual scaling relationship in (7). However, this interpretation is a little artificial, and it seems that it is best to view functionals such as (6) from an abstract algebraic perspective, rather than to try to force an analytic interpretation on them.

Now we return to the classical Lebesgue integral

\displaystyle  I(f) := \int_{\bf R} f(x)\ dx. \ \ \ \ \ (8)

As noted earlier, this integration functional has a translation invariance associated to translations along the real line {{\bf R}}, as well as a dilation invariance by real dilation parameters {\lambda>0}. However, if we refine the class {{\mathcal S}({\bf R} \rightarrow {\bf C})} of functions somewhat, we can obtain a stronger family of invariances, in which we allow complex translations and dilations. More precisely, let {\mathcal{SE}({\bf C} \rightarrow {\bf C})} denote the space of all functions {f: {\bf C} \rightarrow {\bf C}} which are entire (or equivalently, are given by a Taylor series with an infinite radius of convergence around the origin) and also admit rapid decay in a sectorial neighbourhood of the real line, or more precisely there exists an {\epsilon>0} such that for every {A > 0} there exists {C_A > 0} such that one has the bound

\displaystyle  |f(z)| \leq C_A (1+|z|)^{-A}

whenever {|\hbox{Im}(z)| \leq A + \epsilon |\hbox{Re}(z)|}. For want of a better name, we shall call elements of this space Schwartz entire functions. This is clearly a complex vector space. A typical example of a Schwartz entire function are the complex gaussians

\displaystyle  f(z) := e^{-\pi (az^2 + 2bz + c)}

where {a,b,c} are complex numbers with {\hbox{Re}(a) > 0}. From the Cauchy integral formula (and its derivatives) we see that if {f} lies in {\mathcal{SE}({\bf C} \rightarrow {\bf C})}, then the restriction of {f} to the real line lies in {{\mathcal S}({\bf R} \rightarrow {\bf C})}; conversely, from analytic continuation we see that every function in {{\mathcal S}({\bf R} \rightarrow {\bf C})} has at most one extension in {\mathcal{SE}({\bf C} \rightarrow {\bf C})}. Thus one can identify {\mathcal{SE}({\bf C} \rightarrow {\bf C})} with a subspace of {{\mathcal S}({\bf R} \rightarrow {\bf C})}, and in particular the integration functional (8) is inherited by {\mathcal{SE}({\bf C} \rightarrow {\bf C})}, and by abuse of notation we denote the resulting functional {I: \mathcal{SE}({\bf C} \rightarrow {\bf C}) \rightarrow {\bf C}} as {I} also. Note, in analogy with the situation with polynomials, that this abstract integration functional is somewhat localised; one only needs to evaluate the function {f} on the real line, rather than the entire complex plane, in order to compute {I(f)}. This is consistent with the rigid nature of Schwartz entire functions, as one can uniquely recover the entire function from its values on the real line by analytic continuation.

Of course, the functional {I: \mathcal{SE}({\bf C} \rightarrow {\bf C}) \rightarrow {\bf C}} remains translation invariant with respect to real translation:

\displaystyle  I(\tau_h f) = I(f) \hbox{ for all } h \in {\bf R}.

However, thanks to contour shifting, we now also have translation invariance with respect to complex translation:

\displaystyle  I(\tau_h f) = I(f) \hbox{ for all } h \in {\bf C},

where of course we continue to define the translation operator {\tau_h} for complex {h} by the usual formula {\tau_h f(x) := f(x-h)}. In a similar vein, we also have the scaling law

\displaystyle  I(\delta_\lambda f) = \lambda I(f)

for any {f \in \mathcal{SE}({\bf C} \rightarrow {\bf C})}, if {\lambda} is a complex number sufficiently close to {1} (where “sufficiently close” depends on {f}, and more precisely depends on the sectoral aperture parameter {\epsilon} associated to {f}); again, one can verify that {\delta_\lambda f} lies in {\mathcal{SE}({\bf C} \rightarrow {\bf C})} for {\lambda} sufficiently close to {1}. These invariances (which relocalise the integration functional {I} onto other contours than the real line {{\bf R}}) are very useful for computing integrals, and in particular for computing gaussian integrals. For instance, the complex translation invariance tells us (after shifting by {b/a}) that

\displaystyle  I( z \mapsto e^{-\pi (az^2 + 2bz + c) } ) = e^{-\pi (c-b^2/a)} I( z \mapsto e^{-\pi a z^2} )

when {a,b,c \in {\bf C}} with {\hbox{Re}(a) > 0}, and then an application of the complex scaling law (and a continuity argument, observing that there is a compact path connecting {a} to {1} in the right half plane) gives

\displaystyle  I( z \mapsto e^{-\pi (az^2 + 2bz + c) } ) = a^{-1/2} e^{-\pi (c-b^2/a)} I( z \mapsto e^{-\pi z^2} )

using the branch of {a^{-1/2}} on the right half-plane for which {1^{-1/2} = 1}. Using the normalisation (4) we thus have

\displaystyle  I( z \mapsto e^{-\pi (az^2 + 2bz + c) } ) = a^{-1/2} e^{-\pi (c-b^2/a)}

giving the usual gaussian integral formula

\displaystyle  \int_{\bf R} e^{-\pi (ax^2 + 2bx + c)}\ dx = a^{-1/2} e^{-\pi (c-b^2/a)}. \ \ \ \ \ (9)

This is a basic illustration of the power that a large symmetry group (in this case, the complex homothety group) can bring to bear on the task of computing integrals.

One can extend this sort of analysis to higher dimensions. For any natural number {n \geq 1}, let {\mathcal{SE}({\bf C}^n \rightarrow {\bf C})} denote the space of all functions {f: {\bf C}^n \rightarrow {\bf C}} which is jointly entire in the sense that {f(z_1,\ldots,z_n)} can be expressed as a Taylor series in {z_1,\ldots,z_n} which is absolutely convergent for all choices of {z_1,\ldots,z_n}, and such that there exists an {\epsilon > 0} such that for any {A>0} there is {C_A>0} for which one has the bound

\displaystyle  |f(z)| \leq C_A (1+|z|)^{-A}

whenever {|\hbox{Im}(z_j)| \leq A + \epsilon |\hbox{Re}(z_j)|} for all {1 \leq j \leq n}, where {z = \begin{pmatrix} z_1 \\ \vdots \\ z_n \end{pmatrix}} and {|z| := (|z_1|^2+\ldots+|z_n|^2)^{1/2}}. Again, we call such functions Schwartz entire functions; a typical example is the function

\displaystyle  f(z) := e^{-\pi (z^T A z + 2b^T z + c)}

where {A} is an {n \times n} complex symmetric matrix with positive definite real part, {b} is a vector in {{\bf C}^n}, and {c} is a complex number. We can then define an abstract integration functional {I: \mathcal{SE}({\bf C}^n \rightarrow {\bf C}) \rightarrow {\bf C}} by integration on the real slice {{\bf R}^n}:

\displaystyle  I(f) := \int_{{\bf R}^n} f(x)\ dx

where {dx} is the usual Lebesgue measure on {{\bf R}^n}. By contour shifting in each of the {n} variables {z_1,\ldots,z_n} separately, we see that {I} is invariant with respect to complex translations of each of the {z_j} variables, and is thus invariant under translating the joint variable {z} by {{\bf C}^n}. One can also verify the scaling law

\displaystyle  I(\delta_A f) = \hbox{det}(A) I(f)

for {n \times n} complex matrices {A} sufficiently close to the origin, where {\delta_A f(z) := f(A^{-1} z)}. This can be seen for shear transformations {A} by Fubini’s theorem and the aforementioned translation invariance, while for diagonal transformations near the origin this can be seen from {n} applications of one-dimensional scaling law, and the general case then follows by composition. Among other things, these laws then easily lead to the higher-dimensional generalisation

\displaystyle  \int_{{\bf R}^n} e^{-\pi (x^T A x + 2 b^T x + c)}\ dx = \hbox{det}(A)^{-1/2} e^{-\pi (c-b^T A^{-1} b)} \ \ \ \ \ (10)

whenever {A} is a complex symmetric matrix with positive definite real part, {b} is a vector in {{\bf C}^n}, and {c} is a complex number, basically by repeating the one-dimensional argument sketched earlier. Here, we choose the branch of {\hbox{det}(A)^{-1/2}} for all matrices {A} in the indicated class for which {\hbox{det}(1)^{-1/2} = 1}.

Now we turn to an integration functional suitable for computing complex gaussian integrals such as

\displaystyle  \int_{{\bf C}^n} e^{-2\pi (z^\dagger A z + b^\dagger z + z^\dagger \tilde b + c)}\ dz d\overline{z}, \ \ \ \ \ (11)

where {z} is now a complex variable

\displaystyle  z = \begin{pmatrix} z_1 \\ \vdots \\ z_n \end{pmatrix},

{z^\dagger} is the adjoint

\displaystyle  z^\dagger := (\overline{z_1},\ldots, \overline{z_n}),

{A} is a complex {n \times n} matrix with positive definite Hermitian part, {b, \tilde b} are column vectors in {{\bf C}^n}, {c} is a complex number, and {dz d\overline{z} = \prod_{j=1}^n 2 d\hbox{Re}(z_j) d\hbox{Im}(z_j)} is {2^n} times Lebesgue measure on {{\bf C}^n}. (The factors of two here turn out to be a natural normalisation, but they can be ignored on a first reading.) As we shall see later, such integrals are relevant when performing computations on the Gaussian Unitary Ensemble (GUE) in random matrix theory. Note that the integrand here is not complex analytic due to the presence of the complex conjugates. However, this can be dealt with by the trick of replacing the complex conjugate {\overline{z}} by a variable {z^*} which is formally conjugate to {z}, but which is allowed to vary independently of {z}. More precisely, let {\mathcal{SA}({\bf C}^n \times {\bf C}^n \rightarrow {\bf C})} be the space of all functions {f: (z,z^*) \mapsto f(z,z^*)} of two independent {n}-tuples

\displaystyle  z = \begin{pmatrix} z_1 \\ \vdots \\ z_n \end{pmatrix}, z^* = \begin{pmatrix} z_1^* \\ \vdots \\ z_n^* \end{pmatrix}

of complex variables, which is jointly entire in all {2n} variables (in the sense defined previously, i.e. there is a joint Taylor series that is absolutely convergent for all independent choices of {z, z^* \in {\bf C}^n}), and such that there is an {\epsilon>0} such that for every {A>0} there is {C_A>0} such that one has the bound

\displaystyle  |f(z,z^*)| \leq C_A (1 + |z|)^{-A}

whenever {|z^* - \overline{z}| \leq A + \epsilon |z|}. We will call such functions Schwartz analytic. Note that the integrand in (11) is Schwartz analytic when {A} has positive definite Hermitian part, if we reinterpret {z^\dagger} as the transpose of {z^*} rather than as the adjoint of {z} in order to make the integrand entire in {z} and {z^*}. We can then define an abstract integration functional {I: \mathcal{SA}({\bf C}^n \times {\bf C}^n \rightarrow {\bf C}) \rightarrow {\bf C}} by the formula

\displaystyle  I(f) := \int_{{\bf C}^n} f(z,\overline{z})\ dz d\overline{z}, \ \ \ \ \ (12)

thus {I} can be localised to the slice {\{ (z,\overline{z}): z \in {\bf C}^n\}} of {{\bf C}^n \times {\bf C}^n} (though, as with previous functionals, one can use contour shifting to relocalise {I} to other slices also.) One can also write this integral as

\displaystyle  I(f) = 2^n \int_{{\bf R}^n \times {\bf R}^n} f(x+iy, x-iy)\ dx dy

and note that the integrand here is a Schwartz entire function on {{\bf C}^n \times {\bf C}^n}, thus linking the Schwartz analytic integral with the Schwartz entire integral. Using this connection, one can verify that this functional {I} is invariant with respect to translating {z} and {z^*} by independent shifts in {{\bf C}^n} (thus giving a {{\bf C}^n \times {\bf C}^n} translation symmetry), and one also has the independent dilation symmetry

\displaystyle  I(\delta_{A,B} f) = \hbox{det}(A) \hbox{det}(B) I(f)

for {n \times n} complex matrices {A,B} that are sufficiently close to the identity, where {\delta_{A,B} f(z,z^*) := f(A^{-1} z, B^{-1} z^*)}. Arguing as before, we can then compute (11) as

\displaystyle  \int_{{\bf C}^n} e^{-2\pi (z^\dagger A z + b^\dagger z + z^\dagger \tilde b + c)}\ dz d\overline{z} = \hbox{det}(A)^{-1} e^{-2\pi (c - b^\dagger A^{-1} \tilde b)}. \ \ \ \ \ (13)

In particular, this gives an integral representation for the determinant-reciprocal {\hbox{det}(A)^{-1}} of a complex {n \times n} matrix with positive definite Hermitian part, in terms of gaussian expressions in which {A} only appears linearly in the exponential:

\displaystyle  \hbox{det}(A)^{-1} = \int_{{\bf C}^n} e^{-2\pi z^\dagger A z}\ dz d\overline{z}.

This formula is then convenient for computing statistics such as

\displaystyle  \mathop{\bf E} \hbox{det}(W_n-E-i\eta)^{-1}

for random matrices {W_n} drawn from the Gaussian Unitary Ensemble (GUE), and some choice of spectral parameter {E+i\eta} with {\eta>0}; we review this computation later in this post. By the trick of matrix differentiation of the determinant (as reviewed in this recent blog post), one can also use this method to compute matrix-valued statistics such as

\displaystyle  \mathop{\bf E} \hbox{det}(W_n-E-i\eta)^{-1} (W_n-E-i\eta)^{-1}.

However, if one restricts attention to classical integrals over real or complex (and in particular, commuting or bosonic) variables, it does not seem possible to easily eradicate the negative determinant factors in such calculations, which is unfortunate because many statistics of interest in random matrix theory, such as the expected Stieltjes transform

\displaystyle  \mathop{\bf E} \frac{1}{n} \hbox{tr} (W_n-E-i\eta)^{-1},

which is the Stieltjes transform of the density of states. However, it turns out (as I learned recently from Peter Sarnak and Tom Spencer) that it is possible to cancel out these negative determinant factors by balancing the bosonic gaussian integrals with an equal number of fermionic gaussian integrals, in which one integrates over a family of anticommuting variables. These fermionic integrals are closer in spirit to the polynomial integral (6) than to Lebesgue type integrals, and in particular obey a scaling law which is inverse to the Lebesgue scaling (in particular, a linear change of fermionic variables {\zeta \mapsto A \zeta} ends up transforming a fermionic integral by {\hbox{det}(A)} rather than {\hbox{det}(A)^{-1}}), which conveniently cancels out the reciprocal determinants in the previous calculations. Furthermore, one can combine the bosonic and fermionic integrals into a unified integration concept, known as the Berezin integral (or Grassmann integral), in which one integrates functions of supervectors (vectors with both bosonic and fermionic components), and is of particular importance in the theory of supersymmetry in physics. (The prefix “super” in physics means, roughly speaking, that the object or concept that the prefix is attached to contains both bosonic and fermionic aspects.) When one applies this unified integration concept to gaussians, this can lead to quite compact and efficient calculations (provided that one is willing to work with “super”-analogues of various concepts in classical linear algebra, such as the supertrace or superdeterminant).

Abstract integrals of the flavour of (6) arose in quantum field theory, when physicists sought to formally compute integrals of the form

\displaystyle  \int F( x_1, \ldots, x_n, \xi_1, \ldots, \xi_m )\ dx_1 \ldots dx_n d\xi_1 \ldots d\xi_m \ \ \ \ \ (14)

where {x_1,\ldots,x_n} are familiar commuting (or bosonic) variables (which, in particular, can often be localised to be scalar variables taking values in {{\bf R}} or {{\bf C}}), while {\xi_1,\ldots,\xi_m} were more exotic anticommuting (or fermionic) variables, taking values in some vector space of fermions. (As we shall see shortly, one can formalise these concepts by working in a supercommutative algebra.) The integrand {F(x_1,\ldots,x_n,\xi_1,\ldots,\xi_m)} was a formally analytic function of {x_1,\ldots,x_n,\xi_1,\ldots,\xi_m}, in that it could be expanded as a (formal, noncommutative) power series in the variables {x_1,\ldots,x_n,\xi_1,\ldots,\xi_m}. For functions {F(x_1,\ldots,x_n)} that depend only on bosonic variables, it is certainly possible for such analytic functions to be in the Schwartz class and thus fall under the scope of the classical integral, as discussed previously. However, functions {F(\xi_1,\ldots,\xi_m)} that depend on fermionic variables {\xi_1,\ldots,\xi_m} behave rather differently. Indeed, a fermonic variable {\xi} must anticommute with itself, so that {\xi^2 = 0}. In particular, any power series in {\xi} terminates after the linear term in {\xi}, so that a function {F(\xi)} can only be analytic in {\xi} if it is a polynomial of degree at most {1} in {\xi}; more generally, an analytic function {F(\xi_1,\ldots,\xi_m)} of {m} fermionic variables {\xi_1,\ldots,\xi_m} must be a polynomial of degree at most {m}, and an analytic function {F(x_1,\ldots,x_n,\xi_1,\ldots,\xi_m)} of {n} bosonic and {m} fermionic variables can be Schwartz in the bosonic variables but will be polynomial in the fermonic variables. As such, to interpret the integral (14), one can use classical (Lebesgue) integration (or the variants discussed above for integrating Schwartz entire or Schwartz analytic functions) for the bosonic variables, but must use abstract integrals such as (6) for the fermonic variables, leading to the concept of Berezin integration mentioned earlier.

In this post I would like to set out some of the basic algebraic formalism of Berezin integration, particularly with regards to integration of gaussian-type expressions, and then show how this formalism can be used to perform computations involving GUE (for instance, one can compute the density of states of GUE by this machinery without recourse to the theory of orthogonal polynomials). The use of supersymmetric gaussian integrals to analyse ensembles such as GUE appears in the work of Efetov (and was also proposed in the slightly earlier works of Parisi-Sourlas and McKane, with a related approach also appearing in the work of Wegner); the material here is adapted from this survey of Mirlin, as well as the later papers of Disertori-Pinson-Spencer and of Disertori.

Read the rest of this entry »

[These are notes intended mostly for myself, as these topics are useful in random matrix theory, but may be of interest to some readers also. -T.]

One of the most fundamental partial differential equations in mathematics is the heat equation

\displaystyle  \partial_t f = L f \ \ \ \ \ (1)

where {f: [0,+\infty) \times {\bf R}^n \rightarrow {\bf R}} is a scalar function {(t,x) \mapsto f(t,x)} of both time and space, and {L} is the Laplacian {L := \frac{1}{2} \Delta = \sum_{i=1}^n \frac{\partial^2}{\partial x_i^2}}. For the purposes of this post, we will ignore all technical issues of regularity and decay, and always assume that the solutions to equations such as (1) have all the regularity and decay in order to justify all formal operations such as the chain rule, integration by parts, or differentiation under the integral sign. The factor of {\frac{1}{2}} in the definition of the heat propagator {L} is of course an arbitrary normalisation, chosen for some minor technical reasons; one can certainly continue the discussion below with other choices of normalisations if desired.

In probability theory, this equation takes on particular significance when {f} is restricted to be non-negative, and furthermore to be a probability measure at each time, in the sense that

\displaystyle  \int_{{\bf R}^n} f(t,x)\ dx = 1

for all {t}. (Actually, it suffices to verify this constraint at time {t=0}, as the heat equation (1) will then preserve this constraint.) Indeed, in this case, one can interpret {f(t,x)\ dx} as the probability distribution of a Brownian motion

\displaystyle  dx = dB(t) \ \ \ \ \ (2)

where {x = x(t) \in {\bf R}^n} is a stochastic process with initial probability distribution {f(0,x)\ dx}; see for instance this previous blog post for more discussion.

A model example of a solution to the heat equation to keep in mind is that of the fundamental solution

\displaystyle  G(t,x) = \frac{1}{(2\pi t)^{n/2}} e^{-|x|^2/2t} \ \ \ \ \ (3)

defined for any {t>0}, which represents the distribution of Brownian motion of a particle starting at the origin {x=0} at time {t=0}. At time {t}, {G(t,x)} represents an {{\bf R}^n}-valued random variable, each coefficient of which is an independent random variable of mean zero and variance {t}. (As {t \rightarrow 0^+}, {G(t)} converges in the sense of distributions to a Dirac mass at the origin.)

The heat equation can also be viewed as the gradient flow for the Dirichlet form

\displaystyle  D(f,g) := \frac{1}{2} \int_{{\bf R}^n} \nabla f \cdot \nabla g\ dx \ \ \ \ \ (4)

since one has the integration by parts identity

\displaystyle  \int_{{\bf R}^n} Lf(x) g(x)\ dx = \int_{{\bf R}^n} f(x) Lg(x)\ dx = - D(f,g) \ \ \ \ \ (5)

for all smooth, rapidly decreasing {f,g}, which formally implies that {L f} is (half of) the negative gradient of the Dirichlet energy {D(f,f) = \frac{1}{2} \int_{{\bf R}^n} |\nabla f|^2\ dx} with respect to the {L^2({\bf R}^n,dx)} inner product. Among other things, this implies that the Dirichlet energy decreases in time:

\displaystyle  \partial_t D(f,f) = - 2 \int_{{\bf R}^n} |Lf|^2\ dx. \ \ \ \ \ (6)

For instance, for the fundamental solution (3), one can verify for any time {t>0} that

\displaystyle  D(G,G) = \frac{n}{2^{n+2} \pi^{n/2}} \frac{1}{t^{(n+2)/2}} \ \ \ \ \ (7)

(assuming I have not made a mistake in the calculation). In a similar spirit we have

\displaystyle  \partial_t \int_{{\bf R}^n} |f|^2\ dx = - 2 D(f,f). \ \ \ \ \ (8)

Since {D(f,f)} is non-negative, the formula (6) implies that {\int_{{\bf R}^n} |Lf|^2\ dx} is integrable in time, and in particular we see that {Lf} converges to zero as {t \rightarrow \infty}, in some averaged {L^2} sense at least; similarly, (8) suggests that {D(f,f)} also converges to zero. This suggests that {f} converges to a constant function; but as {f} is also supposed to decay to zero at spatial infinity, we thus expect solutions to the heat equation in {{\bf R}^n} to decay to zero in some sense as {t \rightarrow \infty}. However, the decay is only expected to be polynomial in nature rather than exponential; for instance, the solution (3) decays in the {L^\infty} norm like {O(t^{-n/2})}.

Since {L1=0}, we also observe the basic cancellation property

\displaystyle  \int_{{\bf R}^n} Lf(x) \ dx = 0 \ \ \ \ \ (9)

for any function {f}.

There are other quantities relating to {f} that also decrease in time under heat flow, particularly in the important case when {f} is a probability measure. In this case, it is natural to introduce the entropy

\displaystyle  S(f) := \int_{{\bf R}^n} f(x) \log f(x)\ dx.

Thus, for instance, if {f(x)\ dx} is the uniform distribution on some measurable subset {E} of {{\bf R}^n} of finite measure {|E|}, the entropy would be {-\log |E|}. Intuitively, as the entropy decreases, the probability distribution gets wider and flatter. For instance, in the case of the fundamental solution (3), one has {S(G) = -\frac{n}{2} \log( 2 \pi e t )} for any {t>0}, reflecting the fact that {G(t)} is approximately uniformly distributed on a ball of radius {O(\sqrt{t})} (and thus of measure {O(t^{n/2})}).

A short formal computation shows (if one assumes for simplicity that {f} is strictly positive, which is not an unreasonable hypothesis, particularly in view of the strong maximum principle) using (9), (5) that

\displaystyle  \partial_t S(f) = \int_{{\bf R}^n} (Lf) \log f + f \frac{Lf}{f}\ dx

\displaystyle  = \int_{{\bf R}^n} (Lf) \log f\ dx

\displaystyle  = - D( f, \log f )

\displaystyle  = - \frac{1}{2} \int_{{\bf R}^n} \frac{|\nabla f|^2}{f}\ dx

\displaystyle  = - 4D( g, g )

where {g := \sqrt{f}} is the square root of {f}. For instance, if {f} is the fundamental solution (3), one can check that {D(g,g) = \frac{n}{8t}} (note that this is a significantly cleaner formula than (7)!).

In particular, the entropy is decreasing, which corresponds well to one’s intuition that the heat equation (or Brownian motion) should serve to spread out a probability distribution over time.

Actually, one can say more: the rate of decrease {4D(g,g)} of the entropy is itself decreasing, or in other words the entropy is convex. I do not have a satisfactorily intuitive reason for this phenomenon, but it can be proved by straightforward application of basic several variable calculus tools (such as the chain rule, product rule, quotient rule, and integration by parts), and completing the square. Namely, by using the chain rule we have

\displaystyle  L \phi(f) = \phi'(f) Lf + \frac{1}{2} \phi''(f) |\nabla f|^2, \ \ \ \ \ (10)

valid for for any smooth function {\phi: {\bf R} \rightarrow {\bf R}}, we see from (1) that

\displaystyle  2 g \partial_t g = 2 g L g + |\nabla g|^2

and thus (again assuming that {f}, and hence {g}, is strictly positive to avoid technicalities)

\displaystyle  \partial_t g = Lg + \frac{|\nabla g|^2}{2g}.

We thus have

\displaystyle  \partial_t D(g,g) = 2 D(g,Lg) + D(g, \frac{|\nabla g|^2}{g} ).

It is now convenient to compute using the Einstein summation convention to hide the summation over indices {i,j = 1,\ldots,n}. We have

\displaystyle  2 D(g,Lg) = \frac{1}{2} \int_{{\bf R}^n} (\partial_i g) (\partial_i \partial_j \partial_j g)\ dx

and

\displaystyle  D(g, \frac{|\nabla g|^2}{g} ) = \frac{1}{2} \int_{{\bf R}^n} (\partial_i g) \partial_i \frac{\partial_j g \partial_j g}{g}\ dx.

By integration by parts and interchanging partial derivatives, we may write the first integral as

\displaystyle  2 D(g,Lg) = - \frac{1}{2} \int_{{\bf R}^n} (\partial_i \partial_j g) (\partial_i \partial_j g)\ dx,

and from the quotient and product rules, we may write the second integral as

\displaystyle  D(g, \frac{|\nabla g|^2}{g} ) = \int_{{\bf R}^n} \frac{(\partial_i g) (\partial_j g) (\partial_i \partial_j g)}{g} - \frac{(\partial_i g) (\partial_j g) (\partial_i g) (\partial_j g)}{2g^2}\ dx.

Gathering terms, completing the square, and making the summations explicit again, we see that

\displaystyle  \partial_t D(g,g) =- \frac{1}{2} \int_{{\bf R}^n} \frac{\sum_{i=1}^n \sum_{j=1}^n |g \partial_i \partial_j g - (\partial_i g) (\partial_j g)|^2}{g^2}\ dx

and so in particular {D(g,g)} is always decreasing.

The above identity can also be written as

\displaystyle  \partial_t D(g,g) = - \frac{1}{2} \int_{{\bf R}^n} |\nabla^2 \log g|^2 g^2\ dx.

Exercise 1 Give an alternate proof of the above identity by writing {f = e^{2u}}, {g = e^u} and deriving the equation {\partial_t u = Lu + |\nabla u|^2} for {u}.

It was observed in a well known paper of Bakry and Emery that the above monotonicity properties hold for a much larger class of heat flow-type equations, and lead to a number of important relations between energy and entropy, such as the log-Sobolev inequality of Gross and of Federbush, and the hypercontractivity inequality of Nelson; we will discuss one such family of generalisations (or more precisely, variants) below the fold.

Read the rest of this entry »

Let {n} be a large natural number, and let {M_n} be a matrix drawn from the Gaussian Unitary Ensemble (GUE), by which we mean that {M_n} is a Hermitian matrix whose upper triangular entries are iid complex gaussians with mean zero and variance one, and whose diagonal entries are iid real gaussians with mean zero and variance one (and independent of the upper triangular entries). The eigenvalues {\lambda_1(M_n) \leq \ldots \leq \lambda_n(M_n)} are then real and almost surely distinct, and can be viewed as a random point process {\Sigma^{(n)} := \{\lambda_1(M_n),\ldots,\lambda_n(M_n)\}} on the real line. One can then form the {k}-point correlation functions {\rho_k^{(n)}: {\bf R}^k \rightarrow {\bf R}^+} for every {k \geq 0}, which can be defined by duality by requiring

\displaystyle  \mathop{\bf E} \sum_{i_1,\ldots,i_k \hbox{ distinct}} F( \lambda_{i_1}(M_n),\ldots,\lambda_{i_k}(M_n))

\displaystyle  = \int_{{\bf R}^k} \rho_k^{(n)}(x_1,\ldots,x_k) F(x_1,\ldots,x_k)\ dx_1 \ldots dx_k

for any test function {F: {\bf R}^k \rightarrow {\bf R}^+}. For GUE, which is a continuous matrix ensemble, one can also define {\rho_k^{(n)}(x_1,\ldots,x_k)} for distinct {x_1<\ldots<x_k} as the unique quantity such that the probability that there is an eigenvalue in each of the intervals {[x_1,x_1+\epsilon],\ldots,[x_k,x_k+\epsilon]} is {(\rho_k^{(n)}(x_1,\ldots,x_k)+o(1))\epsilon^k} in the limit {\epsilon\rightarrow 0}.

As is well known, the GUE process is a determinantal point process, which means that {k}-point correlation functions can be explicitly computed as

\displaystyle  \rho^{(n)}_k(x_1,\ldots,x_k) = \det( K^{(n)}(x_i,x_j) )_{1 \leq i,j \leq k}

for some kernel {K^{(n)}: {\bf R} \times {\bf R} \rightarrow {\bf C}}; explicitly, one has

\displaystyle  K^{(n)}(x,y) := \sum_{k=0}^{n-1} P_k(x) e^{-x^2/4}P_k(y) e^{-y^2/4}

where {P_0, P_1,\ldots} are the (normalised) Hermite polynomials; see this previous blog post for details.

Using the asymptotics of Hermite polynomials (which then give asymptotics for the kernel {K^{(n)}}), one can take a limit of a (suitably rescaled) sequence of GUE processes to obtain the Dyson sine process, which is a determinantal point process {\Sigma} on the real line with correlation functions

\displaystyle  \rho_k(x_1,\ldots,x_k) = \det( K(x_i,x_j) )_{1 \leq i,j \leq k} \ \ \ \ \ (1)

where {K} is the Dyson sine kernel

\displaystyle  K(x,y) := \frac{\sin(\pi(x-y))}{\pi(x-y)}. \ \ \ \ \ (2)

A bit more precisely, for any fixed bulk energy {-2 < u < 2}, the renormalised point processes {\rho_{sc}(u) \sqrt{n} ( \Sigma^{(n)} - \sqrt{n} u )} converge in distribution in the vague topology to {\Sigma} as {n \rightarrow \infty}, where {\rho_{sc}(u) := \frac{1}{2\pi} (4-u^2)^{1/2}_+} is the semi-circular law density.

On the other hand, an important feature of the GUE process {\Sigma^{(n)} = \{\lambda_1,\ldots,\lambda_n\}} is its stationarity (modulo rescaling) under Dyson Brownian motion

\displaystyle  d\lambda_i = dB_i + \sum_{j \neq i} \frac{dt}{\lambda_i-\lambda_j}

which describes the stochastic evolution of eigenvalues of a Hermitian matrix under independent Brownian motion of its entries, and is discussed in this previous blog post. To cut a long story short, this stationarity tells us that the self-similar {n}-point correlation function

\displaystyle  \rho^{(n)}_n(t,x) := t^{-n/2} \rho^{(n)}_n(x/\sqrt{t})

obeys the Dyson heat equation

\displaystyle  \partial_t \rho^{(n)}_n = \frac{1}{2} \sum_{i=1}^n \partial_{x_i}^2 \rho^{(n)}_n - \sum_{1 \leq i,j \leq n;i\neq j} \partial_{x_i} \frac{\rho^{(n)}_n}{x_i-x_j}

(see Exercise 11 of the previously mentioned blog post). Note that {\rho^{(n)}_n} vanishes to second order whenever two of the {x_i} coincide, so there is no singularity on the right-hand side. Setting {t=1} and using self-similarity, we can rewrite this equation in time-independent form as

\displaystyle  -\frac{1}{2} \sum_{i=1}^n \partial_i (x_i \rho^{(n)}_n) = \frac{1}{2} \sum_{i=1}^n \partial_{x_i}^2 \rho^{(n)}_n - \sum_{1 \leq i,j \leq n;i\neq j} \partial_{x_i} \frac{\rho^{(n)}_n}{x_i-x_j}.

One can then integrate out all but {k} of these variables (after carefully justifying convergence) to obtain a system of equations for the {k}-point correlation functions {\rho^{(n)}_k}:

\displaystyle  -\frac{1}{2} \sum_{i=1}^k \partial_i (x_i \rho^{(n)}_k) = \frac{1}{2} \sum_{i=1}^k \partial_{x_i}^2 \rho^{(n)}_k - \sum_{1 \leq i,j \leq k;i\neq j} \partial_{x_i} \frac{\rho^{(n)}_k}{x_i-x_j}

\displaystyle  - \sum_{i=1}^k \partial_{x_i} \int_{\bf R} \frac{\rho^{(n)}_{k+1}(x_1,\ldots,x_{k+1})}{x_i-x_{k+1}}\ dx_{k+1},

where the integral is interpreted in the principal value case. This system is an example of a BBGKY hierarchy.

If one carefully rescales and takes limits (say at the energy level {u=0}, for simplicity), the left-hand side turns out to rescale to be a lower order term, and one ends up with a hierarchy for the Dyson sine process:

\displaystyle  0 = \frac{1}{2} \sum_{i=1}^k \partial_{x_i}^2 \rho_k - \sum_{1 \leq i,j \leq k;i\neq j} \partial_{x_i} \frac{\rho_k}{x_i-x_j} \ \ \ \ \ (3)

\displaystyle  - \sum_{i=1}^k \partial_{x_i} \int_{\bf R} \frac{\rho_{k+1}(x_1,\ldots,x_{k+1})}{x_i-x_{k+1}}\ dx_{k+1}.

Informally, these equations show that the Dyson sine process {\Sigma = \{ \lambda_i: i \in {\bf Z} \}} is stationary with respect to the infinite Dyson Brownian motion

\displaystyle  d\lambda_i = dB_i + \sum_{j \neq i} \frac{dt}{\lambda_i-\lambda_j}

where {dB_i} are independent Brownian increments, and the sum is interpreted in a suitable principal value sense.

I recently set myself the exercise of deriving the identity (3) directly from the definition (1) of the Dyson sine process, without reference to GUE. This turns out to not be too difficult when done the right way (namely, by modifying the proof of Gaudin’s lemma), although it did take me an entire day of work before I realised this, and I could not find it in the literature (though I suspect that many people in the field have privately performed this exercise in the past). In any case, I am recording the computation here, largely because I really don’t want to have to do it again, but perhaps it will also be of interest to some readers.

Read the rest of this entry »

One of the basic general problems in analytic number theory is to understand as much as possible the fluctuations of the Möbius function {\mu(n)}, defined as {(-1)^k} when {n} is the product of {k} distinct primes, and zero otherwise. For instance, as {\mu} takes values in {\{-1,0,1\}}, we have the trivial bound

\displaystyle  |\sum_{n \leq x} \mu(n)| \leq x

and the seemingly slight improvement

\displaystyle  \sum_{n \leq x} \mu(n) = o(x) \ \ \ \ \ (1)

is already equivalent to the prime number theorem, as observed by Landau (see e.g. this previous blog post for a proof), while the much stronger (and still open) improvement

\displaystyle  \sum_{n \leq x} \mu(n) = O(x^{1/2+o(1)})

is equivalent to the notorious Riemann hypothesis.

There is a general Möbius pseudorandomness heuristic that suggests that the sign pattern {\mu} behaves so randomly (or pseudorandomly) that one should expect a substantial amount of cancellation in sums that involve the sign fluctuation of the Möbius function in a nontrivial fashion, with the amount of cancellation present comparable to the amount that an analogous random sum would provide; cf. the probabilistic heuristic discussed in this recent blog post. There are a number of ways to make this heuristic precise. As already mentioned, the Riemann hypothesis can be considered one such manifestation of the heuristic. Another manifestation is the following old conjecture of Chowla:

Conjecture 1 (Chowla’s conjecture) For any fixed integer {m} and exponents {a_1,a_2,\ldots,a_m \geq 0}, with at least one of the {a_i} odd (so as not to completely destroy the sign cancellation), we have

\displaystyle  \sum_{n \leq x} \mu(n+1)^{a_1} \ldots \mu(n+m)^{a_m} = o_{x \rightarrow \infty;m}(x).

Note that as {\mu^a = \mu^{a+2}} for any {a \geq 1}, we can reduce to the case when the {a_i} take values in {0,1,2} here. When only one of the {a_i} are odd, this is essentially the prime number theorem in arithmetic progressions (after some elementary sieving), but with two or more of the {a_i} are odd, the problem becomes completely open. For instance, the estimate

\displaystyle  \sum_{n \leq x} \mu(n) \mu(n+2) = o(x)

is morally very close to the conjectured asymptotic

\displaystyle  \sum_{n \leq x} \Lambda(n) \Lambda(n+2) = 2\Pi_2 x + o(x)

for the von Mangoldt function {\Lambda}, where {\Pi_2 := \prod_{p > 2} (1 - \frac{1}{(p-1)^2}) = 0.66016\ldots} is the twin prime constant; this asymptotic in turn implies the twin prime conjecture. (To formally deduce estimates for von Mangoldt from estimates for Möbius, though, typically requires some better control on the error terms than {o()}, in particular gains of some power of {\log x} are usually needed. See this previous blog post for more discussion.)

Remark 1 The Chowla conjecture resembles an assertion that, for {n} chosen randomly and uniformly from {1} to {x}, the random variables {\mu(n+1),\ldots,\mu(n+k)} become asymptotically independent of each other (in the probabilistic sense) as {x \rightarrow \infty}. However, this is not quite accurate, because some moments (namely those with all exponents {a_i} even) have the “wrong” asymptotic value, leading to some unwanted correlation between the two variables. For instance, the events {\mu(n)=0} and {\mu(n+4)=0} have a strong correlation with each other, basically because they are both strongly correlated with the event of {n} being divisible by {4}. A more accurate interpretation of the Chowla conjecture is that the random variables {\mu(n+1),\ldots,\mu(n+k)} are asymptotically conditionally independent of each other, after conditioning on the zero pattern {\mu(n+1)^2,\ldots,\mu(n+k)^2}; thus, it is the sign of the Möbius function that fluctuates like random noise, rather than the zero pattern. (The situation is a bit cleaner if one works instead with the Liouville function {\lambda} instead of the Möbius function {\mu}, as this function never vanishes, but we will stick to the traditional Möbius function formalism here.)

A more recent formulation of the Möbius randomness heuristic is the following conjecture of Sarnak. Given a bounded sequence {f: {\bf N} \rightarrow {\bf C}}, define the topological entropy of the sequence to be the least exponent {\sigma} with the property that for any fixed {\epsilon > 0}, and for {m} going to infinity the set {\{ (f(n+1),\ldots,f(n+m)): n \in {\bf N} \} \subset {\bf C}^m} of {f} can be covered by {O( \exp( \sigma m + o(m) ) )} balls of radius {\epsilon}. (If {f} arises from a minimal topological dynamical system {(X,T)} by {f(n) := f(T^n x)}, the above notion is equivalent to the usual notion of the topological entropy of a dynamical system.) For instance, if the sequence is a bit sequence (i.e. it takes values in {\{0,1\}}), then there are only {\exp(\sigma m + o(m))} {m}-bit patterns that can appear as blocks of {m} consecutive bits in this sequence. As a special case, a Turing machine with bounded memory that had access to a random number generator at the rate of one random bit produced every {T} units of time, but otherwise evolved deterministically, would have an output sequence that had a topological entropy of at most {\frac{1}{T} \log 2}. A bounded sequence is said to be deterministic if its topological entropy is zero. A typical example is a polynomial sequence such as {f(n) := e^{2\pi i \alpha n^2}} for some fixed {\sigma}; the {m}-blocks of such polynomials sequence have covering numbers that only grow polynomially in {m}, rather than exponentially, thus yielding the zero entropy. Unipotent flows, such as the horocycle flow on a compact hyperbolic surface, are another good source of deterministic sequences.

Conjecture 2 (Sarnak’s conjecture) Let {f: {\bf N} \rightarrow {\bf C}} be a deterministic bounded sequence. Then

\displaystyle  \sum_{n \leq x} \mu(n) f(n) = o_{x \rightarrow \infty;f}(x).

This conjecture in general is still quite far from being solved. However, special cases are known:

  • For constant sequences, this is essentially the prime number theorem (1).
  • For periodic sequences, this is essentially the prime number theorem in arithmetic progressions.
  • For quasiperiodic sequences such as {f(n) = F(\alpha n \hbox{ mod } 1)} for some continuous {F}, this follows from the work of Davenport.
  • For nilsequences, this is a result of Ben Green and myself.
  • For horocycle flows, this is a result of Bourgain, Sarnak, and Ziegler.
  • For the Thue-Morse sequence, this is a result of Dartyge-Tenenbaum (with a stronger error term obtained by Maduit-Rivat). A subsequent result of Bourgain handles all bounded rank one sequences (though the Thue-Morse sequence is actually of rank two), and a related result of Green establishes asymptotic orthogonality of the Möbius function to bounded depth circuits, although such functions are not necessarily deterministic in nature.
  • For the Rudin-Shapiro sequence, I sketched out an argument at this MathOverflow post.
  • The Möbius function is known to itself be non-deterministic, because its square {\mu^2(n)} (i.e. the indicator of the square-free functions) is known to be non-deterministic (indeed, its topological entropy is {\frac{6}{\pi^2}\log 2}). (The corresponding question for the Liouville function {\lambda(n)}, however, remains open, as the square {\lambda^2(n)=1} has zero entropy.)
  • In the converse direction, it is easy to construct sequences of arbitrarily small positive entropy that correlate with the Möbius function (a rather silly example is {\mu(n) 1_{k|n}} for some fixed large (squarefree) {k}, which has topological entropy at most {\log 2/k} but clearly correlates with {\mu}).

See this survey of Sarnak for further discussion of this and related topics.

In this post I wanted to give a very nice argument of Sarnak that links the above two conjectures:

Proposition 3 The Chowla conjecture implies the Sarnak conjecture.

The argument does not use any number-theoretic properties of the Möbius function; one could replace {\mu} in both conjectures by any other function from the natural numbers to {\{-1,0,+1\}} and obtain the same implication. The argument consists of the following ingredients:

  1. To show that {\sum_{n<x} \mu(n) f(n) = o(x)}, it suffices to show that the expectation of the random variable {\frac{1}{m} (\mu(n+1)f(n+1)+\ldots+\mu(n+m)f(n+m))}, where {n} is drawn uniformly at random from {1} to {x}, can be made arbitrary small by making {m} large (and {n} even larger).
  2. By the union bound and the zero topological entropy of {f}, it suffices to show that for any bounded deterministic coefficients {c_1,\ldots,c_m}, the random variable {\frac{1}{m}(c_1 \mu(n+1) + \ldots + c_m \mu(n+m))} concentrates with exponentially high probability.
  3. Finally, this exponentially high concentration can be achieved by the moment method, using a slight variant of the moment method proof of the large deviation estimates such as the Chernoff inequality or Hoeffding inequality (as discussed in this blog post).

As is often the case, though, while the “top-down” order of steps presented above is perhaps the clearest way to think conceptually about the argument, in order to present the argument formally it is more convenient to present the arguments in the reverse (or “bottom-up”) order. This is the approach taken below the fold.

Read the rest of this entry »

There has been a lot of recent interest in the abc conjecture, since the release a few weeks ago of the last of a series of papers by Shinichi Mochizuki which, as one of its major applications, claims to establish this conjecture. It’s still far too early to judge whether this proof is likely to be correct or not (the entire argument encompasses several hundred pages of argument, mostly in the area of anabelian geometry, which very few mathematicians are expert in, to the extent that we still do not even have a full outline of the proof strategy yet), and I don’t have anything substantial to add to the existing discussion around that conjecture. (But, for those that are interested, the Polymath wiki page on the ABC conjecture has collected most of the links to that discussion, and to various background materials.)

In the meantime, though, I thought I might give the standard probabilistic heuristic argument that explains why we expect the ABC conjecture to be true. The underlying heuristic is a common one, used throughout number theory, and it can be summarised as follows:

Heuristic 1 (Probabilistic heuristic) Even though number theory is a deterministic subject (one does not need to roll any dice to factorise a number, or figure out if a number is prime), one expects to get a good asymptotic prediction for the answers to many number-theoretic questions by pretending that various number-theoretic assertions {E} (e.g. that a given number {n} is prime) are probabilistic events (with a probability {\mathop{\bf P}(E)} that can vary between {0} and {1}) rather than deterministic events (that are either always true or always false). Furthermore:

  • (Basic heuristic) If two or more of these heuristically probabilistic events have no obvious reason to be strongly correlated to each other, then we should expect them to behave as if they were (jointly) independent.
  • (Advanced heuristic) If two or more of these heuristically probabilistic events have some obvious correlation between them, but no further correlations are suspected, then we should expect them to behave as if they were conditionally independent, relative to whatever data is causing the correlation.

This is, of course, an extremely vague and completely non-rigorous heuristic, requiring (among other things) a subjective and ad hoc determination of what an “obvious reason” is, but in practice it tends to give remarkably plausible predictions, some fraction of which can in fact be backed up by rigorous argument (although in many cases, the actual argument has almost nothing in common with the probabilistic heuristic). A famous special case of this heuristic is the Cramér random model for the primes, but this is not the only such instance for that heuristic.

To give the most precise predictions, one should use the advanced heuristic in Heuristic 1, but this can be somewhat complicated to execute, and so we shall focus instead on the predictions given by the basic heuristic (thus ignoring the presence of some number-theoretic correlations), which tends to give predictions that are quantitatively inaccurate but still reasonably good at the qualitative level.

Here is a basic “corollary” of Heuristic 1:

Heuristic 2 (Heuristic Borel-Cantelli) Suppose one has a sequence {E_1, E_2, \ldots} of number-theoretic statements, which we heuristically interpet as probabilistic events with probabilities {\mathop{\bf P}(E_1), \mathop{\bf P}(E_2), \ldots}. Suppose also that we know of no obvious reason for these events to have much of a correlation with each other. Then:

  • If {\sum_{i=1}^\infty \mathop{\bf P}(E_i) < \infty}, we expect only finitely many of the statements {E_n} to be true. (And if {\sum_{i=1}^\infty \mathop{\bf P}(E_i)} is much smaller than {1}, we in fact expect none of the {E_n} to be true.)
  • If {\sum_{i=1}^\infty \mathop{\bf P}(E_i) = \infty}, we expect infinitely many of the statements {E_n} to be true.

This heuristic is motivated both by the Borel-Cantelli lemma, and by the standard probabilistic computation that if one is given jointly independent, and genuinely probabilistic, events {E_1, E_2, \ldots} with {\sum_{i=1}^\infty \mathop{\bf P}(E_i) = \infty}, then one almost surely has an infinite number of the {E_i} occuring.

Before we get to the ABC conjecture, let us give two simpler (and well known) demonstrations of these heuristics in action:

Example 1 (Twin prime conjecture) One can heuristically justify the twin prime conjecture as follows. Using the prime number theorem, one can heuristically assign a probability of {1/\log n} to the event that any given large integer {n} is prime. In particular, the probability that {n+2} is prime will then be {1/\log(n+2)}. Making the assumption that there are no strong correlations between these events, we are led to the prediction that the probability that {n} and {n+2} are simultaneously prime is {\frac{1}{(\log n)(\log n+2)}}. Since {\sum_{n=1}^\infty \frac{1}{(\log n) (\log n+2)} = \infty}, the Borel-Cantelli heuristic then predicts that there should be infinitely many twin primes.

Note that the above argument is a bit too naive, because there are some non-trivial correlations between the primality of {n} and the primality of {n+2}. Most obviously, if {n} is prime, this greatly increases the probability that {n} is odd, which implies that {n+2} is odd, which then elevates the probability that {n+2} is prime. A bit more subtly, if {n} is prime, then {n} is likely to avoid the residue class {0 \hbox{ mod } 3}, which means that {n+2} avoids the residue class {2 \hbox{ mod } 3}, which ends up decreasing the probability that {n+2} is prime. However, there is a standard way to correct for these local correlations; see for instance in this previous blog post. As it turns out, these local correlations ultimately alter the prediction for the asymptotic density of twin primes by a constant factor (the twin prime constant), but do not affect the qualitative prediction of there being infinitely many twin primes.

Example 2 (Fermat’s last theorem) Let us now heuristically count the number of solutions to {x^n+y^n=z^n} for various {n} and natural numbers {x,y,z} (which we can reduce to be coprime if desired). We recast this (in the spirit of the ABC conjecture) as {a+b=c}, where {a,b,c} are {n^{th}} powers. The number of {n^{th}} powers up to any given number {N} is about {N^{1/n}}, so heuristically any given natural number {a} has a probability about {a^{1/n - 1}} of being an {n^{th}} power. If we make the naive assumption that (in the coprime case at least) there is no strong correlation between the events that {a} is an {n^{th}} power, {b} is an {n^{th}} power, and {a+b} being an {n^{th}} power, then for typical {a,b}, the probability that {a,b,a+b} are all simultaneously {n^{th}} powers would then be {a^{1/n-1} b^{1/n-1} (a+b)^{1/n-1}}. For fixed {n}, the total number of solutions to the Fermat equation would then be predicted to be

\displaystyle  \sum_{a=1}^\infty \sum_{b=1}^\infty a^{1/n-1} b^{1/n-1} (a+b)^{1/n-1}.

(Strictly speaking, we need to restrict to the coprime case, but given that a positive density of pairs of integers are coprime, it should not affect the qualitative conclusion significantly if we now omit this restriction.) It might not be immediately obvious as to whether this sum converges or diverges, but (as is often the case with these sorts of unsigned sums) one can clarify the situation by dyadic decomposition. Suppose for instance that we consider the portion of the sum where {c=a+b} lies between {2^k} and {2^{k+1}}. Then this portion of the sum can be controlled by

\displaystyle  \sum_{a \leq 2^{k+1}} \sum_{b \leq 2^{k+1}} a^{1/n-1} b^{1/n-1} O( ( 2^k )^{1/n - 1} )

which simplifies to

\displaystyle  O( 2^{(3/n - 1)k} ).

Summing in {k}, one thus expects infinitely many solutions for {n=2}, only finitely many solutions for {n>3} (indeed, a refinement of this argument shows that one expects only finitely many solutions even if one considers all {n>3} at once), and a borderline prediction of there being a barely infinite number of solutions when {n=3}. Here is of course a place where a naive application of the probabilistic heuristic breaks down; there is enough arithmetic structure in the equation {x^3+y^3=z^3} that the naive probabilistic prediction ends up being an inaccurate model. Indeed, while this heuristic suggests that a typical homogeneous cubic should have a logarithmic number of integer solutions of a given height {N}, it turns out that some homogeneous cubics (namely, those associated to elliptic curves of positive rank) end up with the bulk of these solutions, while other homogeneous cubics (including those associated to elliptic curves of zero rank, including the Fermat curve {x^3+y^3=z^3}) only get finitely many solutions. The reasons for this are subtle, but certainly the high degree of arithmetic structure present in an elliptic curve (starting with the elliptic curve group law which allows one to generate new solutions from old ones, and which also can be used to exclude solutions to {x^3+y^3=z^3} via the method of descent) is a major contributing factor.

Below the fold, we apply similar heuristics to suggest the truth of the ABC conjecture.

Read the rest of this entry »

Archives

RSS Google+ feed

  • An error has occurred; the feed is probably down. Try again later.
Follow

Get every new post delivered to your Inbox.

Join 3,711 other followers