You are currently browsing the category archive for the ‘expository’ category.

The determinant ${\det(A)}$ of a square matrix ${A}$ obeys a large number of important identities, the most basic of which is the multiplicativity property

$\displaystyle \det(AB) = \det(A) \det(B) \ \ \ \ \ (1)$

whenever ${A,B}$ are square matrices of the same dimension. This identity then generates many other important identities. For instance, if ${A}$ is an ${n \times m}$ matrix and ${B}$ is an ${m \times n}$ matrix, then by applying the previous identity to equate the determinants of ${\begin{pmatrix} 1 & -A \\ B & 1 \end{pmatrix} \begin{pmatrix} 1 & A \\ 0 & 1 \end{pmatrix}}$ and ${\begin{pmatrix} 1 & 0 \\ -B & 1 \end{pmatrix} \begin{pmatrix} 1 & -A \\ B & 1 \end{pmatrix}}$ (where we will adopt the convention that ${1}$ denotes an identity matrix of whatever dimension is needed to make sense of the expressions being computed, and similarly for ${0}$) we obtain the Sylvester determinant identity

$\displaystyle \det( 1 + AB ) = \det( 1 + BA ). \ \ \ \ \ (2)$

This identity, which relates an ${n \times n}$ determinant with an ${m \times m}$ determinant, is very useful in random matrix theory (a point emphasised in particular by Deift), particularly in regimes in which ${m}$ is much smaller than ${n}$.

Another identity generated from (1) arises when trying to compute the determinant of a ${(n+m) \times (n+m)}$ block matrix

$\displaystyle \begin{pmatrix} A & B \\ C & D \end{pmatrix}$

where ${A}$ is an ${n \times n}$ matrix, ${B}$ is an ${n \times m}$ matrix, ${C}$ is an ${m \times n}$ matrix, and ${D}$ is an ${m \times m}$ matrix. If ${A}$ is invertible, then we can manipulate this matrix via block Gaussian elimination as

$\displaystyle \begin{pmatrix} A & B \\ C & D \end{pmatrix} = \begin{pmatrix} A & 0 \\ 0 & 1 \end{pmatrix} \begin{pmatrix} 1 & A^{-1} B \\ C & D \end{pmatrix}$

$\displaystyle = \begin{pmatrix} A & 0 \\ 0 & 1 \end{pmatrix} \begin{pmatrix} 1 & 0 \\ C & 1 \end{pmatrix} \begin{pmatrix} 1 & A^{-1} B \\ 0 & D - C A^{-1} B \end{pmatrix}$

and on taking determinants using (1) we obtain the Schur determinant identity

$\displaystyle \det \begin{pmatrix} A & B \\ C & D \end{pmatrix} = \det(A) \det( D - C A^{-1} B ) \ \ \ \ \ (3)$

relating the determinant of a block-diagonal matrix with the determinant of the Schur complement ${D-C A^{-1} B}$ of the upper left block ${A}$. This identity can be viewed as the correct way to generalise the ${2 \times 2}$ determinant formula

$\displaystyle \det \begin{pmatrix} a & b \\ c & d \end{pmatrix} = ad-bc = a ( d - c a^{-1} b).$

It is also possible to use determinant identities to deduce other matrix identities that do not involve the determinant, by the technique of matrix differentiation (or equivalently, matrix linearisation). The key observation is that near the identity, the determinant behaves like the trace, or more precisely one has

$\displaystyle \det( 1 + \epsilon A ) = 1 + \epsilon \hbox{tr}(A) + O(\epsilon^2) \ \ \ \ \ (4)$

for any bounded square matrix ${A}$ and infinitesimal ${\epsilon}$. (If one is uncomfortable with infinitesimals, one can interpret this sort of identity as an asymptotic as ${\epsilon\rightarrow 0}$.) Combining this with (1) we see that for square matrices ${A,B}$ of the same dimension with ${A}$ invertible and ${A^{-1}, B}$ invertible, one has

$\displaystyle \det( A + \epsilon B ) = \det(A) \det(1 + \epsilon A^{-1} B )$

$\displaystyle = \det(A) (1 + \epsilon \hbox{tr}( A^{-1} B ) + O(\epsilon^2) )$

for infinitesimal ${\epsilon}$. To put it another way, if ${A(t)}$ is a square matrix that depends in a differentiable fashion on a real parameter ${t}$, then

$\displaystyle \frac{d}{dt} \det(A(t)) = \det(A(t)) \hbox{tr}( A(t)^{-1} \frac{d}{dt} A(t) )$

whenever ${A(t)}$ is invertible. (Note that if one combines this identity with cofactor expansion, one recovers Cramer’s rule.)

Let us see some examples of this differentiation method. If we take the Sylvester identity (2) and multiply one of the rectangular matrices ${A}$ by an infinitesimal ${\epsilon}$, we obtain

$\displaystyle \det( 1 + \epsilon A B ) = \det( 1 + \epsilon B A);$

applying (4) and extracting the linear term in ${\epsilon}$ (or equivalently, differentiating at ${\epsilon}$ and then setting ${\epsilon=0}$) we conclude the cyclic property of trace:

$\displaystyle \hbox{tr}(AB) = \hbox{tr}(BA).$

To manipulate derivatives and inverses, we begin with the Neumann series approximation

$\displaystyle (1 + \epsilon A)^{-1} = 1 - \epsilon A + O(\epsilon^2)$

for bounded square ${A}$ and infinitesimal ${\epsilon}$, which then leads to the more general approximation

$\displaystyle (A + \epsilon B)^{-1} = (1 + \epsilon A^{-1} B)^{-1} A^{-1}$

$\displaystyle = A^{-1} - \epsilon A^{-1} B A^{-1} + O(\epsilon^2) \ \ \ \ \ (5)$

for square matrices ${A,B}$ of the same dimension with ${B, A^{-1}}$ bounded. To put it another way, we have

$\displaystyle \frac{d}{dt} A(t)^{-1} = -A(t)^{-1} (\frac{d}{dt} A(t)) A(t)^{-1}$

whenever ${A(t)}$ depends in a differentiable manner on ${t}$ and ${A(t)}$ is invertible.

We can then differentiate (or linearise) the Schur identity (3) in a number of ways. For instance, if we replace the lower block ${D}$ by ${D + \epsilon H}$ for some test ${m \times m}$ matrix ${H}$, then by (4), the left-hand side of (3) becomes (assuming the invertibility of the block matrix)

$\displaystyle (\det \begin{pmatrix} A & B \\ C & D \end{pmatrix}) (1 + \epsilon \hbox{tr} \begin{pmatrix} A & B \\ C & D \end{pmatrix}^{-1} \begin{pmatrix} 0 & 0 \\ 0 & H \end{pmatrix} + O(\epsilon^2) )$

while the right-hand side becomes

$\displaystyle \det(A) ( \det(D-BA^{-1}C) + \epsilon \hbox{tr}( (D-BA^{-1}C)^{-1} H ) + O(\epsilon^2) );$

extracting the linear term in ${\epsilon}$, we conclude that

$\displaystyle \hbox{tr} (\begin{pmatrix} A & B \\ C & D \end{pmatrix}^{-1} \begin{pmatrix} 0 & 0 \\ 0 & H \end{pmatrix}) = \hbox{tr}( (D-BA^{-1}C)^{-1} H ).$

As ${H}$ was an arbitrary ${m \times m}$ matrix, we conclude from duality that the lower right ${m \times m}$ block of ${\begin{pmatrix} A & B \\ C & D \end{pmatrix}^{-1}}$ is given by the inverse ${(D-BA^{-1}C)^{-1}}$ of the Schur complement:

$\displaystyle \begin{pmatrix} A & B \\ C & D \end{pmatrix}^{-1} = \begin{pmatrix} ?? & ?? \\ ?? & (D-BA^{-1}C)^{-1} \end{pmatrix}.$

One can also compute the other components of this inverse in terms of the Schur complement ${D-BA^{-1} C}$ by a similar method (although the formulae become more complicated). As a variant of this method, we can perturb the block matrix in (3) by an infinitesimal multiple of the identity matrix giving

$\displaystyle \det \begin{pmatrix} A+\epsilon & B \\ C & D+\epsilon \end{pmatrix} = \det(A+\epsilon) \det( D +\epsilon - C (A+\epsilon)^{-1} B ). \ \ \ \ \ (6)$

By (4), the left-hand side is

$\displaystyle (\det \begin{pmatrix} A & B \\ C & D \end{pmatrix}) (1 + \epsilon \hbox{tr} \begin{pmatrix} A & B \\ C & D \end{pmatrix}^{-1} + O(\epsilon^2) ).$

From (5), we have

$\displaystyle D + \epsilon - C (A+ \epsilon)^{-1} B = D - C A^{-1} B + \epsilon(1 + C A^{-2} B) + O(\epsilon^2)$

and so from (4) the right-hand side of (6) is

$\displaystyle \det(A) \det(D-CA^{-1} B) \times$

$\displaystyle \times ( 1 + \epsilon (\hbox{tr}(A^{-1}) + \hbox{tr}( (D-CA^{-1} B)^{-1} (1 + C A^{-2} B)) ) + O(\epsilon^2) );$

extracting the linear component in ${\epsilon}$, we conclude the identity

$\displaystyle \hbox{tr} \begin{pmatrix} A & B \\ C & D \end{pmatrix}^{-1} = \hbox{tr}(A^{-1}) + \hbox{tr}( (D-CA^{-1} B)^{-1} (1 + C A^{-2} B)) \ \ \ \ \ (7)$

which relates the trace of the inverse of a block matrix, with the trace of the inverse of one of its blocks. This particular identity turns out to be useful in random matrix theory; I hope to elaborate on this in a later post.

As a final example of this method, we can analyse low rank perturbations ${A+BC}$ of a large (${n \times n}$) matrix ${A}$, where ${B}$ is an ${n \times m}$ matrix and ${C}$ is an ${m \times n}$ matrix for some ${m. (This type of situation is also common in random matrix theory, for instance it arose in this previous paper of mine on outliers to the circular law.) If ${A}$ is invertible, then from (1) and (2) one has the matrix determinant lemma

$\displaystyle \det( A + BC ) = \det(A) \det( 1 + A^{-1} BC) = \det(A) \det(1 + CA^{-1} B);$

if one then perturbs ${A}$ by an infinitesimal matrix ${\epsilon H}$, we have

$\displaystyle \det( A + BC + \epsilon H ) = \det(A + \epsilon H ) \det(1 + C(A+\epsilon H)^{-1} B).$

Extracting the linear component in ${\epsilon}$ as before, one soon arrives at

$\displaystyle \hbox{tr}( (A+BC)^{-1} H ) = \hbox{tr}( A^{-1} H ) - \hbox{tr}( (1 + C A^{-1} B)^{-1} C A^{-1} H A^{-1} B )$

assuming that ${A}$ and ${A+BC}$ are both invertible; as ${H}$ is arbitrary, we conclude (after using the cyclic property of trace) the Sherman-Morrison formula

$\displaystyle (A+BC)^{-1} = A^{-1} - A^{-1} B (1 + C A^{-1} B)^{-1} C A^{-1}$

for the inverse of a low rank perturbation ${A+BC}$ of a matrix ${A}$. While this identity can be easily verified by direct algebraic computation, it is somewhat difficult to discover this identity by such algebraic manipulation; thus we see that the “determinant first” approach to matrix identities can make it easier to find appropriate matrix identities (particularly those involving traces and/or inverses), even if the identities one is ultimately interested in do not involve determinants. (As differentiation typically makes an identity lengthier, but also more “linear” or “additive”, the determinant identity tends to be shorter (albeit more nonlinear and more multiplicative) than the differentiated identity, and can thus be slightly easier to derive.)

Exercise 1 Use the “determinant first” approach to derive the Woodbury matrix identity (also known as the binomial inverse theorem)

$\displaystyle (A+BDC)^{-1} = A^{-1} - A^{-1} B (D^{-1} + CA^{-1} B)^{-1} C A^{-1}$

where ${A}$ is an ${n \times n}$ matrix, ${B}$ is an ${n \times m}$ matrix, ${C}$ is an ${m \times n}$ matrix, and ${D}$ is an ${m \times m}$ matrix, assuming that ${A}$, ${D}$ and ${A+BDC}$ are all invertible.

Exercise 2 Let ${A,B}$ be invertible ${n \times n}$ matrices. Establish the identity

$\displaystyle \det(A + B) \det(A - B) = \det(B) \det( AB^{-1} A - B)$

and differentiate this in ${A}$ to deduce the identity

$\displaystyle (A+B)^{-1} + (A-B)^{-1} = 2 (A - BA^{-1} B)^{-1}$

(assuming that all inverses exist) and thence

$\displaystyle (A+B)^{-1} = (A - BA^{-1} B)^{-1} - (B - AB^{-1} A)^{-1}.$

Rotating ${B}$ by ${i}$ then gives

$\displaystyle (A+iB)^{-1} = (A + BA^{-1} B)^{-1} - i (B + AB^{-1} A)^{-1},$

which is useful for inverting a matrix ${A+iB}$ that has been split into a self-adjoint component ${A}$ and a skew-adjoint component ${iB}$.

Mathematicians study a variety of different mathematical structures, but perhaps the structures that are most commonly associated with mathematics are the number systems, such as the integers ${{\bf Z}}$ or the real numbers ${{\bf R}}$. Indeed, the use of number systems is so closely identified with the practice of mathematics that one sometimes forgets that it is possible to do mathematics without explicit reference to any concept of number. For instance, the ancient Greeks were able to prove many theorems in Euclidean geometry, well before the development of Cartesian coordinates and analytic geometry in the seventeenth century, or the formal constructions or axiomatisations of the real number system that emerged in the nineteenth century (not to mention precursor concepts such as zero or negative numbers, whose very existence was highly controversial, if entertained at all, to the ancient Greeks). To do this, the Greeks used geometric operations as substitutes for the arithmetic operations that would be more familiar to modern mathematicians. For instance, concatenation of line segments or planar regions serves as a substitute for addition; the operation of forming a rectangle out of two line segments would serve as a substitute for multiplication; the concept of similarity can be used as a substitute for ratios or division; and so forth.

A similar situation exists in modern physics. Physical quantities such as length, mass, momentum, charge, and so forth are routinely measured and manipulated using the real number system ${{\bf R}}$ (or related systems, such as ${{\bf R}^3}$ if one wishes to measure a vector-valued physical quantity such as velocity). Much as analytic geometry allows one to use the laws of algebra and trigonometry to calculate and prove theorems in geometry, the identification of physical quantities with numbers allows one to express physical laws and relationships (such as Einstein’s famous mass-energy equivalence ${E=mc^2}$) as algebraic (or differential) equations, which can then be solved and otherwise manipulated through the extensive mathematical toolbox that has been developed over the centuries to deal with such equations.

However, as any student of physics is aware, most physical quantities are not represented purely by one or more numbers, but instead by a combination of a number and some sort of unit. For instance, it would be a category error to assert that the length of some object was a number such as ${10}$; instead, one has to say something like “the length of this object is ${10}$ yards”, combining both a number ${10}$ and a unit (in this case, the yard). Changing the unit leads to a change in the numerical value assigned to this physical quantity, even though no physical change to the object being measured has occurred. For instance, if one decides to use feet as the unit of length instead of yards, then the length of the object is now ${30}$ feet; if one instead uses metres, the length is now ${9.144}$ metres; and so forth. But nothing physical has changed when performing this change of units, and these lengths are considered all equal to each other:

$\displaystyle 10 \hbox{ yards } = 30 \hbox{ feet } = 9.144 \hbox{ metres}.$

It is then common to declare that while physical quantities and units are not, strictly speaking, numbers, they should be manipulated using the laws of algebra as if they were numerical quantities. For instance, if an object travels ${10}$ metres in ${5}$ seconds, then its speed should be

$\displaystyle (10 m) / (5 s) = 2 ms^{-1}$

where we use the usual abbreviations of ${m}$ and ${s}$ for metres and seconds respectively. Similarly, if the speed of light ${c}$ is ${c=299 792 458 ms^{-1}}$ and an object has mass ${10 kg}$, then Einstein’s mass-energy equivalence ${E=mc^2}$ then tells us that the energy-content of this object is

$\displaystyle (10 kg) (299 792 458 ms^{-1})^2 \approx 8.99 \times 10^{17} kg m^2 s^{-2}.$

Note that the symbols ${kg, m, s}$ are being manipulated algebraically as if they were mathematical variables such as ${x}$ and ${y}$. By collecting all these units together, we see that every physical quantity gets assigned a unit of a certain dimension: for instance, we see here that the energy ${E}$ of an object can be given the unit of ${kg m^2 s^{-2}}$ (more commonly known as a Joule), which has the dimension of ${M L^2 T^{-2}}$ where ${M, L, T}$ are the dimensions of mass, length, and time respectively.

There is however one important limitation to the ability to manipulate “dimensionful” quantities as if they were numbers: one is not supposed to add, subtract, or compare two physical quantities if they have different dimensions, although it is acceptable to multiply or divide two such quantities. For instance, if ${m}$ is a mass (having the units ${M}$) and ${v}$ is a speed (having the units ${LT^{-1}}$), then it is physically “legitimate” to form an expression such as ${\frac{1}{2} mv^2}$, but not an expression such as ${m+v}$ or ${m-v}$; in a similar spirit, statements such as ${m=v}$ or ${m\geq v}$ are physically meaningless. This combines well with the mathematical distinction between vector, scalar, and matrix quantities, which among other things prohibits one from adding together two such quantities if their vector or matrix type are different (e.g. one cannot add a scalar to a vector, or a vector to a matrix), and also places limitations on when two such quantities can be multiplied together. A related limitation, which is not always made explicit in physics texts, is that transcendental mathematical functions such as ${\sin}$ or ${\exp}$ should only be applied to arguments that are dimensionless; thus, for instance, if ${v}$ is a speed, then ${\hbox{arctanh}(v)}$ is not physically meaningful, but ${\hbox{arctanh}(v/c)}$ is (this particular quantity is known as the rapidity associated to this speed).

These limitations may seem like a weakness in the mathematical modeling of physical quantities; one may think that one could get a more “powerful” mathematical framework if one were allowed to perform dimensionally inconsistent operations, such as add together a mass and a velocity, add together a vector and a scalar, exponentiate a length, etc. Certainly there is some precedent for this in mathematics; for instance, the formalism of Clifford algebras does in fact allow one to (among other things) add vectors with scalars, and in differential geometry it is quite common to formally apply transcendental functions (such as the exponential function) to a differential form (for instance, the Liouville measure ${\frac{1}{n!} \omega^n}$ of a symplectic manifold can be usefully thought of as a component of the exponential ${\exp(\omega)}$ of the symplectic form ${\omega}$).

However, there are several reasons why it is advantageous to retain the limitation to only perform dimensionally consistent operations. One is that of error correction: one can often catch (and correct for) errors in one’s calculations by discovering a dimensional inconsistency, and tracing it back to the first step where it occurs. Also, by performing dimensional analysis, one can often identify the form of a physical law before one has fully derived it. For instance, if one postulates the existence of a mass-energy relationship involving only the mass of an object ${m}$, the energy content ${E}$, and the speed of light ${c}$, dimensional analysis is already sufficient to deduce that the relationship must be of the form ${E = \alpha mc^2}$ for some dimensionless absolute constant ${\alpha}$; the only remaining task is then to work out the constant of proportionality ${\alpha}$, which requires physical arguments beyond that provided by dimensional analysis. (This is a simple instance of a more general application of dimensional analysis known as the Buckingham ${\pi}$ theorem.)

The use of units and dimensional analysis has certainly been proven to be very effective tools in physics. But one can pose the question of whether it has a properly grounded mathematical foundation, in order to settle any lingering unease about using such tools in physics, and also in order to rigorously develop such tools for purely mathematical purposes (such as analysing identities and inequalities in such fields of mathematics as harmonic analysis or partial differential equations).

The example of Euclidean geometry mentioned previously offers one possible approach to formalising the use of dimensions. For instance, one could model the length of a line segment not by a number, but rather by the equivalence class of all line segments congruent to the original line segment (cf. the Frege-Russell definition of a number). Similarly, the area of a planar region can be modeled not by a number, but by the equivalence class of all regions that are equidecomposable with the original region (one can, if one wishes, restrict attention here to measurable sets in order to avoid Banach-Tarski-type paradoxes, though that particular paradox actually only arises in three and higher dimensions). As mentioned before, it is then geometrically natural to multiply two lengths to form an area, by taking a rectangle whose line segments have the stated lengths, and using the area of that rectangle as a product. This geometric picture works well for units such as length and volume that have a spatial geometric interpretation, but it is less clear how to apply it for more general units. For instance, it does not seem geometrically natural (or, for that matter, conceptually helpful) to envision the equation ${E=mc^2}$ as the assertion that the energy ${E}$ is the volume of a rectangular box whose height is the mass ${m}$ and whose length and width is given by the speed of light ${c}$.

But there are at least two other ways to formalise dimensionful quantities in mathematics, which I will discuss below the fold. The first is a “parametric” model in which dimensionful objects are modeled as numbers (or vectors, matrices, etc.) depending on some base dimensional parameters (such as units of length, mass, and time, or perhaps a coordinate system for space or spacetime), and transforming according to some representation of a structure group that encodes the range of these parameters; this type of “coordinate-heavy” model is often used (either implicitly or explicitly) by physicists in order to efficiently perform calculations, particularly when manipulating vector or tensor-valued quantities. The second is an “abstract” model in which dimensionful objects now live in an abstract mathematical space (e.g. an abstract vector space), in which only a subset of the operations available to general-purpose number systems such as ${{\bf R}}$ or ${{\bf R}^3}$ are available, namely those operations which are “dimensionally consistent” or invariant (or more precisely, equivariant) with respect to the action of the underlying structure group. This sort of “coordinate-free” approach tends to be the one which is preferred by pure mathematicians, particularly in the various branches of modern geometry, in part because it can lead to greater conceptual clarity, as well as results of great generality; it is also close to the more informal practice of treating mathematical manipulations that do not preserve dimensional consistency as being physically meaningless.

Things are pretty quiet here during the holiday season, but one small thing I have been working on recently is a set of notes on special relativity that I will be working through in a few weeks with some bright high school students here at our local math circle.  I have only two hours to spend with this group, and it is unlikely that we will reach the end of the notes (in which I derive the famous mass-energy equivalence relation E=mc^2, largely following Einstein’s original derivation as discussed in this previous blog post); instead we will probably spend a fair chunk of time on related topics which do not actually require special relativity per se, such as spacetime diagrams, the Doppler shift effect, and an analysis of my airport puzzle.  This will be my first time doing something of this sort (in which I will be spending as much time interacting directly with the students as I would lecturing);  I’m not sure exactly how it will play out, being a little outside of my usual comfort zone of undergraduate and graduate teaching, but am looking forward to finding out how it goes.   (In particular, it may end up that the discussion deviates somewhat from my prepared notes.)

The material covered in my notes is certainly not new, but I ultimately decided that it was worth putting up here in case some readers here had any corrections or other feedback to contribute (which, as always, would be greatly appreciated).

[Dec 24 and then Jan 21: notes updated, in response to comments.]

Perhaps the most important structural result about general large dense graphs is the Szemerédi regularity lemma. Here is a standard formulation of that lemma:

Lemma 1 (Szemerédi regularity lemma) Let ${G = (V,E)}$ be a graph on ${n}$ vertices, and let ${\epsilon > 0}$. Then there exists a partition ${V = V_1 \cup \ldots \cup V_M}$ for some ${M \leq M(\epsilon)}$ with the property that for all but at most ${\epsilon M^2}$ of the pairs ${1 \leq i \leq j \leq M}$, the pair ${V_i, V_j}$ is ${\epsilon}$-regular in the sense that

$\displaystyle | d( A, B ) - d( V_i, V_j ) | \leq \epsilon$

whenever ${A \subset V_i, B \subset V_j}$ are such that ${|A| \geq \epsilon |V_i|}$ and ${|B| \geq \epsilon |V_j|}$, and ${d(A,B) := |\{ (a,b) \in A \times B: \{a,b\} \in E \}|/|A| |B|}$ is the edge density between ${A}$ and ${B}$. Furthermore, the partition is equitable in the sense that ${||V_i| - |V_j|| \leq 1}$ for all ${1 \leq i \leq j \leq M}$.

There are many proofs of this lemma, which is actually not that difficult to establish; see for instance these previous blog posts for some examples. In this post I would like to record one further proof, based on the spectral decomposition of the adjacency matrix of ${G}$, which is essentially due to Frieze and Kannan. (Strictly speaking, Frieze and Kannan used a variant of this argument to establish a weaker form of the regularity lemma, but it is not difficult to modify the Frieze-Kannan argument to obtain the usual form of the regularity lemma instead. Some closely related spectral regularity lemmas were also developed by Szegedy.) I found recently (while speaking at the Abel conference in honour of this year’s laureate, Endre Szemerédi) that this particular argument is not as widely known among graph theory experts as I had thought, so I thought I would record it here.

For reasons of exposition, it is convenient to first establish a slightly weaker form of the lemma, in which one drops the hypothesis of equitability (but then has to weight the cells ${V_i}$ by their magnitude when counting bad pairs):

Lemma 2 (Szemerédi regularity lemma, weakened variant) . Let ${G = (V,E)}$ be a graph on ${n}$ vertices, and let ${\epsilon > 0}$. Then there exists a partition ${V = V_1 \cup \ldots \cup V_M}$ for some ${M \leq M(\epsilon)}$ with the property that for all pairs ${(i,j) \in \{1,\ldots,M\}^2}$ outside of an exceptional set ${\Sigma}$, one has

$\displaystyle | E(A,B) - d_{ij} |A| |B| | \ll \epsilon |V_i| |V_j| \ \ \ \ \ (1)$

whenever ${A \subset V_i, B \subset V_j}$, for some real number ${d_{ij}}$, where ${E(A,B) := |\{ (a,b) \in A \times B: \{a,b\} \in E \}|}$ is the number of edges between ${A}$ and ${B}$. Furthermore, we have

$\displaystyle \sum_{(i,j) \in \Sigma} |V_i| |V_j| \ll \epsilon |V|^2. \ \ \ \ \ (2)$

Let us now prove Lemma 2. We enumerate ${V}$ (after relabeling) as ${V = \{1,\ldots,n\}}$. The adjacency matrix ${T}$ of the graph ${G}$ is then a self-adjoint ${n \times n}$ matrix, and thus admits an eigenvalue decomposition

$\displaystyle T = \sum_{i=1}^n \lambda_i u_i^* u_i$

for some orthonormal basis ${u_1,\ldots,u_n}$ of ${{\bf C}^n}$ and some eigenvalues ${\lambda_1,\ldots,\lambda_n \in {\bf R}}$, which we arrange in decreasing order of magnitude:

$\displaystyle |\lambda_1| \geq \ldots \geq |\lambda_n|.$

We can compute the trace of ${T^2}$ as

$\displaystyle \hbox{tr}(T^2) = \sum_{i=1}^n |\lambda_i|^2.$

But we also have ${\hbox{tr}(T^2) = 2|E| \leq n^2}$, so

$\displaystyle \sum_{i=1}^n |\lambda_i|^2 \leq n^2. \ \ \ \ \ (3)$

Among other things, this implies that

$\displaystyle |\lambda_i| \leq \frac{n}{\sqrt{i}} \ \ \ \ \ (4)$

for all ${i \geq 1}$.

Let ${F: {\bf N} \rightarrow {\bf N}}$ be a function (depending on ${\epsilon}$) to be chosen later, with ${F(i) \geq i}$ for all ${i}$. Applying (3) and the pigeonhole principle (or the finite convergence principle, see this blog post), we can find ${J \leq C(F,\epsilon)}$ such that

$\displaystyle \sum_{J \leq i < F(J)} |\lambda_i|^2 \leq \epsilon^3 n^2.$

(Indeed, the bound on ${J}$ is basically ${F}$ iterated ${1/\epsilon^3}$ times.) We can now split

$\displaystyle T = T_1 + T_2 + T_3, \ \ \ \ \ (5)$

where ${T_1}$ is the “structured” component

$\displaystyle T_1 := \sum_{i < J} \lambda_i u_i^* u_i, \ \ \ \ \ (6)$

${T_2}$ is the “small” component

$\displaystyle T_2 := \sum_{J \leq i < F(J)} \lambda_i u_i^* u_i, \ \ \ \ \ (7)$

and ${T_3}$ is the “pseudorandom” component

$\displaystyle T_3 := \sum_{i > F(J)} \lambda_i u_i^* u_i. \ \ \ \ \ (8)$

We now design a vertex partition to make ${T_1}$ approximately constant on most cells. For each ${i < J}$, we partition ${V}$ into ${O_{J,\epsilon}(1)}$ cells on which ${u_i}$ (viewed as a function from ${V}$ to ${{\bf C}}$) only fluctuates by ${O(\epsilon n^{-1/2} /J)}$, plus an exceptional cell of size ${O( \frac{\epsilon}{J} |V|)}$ coming from the values where ${|u_i|}$ is excessively large (larger than ${\sqrt{\frac{J}{\epsilon}} n^{-1/2}}$). Combining all these partitions together, we can write ${V = V_1 \cup \ldots \cup V_{M-1} \cup V_M}$ for some ${M = O_{J,\epsilon}(1)}$, where ${V_M}$ has cardinality at most ${\epsilon |V|}$, and for all ${1 \leq i \leq M-1}$, the eigenfunctions ${u_1,\ldots,u_{J-1}}$ all fluctuate by at most ${O(\epsilon/J)}$. In particular, if ${1 \leq i,j \leq M-1}$, then (by (4) and (6)) the entries of ${T_1}$ fluctuate by at most ${O(\epsilon)}$ on each block ${V_i \times V_j}$. If we let ${d_{ij}}$ be the mean value of these entries on ${V_i \times V_j}$, we thus have

$\displaystyle 1_B^* T_1 1_A = d_{ij} |A| |B| + O( \epsilon |V_i| |V_j| ) \ \ \ \ \ (9)$

for any ${1 \leq i,j \leq M-1}$ and ${A \subset V_i, B \subset V_j}$, where we view the indicator functions ${1_A, 1_B}$ as column vectors of dimension ${n}$.

Next, we observe from (3) and (7) that ${\hbox{tr} T_2^2 \leq \epsilon^3 n^2}$. If we let ${x_{ab}}$ be the coefficients of ${T_2}$, we thus have

$\displaystyle \sum_{a,b \in V} |x_{ab}|^2 \leq \epsilon^3 n^2$

and hence by Markov’s inequality we have

$\displaystyle \sum_{a \in V_i} \sum_{b \in V_j} |x_{ab}|^2 \leq \epsilon^2 |V_i| |V_j| \ \ \ \ \ (10)$

for all pairs ${(i,j) \in \{1,\ldots,M-1\}^2}$ outside of an exceptional set ${\Sigma_1}$ with

$\displaystyle \sum_{(i,j) \in \Sigma} |V_i| |V_j| \leq \epsilon |V|^2.$

If ${(i,j) \in \{1,\ldots,M-1\}^2}$ avoids ${\Sigma}$, we thus have

$\displaystyle 1_B^* T_2 1_A = O( \epsilon |V_i| |V_j| ) \ \ \ \ \ (11)$

for any ${A \subset V_i, B \subset V_j}$, by (10) and the Cauchy-Schwarz inequality.

Finally, to control ${T_3}$ we see from (4) and (8) that ${T_3}$ has an operator norm of at most ${1/F(J)}$. In particular, we have from the Cauchy-Schwarz inequality that

$\displaystyle 1_B^* T_3 1_A = O( n^2 / F(J) ) \ \ \ \ \ (12)$

for any ${A, B \subset V}$.

Let ${\Sigma}$ be the set of all pairs ${(i,j) \in \{1,\ldots,M\}^2}$ where either ${(i,j) \in \Sigma_1}$, ${i = M}$, ${j=M}$, or

$\displaystyle \min(|V_i|, |V_j|) \leq \frac{\epsilon}{M} n.$

One easily verifies that (2) holds. If ${(i,j) \in \{1,\ldots,M\}^2}$ is not in ${\Sigma}$, then by summing (9), (11), (12) and using (5), we see that

$\displaystyle 1_B^* T 1_A = d_{ij} |A| |B| + O( \epsilon |V_i| |V_j| ) + O( n^2 / F(J) ) \ \ \ \ \ (13)$

for all ${A \subset V_i, B \subset V_j}$. The left-hand side is just ${E(A,B)}$. As ${(i,j) \not \in \Sigma}$, we have

$\displaystyle |V_i|, |V_j| > \frac{\epsilon}{M} n$

and so (since ${M = O_{J,\epsilon}(1)}$)

$\displaystyle n^2 / F(J) \ll_{J,\epsilon} |V_i| |V_j| / F(J).$

If we let ${F}$ be a sufficiently rapidly growing function of ${J}$ that depends on ${\epsilon}$, the second error term in (13) can be absorbed in the first, and (1) follows. This concludes the proof of Lemma 2.

To prove Lemma 1, one argues similarly (after modifying ${\epsilon}$ as necessary), except that the initial partition ${V_1,\ldots,V_M}$ of ${V}$ constructed above needs to be subdivided further into equitable components (of size ${\epsilon |V|/M+O(1)}$), plus some remainder sets which can be aggregated into an exceptional component of size ${O( \epsilon |V| )}$ (and which can then be redistributed amongst the other components to arrive at a truly equitable partition). We omit the details.

Remark 1 It is easy to verify that ${F}$ needs to be growing exponentially in ${J}$ in order for the above argument to work, which leads to tower-exponential bounds in the number of cells ${M}$ in the partition. It was shown by Gowers that a tower-exponential bound is actually necessary here. By varying ${F}$, one basically obtains the strong regularity lemma first established by Alon, Fischer, Krivelevich, and Szegedy; in the opposite direction, setting ${F(J) := J}$ essentially gives the weak regularity lemma of Frieze and Kannan.

Remark 2 If we specialise to a Cayley graph, in which ${V = (V,+)}$ is a finite abelian group and ${E = \{ (a,b): a-b \in A \}}$ for some (symmetric) subset ${A}$ of ${V}$, then the eigenvectors are characters, and one essentially recovers the arithmetic regularity lemma of Green, in which the vertex partition classes ${V_i}$ are given by Bohr sets (and one can then place additional regularity properties on these Bohr sets with some additional arguments). The components ${T_1, T_2, T_3}$ of ${T}$, representing high, medium, and low eigenvalues of ${T}$, then become a decomposition associated to high, medium, and low Fourier coefficients of ${A}$.

Remark 3 The use of spectral theory here is parallel to the use of Fourier analysis to establish results such as Roth’s theorem on arithmetic progressions of length three. In analogy with this, one could view hypergraph regularity as being a sort of “higher order spectral theory”, although this spectral perspective is not as convenient as it is in the graph case.

Given a function ${f: X \rightarrow Y}$ between two sets ${X, Y}$, we can form the graph

$\displaystyle \Sigma := \{ (x,f(x)): x\in X \},$

which is a subset of the Cartesian product ${X \times Y}$.

There are a number of “closed graph theorems” in mathematics which relate the regularity properties of the function ${f}$ with the closure properties of the graph ${\Sigma}$, assuming some “completeness” properties of the domain ${X}$ and range ${Y}$. The most famous of these is the closed graph theorem from functional analysis, which I phrase as follows:

Theorem 1 (Closed graph theorem (functional analysis)) Let ${X, Y}$ be complete normed vector spaces over the reals (i.e. Banach spaces). Then a function ${f: X \rightarrow Y}$ is a continuous linear transformation if and only if the graph ${\Sigma := \{ (x,f(x)): x \in X \}}$ is both linearly closed (i.e. it is a linear subspace of ${X \times Y}$) and topologically closed (i.e. closed in the product topology of ${X \times Y}$).

I like to think of this theorem as linking together qualitative and quantitative notions of regularity preservation properties of an operator ${f}$; see this blog post for further discussion.

The theorem is equivalent to the assertion that any continuous linear bijection ${f: X \rightarrow Y}$ from one Banach space to another is necessarily an isomorphism in the sense that the inverse map is also continuous and linear. Indeed, to see that this claim implies the closed graph theorem, one applies it to the projection from ${\Sigma}$ to ${X}$, which is a continuous linear bijection; conversely, to deduce this claim from the closed graph theorem, observe that the graph of the inverse ${f^{-1}}$ is the reflection of the graph of ${f}$. As such, the closed graph theorem is a corollary of the open mapping theorem, which asserts that any continuous linear surjection from one Banach space to another is open. (Conversely, one can deduce the open mapping theorem from the closed graph theorem by quotienting out the kernel of the continuous surjection to get a bijection.)

It turns out that there is a closed graph theorem (or equivalent reformulations of that theorem, such as an assertion that bijective morphisms between sufficiently “complete” objects are necessarily isomorphisms, or as an open mapping theorem) in many other categories in mathematics as well. Here are some easy ones:

Theorem 2 (Closed graph theorem (linear algebra)) Let ${X, Y}$ be vector spaces over a field ${k}$. Then a function ${f: X \rightarrow Y}$ is a linear transformation if and only if the graph ${\Sigma := \{ (x,f(x)): x \in X \}}$ is linearly closed.

Theorem 3 (Closed graph theorem (group theory)) Let ${X, Y}$ be groups. Then a function ${f: X \rightarrow Y}$ is a group homomorphism if and only if the graph ${\Sigma := \{ (x,f(x)): x \in X \}}$ is closed under the group operations (i.e. it is a subgroup of ${X \times Y}$).

Theorem 4 (Closed graph theorem (order theory)) Let ${X, Y}$ be totally ordered sets. Then a function ${f: X \rightarrow Y}$ is monotone increasing if and only if the graph ${\Sigma := \{ (x,f(x)): x \in X \}}$ is totally ordered (using the product order on ${X \times Y}$).

Remark 1 Similar results to the above three theorems (with similarly easy proofs) hold for other algebraic structures, such as rings (using the usual product of rings), modules, algebras, or Lie algebras, groupoids, or even categories (a map between categories is a functor iff its graph is again a category). (ADDED IN VIEW OF COMMENTS: further examples include affine spaces and ${G}$-sets (sets with an action of a given group ${G}$).) There are also various approximate versions of this theorem that are useful in arithmetic combinatorics, that relate the property of a map ${f}$ being an “approximate homomorphism” in some sense with its graph being an “approximate group” in some sense. This is particularly useful for this subfield of mathematics because there are currently more theorems about approximate groups than about approximate homomorphisms, so that one can profitably use closed graph theorems to transfer results about the former to results about the latter.

A slightly more sophisticated result in the same vein:

Theorem 5 (Closed graph theorem (point set topology)) Let ${X, Y}$ be compact Hausdorff spaces. Then a function ${f: X \rightarrow Y}$ is continuous if and only if the graph ${\Sigma := \{ (x,f(x)): x \in X \}}$ is topologically closed.

Indeed, the “only if” direction is easy, while for the “if” direction, note that if ${\Sigma}$ is a closed subset of ${X \times Y}$, then it is compact Hausdorff, and the projection map from ${\Sigma}$ to ${X}$ is then a bijective continuous map between compact Hausdorff spaces, which is then closed, thus open, and hence a homeomorphism, giving the claim.

Note that the compactness hypothesis is necessary: for instance, the function ${f: {\bf R} \rightarrow {\bf R}}$ defined by ${f(x) := 1/x}$ for ${x \neq 0}$ and ${f(0)=0}$ for ${x=0}$ is a function which has a closed graph, but is discontinuous.

A similar result (but relying on a much deeper theorem) is available in algebraic geometry, as I learned after asking this MathOverflow question:

Theorem 6 (Closed graph theorem (algebraic geometry)) Let ${X, Y}$ be normal projective varieties over an algebraically closed field ${k}$ of characteristic zero. Then a function ${f: X \rightarrow Y}$ is a regular map if and only if the graph ${\Sigma := \{ (x,f(x)): x \in X \}}$ is Zariski-closed.

Proof: (Sketch) For the only if direction, note that the map ${x \mapsto (x,f(x))}$ is a regular map from the projective variety ${X}$ to the projective variety ${X \times Y}$ and is thus a projective morphism, hence is proper. In particular, the image ${\Sigma}$ of ${X}$ under this map is Zariski-closed.

Conversely, if ${\Sigma}$ is Zariski-closed, then it is also a projective variety, and the projection ${(x,y) \mapsto x}$ is a projective morphism from ${\Sigma}$ to ${X}$, which is clearly quasi-finite; by the characteristic zero hypothesis, it is also separated. Applying (Grothendieck’s form of) Zariski’s main theorem, this projection is the composition of an open immersion and a finite map. As projective varieties are complete, the open immersion is an isomorphism, and so the projection from ${\Sigma}$ to ${X}$ is finite. Being injective and separable, the degree of this finite map must be one, and hence ${k(\Sigma)}$ and ${k(X)}$ are isomorphic, hence (by normality of ${X}$) ${k[\Sigma]}$ is contained in (the image of) ${k[X]}$, which makes the map from ${X}$ to ${\Sigma}$ regular, which makes ${f}$ regular. $\Box$

The counterexample of the map ${f: k \rightarrow k}$ given by ${f(x) := 1/x}$ for ${x \neq 0}$ and ${f(0) := 0}$ demonstrates why the projective hypothesis is necessary. The necessity of the normality condition (or more precisely, a weak normality condition) is demonstrated by (the projective version of) the map ${(t^2,t^3) \mapsto t}$ from the cusipdal curve ${\{ (t^2,t^3): t \in k \}}$ to ${k}$. (If one restricts attention to smooth varieties, though, normality becomes automatic.) The necessity of characteristic zero is demonstrated by (the projective version of) the inverse of the Frobenius map ${x \mapsto x^p}$ on a field ${k}$ of characteristic ${p}$.

There are also a number of closed graph theorems for topological groups, of which the following is typical (see Exercise 3 of these previous blog notes):

Theorem 7 (Closed graph theorem (topological group theory)) Let ${X, Y}$ be ${\sigma}$-compact, locally compact Hausdorff groups. Then a function ${X \rightarrow Y}$ is a continuous homomorphism if and only if the graph ${\Sigma := \{ (x,f(x)): x \in X \}}$ is both group-theoretically closed and topologically closed.

The hypotheses of being ${\sigma}$-compact, locally compact, and Hausdorff can be relaxed somewhat, but I doubt that they can be eliminated entirely (though I do not have a ready counterexample for this).

In several complex variables, it is a classical theorem (see e.g. Lemma 4 of this blog post) that a holomorphic function from a domain in ${{\bf C}^n}$ to ${{\bf C}^n}$ is locally injective if and only if it is a local diffeomorphism (i.e. its derivative is everywhere non-singular). This leads to a closed graph theorem for complex manifolds:

Theorem 8 (Closed graph theorem (complex manifolds)) Let ${X, Y}$ be complex manifolds. Then a function ${f: X \rightarrow Y}$ is holomorphic if and only if the graph ${\Sigma := \{ (x,f(x)): x \in X \}}$ is a complex manifold (using the complex structure inherited from ${X \times Y}$) of the same dimension as ${X}$.

Indeed, one applies the previous observation to the projection from ${\Sigma}$ to ${X}$. The dimension requirement is needed, as can be seen from the example of the map ${f: {\bf C} \rightarrow {\bf C}}$ defined by ${f(z) =1/z}$ for ${z \neq 0}$ and ${f(0)=0}$.

(ADDED LATER:) There is a real analogue to the above theorem:

Theorem 9 (Closed graph theorem (real manifolds)) Let ${X, Y}$ be real manifolds. Then a function ${f: X \rightarrow Y}$ is continuous if and only if the graph ${\Sigma := \{ (x,f(x)): x \in X \}}$ is a real manifold of the same dimension as ${X}$.

This theorem can be proven by applying invariance of domain (discussed in this previous post) to the projection of ${\Sigma}$ to ${X}$, to show that it is open if ${\Sigma}$ has the same dimension as ${X}$.

Note though that the analogous claim for smooth real manifolds fails: the function ${f: {\bf R} \rightarrow {\bf R}}$ defined by ${f(x) := x^{1/3}}$ has a smooth graph, but is not itself smooth.

(ADDED YET LATER:) Here is an easy closed graph theorem in the symplectic category:

Theorem 10 (Closed graph theorem (symplectic geometry)) Let ${X = (X,\omega_X)}$ and ${Y = (Y,\omega_Y)}$ be smooth symplectic manifolds of the same dimension. Then a smooth map ${f: X \rightarrow Y}$ is a symplectic morphism (i.e. ${f^* \omega_Y = \omega_X}$) if and only if the graph ${\Sigma := \{(x,f(x)): x \in X \}}$ is a Lagrangian submanifold of ${X \times Y}$ with the symplectic form ${\omega_X \oplus -\omega_Y}$.

In view of the symplectic rigidity phenomenon, it is likely that the smoothness hypotheses on ${f,X,Y}$ can be relaxed substantially, but I will not try to formulate such a result here.

There are presumably many further examples of closed graph theorems (or closely related theorems, such as criteria for inverting a morphism, or open mapping type theorems) throughout mathematics; I would be interested to know of further examples.

$\Box$

Let ${n}$ be a large natural number, and let ${M_n}$ be a matrix drawn from the Gaussian Unitary Ensemble (GUE), by which we mean that ${M_n}$ is a Hermitian matrix whose upper triangular entries are iid complex gaussians with mean zero and variance one, and whose diagonal entries are iid real gaussians with mean zero and variance one (and independent of the upper triangular entries). The eigenvalues ${\lambda_1(M_n) \leq \ldots \leq \lambda_n(M_n)}$ are then real and almost surely distinct, and can be viewed as a random point process ${\Sigma^{(n)} := \{\lambda_1(M_n),\ldots,\lambda_n(M_n)\}}$ on the real line. One can then form the ${k}$-point correlation functions ${\rho_k^{(n)}: {\bf R}^k \rightarrow {\bf R}^+}$ for every ${k \geq 0}$, which can be defined by duality by requiring

$\displaystyle \mathop{\bf E} \sum_{i_1,\ldots,i_k \hbox{ distinct}} F( \lambda_{i_1}(M_n),\ldots,\lambda_{i_k}(M_n))$

$\displaystyle = \int_{{\bf R}^k} \rho_k^{(n)}(x_1,\ldots,x_k) F(x_1,\ldots,x_k)\ dx_1 \ldots dx_k$

for any test function ${F: {\bf R}^k \rightarrow {\bf R}^+}$. For GUE, which is a continuous matrix ensemble, one can also define ${\rho_k^{(n)}(x_1,\ldots,x_k)}$ for distinct ${x_1<\ldots as the unique quantity such that the probability that there is an eigenvalue in each of the intervals ${[x_1,x_1+\epsilon],\ldots,[x_k,x_k+\epsilon]}$ is ${(\rho_k^{(n)}(x_1,\ldots,x_k)+o(1))\epsilon^k}$ in the limit ${\epsilon\rightarrow 0}$.

As is well known, the GUE process is a determinantal point process, which means that ${k}$-point correlation functions can be explicitly computed as

$\displaystyle \rho^{(n)}_k(x_1,\ldots,x_k) = \det( K^{(n)}(x_i,x_j) )_{1 \leq i,j \leq k}$

for some kernel ${K^{(n)}: {\bf R} \times {\bf R} \rightarrow {\bf C}}$; explicitly, one has

$\displaystyle K^{(n)}(x,y) := \sum_{k=0}^{n-1} P_k(x) e^{-x^2/4}P_k(y) e^{-y^2/4}$

where ${P_0, P_1,\ldots}$ are the (normalised) Hermite polynomials; see this previous blog post for details.

Using the asymptotics of Hermite polynomials (which then give asymptotics for the kernel ${K^{(n)}}$), one can take a limit of a (suitably rescaled) sequence of GUE processes to obtain the Dyson sine process, which is a determinantal point process ${\Sigma}$ on the real line with correlation functions

$\displaystyle \rho_k(x_1,\ldots,x_k) = \det( K(x_i,x_j) )_{1 \leq i,j \leq k} \ \ \ \ \ (1)$

where ${K}$ is the Dyson sine kernel

$\displaystyle K(x,y) := \frac{\sin(\pi(x-y))}{\pi(x-y)}. \ \ \ \ \ (2)$

A bit more precisely, for any fixed bulk energy ${-2 < u < 2}$, the renormalised point processes ${\rho_{sc}(u) \sqrt{n} ( \Sigma^{(n)} - \sqrt{n} u )}$ converge in distribution in the vague topology to ${\Sigma}$ as ${n \rightarrow \infty}$, where ${\rho_{sc}(u) := \frac{1}{2\pi} (4-u^2)^{1/2}_+}$ is the semi-circular law density.

On the other hand, an important feature of the GUE process ${\Sigma^{(n)} = \{\lambda_1,\ldots,\lambda_n\}}$ is its stationarity (modulo rescaling) under Dyson Brownian motion

$\displaystyle d\lambda_i = dB_i + \sum_{j \neq i} \frac{dt}{\lambda_i-\lambda_j}$

which describes the stochastic evolution of eigenvalues of a Hermitian matrix under independent Brownian motion of its entries, and is discussed in this previous blog post. To cut a long story short, this stationarity tells us that the self-similar ${n}$-point correlation function

$\displaystyle \rho^{(n)}_n(t,x) := t^{-n/2} \rho^{(n)}_n(x/\sqrt{t})$

obeys the Dyson heat equation

$\displaystyle \partial_t \rho^{(n)}_n = \frac{1}{2} \sum_{i=1}^n \partial_{x_i}^2 \rho^{(n)}_n - \sum_{1 \leq i,j \leq n;i\neq j} \partial_{x_i} \frac{\rho^{(n)}_n}{x_i-x_j}$

(see Exercise 11 of the previously mentioned blog post). Note that ${\rho^{(n)}_n}$ vanishes to second order whenever two of the ${x_i}$ coincide, so there is no singularity on the right-hand side. Setting ${t=1}$ and using self-similarity, we can rewrite this equation in time-independent form as

$\displaystyle -\frac{1}{2} \sum_{i=1}^n \partial_i (x_i \rho^{(n)}_n) = \frac{1}{2} \sum_{i=1}^n \partial_{x_i}^2 \rho^{(n)}_n - \sum_{1 \leq i,j \leq n;i\neq j} \partial_{x_i} \frac{\rho^{(n)}_n}{x_i-x_j}.$

One can then integrate out all but ${k}$ of these variables (after carefully justifying convergence) to obtain a system of equations for the ${k}$-point correlation functions ${\rho^{(n)}_k}$:

$\displaystyle -\frac{1}{2} \sum_{i=1}^k \partial_i (x_i \rho^{(n)}_k) = \frac{1}{2} \sum_{i=1}^k \partial_{x_i}^2 \rho^{(n)}_k - \sum_{1 \leq i,j \leq k;i\neq j} \partial_{x_i} \frac{\rho^{(n)}_k}{x_i-x_j}$

$\displaystyle - \sum_{i=1}^k \partial_{x_i} \int_{\bf R} \frac{\rho^{(n)}_{k+1}(x_1,\ldots,x_{k+1})}{x_i-x_{k+1}}\ dx_{k+1},$

where the integral is interpreted in the principal value case. This system is an example of a BBGKY hierarchy.

If one carefully rescales and takes limits (say at the energy level ${u=0}$, for simplicity), the left-hand side turns out to rescale to be a lower order term, and one ends up with a hierarchy for the Dyson sine process:

$\displaystyle 0 = \frac{1}{2} \sum_{i=1}^k \partial_{x_i}^2 \rho_k - \sum_{1 \leq i,j \leq k;i\neq j} \partial_{x_i} \frac{\rho_k}{x_i-x_j} \ \ \ \ \ (3)$

$\displaystyle - \sum_{i=1}^k \partial_{x_i} \int_{\bf R} \frac{\rho_{k+1}(x_1,\ldots,x_{k+1})}{x_i-x_{k+1}}\ dx_{k+1}.$

Informally, these equations show that the Dyson sine process ${\Sigma = \{ \lambda_i: i \in {\bf Z} \}}$ is stationary with respect to the infinite Dyson Brownian motion

$\displaystyle d\lambda_i = dB_i + \sum_{j \neq i} \frac{dt}{\lambda_i-\lambda_j}$

where ${dB_i}$ are independent Brownian increments, and the sum is interpreted in a suitable principal value sense.

I recently set myself the exercise of deriving the identity (3) directly from the definition (1) of the Dyson sine process, without reference to GUE. This turns out to not be too difficult when done the right way (namely, by modifying the proof of Gaudin’s lemma), although it did take me an entire day of work before I realised this, and I could not find it in the literature (though I suspect that many people in the field have privately performed this exercise in the past). In any case, I am recording the computation here, largely because I really don’t want to have to do it again, but perhaps it will also be of interest to some readers.

If ${M}$ is a connected topological manifold, and ${p}$ is a point in ${M}$, the (topological) fundamental group ${\pi_1(M,p)}$ of ${M}$ at ${p}$ is traditionally defined as the space of equivalence classes of loops starting and ending at ${p}$, with two loops considered equivalent if they are homotopic to each other. (One can of course define the fundamental group for more general classes of topological spaces, such as locally path connected spaces, but we will stick with topological manifolds in order to avoid pathologies.) As the name suggests, it is one of the most basic topological invariants of a manifold, which among other things can be used to classify the covering spaces of that manifold. Indeed, given any such covering ${\phi: N \rightarrow M}$, the fundamental group ${\pi_1(M,p)}$ acts (on the right) by monodromy on the fibre ${\phi^{-1}(\{p\})}$, and conversely given any discrete set with a right action of ${\pi_1(M,p)}$, one can find a covering space with that monodromy action (this can be done by “tensoring” the universal cover with the given action, as illustrated below the fold). In more category-theoretic terms: monodromy produces an equivalence of categories between the category of covers of ${M}$, and the category of discrete ${\pi_1(M,p)}$-sets.

One of the basic tools used to compute fundamental groups is van Kampen’s theorem:

Theorem 1 (van Kampen’s theorem) Let ${M_1, M_2}$ be connected open sets covering a connected topological manifold ${M}$ with ${M_1 \cap M_2}$ also connected, and let ${p}$ be an element of ${M_1 \cap M_2}$. Then ${\pi_1(M_1 \cup M_2,p)}$ is isomorphic to the amalgamated free product ${\pi_1(M_1,p) *_{\pi_1(M_1\cap M_2,p)} \pi_1(M_2,p)}$.

Since the topological fundamental group is customarily defined using loops, it is not surprising that many proofs of van Kampen’s theorem (e.g. the one in Hatcher’s text) proceed by an analysis of the loops in ${M_1 \cup M_2}$, carefully deforming them into combinations of loops in ${M_1}$ or in ${M_2}$ and using the combinatorial description of the amalgamated free product (which was discussed in this previous blog post). But I recently learned (thanks to the responses to this recent MathOverflow question of mine) that by using the above-mentioned equivalence of categories, one can convert statements about fundamental groups to statements about coverings. In particular, van Kampen’s theorem turns out to be equivalent to a basic statement about how to glue a cover of ${M_1}$ and a cover of ${M_2}$ together to give a cover of ${M}$, and the amalgamated free product emerges through its categorical definition as a coproduct, rather than through its combinatorial description. One advantage of this alternate proof is that it can be extended to other contexts (such as the étale fundamental groups of varieties or schemes) in which the concept of a path or loop is no longer useful, but for which the notion of a covering is still important. I am thus recording (mostly for my own benefit) the covering-based proof of van Kampen’s theorem in the topological setting below the fold.

Two weeks ago I was at Oberwolfach, for the Arbeitsgemeinschaft in Ergodic Theory and Combinatorial Number Theory that I was one of the organisers for. At this workshop, I learned the details of a very nice recent convergence result of Miguel Walsh (who, incidentally, is an informal grandstudent of mine, as his advisor, Roman Sasyk, was my informal student), which considerably strengthens and generalises a number of previous convergence results in ergodic theory (including one of my own), with a remarkably simple proof. Walsh’s argument is phrased in a finitary language (somewhat similar, in fact, to the approach used in my paper mentioned previously), and (among other things) relies on the concept of metastability of sequences, a variant of the notion of convergence which is useful in situations in which one does not expect a uniform convergence rate; see this previous blog post for some discussion of metastability. When interpreted in a finitary setting, this concept requires a fair amount of “epsilon management” to manipulate; also, Walsh’s argument uses some other epsilon-intensive finitary arguments, such as a decomposition lemma of Gowers based on the Hahn-Banach theorem. As such, I was tempted to try to rewrite Walsh’s argument in the language of nonstandard analysis to see the extent to which these sorts of issues could be managed. As it turns out, the argument gets cleaned up rather nicely, with the notion of metastability being replaced with the simpler notion of external Cauchy convergence (which we will define below the fold).

Let’s first state Walsh’s theorem. This theorem is a norm convergence theorem in ergodic theory, and can be viewed as a substantial generalisation of one of the most fundamental theorems of this type, namely the mean ergodic theorem:

Theorem 1 (Mean ergodic theorem) Let ${(X,\mu,T)}$ be a measure-preserving system (a probability space ${(X,\mu)}$ with an invertible measure-preserving transformation ${T}$). Then for any ${f \in L^2(X,\mu)}$, the averages ${\frac{1}{N} \sum_{n=1}^N T^n f}$ converge in ${L^2(X,\mu)}$ norm as ${N \rightarrow \infty}$, where ${T^n f(x) := f(T^{-n} x)}$.

In this post, all functions in ${L^2(X,\mu)}$ and similar spaces will be taken to be real instead of complex-valued for simplicity, though the extension to the complex setting is routine.

Actually, we have a precise description of the limit of these averages, namely the orthogonal projection of ${f}$ to the ${T}$-invariant factors. (See for instance my lecture notes on this theorem.) While this theorem ostensibly involves measure theory, it can be abstracted to the more general setting of unitary operators on a Hilbert space:

Theorem 2 (von Neumann mean ergodic theorem) Let ${H}$ be a Hilbert space, and let ${U: H \rightarrow H}$ be a unitary operator on ${H}$. Then for any ${f \in H}$, the averages ${\frac{1}{N} \sum_{n=1}^N U^n f}$ converge strongly in ${H}$ as ${N \rightarrow \infty}$.

Again, see my lecture notes (or just about any text in ergodic theory) for a proof.

Now we turn to Walsh’s theorem.

Theorem 3 (Walsh’s convergence theorem) Let ${(X,\mu)}$ be a measure space with a measure-preserving action of a nilpotent group ${G}$. Let ${g_1,\ldots,g_k: {\bf Z} \rightarrow G}$ be polynomial sequences in ${G}$ (i.e. each ${g_i}$ takes the form ${g_i(n) = a_{i,1}^{p_{i,1}(n)} \ldots a_{i,j}^{p_{i,j}(n)}}$ for some ${a_{i,1},\ldots,a_{i,j} \in G}$ and polynomials ${p_{i,1},\ldots,p_{i,j}: {\bf Z} \rightarrow {\bf Z}}$). Then for any ${f_1,\ldots,f_k \in L^\infty(X,\mu)}$, the averages ${\frac{1}{N} \sum_{n=1}^N (g_1(n) f_1) \ldots (g_k(n) f_k)}$ converge in ${L^2(X,\mu)}$ norm as ${N \rightarrow \infty}$, where ${g(n) f(x) := f(g(n)^{-1} x)}$.

It turns out that this theorem can also be abstracted to some extent, although due to the multiplication in the summand ${(g_1(n) f_1) \ldots (g_k(n) f_k)}$, one cannot work purely with Hilbert spaces as in the von Neumann mean ergodic theorem, but must also work with something like the Banach algebra ${L^\infty(X,\mu)}$. There are a number of ways to formulate this abstraction (which will be of some minor convenience to us, as it will allow us to reduce the need to invoke the nonstandard measure theory of Loeb, discussed for instance in this blog post); we will use the notion of a (real) commutative probability space ${({\mathcal A},\tau)}$, which for us will be a commutative unital algebra ${{\mathcal A}}$ over the reals together with a linear functional ${\tau: {\mathcal A} \rightarrow {\bf R}}$ which maps ${1}$ to ${1}$ and obeys the non-negativity axiom ${\tau(f^2) \ge 0}$ for all ${f}$. The key example to keep in mind here is ${{\mathcal A} = L^\infty(X,\mu)}$ of essentially bounded real-valued measurable functions with the supremum norm, and with the trace ${\tau(f) := \int_X f\ d\mu}$. We will also assume in our definition of commutative probability spaces that all elements ${f}$ of ${{\mathcal A}}$ are bounded in the sense that the spectral radius ${\rho(f) := \lim_{k \rightarrow \infty} \tau(f^{2k})^{1/2k}}$ is finite. (In the concrete case of ${L^\infty(X,\mu)}$, the spectral radius is just the ${L^\infty}$ norm.)

Given a commutative probability space, we can form an inner product ${\langle, \rangle_{L^2(\tau)}}$ on it by the formula

$\displaystyle \langle f, g \rangle_{L^2(\tau)} := \tau(fg).$

This is a positive semi-definite form, and gives a (possibly degenerate) inner product structure on ${{\mathcal A}}$. We could complete this structure into a Hilbert space ${L^2(\tau)}$ (after quotienting out the elements of zero norm), but we will not do so here, instead just viewing ${L^2(\tau)}$ as providing a semi-metric on ${{\mathcal A}}$. For future reference we record the inequalities

$\displaystyle \rho(fg) \leq \rho(f) \rho(g)$

$\displaystyle \rho(f+g) \leq \rho(f) + \rho(g)$

$\displaystyle \| fg\|_{L^2(\tau)} \leq \|f\|_{L^2(\tau)} \rho(g)$

for any ${f,g}$, which we will use in the sequel without further comment; see e.g. these previous blog notes for proofs. (Actually, for the purposes of proving Theorem 3, one can specialise to the ${L^\infty(X,\mu)}$ case (and ultraproducts thereof), in which case these inequalities are just the triangle and Hölder inequalities.)

The abstract version of Theorem 3 is then

Theorem 4 (Walsh’s theorem, abstract version) Let ${({\mathcal A},\tau)}$ be a commutative probability space, and let ${G}$ be a nilpotent group acting on ${{\mathcal A}}$ by isomorphisms (preserving the algebra, conjugation, and trace structure, and thus also preserving the spectral radius and ${L^2(\tau)}$ norm). Let ${g_1,\ldots,g_k: {\bf Z} \rightarrow G}$ be polynomial sequences. Then for any ${f_1,\ldots,f_k \in {\mathcal A}}$, the averages ${\frac{1}{N} \sum_{n=1}^N (g_1(n) f_1) \ldots (g_k(n) f_k)}$ form a Cauchy sequence in ${L^2(\tau)}$ (semi-)norm as ${N \rightarrow \infty}$.

It is easy to see that this theorem generalises Theorem 3. Conversely, one can use the commutative Gelfand-Naimark theorem to deduce Theorem 4 from Theorem 3, although we will not need this implication. Note how we are abandoning all attempts to discern what the limit of the sequence actually is, instead contenting ourselves with demonstrating that it is merely a Cauchy sequence. With this phrasing, it is tempting to ask whether there is any analogue of Walsh’s theorem for noncommutative probability spaces, but unfortunately the answer to that question is negative for all but the simplest of averages, as was worked out in this paper of Austin, Eisner, and myself.

Our proof of Theorem 4 will proceed as follows. Firstly, in order to avoid the epsilon management alluded to earlier, we will take an ultraproduct to rephrase the theorem in the language of nonstandard analysis; for reasons that will be clearer later, we will also convert the convergence problem to a problem of obtaining metastability (external Cauchy convergence). Then, we observe that (the nonstandard counterpart of) the expression ${\|\frac{1}{N} \sum_{n=1}^N (g_1(n) f_1) \ldots (g_k(n) f_k)\|_{L^2(\tau)}^2}$ can be viewed as the inner product of (say) ${f_k}$ with a certain type of expression, which we call a dual function. By performing an orthogonal projection to the span of the dual functions, we can split ${f_k}$ into the sum of an expression orthogonal to all dual functions (the “pseudorandom” component), and a function that can be well approximated by finite linear combinations of dual functions (the “structured” component). The contribution of the pseudorandom component is asymptotically negligible, so we can reduce to consideration of the structured component. But by a little bit of rearrangement, this can be viewed as an average of expressions similar to the initial average ${\frac{1}{N} \sum_{n=1}^N (g_1(n) f_1) \ldots (g_k(n) f_k)}$, except with the polynomials ${g_1,\ldots,g_k}$ replaced by a “lower complexity” set of such polynomials, which can be greater in number, but which have slightly lower degrees in some sense. One can iterate this (using “PET induction”) until all the polynomials become trivial, at which point the claim follows.

One of the basic general problems in analytic number theory is to understand as much as possible the fluctuations of the Möbius function ${\mu(n)}$, defined as ${(-1)^k}$ when ${n}$ is the product of ${k}$ distinct primes, and zero otherwise. For instance, as ${\mu}$ takes values in ${\{-1,0,1\}}$, we have the trivial bound

$\displaystyle |\sum_{n \leq x} \mu(n)| \leq x$

and the seemingly slight improvement

$\displaystyle \sum_{n \leq x} \mu(n) = o(x) \ \ \ \ \ (1)$

is already equivalent to the prime number theorem, as observed by Landau (see e.g. this previous blog post for a proof), while the much stronger (and still open) improvement

$\displaystyle \sum_{n \leq x} \mu(n) = O(x^{1/2+o(1)})$

is equivalent to the notorious Riemann hypothesis.

There is a general Möbius pseudorandomness heuristic that suggests that the sign pattern ${\mu}$ behaves so randomly (or pseudorandomly) that one should expect a substantial amount of cancellation in sums that involve the sign fluctuation of the Möbius function in a nontrivial fashion, with the amount of cancellation present comparable to the amount that an analogous random sum would provide; cf. the probabilistic heuristic discussed in this recent blog post. There are a number of ways to make this heuristic precise. As already mentioned, the Riemann hypothesis can be considered one such manifestation of the heuristic. Another manifestation is the following old conjecture of Chowla:

Conjecture 1 (Chowla’s conjecture) For any fixed integer ${m}$ and exponents ${a_1,a_2,\ldots,a_m \geq 0}$, with at least one of the ${a_i}$ odd (so as not to completely destroy the sign cancellation), we have

$\displaystyle \sum_{n \leq x} \mu(n+1)^{a_1} \ldots \mu(n+m)^{a_m} = o_{x \rightarrow \infty;m}(x).$

Note that as ${\mu^a = \mu^{a+2}}$ for any ${a \geq 1}$, we can reduce to the case when the ${a_i}$ take values in ${0,1,2}$ here. When only one of the ${a_i}$ are odd, this is essentially the prime number theorem in arithmetic progressions (after some elementary sieving), but with two or more of the ${a_i}$ are odd, the problem becomes completely open. For instance, the estimate

$\displaystyle \sum_{n \leq x} \mu(n) \mu(n+2) = o(x)$

is morally very close to the conjectured asymptotic

$\displaystyle \sum_{n \leq x} \Lambda(n) \Lambda(n+2) = 2\Pi_2 x + o(x)$

for the von Mangoldt function ${\Lambda}$, where ${\Pi_2 := \prod_{p > 2} (1 - \frac{1}{(p-1)^2}) = 0.66016\ldots}$ is the twin prime constant; this asymptotic in turn implies the twin prime conjecture. (To formally deduce estimates for von Mangoldt from estimates for Möbius, though, typically requires some better control on the error terms than ${o()}$, in particular gains of some power of ${\log x}$ are usually needed. See this previous blog post for more discussion.)

Remark 1 The Chowla conjecture resembles an assertion that, for ${n}$ chosen randomly and uniformly from ${1}$ to ${x}$, the random variables ${\mu(n+1),\ldots,\mu(n+k)}$ become asymptotically independent of each other (in the probabilistic sense) as ${x \rightarrow \infty}$. However, this is not quite accurate, because some moments (namely those with all exponents ${a_i}$ even) have the “wrong” asymptotic value, leading to some unwanted correlation between the two variables. For instance, the events ${\mu(n)=0}$ and ${\mu(n+4)=0}$ have a strong correlation with each other, basically because they are both strongly correlated with the event of ${n}$ being divisible by ${4}$. A more accurate interpretation of the Chowla conjecture is that the random variables ${\mu(n+1),\ldots,\mu(n+k)}$ are asymptotically conditionally independent of each other, after conditioning on the zero pattern ${\mu(n+1)^2,\ldots,\mu(n+k)^2}$; thus, it is the sign of the Möbius function that fluctuates like random noise, rather than the zero pattern. (The situation is a bit cleaner if one works instead with the Liouville function ${\lambda}$ instead of the Möbius function ${\mu}$, as this function never vanishes, but we will stick to the traditional Möbius function formalism here.)

A more recent formulation of the Möbius randomness heuristic is the following conjecture of Sarnak. Given a bounded sequence ${f: {\bf N} \rightarrow {\bf C}}$, define the topological entropy of the sequence to be the least exponent ${\sigma}$ with the property that for any fixed ${\epsilon > 0}$, and for ${m}$ going to infinity the set ${\{ (f(n+1),\ldots,f(n+m)): n \in {\bf N} \} \subset {\bf C}^m}$ of ${f}$ can be covered by ${O( \exp( \sigma m + o(m) ) )}$ balls of radius ${\epsilon}$. (If ${f}$ arises from a minimal topological dynamical system ${(X,T)}$ by ${f(n) := f(T^n x)}$, the above notion is equivalent to the usual notion of the topological entropy of a dynamical system.) For instance, if the sequence is a bit sequence (i.e. it takes values in ${\{0,1\}}$), then there are only ${\exp(\sigma m + o(m))}$ ${m}$-bit patterns that can appear as blocks of ${m}$ consecutive bits in this sequence. As a special case, a Turing machine with bounded memory that had access to a random number generator at the rate of one random bit produced every ${T}$ units of time, but otherwise evolved deterministically, would have an output sequence that had a topological entropy of at most ${\frac{1}{T} \log 2}$. A bounded sequence is said to be deterministic if its topological entropy is zero. A typical example is a polynomial sequence such as ${f(n) := e^{2\pi i \alpha n^2}}$ for some fixed ${\sigma}$; the ${m}$-blocks of such polynomials sequence have covering numbers that only grow polynomially in ${m}$, rather than exponentially, thus yielding the zero entropy. Unipotent flows, such as the horocycle flow on a compact hyperbolic surface, are another good source of deterministic sequences.

Conjecture 2 (Sarnak’s conjecture) Let ${f: {\bf N} \rightarrow {\bf C}}$ be a deterministic bounded sequence. Then

$\displaystyle \sum_{n \leq x} \mu(n) f(n) = o_{x \rightarrow \infty;f}(x).$

This conjecture in general is still quite far from being solved. However, special cases are known:

• For constant sequences, this is essentially the prime number theorem (1).
• For periodic sequences, this is essentially the prime number theorem in arithmetic progressions.
• For quasiperiodic sequences such as ${f(n) = F(\alpha n \hbox{ mod } 1)}$ for some continuous ${F}$, this follows from the work of Davenport.
• For nilsequences, this is a result of Ben Green and myself.
• For horocycle flows, this is a result of Bourgain, Sarnak, and Ziegler.
• For the Thue-Morse sequence, this is a result of Dartyge-Tenenbaum (with a stronger error term obtained by Maduit-Rivat). A subsequent result of Bourgain handles all bounded rank one sequences (though the Thue-Morse sequence is actually of rank two), and a related result of Green establishes asymptotic orthogonality of the Möbius function to bounded depth circuits, although such functions are not necessarily deterministic in nature.
• For the Rudin-Shapiro sequence, I sketched out an argument at this MathOverflow post.
• The Möbius function is known to itself be non-deterministic, because its square ${\mu^2(n)}$ (i.e. the indicator of the square-free functions) is known to be non-deterministic (indeed, its topological entropy is ${\frac{6}{\pi^2}\log 2}$). (The corresponding question for the Liouville function ${\lambda(n)}$, however, remains open, as the square ${\lambda^2(n)=1}$ has zero entropy.)
• In the converse direction, it is easy to construct sequences of arbitrarily small positive entropy that correlate with the Möbius function (a rather silly example is ${\mu(n) 1_{k|n}}$ for some fixed large (squarefree) ${k}$, which has topological entropy at most ${\log 2/k}$ but clearly correlates with ${\mu}$).

See this survey of Sarnak for further discussion of this and related topics.

In this post I wanted to give a very nice argument of Sarnak that links the above two conjectures:

Proposition 3 The Chowla conjecture implies the Sarnak conjecture.

The argument does not use any number-theoretic properties of the Möbius function; one could replace ${\mu}$ in both conjectures by any other function from the natural numbers to ${\{-1,0,+1\}}$ and obtain the same implication. The argument consists of the following ingredients:

1. To show that ${\sum_{n, it suffices to show that the expectation of the random variable ${\frac{1}{m} (\mu(n+1)f(n+1)+\ldots+\mu(n+m)f(n+m))}$, where ${n}$ is drawn uniformly at random from ${1}$ to ${x}$, can be made arbitrary small by making ${m}$ large (and ${n}$ even larger).
2. By the union bound and the zero topological entropy of ${f}$, it suffices to show that for any bounded deterministic coefficients ${c_1,\ldots,c_m}$, the random variable ${\frac{1}{m}(c_1 \mu(n+1) + \ldots + c_m \mu(n+m))}$ concentrates with exponentially high probability.
3. Finally, this exponentially high concentration can be achieved by the moment method, using a slight variant of the moment method proof of the large deviation estimates such as the Chernoff inequality or Hoeffding inequality (as discussed in this blog post).

As is often the case, though, while the “top-down” order of steps presented above is perhaps the clearest way to think conceptually about the argument, in order to present the argument formally it is more convenient to present the arguments in the reverse (or “bottom-up”) order. This is the approach taken below the fold.

One of the basic problems in the field of operator algebras is to develop a functional calculus for either a single operator ${A}$, or a collection ${A_1, A_2, \ldots, A_k}$ of operators. These operators could in principle act on any function space, but typically one either considers complex matrices (which act on a complex finite dimensional space), or operators (either bounded or unbounded) on a complex Hilbert space. (One can of course also obtain analogous results for real operators, but we will work throughout with complex operators in this post.)

Roughly speaking, a functional calculus is a way to assign an operator ${F(A)}$ or ${F(A_1,\ldots,A_k)}$ to any function ${F}$ in a suitable function space, which is linear over the complex numbers, preserve the scalars (i.e. ${c(A) = c}$ when ${c \in {\bf C}}$), and should be either an exact or approximate homomorphism in the sense that

$\displaystyle FG(A_1,\ldots,A_k) = F(A_1,\ldots,A_k) G(A_1,\ldots,A_k), \ \ \ \ \ (1)$

should hold either exactly or approximately. In the case when the ${A_i}$ are self-adjoint operators acting on a Hilbert space (or Hermitian matrices), one often also desires the identity

$\displaystyle \overline{F}(A_1,\ldots,A_k) = F(A_1,\ldots,A_k)^* \ \ \ \ \ (2)$

to also hold either exactly or approximately. (Note that one cannot reasonably expect (1) and (2) to hold exactly for all ${F,G}$ if the ${A_1,\ldots,A_k}$ and their adjoints ${A_1^*,\ldots,A_k^*}$ do not commute with each other, so in those cases one has to be willing to allow some error terms in the above wish list of properties of the calculus.) Ideally, one should also be able to relate the operator norm of ${f(A)}$ or ${f(A_1,\ldots,A_k)}$ with something like the uniform norm on ${f}$. In principle, the existence of a good functional calculus allows one to manipulate operators as if they were scalars (or at least approximately as if they were scalars), which is very helpful for a number of applications, such as partial differential equations, spectral theory, noncommutative probability, and semiclassical mechanics. A functional calculus for multiple operators ${A_1,\ldots,A_k}$ can be particularly valuable as it allows one to treat ${A_1,\ldots,A_k}$ as being exact or approximate scalars simultaneously. For instance, if one is trying to solve a linear differential equation that can (formally at least) be expressed in the form

$\displaystyle F(X,D) u = f$

for some data ${f}$, unknown function ${u}$, some differential operators ${X,D}$, and some nice function ${F}$, then if one’s functional calculus is good enough (and ${F}$ is suitably “elliptic” in the sense that it does not vanish or otherwise degenerate too often), one should be able to solve this equation either exactly or approximately by the formula

$\displaystyle u = F^{-1}(X,D) f,$

which is of course how one would solve this equation if one pretended that the operators ${X,D}$ were in fact scalars. Formalising this calculus rigorously leads to the theory of pseudodifferential operators, which allows one to (approximately) solve or at least simplify a much wider range of differential equations than one what can achieve with more elementary algebraic transformations (e.g. integrating factors, change of variables, variation of parameters, etc.). In quantum mechanics, a functional calculus that allows one to treat operators as if they were approximately scalar can be used to rigorously justify the correspondence principle in physics, namely that the predictions of quantum mechanics approximate that of classical mechanics in the semiclassical limit ${\hbar \rightarrow 0}$.

There is no universal functional calculus that works in all situations; the strongest functional calculi, which are close to being an exact *-homomorphisms on very large class of functions, tend to only work for under very restrictive hypotheses on ${A}$ or ${A_1,\ldots,A_k}$ (in particular, when ${k > 1}$, one needs the ${A_1,\ldots,A_k}$ to commute either exactly, or very close to exactly), while there are weaker functional calculi which have fewer nice properties and only work for a very small class of functions, but can be applied to quite general operators ${A}$ or ${A_1,\ldots,A_k}$. In some cases the functional calculus is only formal, in the sense that ${f(A)}$ or ${f(A_1,\ldots,A_k)}$ has to be interpreted as an infinite formal series that does not converge in a traditional sense. Also, when one wishes to select a functional calculus on non-commuting operators ${A_1,\ldots,A_k}$, there is a certain amount of non-uniqueness: one generally has a number of slightly different functional calculi to choose from, which generally have the same properties but differ in some minor technical details (particularly with regards to the behaviour of “lower order” components of the calculus). This is a similar to how one has a variety of slightly different coordinate systems available to parameterise a Riemannian manifold or Lie group. This is on contrast to the ${k=1}$ case when the underlying operator ${A = A_1}$ is (essentially) normal (so that ${A}$ commutes with ${A^*}$); in this special case (which includes the important subcases when ${A}$ is unitary or (essentially) self-adjoint), spectral theory gives us a canonical and very powerful functional calculus which can be used without further modification in applications.

Despite this lack of uniqueness, there is one standard choice for a functional calculus available for general operators ${A_1,\ldots,A_k}$, namely the Weyl functional calculus; it is analogous in some ways to normal coordinates for Riemannian manifolds, or exponential coordinates of the first kind for Lie groups, in that it treats lower order terms in a reasonably nice fashion. (But it is important to keep in mind that, like its analogues in Riemannian geometry or Lie theory, there will be some instances in which the Weyl calculus is not the optimal calculus to use for the application at hand.)

I decided to write some notes on the Weyl functional calculus (also known as Weyl quantisation), and to sketch the applications of this calculus both to the theory of pseudodifferential operators. They are mostly for my own benefit (so that I won’t have to redo these particular calculations again), but perhaps they will also be of interest to some readers here. (Of course, this material is also covered in many other places. e.g. Folland’s “harmonic analysis in phase space“.)