You are currently browsing the category archive for the ‘math.RA’ category.

The determinant ${\det(A)}$ of a square matrix ${A}$ obeys a large number of important identities, the most basic of which is the multiplicativity property

$\displaystyle \det(AB) = \det(A) \det(B) \ \ \ \ \ (1)$

whenever ${A,B}$ are square matrices of the same dimension. This identity then generates many other important identities. For instance, if ${A}$ is an ${n \times m}$ matrix and ${B}$ is an ${m \times n}$ matrix, then by applying the previous identity to equate the determinants of ${\begin{pmatrix} 1 & -A \\ B & 1 \end{pmatrix} \begin{pmatrix} 1 & A \\ 0 & 1 \end{pmatrix}}$ and ${\begin{pmatrix} 1 & A \\ 0 & 1 \end{pmatrix} \begin{pmatrix} 1 & -A \\ B & 1 \end{pmatrix}}$ (where we will adopt the convention that ${1}$ denotes an identity matrix of whatever dimension is needed to make sense of the expressions being computed, and similarly for ${0}$) we obtain the Sylvester determinant identity

$\displaystyle \det( 1 + AB ) = \det( 1 + BA ). \ \ \ \ \ (2)$

This identity, which relates an ${n \times n}$ determinant with an ${m \times m}$ determinant, is very useful in random matrix theory (a point emphasised in particular by Deift), particularly in regimes in which ${m}$ is much smaller than ${n}$.

Another identity generated from (1) arises when trying to compute the determinant of a ${(n+m) \times (n+m)}$ block matrix

$\displaystyle \begin{pmatrix} A & B \\ C & D \end{pmatrix}$

where ${A}$ is an ${n \times n}$ matrix, ${B}$ is an ${n \times m}$ matrix, ${C}$ is an ${m \times n}$ matrix, and ${D}$ is an ${m \times m}$ matrix. If ${A}$ is invertible, then we can manipulate this matrix via block Gaussian elimination as

$\displaystyle \begin{pmatrix} A & B \\ C & D \end{pmatrix} = \begin{pmatrix} A & 0 \\ 0 & 1 \end{pmatrix} \begin{pmatrix} 1 & A^{-1} B \\ C & D \end{pmatrix}$

$\displaystyle = \begin{pmatrix} A & 0 \\ 0 & 1 \end{pmatrix} \begin{pmatrix} 1 & 0 \\ C & 1 \end{pmatrix} \begin{pmatrix} 1 & A^{-1} B \\ 0 & D - C A^{-1} B \end{pmatrix}$

and on taking determinants using (1) we obtain the Schur determinant identity

$\displaystyle \det \begin{pmatrix} A & B \\ C & D \end{pmatrix} = \det(A) \det( D - C A^{-1} B ) \ \ \ \ \ (3)$

relating the determinant of a block-diagonal matrix with the determinant of the Schur complement ${D-C A^{-1} B}$ of the upper left block ${A}$. This identity can be viewed as the correct way to generalise the ${2 \times 2}$ determinant formula

$\displaystyle \det \begin{pmatrix} a & b \\ c & d \end{pmatrix} = ad-bc = a ( d - c a^{-1} b).$

It is also possible to use determinant identities to deduce other matrix identities that do not involve the determinant, by the technique of matrix differentiation (or equivalently, matrix linearisation). The key observation is that near the identity, the determinant behaves like the trace, or more precisely one has

$\displaystyle \det( 1 + \epsilon A ) = 1 + \epsilon \hbox{tr}(A) + O(\epsilon^2) \ \ \ \ \ (4)$

for any bounded square matrix ${A}$ and infinitesimal ${\epsilon}$. (If one is uncomfortable with infinitesimals, one can interpret this sort of identity as an asymptotic as ${\epsilon\rightarrow 0}$.) Combining this with (1) we see that for square matrices ${A,B}$ of the same dimension with ${A}$ invertible and ${A^{-1}, B}$ invertible, one has

$\displaystyle \det( A + \epsilon B ) = \det(A) \det(1 + \epsilon A^{-1} B )$

$\displaystyle = \det(A) (1 + \epsilon \hbox{tr}( A^{-1} B ) + O(\epsilon^2) )$

for infinitesimal ${\epsilon}$. To put it another way, if ${A(t)}$ is a square matrix that depends in a differentiable fashion on a real parameter ${t}$, then

$\displaystyle \frac{d}{dt} \det(A(t)) = \det(A(t)) \hbox{tr}( A(t)^{-1} \frac{d}{dt} A(t) )$

whenever ${A(t)}$ is invertible. (Note that if one combines this identity with cofactor expansion, one recovers Cramer’s rule.)

Let us see some examples of this differentiation method. If we take the Sylvester identity (2) and multiply one of the rectangular matrices ${A}$ by an infinitesimal ${\epsilon}$, we obtain

$\displaystyle \det( 1 + \epsilon A B ) = \det( 1 + \epsilon B A);$

applying (4) and extracting the linear term in ${\epsilon}$ (or equivalently, differentiating at ${\epsilon}$ and then setting ${\epsilon=0}$) we conclude the cyclic property of trace:

$\displaystyle \hbox{tr}(AB) = \hbox{tr}(BA).$

To manipulate derivatives and inverses, we begin with the Neumann series approximation

$\displaystyle (1 + \epsilon A)^{-1} = 1 - \epsilon A + O(\epsilon^2)$

for bounded square ${A}$ and infinitesimal ${\epsilon}$, which then leads to the more general approximation

$\displaystyle (A + \epsilon B)^{-1} = (1 + \epsilon A^{-1} B)^{-1} A^{-1}$

$\displaystyle = A^{-1} - \epsilon A^{-1} B A^{-1} + O(\epsilon^2) \ \ \ \ \ (5)$

for square matrices ${A,B}$ of the same dimension with ${B, A^{-1}}$ bounded. To put it another way, we have

$\displaystyle \frac{d}{dt} A(t)^{-1} = -A(t)^{-1} (\frac{d}{dt} A(t)) A(t)^{-1}$

whenever ${A(t)}$ depends in a differentiable manner on ${t}$ and ${A(t)}$ is invertible.

We can then differentiate (or linearise) the Schur identity (3) in a number of ways. For instance, if we replace the lower block ${D}$ by ${D + \epsilon H}$ for some test ${m \times m}$ matrix ${H}$, then by (4), the left-hand side of (3) becomes (assuming the invertibility of the block matrix)

$\displaystyle (\det \begin{pmatrix} A & B \\ C & D \end{pmatrix}) (1 + \epsilon \hbox{tr} \begin{pmatrix} A & B \\ C & D \end{pmatrix}^{-1} \begin{pmatrix} 0 & 0 \\ 0 & H \end{pmatrix} + O(\epsilon^2) )$

while the right-hand side becomes

$\displaystyle \det(A) \det(D-CA^{-1}B) (1 + \epsilon \hbox{tr}( (D-CA^{-1}B)^{-1} H ) + O(\epsilon^2) );$

extracting the linear term in ${\epsilon}$ (after dividing through by (3)), we conclude that

$\displaystyle \hbox{tr} (\begin{pmatrix} A & B \\ C & D \end{pmatrix}^{-1} \begin{pmatrix} 0 & 0 \\ 0 & H \end{pmatrix}) = \hbox{tr}( (D-CA^{-1}B)^{-1} H ).$

As ${H}$ was an arbitrary ${m \times m}$ matrix, we conclude from duality that the lower right ${m \times m}$ block of ${\begin{pmatrix} A & B \\ C & D \end{pmatrix}^{-1}}$ is given by the inverse ${(D-CA^{-1}B)^{-1}}$ of the Schur complement:

$\displaystyle \begin{pmatrix} A & B \\ C & D \end{pmatrix}^{-1} = \begin{pmatrix} ?? & ?? \\ ?? & (D-CA^{-1}B)^{-1} \end{pmatrix}.$

One can also compute the other components of this inverse in terms of the Schur complement ${D-CA^{-1} B}$ by a similar method (although the formulae become more complicated). As a variant of this method, we can perturb the block matrix in (3) by an infinitesimal multiple of the identity matrix giving

$\displaystyle \det \begin{pmatrix} A+\epsilon & B \\ C & D+\epsilon \end{pmatrix} = \det(A+\epsilon) \det( D +\epsilon - C (A+\epsilon)^{-1} B ). \ \ \ \ \ (6)$

By (4), the left-hand side is

$\displaystyle (\det \begin{pmatrix} A & B \\ C & D \end{pmatrix}) (1 + \epsilon \hbox{tr} \begin{pmatrix} A & B \\ C & D \end{pmatrix}^{-1} + O(\epsilon^2) ).$

From (5), we have

$\displaystyle D + \epsilon - C (A+ \epsilon)^{-1} B = D - C A^{-1} B + \epsilon(1 + C A^{-2} B) + O(\epsilon^2)$

and so from (4) the right-hand side of (6) is

$\displaystyle \det(A) \det(D-CA^{-1} B) \times$

$\displaystyle \times ( 1 + \epsilon (\hbox{tr}(A^{-1}) + \hbox{tr}( (D-CA^{-1} B)^{-1} (1 + C A^{-2} B)) ) + O(\epsilon^2) );$

extracting the linear component in ${\epsilon}$, we conclude the identity

$\displaystyle \hbox{tr} \begin{pmatrix} A & B \\ C & D \end{pmatrix}^{-1} = \hbox{tr}(A^{-1}) + \hbox{tr}( (D-CA^{-1} B)^{-1} (1 + C A^{-2} B)) \ \ \ \ \ (7)$

which relates the trace of the inverse of a block matrix, with the trace of the inverse of one of its blocks. This particular identity turns out to be useful in random matrix theory; I hope to elaborate on this in a later post.

As a final example of this method, we can analyse low rank perturbations ${A+BC}$ of a large (${n \times n}$) matrix ${A}$, where ${B}$ is an ${n \times m}$ matrix and ${C}$ is an ${m \times n}$ matrix for some ${m. (This type of situation is also common in random matrix theory, for instance it arose in this previous paper of mine on outliers to the circular law.) If ${A}$ is invertible, then from (1) and (2) one has the matrix determinant lemma

$\displaystyle \det( A + BC ) = \det(A) \det( 1 + A^{-1} BC) = \det(A) \det(1 + CA^{-1} B);$

if one then perturbs ${A}$ by an infinitesimal matrix ${\epsilon H}$, we have

$\displaystyle \det( A + BC + \epsilon H ) = \det(A + \epsilon H ) \det(1 + C(A+\epsilon H)^{-1} B).$

Extracting the linear component in ${\epsilon}$ as before, one soon arrives at

$\displaystyle \hbox{tr}( (A+BC)^{-1} H ) = \hbox{tr}( A^{-1} H ) - \hbox{tr}( (1 + C A^{-1} B)^{-1} C A^{-1} H A^{-1} B )$

assuming that ${A}$ and ${A+BC}$ are both invertible; as ${H}$ is arbitrary, we conclude (after using the cyclic property of trace) the Sherman-Morrison formula

$\displaystyle (A+BC)^{-1} = A^{-1} - A^{-1} B (1 + C A^{-1} B)^{-1} C A^{-1}$

for the inverse of a low rank perturbation ${A+BC}$ of a matrix ${A}$. While this identity can be easily verified by direct algebraic computation, it is somewhat difficult to discover this identity by such algebraic manipulation; thus we see that the “determinant first” approach to matrix identities can make it easier to find appropriate matrix identities (particularly those involving traces and/or inverses), even if the identities one is ultimately interested in do not involve determinants. (As differentiation typically makes an identity lengthier, but also more “linear” or “additive”, the determinant identity tends to be shorter (albeit more nonlinear and more multiplicative) than the differentiated identity, and can thus be slightly easier to derive.)

Exercise 1 Use the “determinant first” approach to derive the Woodbury matrix identity (also known as the binomial inverse theorem)

$\displaystyle (A+BDC)^{-1} = A^{-1} - A^{-1} B (D^{-1} + CA^{-1} B)^{-1} C A^{-1}$

where ${A}$ is an ${n \times n}$ matrix, ${B}$ is an ${n \times m}$ matrix, ${C}$ is an ${m \times n}$ matrix, and ${D}$ is an ${m \times m}$ matrix, assuming that ${A}$, ${D}$ and ${A+BDC}$ are all invertible.

Exercise 2 Let ${A,B}$ be invertible ${n \times n}$ matrices. Establish the identity

$\displaystyle \det(A + B) \det(A - B) = \det(B) \det( AB^{-1} A - B)$

and differentiate this in ${A}$ to deduce the identity

$\displaystyle (A+B)^{-1} + (A-B)^{-1} = 2 (A - BA^{-1} B)^{-1}$

(assuming that all inverses exist) and hence

$\displaystyle (A+B)^{-1} = (A - BA^{-1} B)^{-1} - (B - AB^{-1} A)^{-1}.$

Rotating ${B}$ by ${i}$ then gives

$\displaystyle (A+iB)^{-1} = (A + BA^{-1} B)^{-1} - i (B + AB^{-1} A)^{-1},$

which is useful for inverting a matrix ${A+iB}$ that has been split into a self-adjoint component ${A}$ and a skew-adjoint component ${iB}$.

Mathematicians study a variety of different mathematical structures, but perhaps the structures that are most commonly associated with mathematics are the number systems, such as the integers ${{\bf Z}}$ or the real numbers ${{\bf R}}$. Indeed, the use of number systems is so closely identified with the practice of mathematics that one sometimes forgets that it is possible to do mathematics without explicit reference to any concept of number. For instance, the ancient Greeks were able to prove many theorems in Euclidean geometry, well before the development of Cartesian coordinates and analytic geometry in the seventeenth century, or the formal constructions or axiomatisations of the real number system that emerged in the nineteenth century (not to mention precursor concepts such as zero or negative numbers, whose very existence was highly controversial, if entertained at all, to the ancient Greeks). To do this, the Greeks used geometric operations as substitutes for the arithmetic operations that would be more familiar to modern mathematicians. For instance, concatenation of line segments or planar regions serves as a substitute for addition; the operation of forming a rectangle out of two line segments would serve as a substitute for multiplication; the concept of similarity can be used as a substitute for ratios or division; and so forth.

A similar situation exists in modern physics. Physical quantities such as length, mass, momentum, charge, and so forth are routinely measured and manipulated using the real number system ${{\bf R}}$ (or related systems, such as ${{\bf R}^3}$ if one wishes to measure a vector-valued physical quantity such as velocity). Much as analytic geometry allows one to use the laws of algebra and trigonometry to calculate and prove theorems in geometry, the identification of physical quantities with numbers allows one to express physical laws and relationships (such as Einstein’s famous mass-energy equivalence ${E=mc^2}$) as algebraic (or differential) equations, which can then be solved and otherwise manipulated through the extensive mathematical toolbox that has been developed over the centuries to deal with such equations.

However, as any student of physics is aware, most physical quantities are not represented purely by one or more numbers, but instead by a combination of a number and some sort of unit. For instance, it would be a category error to assert that the length of some object was a number such as ${10}$; instead, one has to say something like “the length of this object is ${10}$ yards”, combining both a number ${10}$ and a unit (in this case, the yard). Changing the unit leads to a change in the numerical value assigned to this physical quantity, even though no physical change to the object being measured has occurred. For instance, if one decides to use feet as the unit of length instead of yards, then the length of the object is now ${30}$ feet; if one instead uses metres, the length is now ${9.144}$ metres; and so forth. But nothing physical has changed when performing this change of units, and these lengths are considered all equal to each other:

$\displaystyle 10 \hbox{ yards } = 30 \hbox{ feet } = 9.144 \hbox{ metres}.$

It is then common to declare that while physical quantities and units are not, strictly speaking, numbers, they should be manipulated using the laws of algebra as if they were numerical quantities. For instance, if an object travels ${10}$ metres in ${5}$ seconds, then its speed should be

$\displaystyle (10 m) / (5 s) = 2 ms^{-1}$

where we use the usual abbreviations of ${m}$ and ${s}$ for metres and seconds respectively. Similarly, if the speed of light ${c}$ is ${c=299 792 458 ms^{-1}}$ and an object has mass ${10 kg}$, then Einstein’s mass-energy equivalence ${E=mc^2}$ then tells us that the energy-content of this object is

$\displaystyle (10 kg) (299 792 458 ms^{-1})^2 \approx 8.99 \times 10^{17} kg m^2 s^{-2}.$

Note that the symbols ${kg, m, s}$ are being manipulated algebraically as if they were mathematical variables such as ${x}$ and ${y}$. By collecting all these units together, we see that every physical quantity gets assigned a unit of a certain dimension: for instance, we see here that the energy ${E}$ of an object can be given the unit of ${kg m^2 s^{-2}}$ (more commonly known as a Joule), which has the dimension of ${M L^2 T^{-2}}$ where ${M, L, T}$ are the dimensions of mass, length, and time respectively.

There is however one important limitation to the ability to manipulate “dimensionful” quantities as if they were numbers: one is not supposed to add, subtract, or compare two physical quantities if they have different dimensions, although it is acceptable to multiply or divide two such quantities. For instance, if ${m}$ is a mass (having the units ${M}$) and ${v}$ is a speed (having the units ${LT^{-1}}$), then it is physically “legitimate” to form an expression such as ${\frac{1}{2} mv^2}$, but not an expression such as ${m+v}$ or ${m-v}$; in a similar spirit, statements such as ${m=v}$ or ${m\geq v}$ are physically meaningless. This combines well with the mathematical distinction between vector, scalar, and matrix quantities, which among other things prohibits one from adding together two such quantities if their vector or matrix type are different (e.g. one cannot add a scalar to a vector, or a vector to a matrix), and also places limitations on when two such quantities can be multiplied together. A related limitation, which is not always made explicit in physics texts, is that transcendental mathematical functions such as ${\sin}$ or ${\exp}$ should only be applied to arguments that are dimensionless; thus, for instance, if ${v}$ is a speed, then ${\hbox{arctanh}(v)}$ is not physically meaningful, but ${\hbox{arctanh}(v/c)}$ is (this particular quantity is known as the rapidity associated to this speed).

These limitations may seem like a weakness in the mathematical modeling of physical quantities; one may think that one could get a more “powerful” mathematical framework if one were allowed to perform dimensionally inconsistent operations, such as add together a mass and a velocity, add together a vector and a scalar, exponentiate a length, etc. Certainly there is some precedent for this in mathematics; for instance, the formalism of Clifford algebras does in fact allow one to (among other things) add vectors with scalars, and in differential geometry it is quite common to formally apply transcendental functions (such as the exponential function) to a differential form (for instance, the Liouville measure ${\frac{1}{n!} \omega^n}$ of a symplectic manifold can be usefully thought of as a component of the exponential ${\exp(\omega)}$ of the symplectic form ${\omega}$).

However, there are several reasons why it is advantageous to retain the limitation to only perform dimensionally consistent operations. One is that of error correction: one can often catch (and correct for) errors in one’s calculations by discovering a dimensional inconsistency, and tracing it back to the first step where it occurs. Also, by performing dimensional analysis, one can often identify the form of a physical law before one has fully derived it. For instance, if one postulates the existence of a mass-energy relationship involving only the mass of an object ${m}$, the energy content ${E}$, and the speed of light ${c}$, dimensional analysis is already sufficient to deduce that the relationship must be of the form ${E = \alpha mc^2}$ for some dimensionless absolute constant ${\alpha}$; the only remaining task is then to work out the constant of proportionality ${\alpha}$, which requires physical arguments beyond that provided by dimensional analysis. (This is a simple instance of a more general application of dimensional analysis known as the Buckingham ${\pi}$ theorem.)

The use of units and dimensional analysis has certainly been proven to be very effective tools in physics. But one can pose the question of whether it has a properly grounded mathematical foundation, in order to settle any lingering unease about using such tools in physics, and also in order to rigorously develop such tools for purely mathematical purposes (such as analysing identities and inequalities in such fields of mathematics as harmonic analysis or partial differential equations).

The example of Euclidean geometry mentioned previously offers one possible approach to formalising the use of dimensions. For instance, one could model the length of a line segment not by a number, but rather by the equivalence class of all line segments congruent to the original line segment (cf. the Frege-Russell definition of a number). Similarly, the area of a planar region can be modeled not by a number, but by the equivalence class of all regions that are equidecomposable with the original region (one can, if one wishes, restrict attention here to measurable sets in order to avoid Banach-Tarski-type paradoxes, though that particular paradox actually only arises in three and higher dimensions). As mentioned before, it is then geometrically natural to multiply two lengths to form an area, by taking a rectangle whose line segments have the stated lengths, and using the area of that rectangle as a product. This geometric picture works well for units such as length and volume that have a spatial geometric interpretation, but it is less clear how to apply it for more general units. For instance, it does not seem geometrically natural (or, for that matter, conceptually helpful) to envision the equation ${E=mc^2}$ as the assertion that the energy ${E}$ is the volume of a rectangular box whose height is the mass ${m}$ and whose length and width is given by the speed of light ${c}$.

But there are at least two other ways to formalise dimensionful quantities in mathematics, which I will discuss below the fold. The first is a “parametric” model in which dimensionful objects are modeled as numbers (or vectors, matrices, etc.) depending on some base dimensional parameters (such as units of length, mass, and time, or perhaps a coordinate system for space or spacetime), and transforming according to some representation of a structure group that encodes the range of these parameters; this type of “coordinate-heavy” model is often used (either implicitly or explicitly) by physicists in order to efficiently perform calculations, particularly when manipulating vector or tensor-valued quantities. The second is an “abstract” model in which dimensionful objects now live in an abstract mathematical space (e.g. an abstract vector space), in which only a subset of the operations available to general-purpose number systems such as ${{\bf R}}$ or ${{\bf R}^3}$ are available, namely those operations which are “dimensionally consistent” or invariant (or more precisely, equivariant) with respect to the action of the underlying structure group. This sort of “coordinate-free” approach tends to be the one which is preferred by pure mathematicians, particularly in the various branches of modern geometry, in part because it can lead to greater conceptual clarity, as well as results of great generality; it is also close to the more informal practice of treating mathematical manipulations that do not preserve dimensional consistency as being physically meaningless.

Given a function ${f: X \rightarrow Y}$ between two sets ${X, Y}$, we can form the graph

$\displaystyle \Sigma := \{ (x,f(x)): x\in X \},$

which is a subset of the Cartesian product ${X \times Y}$.

There are a number of “closed graph theorems” in mathematics which relate the regularity properties of the function ${f}$ with the closure properties of the graph ${\Sigma}$, assuming some “completeness” properties of the domain ${X}$ and range ${Y}$. The most famous of these is the closed graph theorem from functional analysis, which I phrase as follows:

Theorem 1 (Closed graph theorem (functional analysis)) Let ${X, Y}$ be complete normed vector spaces over the reals (i.e. Banach spaces). Then a function ${f: X \rightarrow Y}$ is a continuous linear transformation if and only if the graph ${\Sigma := \{ (x,f(x)): x \in X \}}$ is both linearly closed (i.e. it is a linear subspace of ${X \times Y}$) and topologically closed (i.e. closed in the product topology of ${X \times Y}$).

I like to think of this theorem as linking together qualitative and quantitative notions of regularity preservation properties of an operator ${f}$; see this blog post for further discussion.

The theorem is equivalent to the assertion that any continuous linear bijection ${f: X \rightarrow Y}$ from one Banach space to another is necessarily an isomorphism in the sense that the inverse map is also continuous and linear. Indeed, to see that this claim implies the closed graph theorem, one applies it to the projection from ${\Sigma}$ to ${X}$, which is a continuous linear bijection; conversely, to deduce this claim from the closed graph theorem, observe that the graph of the inverse ${f^{-1}}$ is the reflection of the graph of ${f}$. As such, the closed graph theorem is a corollary of the open mapping theorem, which asserts that any continuous linear surjection from one Banach space to another is open. (Conversely, one can deduce the open mapping theorem from the closed graph theorem by quotienting out the kernel of the continuous surjection to get a bijection.)

It turns out that there is a closed graph theorem (or equivalent reformulations of that theorem, such as an assertion that bijective morphisms between sufficiently “complete” objects are necessarily isomorphisms, or as an open mapping theorem) in many other categories in mathematics as well. Here are some easy ones:

Theorem 2 (Closed graph theorem (linear algebra)) Let ${X, Y}$ be vector spaces over a field ${k}$. Then a function ${f: X \rightarrow Y}$ is a linear transformation if and only if the graph ${\Sigma := \{ (x,f(x)): x \in X \}}$ is linearly closed.

Theorem 3 (Closed graph theorem (group theory)) Let ${X, Y}$ be groups. Then a function ${f: X \rightarrow Y}$ is a group homomorphism if and only if the graph ${\Sigma := \{ (x,f(x)): x \in X \}}$ is closed under the group operations (i.e. it is a subgroup of ${X \times Y}$).

Theorem 4 (Closed graph theorem (order theory)) Let ${X, Y}$ be totally ordered sets. Then a function ${f: X \rightarrow Y}$ is monotone increasing if and only if the graph ${\Sigma := \{ (x,f(x)): x \in X \}}$ is totally ordered (using the product order on ${X \times Y}$).

Remark 1 Similar results to the above three theorems (with similarly easy proofs) hold for other algebraic structures, such as rings (using the usual product of rings), modules, algebras, or Lie algebras, groupoids, or even categories (a map between categories is a functor iff its graph is again a category). (ADDED IN VIEW OF COMMENTS: further examples include affine spaces and ${G}$-sets (sets with an action of a given group ${G}$).) There are also various approximate versions of this theorem that are useful in arithmetic combinatorics, that relate the property of a map ${f}$ being an “approximate homomorphism” in some sense with its graph being an “approximate group” in some sense. This is particularly useful for this subfield of mathematics because there are currently more theorems about approximate groups than about approximate homomorphisms, so that one can profitably use closed graph theorems to transfer results about the former to results about the latter.

A slightly more sophisticated result in the same vein:

Theorem 5 (Closed graph theorem (point set topology)) Let ${X, Y}$ be compact Hausdorff spaces. Then a function ${f: X \rightarrow Y}$ is continuous if and only if the graph ${\Sigma := \{ (x,f(x)): x \in X \}}$ is topologically closed.

Indeed, the “only if” direction is easy, while for the “if” direction, note that if ${\Sigma}$ is a closed subset of ${X \times Y}$, then it is compact Hausdorff, and the projection map from ${\Sigma}$ to ${X}$ is then a bijective continuous map between compact Hausdorff spaces, which is then closed, thus open, and hence a homeomorphism, giving the claim.

Note that the compactness hypothesis is necessary: for instance, the function ${f: {\bf R} \rightarrow {\bf R}}$ defined by ${f(x) := 1/x}$ for ${x \neq 0}$ and ${f(0)=0}$ for ${x=0}$ is a function which has a closed graph, but is discontinuous.

A similar result (but relying on a much deeper theorem) is available in algebraic geometry, as I learned after asking this MathOverflow question:

Theorem 6 (Closed graph theorem (algebraic geometry)) Let ${X, Y}$ be normal projective varieties over an algebraically closed field ${k}$ of characteristic zero. Then a function ${f: X \rightarrow Y}$ is a regular map if and only if the graph ${\Sigma := \{ (x,f(x)): x \in X \}}$ is Zariski-closed.

Proof: (Sketch) For the only if direction, note that the map ${x \mapsto (x,f(x))}$ is a regular map from the projective variety ${X}$ to the projective variety ${X \times Y}$ and is thus a projective morphism, hence is proper. In particular, the image ${\Sigma}$ of ${X}$ under this map is Zariski-closed.

Conversely, if ${\Sigma}$ is Zariski-closed, then it is also a projective variety, and the projection ${(x,y) \mapsto x}$ is a projective morphism from ${\Sigma}$ to ${X}$, which is clearly quasi-finite; by the characteristic zero hypothesis, it is also separated. Applying (Grothendieck’s form of) Zariski’s main theorem, this projection is the composition of an open immersion and a finite map. As projective varieties are complete, the open immersion is an isomorphism, and so the projection from ${\Sigma}$ to ${X}$ is finite. Being injective and separable, the degree of this finite map must be one, and hence ${k(\Sigma)}$ and ${k(X)}$ are isomorphic, hence (by normality of ${X}$) ${k[\Sigma]}$ is contained in (the image of) ${k[X]}$, which makes the map from ${X}$ to ${\Sigma}$ regular, which makes ${f}$ regular. $\Box$

The counterexample of the map ${f: k \rightarrow k}$ given by ${f(x) := 1/x}$ for ${x \neq 0}$ and ${f(0) := 0}$ demonstrates why the projective hypothesis is necessary. The necessity of the normality condition (or more precisely, a weak normality condition) is demonstrated by (the projective version of) the map ${(t^2,t^3) \mapsto t}$ from the cusipdal curve ${\{ (t^2,t^3): t \in k \}}$ to ${k}$. (If one restricts attention to smooth varieties, though, normality becomes automatic.) The necessity of characteristic zero is demonstrated by (the projective version of) the inverse of the Frobenius map ${x \mapsto x^p}$ on a field ${k}$ of characteristic ${p}$.

There are also a number of closed graph theorems for topological groups, of which the following is typical (see Exercise 3 of these previous blog notes):

Theorem 7 (Closed graph theorem (topological group theory)) Let ${X, Y}$ be ${\sigma}$-compact, locally compact Hausdorff groups. Then a function ${X \rightarrow Y}$ is a continuous homomorphism if and only if the graph ${\Sigma := \{ (x,f(x)): x \in X \}}$ is both group-theoretically closed and topologically closed.

The hypotheses of being ${\sigma}$-compact, locally compact, and Hausdorff can be relaxed somewhat, but I doubt that they can be eliminated entirely (though I do not have a ready counterexample for this).

In several complex variables, it is a classical theorem (see e.g. Lemma 4 of this blog post) that a holomorphic function from a domain in ${{\bf C}^n}$ to ${{\bf C}^n}$ is locally injective if and only if it is a local diffeomorphism (i.e. its derivative is everywhere non-singular). This leads to a closed graph theorem for complex manifolds:

Theorem 8 (Closed graph theorem (complex manifolds)) Let ${X, Y}$ be complex manifolds. Then a function ${f: X \rightarrow Y}$ is holomorphic if and only if the graph ${\Sigma := \{ (x,f(x)): x \in X \}}$ is a complex manifold (using the complex structure inherited from ${X \times Y}$) of the same dimension as ${X}$.

Indeed, one applies the previous observation to the projection from ${\Sigma}$ to ${X}$. The dimension requirement is needed, as can be seen from the example of the map ${f: {\bf C} \rightarrow {\bf C}}$ defined by ${f(z) =1/z}$ for ${z \neq 0}$ and ${f(0)=0}$.

(ADDED LATER:) There is a real analogue to the above theorem:

Theorem 9 (Closed graph theorem (real manifolds)) Let ${X, Y}$ be real manifolds. Then a function ${f: X \rightarrow Y}$ is continuous if and only if the graph ${\Sigma := \{ (x,f(x)): x \in X \}}$ is a real manifold of the same dimension as ${X}$.

This theorem can be proven by applying invariance of domain (discussed in this previous post) to the projection of ${\Sigma}$ to ${X}$, to show that it is open if ${\Sigma}$ has the same dimension as ${X}$.

Note though that the analogous claim for smooth real manifolds fails: the function ${f: {\bf R} \rightarrow {\bf R}}$ defined by ${f(x) := x^{1/3}}$ has a smooth graph, but is not itself smooth.

(ADDED YET LATER:) Here is an easy closed graph theorem in the symplectic category:

Theorem 10 (Closed graph theorem (symplectic geometry)) Let ${X = (X,\omega_X)}$ and ${Y = (Y,\omega_Y)}$ be smooth symplectic manifolds of the same dimension. Then a smooth map ${f: X \rightarrow Y}$ is a symplectic morphism (i.e. ${f^* \omega_Y = \omega_X}$) if and only if the graph ${\Sigma := \{(x,f(x)): x \in X \}}$ is a Lagrangian submanifold of ${X \times Y}$ with the symplectic form ${\omega_X \oplus -\omega_Y}$.

In view of the symplectic rigidity phenomenon, it is likely that the smoothness hypotheses on ${f,X,Y}$ can be relaxed substantially, but I will not try to formulate such a result here.

There are presumably many further examples of closed graph theorems (or closely related theorems, such as criteria for inverting a morphism, or open mapping type theorems) throughout mathematics; I would be interested to know of further examples.

$\Box$

Let ${{\mathfrak g}}$ be a finite-dimensional Lie algebra (over the reals). Given two sufficiently small elements ${x, y}$ of ${{\mathfrak g}}$, define the right Baker-Campbell-Hausdorff-Dynkin law

$\displaystyle R_y(x) := x + \int_0^1 F_R( \hbox{Ad}_x \hbox{Ad}_{ty} ) y \ dt \ \ \ \ \ (1)$

where ${\hbox{Ad}_x := \exp(\hbox{ad}_x)}$, ${\hbox{ad}_x: {\mathfrak g} \rightarrow {\mathfrak g}}$ is the adjoint map ${\hbox{ad}_x(y) := [x,y]}$, and ${F_R}$ is the function ${F_R(z) := \frac{z \log z}{z-1}}$, which is analytic for ${z}$ near ${1}$. Similarly, define the left Baker-Campbell-Hausdorff-Dynkin law

$\displaystyle L_x(y) := y + \int_0^1 F_L( \hbox{Ad}_{tx} \hbox{Ad}_y ) x\ dt \ \ \ \ \ (2)$

where ${F_L(z) := \frac{\log z}{z-1}}$. One easily verifies that these expressions are well-defined (and depend smoothly on ${x}$ and ${y}$) when ${x}$ and ${y}$ are sufficiently small.

We have the famous Baker-Campbell-Hausdoff-Dynkin formula:

Theorem 1 (BCH formula) Let ${G}$ be a finite-dimensional Lie group over the reals with Lie algebra ${{\mathfrak g}}$. Let ${\log}$ be a local inverse of the exponential map ${\exp: {\mathfrak g} \rightarrow G}$, defined in a neighbourhood of the identity. Then for sufficiently small ${x, y \in {\mathfrak g}}$, one has

$\displaystyle \log( \exp(x) \exp(y) ) = R_y(x) = L_x(y).$

See for instance these notes of mine for a proof of this formula (it is for ${R_y}$, but one easily obtains a similar proof for ${L_x}$).

In particular, one can give a neighbourhood of the identity in ${{\mathfrak g}}$ the structure of a local Lie group by defining the group operation ${\ast}$ as

$\displaystyle x \ast y := R_y(x) = L_x(y) \ \ \ \ \ (3)$

for sufficiently small ${x, y}$, and the inverse operation by ${x^{-1} := -x}$ (one easily verifies that ${R_x(-x) = L_x(-x) = 0}$ for all small ${x}$).

It is tempting to reverse the BCH formula and conclude (the local form of) Lie’s third theorem, that every finite-dimensional Lie algebra is isomorphic to the Lie algebra of some local Lie group, by using (3) to define a smooth local group structure on a neighbourhood of the identity. (See this previous post for a definition of a local Lie group.) The main difficulty in doing so is in verifying that the definition (3) is well-defined (i.e. that ${R_y(x)}$ is always equal to ${L_x(y)}$) and locally associative. The well-definedness issue can be trivially disposed of by using just one of the expressions ${R_y(x)}$ or ${L_x(y)}$ as the definition of ${\ast}$ (though, as we shall see, it will be very convenient to use both of them simultaneously). However, the associativity is not obvious at all.

With the assistance of Ado’s theorem, which places ${{\mathfrak g}}$ inside the general linear Lie algebra ${\mathfrak{gl}_n({\bf R})}$ for some ${n}$, one can deduce both the well-definedness and associativity of (3) from the Baker-Campbell-Hausdorff formula for ${\mathfrak{gl}_n({\bf R})}$. However, Ado’s theorem is rather difficult to prove (see for instance this previous blog post for a proof), and it is natural to ask whether there is a way to establish these facts without Ado’s theorem.

After playing around with this for some time, I managed to extract a direct proof of well-definedness and local associativity of (3), giving a proof of Lie’s third theorem independent of Ado’s theorem. This is not a new result by any means, (indeed, the original proofs of Lie and Cartan of Lie’s third theorem did not use Ado’s theorem), but I found it an instructive exercise to work out the details, and so I am putting it up on this blog in case anyone else is interested (and also because I want to be able to find the argument again if I ever need it in the future).

Jordan’s theorem is a basic theorem in the theory of finite linear groups, and can be formulated as follows:

Theorem 1 (Jordan’s theorem) Let ${G}$ be a finite subgroup of the general linear group ${GL_d({\bf C})}$. Then there is an abelian subgroup ${G'}$ of ${G}$ of index ${[G:G'] \leq C_d}$, where ${C_d}$ depends only on ${d}$.

Informally, Jordan’s theorem asserts that finite linear groups over the complex numbers are almost abelian. The theorem can be extended to other fields of characteristic zero, and also to fields of positive characteristic so long as the characteristic does not divide the order of ${G}$, but we will not consider these generalisations here. A proof of this theorem can be found for instance in these lecture notes of mine.

I recently learned (from this comment of Kevin Ventullo) that the finiteness hypothesis on the group ${G}$ in this theorem can be relaxed to the significantly weaker condition of periodicity. Recall that a group ${G}$ is periodic if all elements are of finite order. Jordan’s theorem with “finite” replaced by “periodic” is known as the Jordan-Schur theorem.

The Jordan-Schur theorem can be quickly deduced from Jordan’s theorem, and the following result of Schur:

Theorem 2 (Schur’s theorem) Every finitely generated periodic subgroup of a general linear group ${GL_d({\bf C})}$ is finite. (Equivalently, every periodic linear group is locally finite.)

Remark 1 The question of whether all finitely generated periodic subgroups (not necessarily linear in nature) were finite was known as the Burnside problem; the answer was shown to be negative by Golod and Shafarevich in 1964.

Let us see how Jordan’s theorem and Schur’s theorem combine via a compactness argument to form the Jordan-Schur theorem. Let ${G}$ be a periodic subgroup of ${GL_d({\bf C})}$. Then for every finite subset ${S}$ of ${G}$, the group ${G_S}$ generated by ${S}$ is finite by Theorem 2. Applying Jordan’s theorem, ${G_S}$ contains an abelian subgroup ${G'_S}$ of index at most ${C_d}$.

In particular, given any finite number ${S_1,\ldots,S_m}$ of finite subsets of ${G}$, we can find abelian subgroups ${G'_{S_1},\ldots,G'_{S_m}}$ of ${G_{S_1},\ldots,G_{S_m}}$ respectively such that each ${G'_{S_j}}$ has index at most ${C_d}$ in ${G_{S_j}}$. We claim that we may furthermore impose the compatibility condition ${G'_{S_i} = G'_{S_j} \cap G_{S_i}}$ whenever ${S_i \subset S_j}$. To see this, we set ${S := S_1 \cup \ldots \cup S_m}$, locate an abelian subgroup ${G'_S}$ of ${G_S}$ of index at most ${C_d}$, and then set ${G'_{S_i} := G'_S \cap G_{S_i}}$. As ${G_S}$ is covered by at most ${C_d}$ cosets of ${G'_S}$, we see that ${G_{S_i}}$ is covered by at most ${C_d}$ cosets of ${G'_{S_i}}$, and the claim follows.

Note that for each ${S}$, the set of possible ${G'_S}$ is finite, and so the product space of all configurations ${(G'_S)_{S \subset G}}$, as ${S}$ ranges over finite subsets of ${G}$, is compact by Tychonoff’s theorem. Using the finite intersection property, we may thus locate a subgroup ${G'_S}$ of ${G_S}$ of index at most ${C_d}$ for all finite subsets ${S}$ of ${G}$, obeying the compatibility condition ${G'_T = G'_S \cap G_T}$ whenever ${T \subset S}$. If we then set ${G' := \bigcup_S G'_S}$, where ${S}$ ranges over all finite subsets of ${G}$, we then easily verify that ${G'}$ is abelian and has index at most ${C_d}$ in ${G}$, as required.

Below I record a proof of Schur’s theorem, which I extracted from this book of Wehrfritz. This was primarily an exercise for my own benefit, but perhaps it may be of interest to some other readers.

Let ${G}$ be a Lie group with Lie algebra ${{\mathfrak g}}$. As is well known, the exponential map ${\exp: {\mathfrak g} \rightarrow G}$ is a local homeomorphism near the identity. As such, the group law on ${G}$ can be locally pulled back to an operation ${*: U \times U \rightarrow {\mathfrak g}}$ defined on a neighbourhood ${U}$ of the identity in ${G}$, defined as

$\displaystyle x * y := \log( \exp(x) \exp(y) )$

where ${\log}$ is the local inverse of the exponential map. One can view ${*}$ as the group law expressed in local exponential coordinates around the origin.

An asymptotic expansion for ${x*y}$ is provided by the Baker-Campbell-Hausdorff (BCH) formula

$\displaystyle x*y = x+y+ \frac{1}{2} [x,y] + \frac{1}{12}[x,[x,y]] - \frac{1}{12}[y,[x,y]] + \ldots$

for all sufficiently small ${x,y}$, where ${[,]: {\mathfrak g} \times {\mathfrak g} \rightarrow {\mathfrak g}}$ is the Lie bracket. More explicitly, one has the Baker-Campbell-Hausdorff-Dynkin formula

$\displaystyle x * y = x + \int_0^1 F( \hbox{Ad}_x \hbox{Ad}_{ty} ) y\ dt \ \ \ \ \ (1)$

for all sufficiently small ${x,y}$, where ${\hbox{Ad}_x = \exp( \hbox{ad}_x )}$, ${\hbox{ad}_x: {\bf R}^d \rightarrow {\bf R}^d}$ is the adjoint representation ${\hbox{ad}_x(y) := [x,y]}$, and ${F}$ is the function

$\displaystyle F( t ) := \frac{t \log t}{t-1}$

which is real analytic near ${t=1}$ and can thus be applied to linear operators sufficiently close to the identity. One corollary of this is that the multiplication operation ${*}$ is real analytic in local coordinates, and so every smooth Lie group is in fact a real analytic Lie group.

It turns out that one does not need the full force of the smoothness hypothesis to obtain these conclusions. It is, for instance, a classical result that ${C^2}$ regularity of the group operations is already enough to obtain the Baker-Campbell-Hausdorff formula. Actually, it turns out that we can weaken this a bit, and show that even ${C^{1,1}}$ regularity (i.e. that the group operations are continuously differentiable, and the derivatives are locally Lipschitz) is enough to make the classical derivation of the Baker-Campbell-Hausdorff formula work. More precisely, we have

Theorem 1 (${C^{1,1}}$ Baker-Campbell-Hausdorff formula) Let ${{\bf R}^d}$ be a finite-dimensional vector space, and suppose one has a continuous operation ${*: U \times U \rightarrow {\bf R}^d}$ defined on a neighbourhood ${U}$ around the origin, which obeys the following three axioms:

• (Approximate additivity) For ${x,y}$ sufficiently close to the origin, one has

$\displaystyle x*y = x+y+O(|x| |y|). \ \ \ \ \ (2)$

(In particular, ${0*x=x*0=x}$ for ${x}$ sufficiently close to the origin.)

• (Associativity) For ${x,y,z}$ sufficiently close to the origin, ${(x*y)*z = x*(y*z)}$.
• (Radial homogeneity) For ${x}$ sufficiently close to the origin, one has

$\displaystyle (sx) * (tx) = (s+t)x \ \ \ \ \ (3)$

for all ${s,t \in [-1,1]}$. (In particular, ${x * (-x) = (-x) * x = 0}$ for all ${x}$ sufficiently close to the origin.)

Then ${*}$ is real analytic (and in particular, smooth) near the origin. (In particular, ${*}$ gives a neighbourhood of the origin the structure of a local Lie group.)

Indeed, we will recover the Baker-Campbell-Hausdorff-Dynkin formula (after defining ${\hbox{Ad}_x}$ appropriately) in this setting; see below the fold.

The reason that we call this a ${C^{1,1}}$ Baker-Campbell-Hausdorff formula is that if the group operation ${*}$ has ${C^{1,1}}$ regularity, and has ${0}$ as an identity element, then Taylor expansion already gives (2), and in exponential coordinates (which, as it turns out, can be defined without much difficulty in the ${C^{1,1}}$ category) one automatically has (3).

We will record the proof of Theorem 1 below the fold; it largely follows the classical derivation of the BCH formula, but due to the low regularity one will rely on tools such as telescoping series and Riemann sums rather than on the fundamental theorem of calculus. As an application of this theorem, we can give an alternate derivation of one of the components of the solution to Hilbert’s fifth problem, namely the construction of a Lie group structure from a Gleason metric, which was covered in the previous post; we discuss this at the end of this article. With this approach, one can avoid any appeal to von Neumann’s theorem and Cartan’s theorem (discussed in this post), or the Kuranishi-Gleason extension theorem (discussed in this post).

Recall that a (complex) abstract Lie algebra is a complex vector space ${{\mathfrak g}}$ (either finite or infinite dimensional) equipped with a bilinear antisymmetric form ${[]: {\mathfrak g} \times {\mathfrak g} \rightarrow {\mathfrak g}}$ that obeys the Jacobi identity

$\displaystyle [[X,Y],Z] + [[Y,Z],X] + [[Z,X],Y] = 0. \ \ \ \ \ (1)$

(One can of course define Lie algebras over other fields than the complex numbers ${{\bf C}}$, but in order to avoid some technical issues we shall work solely with the complex case in this post.)

An important special case of the abstract Lie algebras are the concrete Lie algebras, in which ${{\mathfrak g} \subset \hbox{End}(V)}$ is a vector space of linear transformations ${X: V \rightarrow V}$ on a vector space ${V}$ (which again can be either finite or infinite dimensional), and the bilinear form is given by the usual Lie bracket

$\displaystyle [X,Y] := XY-YX.$

It is easy to verify that every concrete Lie algebra is an abstract Lie algebra. In the converse direction, we have

Theorem 1 Every abstract Lie algebra is isomorphic to a concrete Lie algebra.

To prove this theorem, we introduce the useful algebraic tool of the universal enveloping algebra ${U({\mathfrak g})}$ of the abstract Lie algebra ${{\mathfrak g}}$. This is the free (associative, complex) algebra generated by ${{\mathfrak g}}$ (viewed as a complex vector space), subject to the constraints

$\displaystyle [X,Y] = XY - YX. \ \ \ \ \ (2)$

This algebra is described by the Poincaré-Birkhoff-Witt theorem, which asserts that given an ordered basis ${(X_i)_{i \in I}}$ of ${{\mathfrak g}}$ as a vector space, that a basis of ${U({\mathfrak g})}$ is given by “monomials” of the form

$\displaystyle X_{i_1}^{a_1} \ldots X_{i_m}^{a_m} \ \ \ \ \ (3)$

where ${m}$ is a natural number, the ${i_1 < \ldots < i_m}$ are an increasing sequence of indices in ${I}$, and the ${a_1,\ldots,a_m}$ are positive integers. Indeed, given two such monomials, one can express their product as a finite linear combination of further monomials of the form (3) after repeatedly applying (2) (which we rewrite as ${XY = YX + [X,Y]}$) to reorder the terms in this product modulo lower order terms until one all monomials have their indices in the required increasing order. It is then a routine exercise in basic abstract algebra (using all the axioms of an abstract Lie algebra) to verify that this is multiplication rule on monomials does indeed define a complex associative algebra which has the universal properties required of the universal enveloping algebra.

The abstract Lie algebra ${{\mathfrak g}}$ acts on its universal enveloping algebra ${U({\mathfrak g})}$ by left-multiplication: ${X: M \mapsto XM}$, thus giving a map from ${{\mathfrak g}}$ to ${\hbox{End}(U({\mathfrak g}))}$. It is easy to verify that this map is a Lie algebra homomorphism (so this is indeed an action (or representation) of the Lie algebra), and this action is clearly faithful (i.e. the map from ${{\mathfrak g}}$ to ${\hbox{End}(U{\mathfrak g})}$ is injective), since each element ${X}$ of ${{\mathfrak g}}$ maps the identity element ${1}$ of ${U({\mathfrak g})}$ to a different element of ${U({\mathfrak g})}$, namely ${X}$. Thus ${{\mathfrak g}}$ is isomorphic to its image in ${\hbox{End}(U({\mathfrak g}))}$, proving Theorem 1.

In the converse direction, every representation ${\rho: {\mathfrak g} \rightarrow \hbox{End}(V)}$ of a Lie algebra “factors through” the universal enveloping algebra, in that it extends to an algebra homomorphism from ${U({\mathfrak g})}$ to ${\hbox{End}(V)}$, which by abuse of notation we shall also call ${\rho}$.

One drawback of Theorem 1 is that the space ${U({\mathfrak g})}$ that the concrete Lie algebra acts on will almost always be infinite-dimensional, even when the original Lie algebra ${{\mathfrak g}}$ is finite-dimensional. However, there is a useful theorem of Ado that rectifies this:

Theorem 2 (Ado’s theorem) Every finite-dimensional abstract Lie algebra is isomorphic to a concrete Lie algebra over a finite-dimensional vector space ${V}$.

Among other things, this theorem can be used (in conjunction with the Baker-Campbell-Hausdorff formula) to show that every abstract (finite-dimensional) Lie group (or abstract local Lie group) is locally isomorphic to a linear group. (It is well-known, though, that abstract Lie groups are not necessarily globally isomorphic to a linear group, but we will not discuss these global obstructions here.)

Ado’s theorem is surprisingly tricky to prove in general, but some special cases are easy. For instance, one can try using the adjoint representation ${\hbox{ad}: {\mathfrak g} \rightarrow \hbox{End}({\mathfrak g})}$ of ${{\mathfrak g}}$ on itself, defined by the action ${X: Y \mapsto [X,Y]}$; the Jacobi identity (1) ensures that this indeed a representation of ${{\mathfrak g}}$. The kernel of this representation is the centre ${Z({\mathfrak g}) := \{ X \in {\mathfrak g}: [X,Y]=0 \hbox{ for all } Y \in {\mathfrak g}\}}$. This already gives Ado’s theorem in the case when ${{\mathfrak g}}$ is semisimple, in which case the center is trivial.

The adjoint representation does not suffice, by itself, to prove Ado’s theorem in the non-semisimple case. However, it does provide an important reduction in the proof, namely it reduces matters to showing that every finite-dimensional Lie algebra ${{\mathfrak g}}$ has a finite-dimensional representation ${\rho: {\mathfrak g} \rightarrow \hbox{End}(V)}$ which is faithful on the centre ${Z({\mathfrak g})}$. Indeed, if one has such a representation, one can then take the direct sum of that representation with the adjoint representation to obtain a new finite-dimensional representation which is now faithful on all of ${{\mathfrak g}}$, which then gives Ado’s theorem for ${{\mathfrak g}}$.

It remains to find a finite-dimensional representation of ${{\mathfrak g}}$ which is faithful on the centre ${Z({\mathfrak g})}$. In the case when ${{\mathfrak g}}$ is abelian, so that the centre ${Z({\mathfrak g})}$ is all of ${{\mathfrak g}}$, this is again easy, because ${{\mathfrak g}}$ then acts faithfully on ${{\mathfrak g} \times {\bf C}}$ by the infinitesimal shear maps ${X: (Y,t) \mapsto (tX, 0)}$. In matrix form, this representation identifies each ${X}$ in this abelian Lie algebra with an “upper-triangular” matrix:

$\displaystyle X \equiv \begin{pmatrix} 0 & X \\ 0 & 0 \end{pmatrix}.$

This construction gives a faithful finite-dimensional representation of the centre ${Z({\mathfrak g})}$ of any finite-dimensional Lie algebra. The standard proof of Ado’s theorem (which I believe dates back to work of Harish-Chandra) then proceeds by gradually “extending” this representation of the centre ${Z({\mathfrak g})}$ to larger and larger sub-algebras of ${{\mathfrak g}}$, while preserving the finite-dimensionality of the representation and the faithfulness on ${Z({\mathfrak g})}$, until one obtains a representation on the entire Lie algebra ${{\mathfrak g}}$ with the required properties. (For technical inductive reasons, one also needs to carry along an additional property of the representation, namely that it maps the nilradical to nilpotent elements, but we will discuss this technicality later.)

This procedure is a little tricky to execute in general, but becomes simpler in the nilpotent case, in which the lower central series ${{\mathfrak g}_1 := {\mathfrak g}; {\mathfrak g}_{n+1} := [{\mathfrak g}, {\mathfrak g}_n]}$ becomes trivial for sufficiently large ${n}$:

Theorem 3 (Ado’s theorem for nilpotent Lie algebras) Let ${{\mathfrak n}}$ be a finite-dimensional nilpotent Lie algebra. Then there exists a finite-dimensional faithful representation ${\rho: {\mathfrak n} \rightarrow \hbox{End}(V)}$ of ${{\mathfrak n}}$. Furthermore, there exists a natural number ${k}$ such that ${\rho({\mathfrak n})^k = \{0\}}$, i.e. one has ${\rho(X_1) \ldots \rho(X_k)=0}$ for all ${X_1,\ldots,X_k \in {\mathfrak n}}$.

The second conclusion of Ado’s theorem here is useful for induction purposes. (By Engel’s theorem, this conclusion is also equivalent to the assertion that every element of ${\rho({\mathfrak n})}$ is nilpotent, but we can prove Theorem 3 without explicitly invoking Engel’s theorem.)

Below the fold, I give a proof of Theorem 3, and then extend the argument to cover the full strength of Ado’s theorem. This is not a new argument – indeed, I am basing this particular presentation from the one in Fulton and Harris – but it was an instructive exercise for me to try to extract the proof of Ado’s theorem from the more general structural theory of Lie algebras (e.g. Engel’s theorem, Lie’s theorem, Levi decomposition, etc.) in which the result is usually placed. (However, the proof I know of still needs Engel’s theorem to establish the solvable case, and the Levi decomposition to then establish the general case.)

I’ve just uploaded to the arXiv my paper “Outliers in the spectrum of iid matrices with bounded rank perturbations“, submitted to Probability Theory and Related Fields. This paper is concerned with outliers to the circular law for iid random matrices. Recall that if ${X_n}$ is an ${n \times n}$ matrix whose entries are iid complex random variables with mean zero and variance one, then the ${n}$ complex eigenvalues of the normalised matrix ${\frac{1}{\sqrt{n}} X_n}$ will almost surely be distributed according to the circular law distribution ${\frac{1}{\pi} 1_{|z| \leq 1} d^2 z}$ in the limit ${n \rightarrow \infty}$. (See these lecture notes for further discussion of this law.)

The circular law is also stable under bounded rank perturbations: if ${C_n}$ is a deterministic rank ${O(1)}$ matrix of polynomial size (i.e. of operator norm ${O(n^{O(1)})}$), then the circular law also holds for ${\frac{1}{\sqrt{n}} X_n + C_n}$ (this is proven in a paper of myself, Van Vu, and Manjunath Krisnhapur). In particular, the bulk of the eigenvalues (i.e. ${(1-o(1)) n}$ of the ${n}$ eigenvalues) will lie inside the unit disk ${\{ z \in {\bf C}: |z| \leq 1 \}}$.

However, this leaves open the possibility for one or more outlier eigenvalues that lie significantly outside the unit disk; the arguments in the paper cited above give some upper bound on the number of such eigenvalues (of the form ${O(n^{1-c})}$ for some absolute constant ${c>0}$) but does not exclude them entirely. And indeed, numerical data shows that such outliers can exist for certain bounded rank perturbations.

In this paper, some results are given as to when outliers exist, and how they are distributed. The easiest case is of course when there is no bounded rank perturbation: ${C_n=0}$. In that case, an old result of Bai and Yin and of Geman shows that the spectral radius of ${\frac{1}{\sqrt{n}} X_n}$ is almost surely ${1+o(1)}$, thus all eigenvalues will be contained in a ${o(1)}$ neighbourhood of the unit disk, and so there are no significant outliers. The proof is based on the moment method.

Now we consider a bounded rank perturbation ${C_n}$ which is nonzero, but which has a bounded operator norm: ${\|C_n\|_{op} = O(1)}$. In this case, it turns out that the matrix ${\frac{1}{\sqrt{n}} X_n + C_n}$ will have outliers if the deterministic component ${C_n}$ has outliers. More specifically (and under the technical hypothesis that the entries of ${X_n}$ have bounded fourth moment), if ${\lambda}$ is an eigenvalue of ${C_n}$ with ${|\lambda| > 1}$, then (for ${n}$ large enough), ${\frac{1}{\sqrt{n}} X_n + C_n}$ will almost surely have an eigenvalue at ${\lambda+o(1)}$, and furthermore these will be the only outlier eigenvalues of ${\frac{1}{\sqrt{n}} X_n + C_n}$.

Thus, for instance, adding a bounded nilpotent low rank matrix to ${\frac{1}{\sqrt{n}} X_n}$ will not create any outliers, because the nilpotent matrix only has eigenvalues at zero. On the other hand, adding a bounded Hermitian low rank matrix will create outliers as soon as this matrix has an operator norm greater than ${1}$.

When I first thought about this problem (which was communicated to me by Larry Abbott), I believed that it was quite difficult, because I knew that the eigenvalues of non-Hermitian matrices were quite unstable with respect to general perturbations (as discussed in this previous blog post), and that there were no interlacing inequalities in this case to control bounded rank perturbations (as discussed in this post). However, as it turns out I had arrived at the wrong conclusion, especially in the exterior of the unit disk in which the resolvent is actually well controlled and so there is no pseudospectrum present to cause instability. This was pointed out to me by Alice Guionnet at an AIM workshop last week, after I had posed the above question during an open problems session. Furthermore, at the same workshop, Percy Deift emphasised the point that the basic determinantal identity

$\displaystyle \det(1 + AB) = \det(1 + BA) \ \ \ \ \ (1)$

for ${n \times k}$ matrices ${A}$ and ${k \times n}$ matrices ${B}$ was a particularly useful identity in random matrix theory, as it converted problems about large (${n \times n}$) matrices into problems about small (${k \times k}$) matrices, which was particularly convenient in the regime when ${n \rightarrow \infty}$ and ${k}$ was fixed. (Percy was speaking in the context of invariant ensembles, but the point is in fact more general than this.)

From this, it turned out to be a relatively simple manner to transform what appeared to be an intractable ${n \times n}$ matrix problem into quite a well-behaved ${k \times k}$ matrix problem for bounded ${k}$. Specifically, suppose that ${C_n}$ had rank ${k}$, so that one can factor ${C_n = A_n B_n}$ for some (deterministic) ${n \times k}$ matrix ${A_n}$ and ${k \times n}$ matrix ${B_n}$. To find an eigenvalue ${z}$ of ${\frac{1}{\sqrt{n}} X_n + C_n}$, one has to solve the characteristic polynomial equation

$\displaystyle \det( \frac{1}{\sqrt{n}} X_n + A_n B_n - z ) = 0.$

This is an ${n \times n}$ determinantal equation, which looks difficult to control analytically. But we can manipulate it using (1). If we make the assumption that ${z}$ is outside the spectrum of ${\frac{1}{\sqrt{n}} X_n}$ (which we can do as long as ${z}$ is well away from the unit disk, as the unperturbed matrix ${\frac{1}{\sqrt{n}} X_n}$ has no outliers), we can divide by ${\frac{1}{\sqrt{n}} X_n - z}$ to arrive at

$\displaystyle \det( 1 + (\frac{1}{\sqrt{n}} X_n-z)^{-1} A_n B_n ) = 0.$

Now we apply the crucial identity (1) to rearrange this as

$\displaystyle \det( 1 + B_n (\frac{1}{\sqrt{n}} X_n-z)^{-1} A_n ) = 0.$

The crucial point is that this is now an equation involving only a ${k \times k}$ determinant, rather than an ${n \times n}$ one, and is thus much easier to solve. The situation is particularly simple for rank one perturbations

$\displaystyle \frac{1}{\sqrt{n}} X_n + u_n v_n^*$

in which case the eigenvalue equation is now just a scalar equation

$\displaystyle 1 + \langle (\frac{1}{\sqrt{n}} X_n-z)^{-1} u_n, v_n \rangle = 0$

that involves what is basically a single coefficient of the resolvent ${(\frac{1}{\sqrt{n}} X_n-z)^{-1}}$. (It is also an instructive exercise to derive this eigenvalue equation directly, rather than through (1).) There is by now a very well-developed theory for how to control such coefficients (particularly for ${z}$ in the exterior of the unit disk, in which case such basic tools as Neumann series work just fine); in particular, one has precise enough control on these coefficients to obtain the result on outliers mentioned above.

The same method can handle some other bounded rank perturbations. One basic example comes from looking at iid matrices with a non-zero mean ${\mu}$ and variance ${1}$; this can be modeled by ${\frac{1}{\sqrt{n}} X_n + \mu \sqrt{n} \phi_n \phi_n^*}$ where ${\phi_n}$ is the unit vector ${\phi_n := \frac{1}{\sqrt{n}} (1,\ldots,1)^*}$. Here, the bounded rank perturbation ${\mu \sqrt{n} \phi_n \phi_n^*}$ has a large operator norm (equal to ${|\mu| \sqrt{n}}$), so the previous result does not directly apply. Nevertheless, the self-adjoint nature of the perturbation has a stabilising effect, and I was able to show that there is still only one outlier, and that it is at the expected location of ${\mu \sqrt{n}+o(1)}$.

If one moves away from the case of self-adjoint perturbations, though, the situation changes. Let us now consider a matrix of the form ${\frac{1}{\sqrt{n}} X_n + \mu \sqrt{n} \phi_n \psi_n^*}$, where ${\psi_n}$ is a randomised version of ${\phi_n}$, e.g. ${\psi_n := \frac{1}{\sqrt{n}} (\pm 1, \ldots, \pm 1)^*}$, where the ${\pm 1}$ are iid Bernoulli signs; such models were proposed recently by Rajan and Abbott as a model for neural networks in which some nodes are excitatory (and give columns with positive mean) and some are inhibitory (leading to columns with negative mean). Despite the superficial similarity with the previous example, the outlier behaviour is now quite different. Instead of having one extremely large outlier (of size ${\sim\sqrt{n}}$) at an essentially deterministic location, we now have a number of eigenvalues of size ${O(1)}$, scattered according to a random process. Indeed, (in the case when the entries of ${X_n}$ were real and bounded) I was able to show that the outlier point process converged (in the sense of converging ${k}$-point correlation functions) to the zeroes of a random Laurent series

$\displaystyle g(z) = 1 - \mu \sum_{j=0}^\infty \frac{g_j}{z^{j+1}}$

where ${g_0,g_1,g_2,\ldots \equiv N(0,1)}$ are iid real Gaussians. This is basically because the coefficients of the resolvent ${(\frac{1}{\sqrt{n}} X_n - zI)^{-1}}$ have a Neumann series whose coefficients enjoy a central limit theorem.

On the other hand, as already observed numerically (and rigorously, in the gaussian case) by Rajan and Abbott, if one projects such matrices to have row sum zero, then the outliers all disappear. This can be explained by another appeal to (1); this projection amounts to right-multiplying ${\frac{1}{\sqrt{n}} X_n + \mu \sqrt{n} \phi_n \psi_n^*}$ by the projection matrix ${P}$ to the zero-sum vectors. But by (1), the non-zero eigenvalues of the resulting matrix ${(\frac{1}{\sqrt{n}} X_n + \mu \sqrt{n} \phi_n \psi_n^*)P}$ are the same as those for ${P (\frac{1}{\sqrt{n}} X_n + \mu \sqrt{n} \phi_n \psi_n^*)}$. Since ${P}$ annihilates ${\phi_n}$, we thus see that in this case the bounded rank perturbation plays no role, and the question reduces to obtaining a circular law with no outliers for ${P \frac{1}{\sqrt{n}} X_n}$. As it turns out, this can be done by invoking the machinery of Van Vu and myself that we used to prove the circular law for various random matrix models.

In the previous lectures, we have focused mostly on the equidistribution or linear patterns on a subset of the integers ${{\bf Z}}$, and in particular on intervals ${[N]}$. The integers are of course a very important domain to study in additive combinatorics; but there are also other fundamental model examples of domains to study. One of these is that of a vector space ${V}$ over a finite field ${{\bf F} = {\bf F}_p}$ of prime order. Such domains are of interest in computer science (particularly when ${p=2}$) and also in number theory; but they also serve as an important simplified “dyadic model” for the integers. See this survey article of Green for further discussion of this point.

The additive combinatorics of the integers ${{\bf Z}}$, and of vector spaces ${V}$ over finite fields, are analogous, but not quite identical. For instance, the analogue of an arithmetic progression in ${{\bf Z}}$ is a subspace of ${V}$. In many cases, the finite field theory is a little bit simpler than the integer theory; for instance, subspaces are closed under addition, whereas arithmetic progressions are only “almost” closed under addition in various senses. (For instance, ${[N]}$ is closed under addition approximately half of the time.) However, there are some ways in which the integers are better behaved. For instance, because the integers can be generated by a single generator, a homomorphism from ${{\bf Z}}$ to some other group ${G}$ can be described by a single group element ${g}$: ${n \mapsto g^n}$. However, to specify a homomorphism from a vector space ${V}$ to ${G}$ one would need to specify one group element for each dimension of ${V}$. Thus we see that there is a tradeoff when passing from ${{\bf Z}}$ (or ${[N]}$) to a vector space model; one gains a bounded torsion property, at the expense of conceding the bounded generation property. (Of course, if one wants to deal with arbitrarily large domains, one has to concede one or the other; the only additive groups that have both bounded torsion and boundedly many generators, are bounded.)

The starting point for this course (Notes 1) was the study of equidistribution of polynomials ${P: {\bf Z} \rightarrow {\bf R}/{\bf Z}}$ from the integers to the unit circle. We now turn to the parallel theory of equidistribution of polynomials ${P: V \rightarrow {\bf R}/{\bf Z}}$ from vector spaces over finite fields to the unit circle. Actually, for simplicity we will mostly focus on the classical case, when the polynomials in fact take values in the ${p^{th}}$ roots of unity (where ${p}$ is the characteristic of the field ${{\bf F} = {\bf F}_p}$). As it turns out, the non-classical case is also of importance (particularly in low characteristic), but the theory is more difficult; see these notes for some further discussion.

Our study of random matrices, to date, has focused on somewhat general ensembles, such as iid random matrices or Wigner random matrices, in which the distribution of the individual entries of the matrices was essentially arbitrary (as long as certain moments, such as the mean and variance, were normalised). In these notes, we now focus on two much more special, and much more symmetric, ensembles:

• The Gaussian Unitary Ensemble (GUE), which is an ensemble of random ${n \times n}$ Hermitian matrices ${M_n}$ in which the upper-triangular entries are iid with distribution ${N(0,1)_{\bf C}}$, and the diagonal entries are iid with distribution ${N(0,1)_{\bf R}}$, and independent of the upper-triangular ones; and
• The Gaussian random matrix ensemble, which is an ensemble of random ${n \times n}$ (non-Hermitian) matrices ${M_n}$ whose entries are iid with distribution ${N(0,1)_{\bf C}}$.

The symmetric nature of these ensembles will allow us to compute the spectral distribution by exact algebraic means, revealing a surprising connection with orthogonal polynomials and with determinantal processes. This will, for instance, recover the semi-circular law for GUE, but will also reveal fine spacing information, such as the distribution of the gap between adjacent eigenvalues, which is largely out of reach of tools such as the Stieltjes transform method and the moment method (although the moment method, with some effort, is able to control the extreme edges of the spectrum).

Similarly, we will see for the first time the circular law for eigenvalues of non-Hermitian matrices.

There are a number of other highly symmetric ensembles which can also be treated by the same methods, most notably the Gaussian Orthogonal Ensemble (GOE) and the Gaussian Symplectic Ensemble (GSE). However, for simplicity we shall focus just on the above two ensembles. For a systematic treatment of these ensembles, see the text by Deift.