This is my second Milliman lecture, in which I talk about recent applications of ideas from additive combinatorics (and in particular, from the inverse Littlewood-Offord problem) to the theory of discrete random matrices.
In many areas of physics, chemistry, and computer science, one often has to study large matrices A, which for sake of discussion we shall take to be square, thus is an matrix for some large integer n, and the are real or complex numbers. In some cases A will be very structured (e.g. self-adjoint, upper triangular, sparse, circulant, low rank, etc.) but in other cases one expects A to be so complicated that there is no usable structure to exploit in the matrix. In such cases, one often makes the non-rigorous (but surprisingly accurate, in practice) assumption that A behaves like a random matrix, whose entries are drawn independently and identically from a single probability distribution (which could be continuous, discrete, or a combination of both). Each choice of distribution determines a different random matrix ensemble. Two particularly fundamental examples of continuous ensembles are:
- The real Gaussian random matrix ensemble, in which each is distributed independently according to the standard normal distribution . This ensemble has the remarkable feature of being invariant under the orthogonal group O(n); if R is a rotation or reflection matrix in O(n), and A is distributed as a real Gaussian random matrix, then RA and AR are also distributed as a real Gaussian random matrix. (This is ultimately because the product of n Gaussian measures on is a Gaussian measure on , which is manifestly rotation-invariant.)
- The complex Gaussian random matrix ensemble, in which the are distributed according to a complex normal distribution . This ensemble has the feature of being invariant under the unitary group U(n).
Two particularly fundamental examples of discrete ensembles are
- The Bernoulli ensemble, in which each is distributed independently and uniformly in the set , thus A is a random matrix of signs.
- The lazy (or sparse) Bernoulli ensemble, in which each is independently equal to -1 or +1 with probability p/2, and equal to 0 with probability 1-p, for some fixed , thus A is a sparse matrix of signs of expected density p.
The Bernoulli and sparse Bernoulli ensembles arise naturally in computer science and numerical analysis, as they form a simple model for simulating the effect of numerical roundoff error (or other types of digital error) on a large matrix. Continuous ensembles such as the Gaussian ensembles, in contrast, are natural models for matrices in the analog world, and in particular in physics and chemistry. [For reasons that are still somewhat mysterious, these ensembles, or more precisely their self-adjoint counterparts, also seem to be good models for various important statistics in number theory, such as the statistics of zeroes of the Riemann zeta function, but this is not the topic of my discussion here.] There are of course many other possible ensembles of interest that one could consider, but I will stick to the Gaussian and Bernoulli ensembles here for simplicity.
If A is drawn randomly from one of the above matrix ensembles, then we have a very explicit understanding of how each of the coefficients of the matrix A behaves. But in practice, we want to study more “global” properties of the matrix A which involve rather complicated interactions of all the coefficients together. For instance, we could be interested in the following (closely related) questions:
- Dynamics. Given a typical vector , what happens to the iterates in the limit ?
- Expansion and contraction. Given a non-zero vector x, how does the norm of Ax compare with the norm of x? What is the largest ratio ? The smallest ratio? The average ratio?
- Invertibility. Is the equation solvable for every vector b? Do small fluctuations in b always cause small fluctuations in x, or can they cause large fluctuations? (In other words, is the invertibility problem stable?)
As any student of linear algebra knows, these questions can be answered satisfactorily if one knows the eigenvalues (counting multiplicity) and singular values of the matrix A. (As the matrix A is not self-adjoint, the eigenvalues can be complex-valued even if the coefficients of A are real-valued; however, the singular values are always non-negative reals, because is self- adjoint and positive semi-definite.) For instance:
- The largest eigenvalue magnitude determines the maximal rate of exponential growth or decay of (ignoring polynomial growth corrections coming from repeated eigenvalues), while the smallest eigenvalue magnitude determines the minimal rate of exponential growth.
- The ratio has a maximal value of , a minimal value of , and a root mean square value of if the orientation of x is selected uniformly at random.
- The matrix A is invertible if and only if all eigenvalues are non-zero, or equivalently if is positive.
- The stability of the invertibility problem is controlled by the condition number .
So, one of the fundamental problems in the theory of random matrices is to understand how the eigenvalues and singular values of a random matrix A are distributed. (More generally, it is of interest to study the eigenvalues and singular values of A+B, where A is drawn from a standard random matrix ensemble, and B is a fixed deterministic matrix, but for simplicity we will not discuss this case here.) But how does one get a handle on these numbers?
The direct approach of working with the characteristic equation (or ) looks very unpromising; one is asking to find the roots of a large degree polynomial, most of whose coefficients depend in a hopelessly complicated way on the coefficients on A.
In the special cases of the Gaussian orthogonal and unitary ensembles, there is a massive amount of algebraic structure coming from the action of O(n) and U(n) that allows one to explicitly compute various multidimensional integrals, and this approach actually works! One gets a very explicit and useful explicit formula for the joint eigenvalue distribution (first worked out by Ginibre, I believe) this way. But for more general ensembles, such as the Bernoulli ensemble, such algebraic structure is not present, and so it is unlikely that any useful explicit formula for the joint eigenvalue distribution exists. However, one can still obtain a lot of useful information if, instead of trying to locate each eigenvalue or singular value directly, one instead tries to compute various special averages (e.g. moments) of these eigenvalues or singular values. For instance, from undergraduate linear algebra we have the fundamental formulae
and similarly for the singular values, we have
So, if one can easily compute traces and determinants of the matrix A (and various other matrices related to A), then one can in principle get quite a bit of control on the eigenvalues and singular values. It is also worth noting that the eigenvalues and singular values are related to each other in several ways; for instance, we have the identity
(which comes from comparing the determinants of A and ), the inequality
(which comes from looking at the ratio when x is an eigenvector), and the inequality
Let’s give some simple examples of this approach. If we take A to be the Gaussian or Bernoulli ensemble, then the trace of A has expectation zero, and so we know that the sum has expectation zero also. (Actually, this is easy to see for symmetry reasons: A has the same distribution as -A, and so the distribution of eigenvalues also has a symmetry around the origin.) The trace of , by contrast, is the sum of the squares of all the matrix coefficients, and will be close to (for the Bernoulli ensemble, it is exactly ); thus we see that , and so by (4) we have . So we see that the eigenvalues and singular values should be about on the average. By working a little harder (e.g. by playing with very high moments of ) one can show that the largest singular value is also going to be with high probability, which then implies by (3) that all eigenvalues and singular values will be . Unfortunately, this approach does not seem to yield much information on the least singular value, which plays a major role in the invertibility and stability of A.
It is now natural to normalise the eigenvalues and singular values of A by , and consider the distribution of the set of normalised eigenvalues . If one plots these normalised eigenvalues numerically in the complex plane for moderately large n (e.g. n=100), one sees a remarkable distribution; the eigenvalues appear to be uniformly distributed in the unit circle . (For small n, there is a little bit of a clustering on the real line, just because polynomials with real coefficients tend to have a couple of real zeroes, but this clustering goes away in the limit as n goes to infinity.) This phenomenon is known as the circular law; more precisely, if we let n tend to infinity, then for every sufficiently nice set R in the plane (e.g. one could take R to be a rectangle), one has
(Technically, this formulation is known as the strong circular law; there is also a weak circular law, which asserts that one has convergence in probability rather than almost sure convergence. But for this talk I will ignore these distinctions.)
The circular law was first proven in the case of the Gaussian unitary ensemble by Mehta in 1967, using an explicit formula for the joint distribution of the eigenvalues. But for more general ensembles, in which explicit formulae were not available, progress was more difficult. The method of moments (in which one uses (1) to compute the sums of powers of the eigenvalues) is not very useful because of the cancellations caused by the complex nature of the eigenvalues; indeed, one can show that is roughly zero for every m, which is consistent with the circular law but also does not preclude, for instance, all the eigenvalues clustering at the origin. [For random self-adjoint matrices, the moment method works quite well, leading for instance to Wigner’s semi-circular law.]
The first breakthrough was by Girko in 1984, who observed that the eigenvalue distribution could be recovered from the quantities
for complex z (this expression is known as the Stieltjes transform of the normalised eigenvalue distribution of A). To compute this quantity, Girko then used the formula (2) to relate the determinant of with the singular values of this matrix. The singular value distribution could then be computed by the moment method (note that singular values, unlike eigenvalues, are real and non-negative, and so we do not have cancellation problems). Putting this all together and doing a large number of algebraic computations, one eventually obtains (formally, at least) a proof of the circular law.
There was however a technical difficulty with the above analysis, which was that the formula (2) becomes very unstable when the least singular value is close to zero (basically because of a division by zero problem). This is not merely a technical issue but is fundamental to the general problem of controlling eigenvalues of non-self-adjoint matrices : these eigenvalues can become very unstable near a region of pseudospectrum, which can be defined as a complex number z such that the least singular value of is small. The classic demonstration of this comes from the perturbed shift matrices
For sake of discussion let us take n to be even. When , this matrix is singular, with least singular value and with all n generalised eigenvalues equal to 0. But when becomes positive, the least singular value creeps up to , but the n eigenvalues move rapidly away from the origin, becoming for . This is ultimately because the zero set of the characteristic polynomial is very sensitive to the value of when that parameter is close to zero.
So, in order to make the circular law argument complete, one needs to get good lower bounds on the least singular value of the random matrix A (as well as variants of this matrix, such as ). In the case of continuous (non-Gaussian) ensembles, this was first done by Bai in 1997. To illustrate the basic idea, let us look at a toy problem, to show that the least singular value of a Gaussian orthogonal matrix A is usually non-zero (i.e. A is invertible with high probability). For this, we use some linear algebra. Let denote the rows of A, which we can view as vectors in . Then the least singular value of A vanishes precisely when lie on a hyperplane. This implies that one of the vectors here is a linear combination of the other n-1; by symmetry, we conclude that the probability that the least singular value vanishes is bounded by n times the probability that (say) is a linear combination of . But span a hyperplane at best; and has a continuous distribution and so has only a zero probability of lying in the hyperplane. Thus the least singular value vanishes with probability zero. It turns out that this argument is robust enough to also show that the least singular value not only avoids zero, but in fact keeps a certain distance away from it; for instance, it is not hard to show with this method that the least singular value is at least (say) with probability , basically because each row vector is not only not a linear combination of the other rows, but in fact keeps a certain distance away from the space spanned by the other rows, with high probability.
For discrete random matrices, one runs into a new difficulty: a row vector such as is no longer continuously distributed, and so can in fact concentrate on a hyperplane with positive probability. For instance, in the Bernoulli case, a row vector X is just a random corner of the discrete cube . There are certain hyperplanes which X can visit quite frequently; for instance, X will have a probability 1/2 of lying in the hyperplane . In particular, there is a probability that all rows lie in this hyperplane, which would cause the Bernoulli matrix A to be non-invertible. (In terms of A itself, what is going on is that there is a chance that the first column and second column coincide, which will of course destroy invertibility.) In particular, A now has a non-zero chance of being singular. [It is in fact conjectured that the singularity probability is close to this value, or more precisely equal to . The best known upper bound currently for this probability is , due to Bourgain, building upon earlier work of Kahn-Komlos-Szemeredi, and Vu and myself.]
One can hope to sum the singularity probability over all hyperplanes, but there are of course infinitely many hyperplanes in . Fortunately, only a few of these will have a particularly high chance of capturing X. The problem now hinges on getting a sufficiently strong control on the number of hyperplanes which are “rich” in the sense that they have a high probability of capturing X (or equivalently, that they have a large intersection with the discrete cube ).
This is where (finally!) the additive combinatorics comes in. Let be a normal vector to a given hyperplane V. (For instance, if , one could take .) Then V is rich if and only if the random variable
vanishes a large proportion of the time, where are independent signs. (One can view (6) as the result of an n-step random walk, in which the step of the walk has magnitude .) To put it another way, if is associated to a rich hyperplane, then there are many additive relations amongst the coefficients . What kinds of sets of numbers have such a strong amount of structure? (This problem is known as the inverse Littlewood-Offord problem; the forward Littlewood-Offord problem concerns how often (6) vanished for a given set of numbers .)
Well, one way that many of the sums (6) could vanish is if many of the are themselves zero; for instance, we have already seen that if , then half of the sums (6) vanish. But this is a rather degenerate case, and it is intuitively obvious that the more non-zero terms one has in the random walk, the less likely it is that the sum is going to vanish. Indeed, there is an observation of Erdős (a quick application of Sperner’s theorem) that if k of the coefficients are non-zero,then (6) can only vanish at most of the time. This bound is sharp; if, for instance, and , then the theory of random walks tells us that (6) is distributed in a roughly Gaussian and discrete fashion around the origin with standard deviation , and so (6) should vanish about of the time (and one can check this easily enough using Stirling’s formula). [In fact, this case is essentially the exact optimum, as follows from Erdős’ argument.]
So, now suppose that all the are non-zero. Then Erdős’ result tells us that (6) vanishes at most of the time. But in most cases one can do a lot better; if, for instance, the are linearly independent over the rationals, then (6) in fact never vanishes at all. It turns out that by intensively using the tools from additive combinatorics (including Fourier analysis and the geometry of numbers) one can obtain a satisfactory classification of the vectors for which (6) has a high chance of vanishing; the precise description is technical, but basically in order for (6) to equal zero often, most of the coordinates of v must lie inside an generalised arithmetic progression of reasonably small size and dimension. Using such facts, it is possible to get good bounds on the singularity probability and on the least singular value of random discrete matrices such as Bernoulli matrices, leading in particular to the circular law for such discrete matrices (various formulations of this law have been recently obtained by Götze-Tikhomirov, Pan-Zhou, and Van and myself). [To get the best bounds on the singularity probability, one uses a slightly different argument, using Fourier analysis and additive combinatorics to compare the vanishing probability of a random walk with that of a lazy random walk, thus creating a relationship between the singularity of Bernoulli matrices and sparse Bernoulli matrices; see for instance my paper with Van on this topic.]
This theory for understanding singularity behaviour of discrete random matrices promises to have applications to some other areas of mathematics as well. For instance, the subject of smoothed analysis, in which the introduction of random noise to various numerical algorithms (such as the simplex method) increases the stability of the algorithm, can use this theory to extend the theoretical results in that subject from continuous noise models to discrete noise models (such as those created by roundoff error).
[Update, Dec 6: Terminology and typos corrected.]