In the theory of discrete random matrices (e.g. matrices whose entries are random signs ), one often encounters the problem of understanding the distribution of the random variable , where is an -dimensional random sign vector (so is uniformly distributed in the discrete cube ), and is some -dimensional subspace of for some .
It is not hard to compute the second moment of this random variable. Indeed, if denotes the orthogonal projection matrix from to the orthogonal complement of , then one observes that
since is a rank orthogonal projection. So we expect to be about on the average.
In fact, one has sharp concentration around this value, in the sense that with high probability. More precisely, we have
for some absolute constants .
In fact the constants are very civilised; for large one can basically take and , for instance. This type of concentration, particularly for subspaces of moderately large codimension , is fundamental to much of my work on random matrices with Van Vu, starting with our first paper (in which this proposition first appears). (For subspaces of small codimension (such as hyperplanes) one has to use other tools to get good results, such as inverse Littlewood-Offord theory or the Berry-Esséen central limit theorem, but that is another story.)
Proposition 1 is an easy consequence of the second moment computation and Talagrand’s inequality, which among other things provides a sharp concentration result for convex Lipschitz functions on the cube ; since is indeed a convex Lipschitz function, this inequality can be applied immediately. The proof of Talagrand’s inequality is short and can be found in several textbooks (e.g. Alon and Spencer), but I thought I would reproduce the argument here (specialised to the convex case), mostly to force myself to learn the proof properly. Note the concentration of obtained by Talagrand’s inequality is much stronger than what one would get from more elementary tools such as Azuma’s inequality or McDiarmid’s inequality, which would only give concentration of about or so (which is in fact trivial, since the cube has diameter ); the point is that Talagrand’s inequality is very effective at exploiting the convexity of the problem, as well as the Lipschitz nature of the function in all directions, whereas Azuma’s inequality can only easily take advantage of the Lipschitz nature of the function in coordinate directions. On the other hand, Azuma’s inequality works just as well if the metric is replaced with the larger metric, and one can conclude that the distance between and concentrates around its median to a width , which is a more non-trivial fact than the concentration bound given by that inequality. (The computation of the median of the distance is more complicated than for the distance, though, and depends on the orientation of .)
Remark 1 If one makes the coordinates of iid Gaussian variables rather than random signs, then Proposition 1 is much easier to prove; the probability distribution of a Gaussian vector is rotation-invariant, so one can rotate to be, say, , at which point is clearly the sum of independent squares of Gaussians (i.e. a chi-square distribution), and the claim follows from direct computation (or one can use the Chernoff inequality). The gaussian counterpart of Talagrand’s inequality is more classical, being essentially due to Lévy, and will also be discussed later in this post.
— 1. Concentration on the cube —
Proposition 1 follows easily from the following statement, that asserts that if a convex set occupies a non-trivial fraction of the cube , then the neighbourhood will occupy almost all of the cube for :
for all and some absolute constant , where is chosen uniformly from .
Remark 2 It is crucial that is convex here. If instead is, say, the set of all points in with fewer than ‘s, then is comparable to , but only starts decaying once , rather than . Indeed, it is not hard to show that Proposition 2 implies the variant
for non-convex (by restricting to and then passing from to the convex hull, noting that distances to on may be contracted by a factor of by this latter process); this inequality can also be easily deduced from Azuma’s inequality.
To apply this proposition to the situation at hand, observe that if is the cylindrical region for some , then is convex and is contained in . Thus
Applying this with or , where is the median value of , one soon obtains concentration around the median:
This is only compatible with (1) if , and the claim follows.
To prove Proposition 2, we use the exponential moment method. Indeed, it suffices by Markov’s inequality to show that
for a sufficiently small absolute constant (in fact one can take ).
We prove (2) by an induction on the dimension . The claim is trivial for , so suppose and the claim has already been proven for .
Let us write for . For each , we introduce the slice , then is convex. We now try to bound the left-hand side of (2) in terms of rather than . Clearly
where and .
Let be chosen later; then the point lies in by convexity, and so
Squaring this and using Pythagoras, one obtains
As we will shortly be exponentiating the left-hand side, we need to linearise the right-hand side. Accordingly, we will exploit the convexity of the function to bound
and thus by (4)
We exponentiate this and take expectations in (holding fixed for now) to get
Meanwhile, from the induction hypothesis and (3) we have
and similarly for . By Hölder’s inequality, we conclude
For , the optimal choice of here is , obtaining
for , the optimal choice of is to be determined. Averaging, we obtain
so to establish (2), it suffices to pick such that
If is bounded away from zero, then by choosing we would obtain the claim if is small enough, so we may take to be small. But then a Taylor expansion allows us to conclude if we take to be a constant multiple of , and again pick to be small enough. The point is that already almost works up to errors of , and increasing from zero to a small non-zero quantity will decrease the LHS by about . [By optimising everything using first-year calculus, one eventually gets the constant claimed earlier.]
Remark 3 Talagrand’s inequality is in fact far more general than this; it applies to arbitrary products of probability spaces, rather than just to , and to non-convex , but the notion of distance needed to define becomes more complicated; the proof of the inequality, though, is essentially the same. Besides its applicability to convex Lipschitz functions, Talagrand’s inequality is also very useful for controlling combinatorial Lipschitz functions which are “locally certifiable” in the sense that whenever is larger than some threshold , then there exist some bounded number of coefficients of which “certify” this fact (in the sense that for any other which agrees with on these coefficients). See e.g. the text of Alon and Spencer for a more precise statement and some applications.
— 2. Gaussian concentration —
As mentioned earlier, there are analogous results when the uniform distribution on the cube are replaced by other distributions, such as the -dimensional Gaussian distribution. In fact, in this case convexity is not needed:
for all and some absolute constant , where is a random Gaussian vector.
This inequality can be deduced from Lévy’s classical concentration of measure inequality for the sphere (with the optimal constant), but we will give an alternate proof due to Maurey and Pisier. It suffices to prove the following variant of Proposition 3:
for all and some absolute constant , where is a random variable. [Informally, Lipschitz functions of Gaussian variables concentrate as if they were Gaussian themselves; for comparison, Talagrand’s inequality implies that convex Lipschitz functions of Bernoulli variables concentrate as if they were Gaussian.]
Indeed, if one sets , and splits into the cases whether or , one obtains either (in the former case) or , and so
in either case. Also, since , we have . Putting the two bounds together gives the claim.
for all . By subtracting off the mean we may assume . By replacing with if necessary it suffices to control the upper tail probability for .
We again use the exponential moment method. It suffices to show that
for some absolute constant .
With an eye to exploiting (5), one might seek to use the fundamental theorem of calculus to write
But actually it turns out to be smarter to use a circular arc of integration, rather than a line segment:
The reason for this is that is another gaussian random variable equivalent to , as is its derivative ; furthermore, and crucially, these two random variables are independent.
To exploit this, we first use Jensen’s inequality to bound
Applying the chain rule and taking expectations, we have
Let us condition to be fixed, then ; applying (5), we conclude that is normally distributed with standard deviation at most . As such we have
for some absolute constant ; integrating out the conditioning on we obtain the claim.