In the theory of discrete random matrices (e.g. matrices whose entries are random signs {\pm 1}), one often encounters the problem of understanding the distribution of the random variable {\hbox{dist}(X,V)}, where {X = (x_1,\ldots,x_n) \in \{-1,+1\}^n} is an {n}-dimensional random sign vector (so {X} is uniformly distributed in the discrete cube {\{-1,+1\}^n}), and {V} is some {d}-dimensional subspace of {{\bf R}^n} for some {0 \leq d \leq n}.

It is not hard to compute the second moment of this random variable. Indeed, if {P = (p_{ij})_{1 \leq i,j \leq n}} denotes the orthogonal projection matrix from {{\bf R}^n} to the orthogonal complement {V^\perp} of {V}, then one observes that

\displaystyle  \hbox{dist}(X,V)^2 = X \cdot P X = \sum_{i=1}^n \sum_{j=1}^n x_i x_j p_{ij}

and so upon taking expectations we see that

\displaystyle  {\Bbb E} \hbox{dist}(X,V)^2 = \sum_{i=1}^n p_{ii} = \hbox{tr} P = n-d \ \ \ \ \ (1)

since {P} is a rank {n-d} orthogonal projection. So we expect {\hbox{dist}(X,V)} to be about {\sqrt{n-d}} on the average.

In fact, one has sharp concentration around this value, in the sense that {\hbox{dist}(X,V) = \sqrt{n-d}+O(1)} with high probability. More precisely, we have

Proposition 1 (Large deviation inequality) For any {t>0}, one has

\displaystyle  {\Bbb P}( |\hbox{dist}(X,V) - \sqrt{n-d}| \geq t ) \leq C \exp(- c t^2 )

for some absolute constants {C, c > 0}.

In fact the constants {C, c} are very civilised; for large {t} one can basically take {C=4} and {c=1/16}, for instance. This type of concentration, particularly for subspaces {V} of moderately large codimension {n-d}, is fundamental to much of my work on random matrices with Van Vu, starting with our first paper (in which this proposition first appears). (For subspaces of small codimension (such as hyperplanes) one has to use other tools to get good results, such as inverse Littlewood-Offord theory or the Berry-Esséen central limit theorem, but that is another story.)

Proposition 1 is an easy consequence of the second moment computation and Talagrand’s inequality, which among other things provides a sharp concentration result for convex Lipschitz functions on the cube {\{-1,+1\}^n}; since {\hbox{dist}(x,V)} is indeed a convex Lipschitz function, this inequality can be applied immediately. The proof of Talagrand’s inequality is short and can be found in several textbooks (e.g. Alon and Spencer), but I thought I would reproduce the argument here (specialised to the convex case), mostly to force myself to learn the proof properly. Note the concentration of {O(1)} obtained by Talagrand’s inequality is much stronger than what one would get from more elementary tools such as Azuma’s inequality or McDiarmid’s inequality, which would only give concentration of about {O(\sqrt{n})} or so (which is in fact trivial, since the cube {\{-1,+1\}^n} has diameter {2\sqrt{n}}); the point is that Talagrand’s inequality is very effective at exploiting the convexity of the problem, as well as the Lipschitz nature of the function in all directions, whereas Azuma’s inequality can only easily take advantage of the Lipschitz nature of the function in coordinate directions. On the other hand, Azuma’s inequality works just as well if the {\ell^2} metric is replaced with the larger {\ell^1} metric, and one can conclude that the {\ell^1} distance between {X} and {V} concentrates around its median to a width {O(\sqrt{n})}, which is a more non-trivial fact than the {\ell^2} concentration bound given by that inequality. (The computation of the median of the {\ell^1} distance is more complicated than for the {\ell^2} distance, though, and depends on the orientation of {V}.)

Remark 1 If one makes the coordinates of {X} iid Gaussian variables {x_i \equiv N(0,1)} rather than random signs, then Proposition 1 is much easier to prove; the probability distribution of a Gaussian vector is rotation-invariant, so one can rotate {V} to be, say, {{\bf R}^d}, at which point {\hbox{dist}(X,V)^2} is clearly the sum of {n-d} independent squares of Gaussians (i.e. a chi-square distribution), and the claim follows from direct computation (or one can use the Chernoff inequality). The gaussian counterpart of Talagrand’s inequality is more classical, being essentially due to Lévy, and will also be discussed later in this post.

— 1. Concentration on the cube —

Proposition 1 follows easily from the following statement, that asserts that if a convex set {A \subset {\bf R}^n} occupies a non-trivial fraction of the cube {\{-1,+1\}^n}, then the neighbourhood {A_t := \{ x \in {\bf R}^n: \hbox{dist}(x,A) \leq t \}} will occupy almost all of the cube for {t \gg 1}:

Proposition 2 (Talagrand’s concentration inequality) Let {A} be a convex set in {{\bf R}^d}. Then

\displaystyle  \mathop{\bf P}(X \in A) \mathop{\bf P}( X \not \in A_t ) \leq \exp( - c t^2 )

for all {t>0} and some absolute constant {c > 0}, where {X \in \{-1,+1\}^n} is chosen uniformly from {\{-1,+1\}^n}.

Remark 2 It is crucial that {A} is convex here. If instead {A} is, say, the set of all points in {\{-1,+1\}^n} with fewer than {n/2-\sqrt{n}} {+1}‘s, then {\mathop{\bf P}(X \in A)} is comparable to {1}, but {\mathop{\bf P}( X \not \in A_t )} only starts decaying once {t \gg \sqrt{n}}, rather than {t \gg 1}. Indeed, it is not hard to show that Proposition 2 implies the variant

\displaystyle  \mathop{\bf P}(X \in A) \mathop{\bf P}( X \not \in A_t ) \leq \exp( - c t^2 / n)

for non-convex {A} (by restricting {A} to {\{-1,+1\}^n} and then passing from {A} to the convex hull, noting that distances to {A} on {\{-1,+1\}^n} may be contracted by a factor of {O(\sqrt{n})} by this latter process); this inequality can also be easily deduced from Azuma’s inequality.

To apply this proposition to the situation at hand, observe that if {A} is the cylindrical region {\{ x \in {\bf R}^n: \hbox{dist}(x,V) \leq r \}} for some {r}, then {A} is convex and {A_t} is contained in {\{ x \in {\bf R}^n: \hbox{dist}(x,V) \leq r+t \}}. Thus

\displaystyle  \mathop{\bf P}(\hbox{dist}(X,V) \leq r) \mathop{\bf P}( \hbox{dist}(X,V) > r+t ) \leq \exp(-ct^2).

Applying this with {r := M} or {r := M-t}, where {M} is the median value of {\hbox{dist}(X,V)}, one soon obtains concentration around the median:

\displaystyle  \mathop{\bf P}( |\hbox{dist}(X,V) - M| > t ) \leq 4 \exp(-ct^2).

This is only compatible with (1) if {M = \sqrt{n-d} + O(1)}, and the claim follows.

To prove Proposition 2, we use the exponential moment method. Indeed, it suffices by Markov’s inequality to show that

\displaystyle  \mathop{\bf P}(X \in A) {\Bbb E} \exp( c \hbox{dist}(X,A)^2 ) \leq 1 \ \ \ \ \ (2)

for a sufficiently small absolute constant {c > 0} (in fact one can take {c=1/16}).

We prove (2) by an induction on the dimension {n}. The claim is trivial for {n=0}, so suppose {n \geq 1} and the claim has already been proven for {n-1}.

Let us write {X = (X',x_n)} for {x_n = \pm 1}. For each {t \in {\bf R}}, we introduce the slice {A_t := \{ x' \in {\bf R}^{n-1}: (x',t) \in A \}}, then {A_t} is convex. We now try to bound the left-hand side of (2) in terms of {X', A_t} rather than {X, A}. Clearly

\displaystyle  \mathop{\bf P}(X \in A) = \frac{1}{2} [ \mathop{\bf P}( X' \in A_{-1}) + \mathop{\bf P}( X' \in A_{+1} ) ].

By symmetry we may assume that {\mathop{\bf P}( X' \in A_{+1} ) \geq \mathop{\bf P}( X' \in A_{-1} )}, thus we may write

\displaystyle  \mathop{\bf P}( X' \in A_{\pm 1} ) = p (1 \pm q) \ \ \ \ \ (3)

where {p := \mathop{\bf P}(X \in A)} and {0 \leq q \leq 1}.

Now we look at {\hbox{dist}(X,A)^2}. For {t = \pm 1}, let {Y_t \in {\bf R}^{n-1}} be the closest point of (the closure of) {A_t} to {X'}, thus

\displaystyle  |X'-Y_t| = \hbox{dist}(X', A_t). \ \ \ \ \ (4)

Let {0 \leq \lambda \leq 1} be chosen later; then the point {(1-\lambda) (Y_{x_n}, x_n) + \lambda (Y_{-x_n},-x_n)} lies in {A} by convexity, and so

\displaystyle  \hbox{dist}(X,A) \leq |(1-\lambda) (Y_{x_n}, x_n) + \lambda (Y_{-x_n},-x_n) - (X',x_n)|.

Squaring this and using Pythagoras, one obtains

\displaystyle  \hbox{dist}(X,A)^2 \leq 4 \lambda^2 + |(1-\lambda) (X'-Y_{x_n}) + \lambda (X'-Y_{-x_n})|^2.

As we will shortly be exponentiating the left-hand side, we need to linearise the right-hand side. Accordingly, we will exploit the convexity of the function {x \mapsto |x|^2} to bound

\displaystyle  |(1-\lambda) (X-Y_{x_n}) + \lambda (X-Y_{-x_n})|^2 \leq

\displaystyle  (1-\lambda) |X'-Y_{x_n}|^2 + \lambda |X'-Y_{-x_n}|^2

and thus by (4)

\displaystyle  \hbox{dist}(X,A)^2 \leq 4 \lambda^2 + (1-\lambda) \hbox{dist}(X',A_{x_n})^2 + \lambda \hbox{dist}(X',A_{-x_n})^2.

We exponentiate this and take expectations in {X'} (holding {x_n} fixed for now) to get

\displaystyle  {\Bbb E}_{X'} e^{c \hbox{dist}(X,A)^2} \leq e^{4 c \lambda^2} {\Bbb E}_{X'} (e^{c \hbox{dist}(X',A_{x_n})^2})^{1-\lambda} (e^{c \hbox{dist}(X',A_{-x_n})^2})^{\lambda}.

Meanwhile, from the induction hypothesis and (3) we have

\displaystyle  {\Bbb E}_{X'} e^{c \hbox{dist}(X',A_{x_n})^2} \leq \frac{1}{p(1 + x_n q)}

and similarly for {A_{-x_n}}. By Hölder’s inequality, we conclude

\displaystyle  {\Bbb E}_{X'} e^{c \hbox{dist}(X,A)^2} \leq e^{4 c \lambda^2} \frac{1}{p (1 + x_n q)^{1-\lambda} (1-x_n q)^\lambda}.

For {x_n=+1}, the optimal choice of {\lambda} here is {0}, obtaining

\displaystyle  {\Bbb E}_{X'} e^{c \hbox{dist}(X,A)^2} = \frac{1}{p (1+q)};

for {x_n=-1}, the optimal choice of {\lambda} is to be determined. Averaging, we obtain

\displaystyle {\Bbb E}_{X} e^{c \hbox{dist}(X,A)^2} = \frac{1}{2} [ \frac{1}{p (1+q)} + e^{4 c \lambda^2} \frac{1}{p (1 - q)^{1-\lambda} (1 + q)^\lambda} ]

so to establish (2), it suffices to pick {0 \leq \lambda \leq 1} such that

\displaystyle  \frac{1}{1+q} + e^{4 c \lambda^2} \frac{1}{(1 - q)^{1-\lambda} (1 + q)^\lambda} \leq 2.

If {q} is bounded away from zero, then by choosing {\lambda=1} we would obtain the claim if {c} is small enough, so we may take {q} to be small. But then a Taylor expansion allows us to conclude if we take {\lambda} to be a constant multiple of {q}, and again pick {c} to be small enough. The point is that {\lambda=0} already almost works up to errors of {O(q^2)}, and increasing {\lambda} from zero to a small non-zero quantity will decrease the LHS by about {O(\lambda q) - O(c \lambda^2)}. [By optimising everything using first-year calculus, one eventually gets the constant {c=1/16} claimed earlier.]

Remark 3 Talagrand’s inequality is in fact far more general than this; it applies to arbitrary products of probability spaces, rather than just to {\{-1,+1\}^n}, and to non-convex {A}, but the notion of distance needed to define {A_t} becomes more complicated; the proof of the inequality, though, is essentially the same. Besides its applicability to convex Lipschitz functions, Talagrand’s inequality is also very useful for controlling combinatorial Lipschitz functions {F} which are “locally certifiable” in the sense that whenever {F(x)} is larger than some threshold {t}, then there exist some bounded number {f(t)} of coefficients of {x} which “certify” this fact (in the sense that {F(y) \geq t} for any other {y} which agrees with {x} on these coefficients). See e.g. the text of Alon and Spencer for a more precise statement and some applications.

— 2. Gaussian concentration —

As mentioned earlier, there are analogous results when the uniform distribution on the cube {\{-1,+1\}^n} are replaced by other distributions, such as the {n}-dimensional Gaussian distribution. In fact, in this case convexity is not needed:

Proposition 3 (Gaussian concentration inequality) Let {A} be a measurable set in {{\bf R}^d}. Then

\displaystyle  \mathop{\bf P}(X \in A) \mathop{\bf P}( X \not \in A_t ) \leq \exp( - c t^2 )

for all {t>0} and some absolute constant {c > 0}, where {X \equiv N(0,1)^n} is a random Gaussian vector.

This inequality can be deduced from Lévy’s classical concentration of measure inequality for the sphere (with the optimal constant), but we will give an alternate proof due to Maurey and Pisier. It suffices to prove the following variant of Proposition 3:

Proposition 4 (Gaussian concentration inequality for Lipschitz functions) Let {f: {\bf R}^d \rightarrow {\bf R}} be a function which is Lipschitz with constant {1} (i.e. {|f(x)-f(y)| \leq |x-y|} for all {x,y \in {\bf R}^d}. Then for any {t} we have

\displaystyle  {\Bbb P}( |f(X) - {\Bbb E} f(X)| \geq t ) \leq 2\exp( - ct^2 )

for all {t>0} and some absolute constant {c>0}, where {X \equiv N(0,1)^n} is a random variable. [Informally, Lipschitz functions of Gaussian variables concentrate as if they were Gaussian themselves; for comparison, Talagrand's inequality implies that convex Lipschitz functions of Bernoulli variables concentrate as if they were Gaussian.]

Indeed, if one sets {f(x) := \hbox{dist}(x,A)}, and splits into the cases whether {{\Bbb E} f(X) \geq t/2} or {{\Bbb E} f(X) < t/2}, one obtains either {{\Bbb P}( X \in A ) \leq 2 \exp(-ct^2/4)} (in the former case) or {{\Bbb P}( X \not \in A_t ) \leq 2 \exp(-ct^2/4)}, and so

\displaystyle  \mathop{\bf P}(X \in A) \mathop{\bf P}( X \not \in A_t ) \leq 2\exp( - c t^2/4 )

in either case. Also, since {\mathop{\bf P}(X \in A) + \mathop{\bf P}( X \not \in A_t ) \leq 1}, we have {\mathop{\bf P}(X \in A) \mathop{\bf P}( X \not \in A_t ) \leq 1/4}. Putting the two bounds together gives the claim.

Now we prove Proposition 4. By the epsilon regularisation argument we may take {f} to be smooth, and so by the Lipschitz property we have

\displaystyle  |\nabla f(x)| \leq 1 \ \ \ \ \ (5)

for all {x}. By subtracting off the mean we may assume {{\Bbb E} f = 0}. By replacing {f} with {-f} if necessary it suffices to control the upper tail probability {{\Bbb P}( f(X) \geq t )} for {t > 0}.

We again use the exponential moment method. It suffices to show that

\displaystyle  {\Bbb E} \exp( t f(X) ) \leq \exp( C t^2 )

for some absolute constant {C}.

Now we use a variant of the square and rearrange trick. Let {Y} be an independent copy of {X}. Since {{\Bbb E} f(Y) = 0}, we see from Jensen’s inequality that {{\Bbb E} \exp( - t f(Y) ) \geq 1}, and so

\displaystyle  {\Bbb E} \exp( t f(X) ) \leq {\Bbb E} \exp( t (f(X)-f(Y)) ).

With an eye to exploiting (5), one might seek to use the fundamental theorem of calculus to write

\displaystyle  f(X) - f(Y) = \int_0^1 \frac{d}{d\lambda} f( (1-\lambda) Y + \lambda X )\ d\lambda.

But actually it turns out to be smarter to use a circular arc of integration, rather than a line segment:

\displaystyle  f(X) - f(Y) = \int_0^{\pi/2} \frac{d}{d\theta} f( Y \cos \theta + X \sin \theta )\ d\theta.

The reason for this is that {X_\theta := Y \cos \theta + X \sin \theta} is another gaussian random variable equivalent to {N(0,1)^n}, as is its derivative {X'_\theta := -Y \sin \theta + X \cos \theta}; furthermore, and crucially, these two random variables are independent.

To exploit this, we first use Jensen’s inequality to bound

\displaystyle  \exp( t (f(X) - f(Y))) \leq \frac{2}{\pi} \int_0^{\pi/2} \exp( \frac{\pi t}{2} \frac{d}{d\theta} f( X_\theta ) )\ d\theta.

Applying the chain rule and taking expectations, we have

\displaystyle  {\Bbb E} \exp( t (f(X) - f(Y))) \leq \frac{2}{\pi} \int_0^{\pi/2} {\Bbb E} \exp( \frac{\pi t}{2} \nabla f( X_\theta ) \cdot X'_\theta )\ d\theta.

Let us condition {X_\theta} to be fixed, then {X'_\theta \equiv N(0,1)^n}; applying (5), we conclude that {\frac{\pi t}{2} \nabla f( X_\theta ) \cdot X'_\theta} is normally distributed with standard deviation at most {\frac{\pi t}{2}}. As such we have

\displaystyle  {\Bbb E} \exp( \frac{2t}{\pi} \nabla f( X_\theta ) \cdot X'_\theta ) \leq \exp( C t^2 )

for some absolute constant {C}; integrating out the conditioning on {X_\theta} we obtain the claim.