Talagrand’s concentration inequality

9 June, 2009 in expository, math.MG, math.PR | Tags: convexity, large deviation inequality, random matrices, Talagrand's inequality | by Terence Tao

In the theory of discrete random matrices (e.g. matrices whose entries are random signs ${\pm 1}$ ), one often encounters the problem of understanding the distribution of the random variable ${\hbox{dist}(X,V)}$ , where ${X = (x_1,\ldots,x_n) \in \{-1,+1\}^n}$ is an ${n}$ -dimensional random sign vector (so ${X}$ is uniformly distributed in the discrete cube ${\{-1,+1\}^n}$ ), and ${V}$ is some ${d}$ -dimensional subspace of ${{\bf R}^n}$ for some ${0 \leq d \leq n}$ .

It is not hard to compute the second moment of this random variable. Indeed, if ${P = (p_{ij})_{1 \leq i,j \leq n}}$ denotes the orthogonal projection matrix from ${{\bf R}^n}$ to the orthogonal complement ${V^\perp}$ of ${V}$ , then one observes that

$\displaystyle \hbox{dist}(X,V)^2 = X \cdot P X = \sum_{i=1}^n \sum_{j=1}^n x_i x_j p_{ij}$

and so upon taking expectations we see that

$\displaystyle {\Bbb E} \hbox{dist}(X,V)^2 = \sum_{i=1}^n p_{ii} = \hbox{tr} P = n-d \ \ \ \ \ (1)$

since ${P}$ is a rank ${n-d}$ orthogonal projection. So we expect ${\hbox{dist}(X,V)}$ to be about ${\sqrt{n-d}}$ on the average.

In fact, one has sharp concentration around this value, in the sense that ${\hbox{dist}(X,V) = \sqrt{n-d}+O(1)}$ with high probability. More precisely, we have

Proposition 1 (Large deviation inequality) For any ${t>0}$ , one has

$\displaystyle {\Bbb P}( |\hbox{dist}(X,V) - \sqrt{n-d}| \geq t ) \leq C \exp(- c t^2 )$

for some absolute constants ${C, c > 0}$ .

In fact the constants ${C, c}$ are very civilised; for large ${t}$ one can basically take ${C=4}$ and ${c=1/16}$ , for instance. This type of concentration, particularly for subspaces ${V}$ of moderately large codimension ${n-d}$ , is fundamental to much of my work on random matrices with Van Vu, starting with our first paper (in which this proposition first appears). (For subspaces of small codimension (such as hyperplanes) one has to use other tools to get good results, such as inverse Littlewood-Offord theory or the Berry-Esséen central limit theorem, but that is another story.)

Proposition 1 is an easy consequence of the second moment computation and Talagrand’s inequality, which among other things provides a sharp concentration result for convex Lipschitz functions on the cube ${\{-1,+1\}^n}$ ; since ${\hbox{dist}(x,V)}$ is indeed a convex Lipschitz function, this inequality can be applied immediately. The proof of Talagrand’s inequality is short and can be found in several textbooks (e.g. Alon and Spencer), but I thought I would reproduce the argument here (specialised to the convex case), mostly to force myself to learn the proof properly. Note the concentration of ${O(1)}$ obtained by Talagrand’s inequality is much stronger than what one would get from more elementary tools such as Azuma’s inequality or McDiarmid’s inequality, which would only give concentration of about ${O(\sqrt{n})}$ or so (which is in fact trivial, since the cube ${\{-1,+1\}^n}$ has diameter ${2\sqrt{n}}$ ); the point is that Talagrand’s inequality is very effective at exploiting the convexity of the problem, as well as the Lipschitz nature of the function in all directions, whereas Azuma’s inequality can only easily take advantage of the Lipschitz nature of the function in coordinate directions. On the other hand, Azuma’s inequality works just as well if the ${\ell^2}$ metric is replaced with the larger ${\ell^1}$ metric, and one can conclude that the ${\ell^1}$ distance between ${X}$ and ${V}$ concentrates around its median to a width ${O(\sqrt{n})}$ , which is a more non-trivial fact than the ${\ell^2}$ concentration bound given by that inequality. (The computation of the median of the ${\ell^1}$ distance is more complicated than for the ${\ell^2}$ distance, though, and depends on the orientation of ${V}$ .)

Remark 1 If one makes the coordinates of ${X}$ iid Gaussian variables ${x_i \equiv N(0,1)}$ rather than random signs, then Proposition 1 is much easier to prove; the probability distribution of a Gaussian vector is rotation-invariant, so one can rotate ${V}$ to be, say, ${{\bf R}^d}$ , at which point ${\hbox{dist}(X,V)^2}$ is clearly the sum of ${n-d}$ independent squares of Gaussians (i.e. a chi-square distribution), and the claim follows from direct computation (or one can use the Chernoff inequality). The gaussian counterpart of Talagrand’s inequality is more classical, being essentially due to Lévy, and will also be discussed later in this post.

— 1. Concentration on the cube —

Proposition 1 follows easily from the following statement, that asserts that if a convex set ${A \subset {\bf R}^n}$ occupies a non-trivial fraction of the cube ${\{-1,+1\}^n}$ , then the neighbourhood ${A_t := \{ x \in {\bf R}^n: \hbox{dist}(x,A) \leq t \}}$ will occupy almost all of the cube for ${t \gg 1}$ :

Proposition 2 (Talagrand’s concentration inequality) Let ${A}$ be a convex set in ${{\bf R}^d}$ . Then

$\displaystyle \mathop{\bf P}(X \in A) \mathop{\bf P}( X \not \in A_t ) \leq \exp( - c t^2 )$

for all ${t>0}$ and some absolute constant ${c > 0}$ , where ${X \in \{-1,+1\}^n}$ is chosen uniformly from ${\{-1,+1\}^n}$ .

Remark 2 It is crucial that ${A}$ is convex here. If instead ${A}$ is, say, the set of all points in ${\{-1,+1\}^n}$ with fewer than ${n/2-\sqrt{n}}$ ${+1}$ ‘s, then ${\mathop{\bf P}(X \in A)}$ is comparable to ${1}$ , but ${\mathop{\bf P}( X \not \in A_t )}$ only starts decaying once ${t \gg \sqrt{n}}$ , rather than ${t \gg 1}$ . Indeed, it is not hard to show that Proposition 2 implies the variant

$\displaystyle \mathop{\bf P}(X \in A) \mathop{\bf P}( X \not \in A_t ) \leq \exp( - c t^2 / n)$

for non-convex ${A}$ (by restricting ${A}$ to ${\{-1,+1\}^n}$ and then passing from ${A}$ to the convex hull, noting that distances to ${A}$ on ${\{-1,+1\}^n}$ may be contracted by a factor of ${O(\sqrt{n})}$ by this latter process); this inequality can also be easily deduced from Azuma’s inequality.

To apply this proposition to the situation at hand, observe that if ${A}$ is the cylindrical region ${\{ x \in {\bf R}^n: \hbox{dist}(x,V) \leq r \}}$ for some ${r}$ , then ${A}$ is convex and ${A_t}$ is contained in ${\{ x \in {\bf R}^n: \hbox{dist}(x,V) \leq r+t \}}$ . Thus

$\displaystyle \mathop{\bf P}(\hbox{dist}(X,V) \leq r) \mathop{\bf P}( \hbox{dist}(X,V) > r+t ) \leq \exp(-ct^2).$

Applying this with ${r := M}$ or ${r := M-t}$ , where ${M}$ is the median value of ${\hbox{dist}(X,V)}$ , one soon obtains concentration around the median:

$\displaystyle \mathop{\bf P}( |\hbox{dist}(X,V) - M| > t ) \leq 4 \exp(-ct^2).$

This is only compatible with (1) if ${M = \sqrt{n-d} + O(1)}$ , and the claim follows.

To prove Proposition 2, we use the exponential moment method. Indeed, it suffices by Markov’s inequality to show that

$\displaystyle \mathop{\bf P}(X \in A) {\Bbb E} \exp( c \hbox{dist}(X,A)^2 ) \leq 1 \ \ \ \ \ (2)$

for a sufficiently small absolute constant ${c > 0}$ (in fact one can take ${c=1/16}$ ).

We prove (2) by an induction on the dimension ${n}$ . The claim is trivial for ${n=0}$ , so suppose ${n \geq 1}$ and the claim has already been proven for ${n-1}$ .

Let us write ${X = (X',x_n)}$ for ${x_n = \pm 1}$ . For each ${t \in {\bf R}}$ , we introduce the slice ${A_t := \{ x' \in {\bf R}^{n-1}: (x',t) \in A \}}$ , then ${A_t}$ is convex. We now try to bound the left-hand side of (2) in terms of ${X', A_t}$ rather than ${X, A}$ . Clearly

$\displaystyle \mathop{\bf P}(X \in A) = \frac{1}{2} [ \mathop{\bf P}( X' \in A_{-1}) + \mathop{\bf P}( X' \in A_{+1} ) ].$

By symmetry we may assume that ${\mathop{\bf P}( X' \in A_{+1} ) \geq \mathop{\bf P}( X' \in A_{-1} )}$ , thus we may write

$\displaystyle \mathop{\bf P}( X' \in A_{\pm 1} ) = p (1 \pm q) \ \ \ \ \ (3)$

where ${p := \mathop{\bf P}(X \in A)}$ and ${0 \leq q \leq 1}$ .

Now we look at ${\hbox{dist}(X,A)^2}$ . For ${t = \pm 1}$ , let ${Y_t \in {\bf R}^{n-1}}$ be the closest point of (the closure of) ${A_t}$ to ${X'}$ , thus

$\displaystyle |X'-Y_t| = \hbox{dist}(X', A_t). \ \ \ \ \ (4)$

Let ${0 \leq \lambda \leq 1}$ be chosen later; then the point ${(1-\lambda) (Y_{x_n}, x_n) + \lambda (Y_{-x_n},-x_n)}$ lies in ${A}$ by convexity, and so

$\displaystyle \hbox{dist}(X,A) \leq |(1-\lambda) (Y_{x_n}, x_n) + \lambda (Y_{-x_n},-x_n) - (X',x_n)|.$

Squaring this and using Pythagoras, one obtains

$\displaystyle \hbox{dist}(X,A)^2 \leq 4 \lambda^2 + |(1-\lambda) (X'-Y_{x_n}) + \lambda (X'-Y_{-x_n})|^2.$

As we will shortly be exponentiating the left-hand side, we need to linearise the right-hand side. Accordingly, we will exploit the convexity of the function ${x \mapsto |x|^2}$ to bound

$\displaystyle |(1-\lambda) (X-Y_{x_n}) + \lambda (X-Y_{-x_n})|^2 \leq$

$\displaystyle (1-\lambda) |X'-Y_{x_n}|^2 + \lambda |X'-Y_{-x_n}|^2$

and thus by (4)

$\displaystyle \hbox{dist}(X,A)^2 \leq 4 \lambda^2 + (1-\lambda) \hbox{dist}(X',A_{x_n})^2 + \lambda \hbox{dist}(X',A_{-x_n})^2.$

We exponentiate this and take expectations in ${X'}$ (holding ${x_n}$ fixed for now) to get

$\displaystyle {\Bbb E}_{X'} e^{c \hbox{dist}(X,A)^2} \leq e^{4 c \lambda^2} {\Bbb E}_{X'} (e^{c \hbox{dist}(X',A_{x_n})^2})^{1-\lambda} (e^{c \hbox{dist}(X',A_{-x_n})^2})^{\lambda}.$

Meanwhile, from the induction hypothesis and (3) we have

$\displaystyle {\Bbb E}_{X'} e^{c \hbox{dist}(X',A_{x_n})^2} \leq \frac{1}{p(1 + x_n q)}$

and similarly for ${A_{-x_n}}$ . By Hölder’s inequality, we conclude

$\displaystyle {\Bbb E}_{X'} e^{c \hbox{dist}(X,A)^2} \leq e^{4 c \lambda^2} \frac{1}{p (1 + x_n q)^{1-\lambda} (1-x_n q)^\lambda}.$

For ${x_n=+1}$ , the optimal choice of ${\lambda}$ here is ${0}$ , obtaining

$\displaystyle {\Bbb E}_{X'} e^{c \hbox{dist}(X,A)^2} = \frac{1}{p (1+q)};$

for ${x_n=-1}$ , the optimal choice of ${\lambda}$ is to be determined. Averaging, we obtain

$\displaystyle {\Bbb E}_{X} e^{c \hbox{dist}(X,A)^2} = \frac{1}{2} [ \frac{1}{p (1+q)} + e^{4 c \lambda^2} \frac{1}{p (1 - q)^{1-\lambda} (1 + q)^\lambda} ]$

so to establish (2), it suffices to pick ${0 \leq \lambda \leq 1}$ such that

$\displaystyle \frac{1}{1+q} + e^{4 c \lambda^2} \frac{1}{(1 - q)^{1-\lambda} (1 + q)^\lambda} \leq 2.$

If ${q}$ is bounded away from zero, then by choosing ${\lambda=1}$ we would obtain the claim if ${c}$ is small enough, so we may take ${q}$ to be small. But then a Taylor expansion allows us to conclude if we take ${\lambda}$ to be a constant multiple of ${q}$ , and again pick ${c}$ to be small enough. The point is that ${\lambda=0}$ already almost works up to errors of ${O(q^2)}$ , and increasing ${\lambda}$ from zero to a small non-zero quantity will decrease the LHS by about ${O(\lambda q) - O(c \lambda^2)}$ . [By optimising everything using first-year calculus, one eventually gets the constant ${c=1/16}$ claimed earlier.]

Remark 3 Talagrand’s inequality is in fact far more general than this; it applies to arbitrary products of probability spaces, rather than just to ${\{-1,+1\}^n}$ , and to non-convex ${A}$ , but the notion of distance needed to define ${A_t}$ becomes more complicated; the proof of the inequality, though, is essentially the same. Besides its applicability to convex Lipschitz functions, Talagrand’s inequality is also very useful for controlling combinatorial Lipschitz functions ${F}$ which are “locally certifiable” in the sense that whenever ${F(x)}$ is larger than some threshold ${t}$ , then there exist some bounded number ${f(t)}$ of coefficients of ${x}$ which “certify” this fact (in the sense that ${F(y) \geq t}$ for any other ${y}$ which agrees with ${x}$ on these coefficients). See e.g. the text of Alon and Spencer for a more precise statement and some applications.

— 2. Gaussian concentration —

As mentioned earlier, there are analogous results when the uniform distribution on the cube ${\{-1,+1\}^n}$ are replaced by other distributions, such as the ${n}$ -dimensional Gaussian distribution. In fact, in this case convexity is not needed:

Proposition 3 (Gaussian concentration inequality) Let ${A}$ be a measurable set in ${{\bf R}^d}$ . Then

$\displaystyle \mathop{\bf P}(X \in A) \mathop{\bf P}( X \not \in A_t ) \leq \exp( - c t^2 )$

for all ${t>0}$ and some absolute constant ${c > 0}$ , where ${X \equiv N(0,1)^d}$ is a random Gaussian vector.

This inequality can be deduced from Lévy’s classical concentration of measure inequality for the sphere (with the optimal constant), but we will give an alternate proof due to Maurey and Pisier. It suffices to prove the following variant of Proposition 3:

Proposition 4 (Gaussian concentration inequality for Lipschitz functions) Let ${f: {\bf R}^d \rightarrow {\bf R}}$ be a function which is Lipschitz with constant ${1}$ (i.e. ${|f(x)-f(y)| \leq |x-y|}$ for all ${x,y \in {\bf R}^d}$ . Then for any ${t}$ we have

$\displaystyle {\Bbb P}( |f(X) - {\Bbb E} f(X)| \geq t ) \leq 2\exp( - ct^2 )$

for all ${t>0}$ and some absolute constant ${c>0}$ , where ${X \equiv N(0,1)^d}$ is a random variable. [Informally, Lipschitz functions of Gaussian variables concentrate as if they were Gaussian themselves; for comparison, Talagrand’s inequality implies that convex Lipschitz functions of Bernoulli variables concentrate as if they were Gaussian.]

Indeed, if one sets ${f(x) := \hbox{dist}(x,A)}$ , and splits into the cases whether ${{\Bbb E} f(X) \geq t/2}$ or ${{\Bbb E} f(X) < t/2}$ , one obtains either ${{\Bbb P}( X \in A ) \leq 2 \exp(-ct^2/4)}$ (in the former case) or ${{\Bbb P}( X \not \in A_t ) \leq 2 \exp(-ct^2/4)}$ , and so

$\displaystyle \mathop{\bf P}(X \in A) \mathop{\bf P}( X \not \in A_t ) \leq 2\exp( - c t^2/4 )$

in either case. Also, since ${\mathop{\bf P}(X \in A) + \mathop{\bf P}( X \not \in A_t ) \leq 1}$ , we have ${\mathop{\bf P}(X \in A) \mathop{\bf P}( X \not \in A_t ) \leq 1/4}$ . Putting the two bounds together gives the claim.

Now we prove Proposition 4. By the epsilon regularisation argument we may take ${f}$ to be smooth, and so by the Lipschitz property we have

$\displaystyle |\nabla f(x)| \leq 1 \ \ \ \ \ (5)$

for all ${x}$ . By subtracting off the mean we may assume ${{\Bbb E} f = 0}$ . By replacing ${f}$ with ${-f}$ if necessary it suffices to control the upper tail probability ${{\Bbb P}( f(X) \geq t )}$ for ${t > 0}$ .

We again use the exponential moment method. It suffices to show that

$\displaystyle {\Bbb E} \exp( t f(X) ) \leq \exp( C t^2 )$

for some absolute constant ${C}$ .

Now we use a variant of the square and rearrange trick. Let ${Y}$ be an independent copy of ${X}$ . Since ${{\Bbb E} f(Y) = 0}$ , we see from Jensen’s inequality that ${{\Bbb E} \exp( - t f(Y) ) \geq 1}$ , and so

$\displaystyle {\Bbb E} \exp( t f(X) ) \leq {\Bbb E} \exp( t (f(X)-f(Y)) ).$

With an eye to exploiting (5), one might seek to use the fundamental theorem of calculus to write

$\displaystyle f(X) - f(Y) = \int_0^1 \frac{d}{d\lambda} f( (1-\lambda) Y + \lambda X )\ d\lambda.$

But actually it turns out to be smarter to use a circular arc of integration, rather than a line segment:

$\displaystyle f(X) - f(Y) = \int_0^{\pi/2} \frac{d}{d\theta} f( Y \cos \theta + X \sin \theta )\ d\theta.$

The reason for this is that ${X_\theta := Y \cos \theta + X \sin \theta}$ is another gaussian random variable equivalent to ${N(0,1)^d}$ , as is its derivative ${X'_\theta := -Y \sin \theta + X \cos \theta}$ ; furthermore, and crucially, these two random variables are independent.

To exploit this, we first use Jensen’s inequality to bound

$\displaystyle \exp( t (f(X) - f(Y))) \leq \frac{2}{\pi} \int_0^{\pi/2} \exp( \frac{\pi t}{2} \frac{d}{d\theta} f( X_\theta ) )\ d\theta.$

Applying the chain rule and taking expectations, we have

$\displaystyle {\Bbb E} \exp( t (f(X) - f(Y))) \leq \frac{2}{\pi} \int_0^{\pi/2} {\Bbb E} \exp( \frac{\pi t}{2} \nabla f( X_\theta ) \cdot X'_\theta )\ d\theta.$

Let us condition ${X_\theta}$ to be fixed, then ${X'_\theta \equiv N(0,1)^d}$ ; applying (5), we conclude that ${\frac{\pi t}{2} \nabla f( X_\theta ) \cdot X'_\theta}$ is normally distributed with standard deviation at most ${\frac{\pi t}{2}}$ . As such we have

$\displaystyle {\Bbb E} \exp( \frac{2t}{\pi} \nabla f( X_\theta ) \cdot X'_\theta ) \leq \exp( C t^2 )$

for some absolute constant ${C}$ ; integrating out the conditioning on ${X_\theta}$ we obtain the claim.

41 comments

Comments feed for this article

9 June, 2009 at 8:39 pm

student

Dear Prof Tao,

How do you pass from line segment integral to circular arc integral?

by change of variable $\lambda=\sin \theta$ ?

thanks

9 June, 2009 at 9:28 pm

Terence Tao

Dear student,

The circular arc identity is established directly from the fundamental theorem of calculus $F(\pi/2) - F(0) = \int_0^{\pi/2} \frac{d}{d\theta} F(\theta)\ d\theta$ , rather than from applying a change of variables from the line segment identity (which doesn’t work). The point is that in higher dimensions, there are many inequivalent ways to apply the fundamental theorem of calculus to expand f(Y)-f(X) – one for each path from X to Y, and the key is to select the right one.

15 June, 2009 at 10:35 pm

student

Dear Prof. Tao,

I remember once you wrote that you will teach random matrix theory next semester. my request is that

Would you please start to post little by little some notes related to prerequisite material for that course? if we start to study during summer and get enough background it would be perfect for us.

thank you very much

16 June, 2009 at 7:10 am

Terence Tao

Actually, the class will be starting in the winter quarter (next January). I have not yet decided exactly what the topics will be, but apart from a basic graduate education in analysis, probability, linear algebra, and combinatorics it should be fairly self-contained.

16 June, 2009 at 11:59 am

Oded

One minor comment: the statement in Remark 2 seems almost vacuous. After all, the distance between *any* two points in {-1,+1}^n is at most 2\sqrt{n}. Maybe the correct behavior should be n^{1/4}?

16 June, 2009 at 12:25 pm

Terence Tao

Hmm, good point. On the other hand, the Azuma/McDiarmid argument also works if the $\ell^2$ metric is replaced by the much larger $\ell^1$ metric (although the mean or median of $\hbox{dist}(X,V)$ is somewhat harder to compute in that case). I’ll adjust the text to reflect this. (One could make the case that Talagrand’s inequality is to $\ell^2$ as Azuma’s inequality or McDiarmid’s inequality is to $\ell^1$ ; note the crucial use of Pythagoras’ theorem in the middle of the proof of Talagrand’s inequality.)

16 June, 2009 at 8:58 pm

vedadi

Dear Prof. Tao,

just after ” square and rearrange trick”, you say that you use Jensen’s inequality.

I guess your convex function is $\exp{-tx}$ for $t>0.$

in that case why do not we have $E(\exp{-tf(Y)}) \geq 1$ ?

Jensen’s ineq is $E(\phi(X)) \geq \phi (E(X))$ , right?

thanks

16 June, 2009 at 9:16 pm

Terence Tao

Dear vedadi,

Yes, that was a typo, the expectation should be at least 1, rather than at most 1. (The subsequent inequalities did not have the typo, though.)

17 June, 2009 at 9:11 pm

student

Dear Prof Tao,

When we say ”let Y be an independent copy of X….” How do we guarantee the existence of that copy? Don’t we have some philosophical questions here too?

thanks

18 June, 2009 at 8:59 am

Terence Tao

Dear student,

Independent copies of a random variable can always be generated by the product measure construction,

http://en.wikipedia.org/wiki/Product_measure

This requires extending the underlying probability space to a finer one, but probability theory is designed in such a way that one can always do this without affecting any probabilistically meaningful quantities, such as probabilities of events or expectations, moments, correlations, and distributions of random variables. (As such, one of the guiding philosophies of probability theory is to not work with the underlying probability space if at all possible. The situation is somewhat analogous to differential geometry, which is designed to be invariant under changes of coordinates, and so the philosophy is often to avoid explicit manipulation of such coordinates whenever possible.)

Kallenberg’s “Foundations of modern probability” is a good reference for these sorts of foundational issues. However, it should be said that while such issues are of course important, especially from a philosophical or logical point of view, they rarely affect day-to-day usage of probability, much as the construction of the real numbers does not play a prominent role in real arithmetic, or as the construction of the Lebesgue integral does not play a prominent role in computation and estimation of integrals.

18 June, 2009 at 11:39 am

Anonymous

Dear Prof. Tao,

Could you please write a blog post on ”Large Deviation Principles” this summer by explaining the theory by examples and from your point of view?

thanks

2 July, 2009 at 6:05 am

ebrima saye

You are inspirational!!

2 July, 2009 at 7:57 pm

student

Dear Prof. Tao,

I could not find in which part I am doing a mistake in the following argument :

we know that if two random variables have the same distribution then their mean is the same. now

Let $Z$ be a standard normal rv. and $Z^{\prime}$ be an independent copy of it. then $Z \cdot Z$ and $Z \cdot Z^{\prime}$ have the same distribution (?) but $E[Z \cdot Z]=1$ and $E[Z \cdot Z^{\prime}]=E[Z]E[Z^{\prime}]=0$

what is going on here?

thanks

3 July, 2009 at 7:58 am

Terence Tao

Dear student:

The same issue crops up if Z is a Bernoulli distribution (equal to +1 with probability 1/2, and -1 with probability -1/2). This example is elementary enough that you can work out everything from first principles; if you do so, you should see the error.

Dear Ben: X decomposes into two orthogonal components, PX and X-PX. X-PX is the orthogonal projection of X to V, so the distance from X to V is the length of PX. Since P is an orthogonal projection, $\|PX\|^2 = X \cdot PX$ .

Alternatively, one can pick a coordinate system so that V is a coordinate plane, in which case all calculations can be done explicitly.

3 July, 2009 at 4:56 am

ben

Dear Prof Tao

could you explain why

$dist (X, V)^2= X \cdot PX$ ?

thanks

5 July, 2009 at 3:33 pm

Anonymous

Dear Prof. Tao,

I did not see at which step you are using the independence of $X_{\theta}$ and $X_{\theta}^{\prime} .$

and in the last paragraf, when you say ” let’s condition $X_{\theta}$ to be fixed….” do you mean you that you are using the following fact:

$E[......]=E[E[.....\| X_{\theta}]]$

thanks

6 July, 2009 at 12:57 pm

Terence Tao

Yes, this is what is happening when one is conditioning on $X_\theta$ . The independence of $X_\theta$ and $X'_\theta$ is used precisely at this step, to ensure that $X'_\theta$ retains its normal distribution even after conditioning on a fixed value of $X_\theta$ .

6 July, 2009 at 11:19 pm

lutfu

Dear Prof. Tao,

Let $X, Y$ be two independent standart normal r.vs. then

$E[\exp(X \cdot Y)]=E[E[\exp(X \cdot Y) | Y]]$

since $E[\exp(X \cdot Y) | Y=t]=\exp(t^2/2)$ we have

$E[\exp(X \cdot Y) | Y]=\exp(Y^2 / 2)$ therefore

$E[\exp(X \cdot Y)]=E[\exp(Y^2 / 2) ]$

is this argument true?

in which step we are using independence of $X$ and $Y?$

thanks

7 July, 2009 at 6:33 am

Terence Tao

Yes, this is correct. The independence is needed to establish the identity ${\Bbb E}( \exp(X \cdot Y) | Y=t ) = \exp(t^2/2)$ (because one needs X to remain normally distributed after conditioning on the event Y=t).

8 July, 2009 at 4:07 pm

lutfu

Dear Prof Tao,

Let $X_{1}, X_{2},...$ be a seq of i.i.d. rvs. with mean zero.

as we know from $SLL$ as $N \rightarrow \infty$ the distribution of

$\frac{S_n}{n}$ concentrates on a point, i.e. limiting distribution is $\delta_{0}.$ (a.s. implies in distribution)

Under the existence of second moment we know the limiting distribution of

$\frac{S_n}{\sqrt{n}}$ is standard normal, i.e Gaussian measure.

my question is that do we know the distribution of
$\frac{S_n}{n^{\alpha}}$ for $0< \alpha <1$ under the suitable conditions? to which measure do they concentrate?

thanks

8 July, 2009 at 7:57 pm

Anonymous

Dear Prof. Tao,

Could you please give a proof for the Azuma-Hoeffding inequality?

thanks

3 January, 2010 at 10:22 pm

254A, Notes 1: Concentration of measure « What’s new

[…] by induction on dimension . In the case when are Bernoulli variables, this can be done; see this previous blog post. In the general case, it turns out that in order to close the induction properly, one must […]

5 January, 2010 at 4:20 pm

254A, Notes 0: A review of probability theory « What’s new

13 June, 2010 at 1:23 pm

Laurent Jacques

Dear Terence,

Thank you for this verrry interesting post. I’d like to share with you some elements.

From what I know, Sub-Gaussian Random Variable (SGRV), such as Bernoulli RV, centered Uniform RV, Gaussian RV, … are defined as follows:

X is a SGRV if there exists one c>0 such that for all real value s,
E[ \exp (s X) ] <= exp (c^2 s^2/2)

See for instance V. V. Buldygin publications, e.g. "Sub-Gaussian random variables", doi:10.1007/BF01087176

If I'm not wrong, one interesting property of SBRV is that any linear combination of SGRV is SGRV with known characteristics. In particular, the "Gaussian standard", defined as the minimal "c" such that the definition above holds, behaves as the standard deviation of Gaussian RV with respect to linear combination of SGRV.

This implies also that if you rotate a sub-Gaussian random vector with iid SGRV components, you get also a sub-Gaussian random vector. The components of this latter are perhaps not iid anymore but at least they have all the same "Gaussian standard".

My question is thus the following:

**Do you think that Theorem 4 above could be easily generalized to sub-Gaussian random vectors of components with unit gaussian standard?**

It seems indeed that in many places in you proof, you use already key ingredients for this generalization:
* the sub-gaussian inequality definition;
* linear combinations (rotation) of Gaussian RVs which are themselves Gaussian.

Or perhaps this generalization is already known in the community (I didn't see it anyway in Talagrand/Ledoux's books)?

Best,
Laurent

18 August, 2010 at 12:43 am

Anonymous

Dear Prof. Tao
I have two questions about the proof for Proposition 4.
Firstly, It seems that there is a typo when you bound
exp(t(F(X)-F(Y))) by using Jensen Inequality, please check it if possible.
Secondly, the aim is
Eexp(tF(X))<=exp(Ct^2),
but the proof only tells us
Eexp(tF(X))<=exp(Ct),
Please give some explaination if possible.
Thanks.

[Corrected, thanks – T.]

20 August, 2010 at 2:50 am

Anonymous

Dear Prof. Tao
Have you checked the bound of exp(t(F(X)-F(Y))) in the proof of Proposition 4? There is still a typo in my opinion.

$\exp(t(f(X) - f(Y))) \leq \frac{2}{\pi} int_0^{\pi/2} \exp( \frac{\pi t}{2} \frac{d}{d \theta} f(X_\theta) \ {d}{\theta})$

[Corrected, thanks – T.]

21 August, 2010 at 2:16 am

Anonymous

Dear Prof. Tao
Can you tell me how to get $E\exp(\frac{\pi t}{2}\nabla f(X_\theta)X'_\theta)\leq\exp(Ct^2)$ ? Some hints will be OK.

21 August, 2010 at 7:30 am

Terence Tao

If X is a Gaussian with mean zero and variance $\sigma^2$ , then the moment generating function is ${\bf E} \exp(tX) = \exp(t^2 \sigma^2/2)$ , as can be seen by completing the square.

22 August, 2010 at 4:41 am

Anonymous

Dear Prof. Tao
I get it. Thanks for your reply. So far I have got a lot of interesting and useful mathemetical knowledge from your blog. Your earnest attitude and diligence impressed me. Thanks again.

16 November, 2010 at 10:04 pm

The Determinant of Random Bernoulli Matrices « Tsourolampis Blog

[…] [6] Talagrand’s inequality https://terrytao.wordpress.com/2009/06/09/talagrands-concentration-inequality/ […]

25 April, 2011 at 2:44 pm

Anonymous

Dear Prof. Tao,

is Talagrand concentration inequality still valid if we have independent but not identically distributed normally distributed component of $\mathbb{X}$ ?

Thanks

8 May, 2012 at 1:15 am

Anonymous

Hi Professor Tao,

Does dist(X,V) = euclidean distance(X,V)?

Thanks.

[Yes – T.]

12 May, 2012 at 5:32 pm

sabyasachi chatterjee

Hello Professor Tao,
I have had to investigate the concentration of the log determinant of a matrix. Specifically, suppose M1,..Mk are random matrices p by p, iid from a given distribution(can assume that the entries of the matrix are bounded a.s or other regularity conds if needed) whose expected value is the identity matrix I. Now if I consider the matrix mean..(M1+…Mk)/k and look at its log determinant..will it concentrate around its mean? Will the mean go to zero as k go to infinity? Hence will it concentrate around 0? Is there any known result on this? Thanks

9 July, 2012 at 10:53 pm

mathepsilon

Does that mean, the small constant c in the gaussian concentration inequaltiy for lipschitz function is (pi^2)/8? Thank you! I really can not find this constant elsewhere.

10 July, 2012 at 9:08 am

Terence Tao

Well, the arguments given in the above post are not necessarily optimal; there may be other arguments that give a better value of c. (Note also that the value of c in Proposition 3 and Proposition 4 may differ, for instance the above arguments give a loss of a factor of 4 when passing from the latter to the former.)

I would try Ledoux’s book, as it has a number of proofs of this inequality and may discuss the question of optimal constants somewhere. (For instance, log-Sobolev methods tend to be quite sharp in general, and should yield good results in this case.)

17 December, 2013 at 2:11 pm

Anonymous

Dear Terry Tao,

can you get limit behavior of dist(X,V) (using Talagrand inequality or otherwise)?

Proposition 1 suggests that perhaps (after properly normalizing), the distribution of dist(X,V) converges to the normal distribution. Is it true?

I think it would nicely illustrate the strength of the presented method — similarly to the comparison of Chernoff bound and Central limit theorem.

Thanks,

Robert Samal

17 December, 2013 at 4:42 pm

Terence Tao

When X is a gaussian vector, then dist(X,V)^2 is a chi-squared distribution (with the number of degrees of freedom equal to the codimension of V), and so the central limit theorem shows that dist(X,V)^2 (and thus also dist(X,V)) is approximately gaussian when the codimension is large. For small codimension, one gets a chi distribution instead.

In the non-gaussian case, one can presumably get similar results either through a Lindeberg exchange argument, or through a central limit theorem for quadratic forms, although I don’t know if this has been done explicitly in the literature.

28 August, 2014 at 11:39 pm

Anonymous

Is proposition 4 valid for idd rvs with exponential moments?

12 September, 2015 at 2:46 pm

Anonymous

Small typo in props 3 and 4: n should be d (or vice versa)

[Corrected, thanks – T.]

15 December, 2018 at 4:53 pm

Ray

Dear Prof. Tao,

I am confused about some detail of the proof to proposition 2. How is $dist(x,A)$ defined when $A$ is an open set? And how is (4) defined if $A_{\pm 1}$ are empty sets?

Thanks!

15 December, 2018 at 8:41 pm

Ray

never mind, I realized that these details can be worked out and does not affect the proof.

	Anonymous on On product representations of…
	Alex Gunning on A symmetric formulation of the…
	Terence Tao on On product representations of…
	domotorp on On product representations of…
	Terence Tao on 275A, Notes 3: The weak and st…
	Terence Tao on A symmetric formulation of the…
	Anonymous on On product representations of…
	Anonymous on 275A, Notes 3: The weak and st…
	Anonymous on 275A, Notes 3: The weak and st…
	Alex Gunning on A symmetric formulation of the…
	Anonymous on Erratum for “An inverse…
	Anonymous on Erratum for “An inverse…
	Anonymous on Erratum for “An inverse…
	Anonymous on 275A, Notes 3: The weak and st…
	Anonymous on It ought to be common knowledg…

Talagrand’s concentration inequality

Recent Comments

Articles by others

Diversions

Mathematics

Selected articles

Software

The sciences

Top Posts

Archives

Categories

The Polymath Blog

41 comments

Leave a comment Cancel reply

For commenters

Talagrand’s concentration inequality

Share this:

Recent Comments

Articles by others

Diversions

Mathematics

Selected articles

Software

The sciences

Top Posts

Archives

Categories

The Polymath Blog

41 comments

Leave a comment Cancel reply

For commenters