254A, Notes 1: Concentration of measure

3 January, 2010 in 254A - random matrices, math.PR | Tags: Azuma's inequality, Chernoff inequality, concentration of measure, McDiarmid's inequality, moment method, Talagrand inequality | by Terence Tao

Suppose we have a large number of scalar random variables ${X_1,\ldots,X_n}$ , which each have bounded size on average (e.g. their mean and variance could be ${O(1)}$ ). What can one then say about their sum ${S_n := X_1+\ldots+X_n}$ ? If each individual summand ${X_i}$ varies in an interval of size ${O(1)}$ , then their sum of course varies in an interval of size ${O(n)}$ . However, a remarkable phenomenon, known as concentration of measure, asserts that assuming a sufficient amount of independence between the component variables ${X_1,\ldots,X_n}$ , this sum sharply concentrates in a much narrower range, typically in an interval of size ${O(\sqrt{n})}$ . This phenomenon is quantified by a variety of large deviation inequalities that give upper bounds (often exponential in nature) on the probability that such a combined random variable deviates significantly from its mean. The same phenomenon applies not only to linear expressions such as ${S_n = X_1+\ldots+X_n}$ , but more generally to nonlinear combinations ${F(X_1,\ldots,X_n)}$ of such variables, provided that the nonlinear function ${F}$ is sufficiently regular (in particular, if it is Lipschitz, either separately in each variable, or jointly in all variables).

The basic intuition here is that it is difficult for a large number of independent variables ${X_1,\ldots,X_n}$ to “work together” to simultaneously pull a sum ${X_1+\ldots+X_n}$ or a more general combination ${F(X_1,\ldots,X_n)}$ too far away from its mean. Independence here is the key; concentration of measure results typically fail if the ${X_i}$ are too highly correlated with each other.

There are many applications of the concentration of measure phenomenon, but we will focus on a specific application which is useful in the random matrix theory topics we will be studying, namely on controlling the behaviour of random ${n}$ -dimensional vectors with independent components, and in particular on the distance between such random vectors and a given subspace.

Once one has a sufficient amount of independence, the concentration of measure tends to be sub-gaussian in nature; thus the probability that one is at least ${\lambda}$ standard deviations from the mean tends to drop off like ${C \exp(-c\lambda^2)}$ for some ${C,c > 0}$ . In particular, one is ${O( \log^{1/2} n )}$ standard deviations from the mean with high probability, and ${O( \log^{1/2+\epsilon} n)}$ standard deviations from the mean with overwhelming probability. Indeed, concentration of measure is our primary tool for ensuring that various events hold with overwhelming probability (other moment methods can give high probability, but have difficulty ensuring overwhelming probability).

This is only a brief introduction to the concentration of measure phenomenon. A systematic study of this topic can be found in this book by Ledoux.

— 1. Linear combinations, and the moment method —

We begin with the simple setting of studying a sum ${S_n := X_1+\ldots+X_n}$ of random variables. As we shall see, these linear sums are particularly amenable to the moment method, though to use the more powerful moments, we will require more powerful independence assumptions (and, naturally, we will need more moments to be finite or bounded). As such, we will take the opportunity to use this topic (large deviation inequalities for sums of random variables) to give a tour of the moment method, which we will return to when we consider the analogous questions for the bulk spectral distribution of random matrices.

In this section we shall concern ourselves primarily with bounded random variables; in the next section we describe the basic truncation method that can allow us to extend from the bounded case to the unbounded case (assuming suitable decay hypotheses).

The zeroth moment method gives a crude upper bound when ${S}$ is non-zero,

$\displaystyle {\bf P}( S_n \neq 0 ) \leq \sum_{i=1}^n {\bf P}( X_i \neq 0 ) \ \ \ \ \ (1)$

but in most cases this bound is worse than the trivial bound ${{\bf P}(S_n \neq 0) \leq 1}$ . This bound, however, will be useful when performing the truncation trick, which we will discuss below.

The first moment method is somewhat better, giving the bound

$\displaystyle {\bf E} |S_n| \leq \sum_{i=1}^n {\bf E} |X_i|$

which when combined with Markov’s inequality gives the rather weak large deviation inequality

$\displaystyle {\bf P}( |S_n| \geq \lambda ) \leq \frac{1}{\lambda} \sum_{i=1}^n {\bf E} |X_i|. \ \ \ \ \ (2)$

As weak as this bound is, this bound is sometimes sharp. For instance, if the ${X_i}$ are all equal to a single signed Bernoulli variable ${X \in \{-1,+1\}}$ , then ${S_n = nX}$ , and so ${|S_n| = n}$ , and so (2) is sharp when ${\lambda=n}$ . The problem here is a complete lack of independence; the ${X_i}$ are all simultaneously positive or simultaneously negative, causing huge fluctuations in the value of ${S_n}$ .

Informally, one can view (2) as the assertion that ${S_n}$ typically has size ${S_n = O( \sum_{i=1}^n |X_i| )}$ .

The first moment method also shows that

$\displaystyle {\bf E} S_n = \sum_{i=1}^n {\bf E} X_i$

and so we can normalise out the means using the identity

$\displaystyle S_n - {\bf E} S_n = \sum_{i=1}^n X_i - {\bf E} X_i.$

Replacing the ${X_i}$ by ${X_i - {\bf E} X_i}$ (and ${S_n}$ by ${S_n - {\bf E} S_n}$ ) we may thus assume for simplicity that all the ${X_i}$ have mean zero.

Now we consider what the second moment method gives us. We square ${S_n}$ and take expectations to obtain

$\displaystyle {\bf E} |S_n|^2 = \sum_{i=1}^n \sum_{j=1}^n {\bf E} X_i \overline{X_j}.$

If we assume that the ${X_i}$ are pairwise independent (in addition to having mean zero), then ${{\bf E} X_i \overline{X_j}}$ vanishes unless ${i=j}$ , in which case this expectation is equal to ${{\bf Var}(X_i)}$ . We thus have

$\displaystyle {\bf Var}(S_n) = \sum_{i=1}^n {\bf Var}(X_i) \ \ \ \ \ (3)$

which when combined with Chebyshev’s inequality (and the mean zero normalisation) yields the large deviation inequality

$\displaystyle {\bf P}( |S_n| \geq \lambda ) \leq \frac{1}{\lambda^2} \sum_{i=1}^n {\bf Var}(X_i). \ \ \ \ \ (4)$

Without the normalisation that the ${X_i}$ have mean zero, we obtain

$\displaystyle {\bf P}( |S_n - {\bf E} S_n| \geq \lambda ) \leq \frac{1}{\lambda^2} \sum_{i=1}^n {\bf Var}(X_i). \ \ \ \ \ (5)$

Informally, this is the assertion that ${S_n}$ typically has size ${S_n = {\bf E} S_n + O( (\sum_{i=1}^n {\bf Var}(X_i))^{1/2} )}$ , if we have pairwise independence. Note also that we do not need the full strength of the pairwise independence assumption; the slightly weaker hypothesis of being pairwise uncorrelated would have sufficed.

The inequality (5) is sharp in two ways. Firstly, we cannot expect any significant concentration in any range narrower than the standard deviation ${O( (\sum_{i=1}^n {\bf Var}(X_i))^{1/2} )}$ , as this would likely contradict (3). Secondly, the quadratic-type decay in ${\lambda}$ in (5) is sharp given the pairwise independence hypothesis. For instance, suppose that ${n = 2^m-1}$ , and that ${X_j := (-1)^{a_j \cdot Y}}$ , where ${Y}$ is drawn uniformly at random from the cube ${\{0,1\}^m}$ , and ${a_1,\ldots,a_n}$ are an enumeration of the non-zero elements of ${\{0,1\}^m}$ . Then a little Fourier analysis shows that each ${X_j}$ for ${1 \leq j \leq n}$ has mean zero, variance ${1}$ , and are pairwise independent in ${j}$ ; but ${S}$ is equal to ${(n+1) {\bf I}( Y = 0 ) - 1}$ , which is equal to ${n}$ with probability ${1/(n+1)}$ ; this is despite the standard deviation of ${S}$ being just ${\sqrt{n}}$ . This shows that (5) is essentially (i.e. up to constants) sharp here when ${\lambda = n}$ .

Now we turn to higher moments. Let us assume that the ${X_i}$ are normalised to have mean zero and variance at most ${1}$ , and are also almost surely bounded in magnitude by some ${K}$ : ${|X_i| \leq K}$ . (The interesting regime here is when ${K \geq 1}$ , otherwise the variance is in fact strictly smaller than ${1}$ .) To simplify the exposition very slightly we will assume that the ${X_i}$ are real-valued; the complex-valued case is very analogous (and can also be deduced from the real-valued case) and is left to the reader.

Let us also assume that the ${X_1,\ldots,X_n}$ are ${k}$ -wise independent for some even positive integer ${k}$ . With this assumption, we can now estimate the ${k^{th}}$ moment

$\displaystyle {\bf E} |S_n|^k = \sum_{1 \leq i_1,\ldots,i_k \leq n} {\bf E} X_{i_1} \ldots X_{i_{k}}.$

To compute the expectation of the product, we can use the ${k}$ -wise independence, but we need to divide into cases (analogous to the ${i \neq j}$ and ${i=j}$ cases in the second moment calculation above) depending on how various indices are repeated. If one of the ${X_{i_j}}$ only appear once, then the entire expectation is zero (since ${X_{i_j}}$ has mean zero), so we may assume that each of the ${X_{i_j}}$ appear at least twice. In particular, there are at most ${k/2}$ distinct ${X_j}$ which appear. If exactly ${k/2}$ such terms appear, then from the unit variance assumption we see that the expectation has magnitude at most ${1}$ ; more generally, if ${k/2-r}$ terms appear, then from the unit variance assumption and the upper bound by ${K}$ we see that the expectation has magnitude at most ${K^{2r}}$ . This leads to the upper bound

$\displaystyle {\bf E} |S_n|^k \leq \sum_{r=0}^{k/2} K^{2r} N_r$

where ${N_r}$ is the number of ways one can assign integers ${i_1,\ldots,i_k}$ in ${\{1,\ldots,n\}}$ such that each ${i_j}$ appears at least twice, and such that exactly ${k/2-r}$ integers appear.

We are now faced with the purely combinatorial problem of estimating ${N_r}$ . We will use a somewhat crude bound. There are ${\binom{n}{k/2-r} \leq n^{k/2-r} / (k/2-r)!}$ ways to choose ${k/2-r}$ integers from ${\{1,\ldots,n\}}$ . Each of the integers ${i_j}$ has to come from one of these ${k/2-r}$ integers, leading to the crude bound

$\displaystyle N_r \leq \frac{n^{k/2-r}}{(k/2-r)!} (k/2-r)^k$

which after using a crude form ${n! \geq n^n e^{-n}}$ of Stirling’s formula gives

$\displaystyle N_r \leq (en)^{k/2-r} (k/2)^{k/2+r},$

and so

$\displaystyle {\bf E} |S_n|^k \leq (enk/2)^{k/2} \sum_{r=0}^{k/2} (\frac{K^2 k}{en})^r.$

If we make the mild assumption

$\displaystyle K^2 \leq n/k \ \ \ \ \ (6)$

then from the geometric series formula we conclude that

$\displaystyle {\bf E} |S_n|^k \leq 2 (enk/2)^{k/2}$

(say), which leads to the large deviation inequality

$\displaystyle {\bf P}( |S_n| \geq \lambda \sqrt{n} ) \leq 2 (\frac{\sqrt{ek/2}}{\lambda})^k. \ \ \ \ \ (7)$

This should be compared with (2), (5). As ${k}$ increases, the rate of decay in the ${\lambda}$ parameter improves, but to compensate for this, the range that ${S_n}$ concentrates in grows slowly, to ${O(\sqrt{nk})}$ rather than ${O(\sqrt{n})}$ .

Remark 1 Note how it was important here that ${k}$ was even. Odd moments, such as ${{\bf E} S_n^3}$ , can be estimated, but due to the lack of the absolute value sign, these moments do not give much usable control on the distribution of the ${S_n}$ . One could be more careful in the combinatorial counting than was done here, but the net effect of such care is only to improve the unspecified constant ${C}$ (which can easily be made explicit, but we will not do so here).

Now suppose that the ${X_1,\ldots,X_n}$ are not just ${k}$ -wise independent for any fixed ${k}$ , but are in fact jointly independent. Then we can apply (7) for any ${k}$ obeying (6). We can optimise in ${k}$ by setting ${\sqrt{nk}}$ to be a small multiple of ${\lambda}$ , and conclude the gaussian-type bound

$\displaystyle {\bf P}( |S_n| \geq \lambda \sqrt{n} ) \leq C \exp( - c \lambda^2 ) \ \ \ \ \ (8)$

for some absolute constants ${C, c > 0}$ , provided that ${|\lambda| \leq c \sqrt{n} / K}$ for some small ${c}$ . (Note that the bound (8) is trivial for ${|\lambda| \gg \sqrt{n}}$ , so we may assume that ${\lambda}$ is small compared to this quantity.) Thus we see that while control of each individual moment ${{\bf E} |S_n|^k}$ only gives polynomial decay in ${\lambda}$ , by using all the moments simultaneously one can obtain square-exponential decay (i.e. subgaussian type decay).

By using Stirling’s formula (Exercise 2 from Notes 0a) one can show that the quadratic decay in (8) cannot be improved; see Exercise 2 below.

It was a little complicated to manage such large moments ${{\bf E} |S_n|^k}$ . A slicker way to proceed (but one which exploits the joint independence and commutativity more strongly) is to work instead with the exponential moments ${{\bf E} \exp( t S_n )}$ , which can be viewed as a sort of generating function for the power moments. A useful lemma in this regard is

Lemma 1 (Hoeffding’s lemma) Let ${X}$ be a scalar variable taking values in an interval ${[a,b]}$ . Then for any ${t>0}$ ,

$\displaystyle {\bf E} e^{tX} \leq e^{t {\bf E} X} (1 + O( t^2 {\bf Var}(X) \exp( O( t (b-a) ) ) ). \ \ \ \ \ (9)$

In particular

$\displaystyle {\bf E} e^{tX} \leq e^{t {\bf E} X} \exp( O( t^2 (b-a)^2 ) ). \ \ \ \ \ (10)$

Proof: It suffices to prove the first inequality, as the second then follows using the bound ${{\bf Var}(X) \leq (b-a)^2}$ and from various elementary estimates.

By subtracting the mean from ${X,a,b}$ we may normalise ${{\bf E}(X)=0}$ . By dividing ${X,a,b}$ (and multiplying ${t}$ to balance) we may assume that ${b-a=1}$ , which implies that ${X = O(1)}$ . We then have the Taylor expansion

$\displaystyle e^{tX} = 1 + tX + O( t^2 X^2 \exp( O(t) ) )$

which on taking expectations gives

$\displaystyle {\bf E} e^{tX} = 1 + O( t^2 {\bf Var}(X) \exp(O(t))$

and the claim follows. $\Box$

Exercise 1 Show that the ${O(t^2(b-a)^2)}$ factor in (10) can be replaced with ${t^2 (b-a)^2/8}$ , and that this is sharp. (Hint: use Jensen’s inequality.)

We now have the fundamental Chernoff bound:

Theorem 2 (Chernoff inequality) Let ${X_1,\ldots,X_n}$ be independent scalar random variables with ${|X_i| \leq K}$ almost surely, with mean ${\mu_i}$ and variance ${\sigma_i^2}$ . Then for any ${\lambda > 0}$ , one has

$\displaystyle {\bf P}( |S_n - \mu| \geq \lambda \sigma ) \leq C \max( \exp( - c \lambda^2 ), \exp( - c \lambda \sigma / K ) ) \ \ \ \ \ (11)$

for some absolute constants ${C, c > 0}$ , where ${\mu := \sum_{i=1}^n \mu_i}$ and ${\sigma^2 := \sum_{i=1}^n \sigma_i^2}$ .

Proof: By taking real and imaginary parts we may assume that the ${X_i}$ are real. By subtracting off the mean (and adjusting ${K}$ appropriately) we may assume that ${\mu_i=0}$ (and so ${\mu=0}$ ); dividing the ${X_i}$ (and ${\sigma_i}$ ) through by ${K}$ we may assume that ${K=1}$ . By symmetry it then suffices to establish the upper tail estimate

$\displaystyle {\bf P}( S_n \geq \lambda \sigma ) \leq C \max( \exp( - c \lambda^2 ), \exp( - c \lambda \sigma ) )$

(with slightly different constants ${C, c}$ ).

To do this, we shall first compute the exponential moments

$\displaystyle {\bf E} \exp( t S_n )$

where ${0 \leq t \leq 1}$ is a real parameter to be optimised later. Expanding out the exponential and using the independence hypothesis, we conclude that

$\displaystyle {\bf E} \exp( t S_n ) = \prod_{i=1}^n {\bf E} \exp(t X_i).$

To compute ${{\bf E}\exp(tX)}$ , we use the hypothesis that ${|X| \leq 1}$ and (9) to obtain

$\displaystyle {\bf E} \exp(tX) \leq \exp( O( t^2 \sigma_i^2 ) ).$

Thus we have

$\displaystyle {\bf E} \exp( t S_n ) \leq \exp( O( t^2 \sigma^2 ) )$

and thus by Markov’s inequality

$\displaystyle {\bf P}( S_n \geq \lambda \sigma ) \leq \exp( O( t^2 \sigma^2 ) - t \lambda \sigma ).$

If we optimise this in ${t}$ , subject to the constraint ${0 \leq t \leq 1}$ , we obtain the claim. $\Box$

Informally, the Chernoff inequality asserts that ${S_n}$ is sharply concentrated in the range ${n \mu + O(\sigma \sqrt{n})}$ . The bounds here are fairly sharp, at least when ${\lambda}$ is not too large:

Exercise 2 Let ${0 < p < 1/2}$ be fixed independently of ${n}$ , and let ${X_1,\ldots,X_n}$ be iid copies of a Bernoulli random variable that equals ${1}$ with probability ${p}$ , thus ${\mu_i=p}$ and ${\sigma_i^2 = p(1-p)}$ , and so ${\mu = np}$ and ${\sigma^2 = np(1-p)}$ . Using Stirling’s formula (Notes 0a), show that

$\displaystyle {\bf P}( |S_n - \mu| \geq \lambda \sigma ) \geq c \exp( - C \lambda^2 )$

for some absolute constants ${C, c > 0}$ and all ${\lambda \leq c\sigma}$ . What happens when ${\lambda}$ is much larger than ${\sigma}$ ?

Exercise 3 Show that the term ${\exp( - c \lambda \sigma / K )}$ in (11) can be replaced with ${(\lambda K/\sigma)^{-c \lambda \sigma / K}}$ (which is superior when ${\lambda K \gg \sigma}$ ). (Hint: Allow ${t}$ to exceed ${1}$ .) Compare this with the results of Exercise 2.

Exercise 4 (Hoeffding’s inequality) Let ${X_1,\ldots,X_n}$ be independent real variables, with ${X_i}$ taking values in an interval ${[a_i,b_i]}$ , and let ${S_n := X_1 + \ldots + X_n}$ . Show that one has

$\displaystyle {\bf P}( |S_n - {\bf E} S_n| \geq \lambda \sigma ) \leq C \exp( - c \lambda^2 )$

for some absolute constants ${C, c > 0}$ , where ${\sigma^2 := \sum_{i=1}^n |b_i-a_i|^2}$ .

Remark 2 As we can see, the exponential moment method is very slick compared to the power moment method. Unfortunately, due to its reliance on the identity ${e^{X+Y} = e^X e^Y}$ , this method relies very strongly on commutativity of the underlying variables, and as such will not be as useful when dealing with noncommutative random variables, and in particular with random matrices. Nevertheless, we will still be able to apply the Chernoff bound to good effect to various components of random matrices, such as rows or columns of such matrices.

The full assumption of joint independence is not completely necessary for Chernoff-type bounds to be present. It suffices to have a martingale difference sequence, in which each ${X_i}$ can depend on the preceding variables ${X_1,\ldots,X_{i-1}}$ , but which always has mean zero even when the preceding variables are conditioned out. More precisely, we have Azuma’s inequality:

Theorem 3 (Azuma’s inequality) Let ${X_1,\ldots,X_n}$ be a sequence of scalar random variables with ${|X_i| \leq 1}$ almost surely. Assume also that we have the martingale difference property

$\displaystyle {\bf E}( X_i | X_1,\ldots,X_{i-1} ) = 0 \ \ \ \ \ (12)$

almost surely for all ${i=1,\ldots,n}$ (here we assume the existence of a suitable disintegration in order to define the conditional expectation, though in fact it is possible to state and prove Azuma’s inequality without this disintegration). Then for any ${\lambda > 0}$ , the sum ${S_n := X_1 + \ldots + X_n}$ obeys the large deviation inequality

$\displaystyle {\bf P}(|S_n| \geq \lambda \sqrt{n} ) \leq C \exp( - c \lambda^2 ) \ \ \ \ \ (13)$

for some absolute constants ${C, c > 0}$ .

A typical example of ${S_n}$ here is a dependent random walk, in which the magnitude and probabilities of the ${i^{th}}$ step are allowed to depend on the outcome of the preceding ${i-1}$ steps, but where the mean of each step is always fixed to be zero.

Proof: Again, we can reduce to the case when the ${X_i}$ are real, and it suffices to establish the upper tail estimate

$\displaystyle {\bf P}(S_n \geq \lambda \sqrt{n} ) \leq C \exp( - c \lambda^2 ).$

Note that ${|S_n| \leq n}$ almost surely, so we may assume without loss of generality that ${\lambda \leq \sqrt{n}}$ .

Once again, we consider the exponential moment ${{\bf E} \exp( t S_n )}$ for some parameter ${t>0}$ . We write ${S_n = S_{n-1} + X_n}$ , so that

$\displaystyle {\bf E} \exp( t S_n ) = {\bf E} \exp( t S_{n-1} ) \exp( t X_n ).$

We do not have independence between ${S_{n-1}}$ and ${X_n}$ , so cannot split the expectation as in the proof of Chernoff’s inequality. Nevertheless we can use conditional expectation as a substitute. We can rewrite the above expression as

$\displaystyle {\bf E} {\bf E}( \exp( t S_{n-1} ) \exp( t X_n ) | X_1,\ldots,X_{n-1} ).$

The quantity ${S_{n-1}}$ is deterministic once we condition on ${X_1,\ldots,X_{n-1}}$ , and so we can pull it out of the conditional expectation:

$\displaystyle {\bf E} \exp( t S_{n-1} ) {\bf E}( \exp( t X_n ) | X_1,\ldots,X_{n-1} ).$

Applying (10) to the conditional expectation, we have

$\displaystyle {\bf E}( \exp( t X_n ) | X_1,\ldots,X_{n-1} ) \leq \exp(O(t^2))$

and

$\displaystyle {\bf E} \exp( t S_n ) \leq \exp(O(t^2)) {\bf E} \exp(t S_{n-1}).$

Iterating this argument gives

$\displaystyle {\bf E} \exp( t S_n ) \leq \exp(O(n t^2))$

and thus by Markov’s inequality

$\displaystyle {\bf P}(S_n \geq \lambda \sqrt{n} ) \leq \exp( O(n t^2) - t \lambda \sqrt{n} ).$

Optimising in ${t}$ gives the claim. $\Box$

Exercise 5 Suppose we replace the hypothesis ${|X_i| \leq 1}$ in Azuma’s inequality with the more general hypothesis ${|X_i| \leq c_i}$ for some scalars ${c_i > 0}$ . Show that we still have (13), but with ${\sqrt{n}}$ replaced by ${(\sum_{i=1}^n c_i^2)^{1/2}}$ .

Remark 3 The exponential moment method is also used frequently in harmonic analysis to deal with lacunary exponential sums, or sums involving Radamacher functions (which are the analogue of lacunary exponential sums for characteristic ${2}$ ). Examples here include Khintchine’s inequality (and the closely related Kahane’s inequality). The exponential moment method also combines very well with log-Sobolev inequalities, as we shall see below (basically because the logarithm inverts the exponential), as well as with the closely related hypercontractivity inequalities.

— 2. The truncation method —

To summarise the discussion so far, we have identified a number of large deviation inequalities to control a sum ${S_n = X_1 + \ldots + X_n}$ :

The zeroth moment method bound (1), which requires no moment assumptions on the ${X_i}$ but is only useful when ${X_i}$ is usually zero, and has no decay in ${\lambda}$ .
The first moment method bound (2), which only requires absolute integrability on the ${X_i}$ , but has only a linear decay in ${\lambda}$ .
The second moment method bound (5), which requires second moment and pairwise independence bounds on ${X_i}$ , and gives a quadratic decay in ${\lambda}$ .
Higher moment bounds (7), which require boundedness and ${k}$ -wise independence, and give a ${k^{th}}$ power decay in ${\lambda}$ (or quadratic-exponential decay, after optimising in ${k}$ ).
Exponential moment bounds such as (11) or (13), which require boundedness and joint independence (or martingale behaviour), and give quadratic-exponential decay in ${\lambda}$ .

We thus see that the bounds with the strongest decay in ${\lambda}$ require strong boundedness and independence hypotheses. However, one can often partially extend these strong results from the case of bounded random variables to that of unbounded random variables (provided one still has sufficient control on the decay of these variables) by a simple but fundamental trick, known as the truncation method. The basic idea here is to take each random variable ${X_i}$ and split it as ${X_i = X_{i,\leq N} + X_{i,>N}}$ , where ${N}$ is a truncation parameter to be optimised later (possibly in manner depending on ${n}$ ),

$\displaystyle X_{i,\leq N} := X_i {\bf I}(|X_i| \leq N)$

is the restriction of ${X_i}$ to the event that ${|X_i| \leq N}$ (thus ${X_{i,\leq N}}$ vanishes when ${X_i}$ is too large), and

$\displaystyle X_{i,> N} := X_i {\bf I}(|X_i| > N)$

is the complementary event. One can similarly split ${S_n = S_{n,\leq N} + S_{n,>N}}$ where

$\displaystyle S_{n,\leq N} = X_{1,\leq N} + \ldots + X_{n,\leq N}$

and

$\displaystyle S_{n,> N} = X_{1,>N} + \ldots + X_{n,>N}.$

The idea is then to estimate the tail of ${S_{n,\leq N}}$ and ${S_{n,>N}}$ by two different means. With ${S_{n,\leq N}}$ , the point is that the variables ${X_{i,\leq N}}$ have been made bounded by fiat, and so the more powerful large deviation inequalities can now be put into play. With ${S_{n,>N}}$ , in contrast, the underlying variables ${X_{i,>N}}$ are certainly not bounded, but they tend to have small zeroth and first moments, and so the bounds based on those moment methods tend to be powerful here. (Readers who are familiar with harmonic analysis may recognise this type of divide and conquer argument as an interpolation argument.)

Let us begin with a simple application of this method.

Theorem 4 (Weak law of large numbers) Let ${X_1,X_2,\ldots}$ be iid scalar random variables with ${X_i \equiv X}$ for all ${i}$ , where ${X}$ is absolutely integrable. Then ${S_n/n}$ converges in probability to ${{\bf E} X}$ .

Proof: By subtracting ${{\bf E} X}$ from ${X}$ we may assume without loss of generality that ${X}$ has mean zero. Our task is then to show that ${{\bf P}( |S_n| \geq \epsilon n ) = o(1)}$ for all fixed ${\epsilon > 0}$ .

If ${X}$ has finite variance, then the claim follows from (5). If ${X}$ has infinite variance, we cannot apply (5) directly, but we may perform the truncation method as follows. Let ${N}$ be a large parameter to be chosen later, and split ${X_i = X_{i,\leq N} + X_{i,> N}}$ , ${S_n = S_{n,\leq N} + S_{n,> N}}$ (and ${X = X_{\leq N} + X_{>N}}$ ) as discussed above. The variable ${X_{\leq N}}$ is bounded and thus has bounded variance; also, from the dominated convergence theorem we see that ${|{\bf E} X_{\leq N}| \leq \epsilon/4}$ (say) if ${N}$ is large enough. From (5) we conclude that

$\displaystyle {\bf P}( |S_{n,\leq N}| \geq \epsilon n / 2 ) = o(1)$

(where the rate of decay here depends on ${N}$ and ${\epsilon}$ ). Meanwhile, to deal with the tail ${X_{>N}}$ we use (2) to conclude that

$\displaystyle {\bf P}( |S_{n,>N}| \geq \epsilon n / 2 ) \leq \frac{2}{\epsilon} {\bf E} |X_{>N}|.$

But by the dominated convergence theorem (or monotone convergence theorem), we may make ${{\bf E} |X_{>N}|}$ as small as we please (say, smaller than ${\delta > 0}$ ) by taking ${N}$ large enough. Summing, we conclude that

$\displaystyle {\bf P}( |S_{n}| \geq \epsilon n ) = \frac{2}{\epsilon} \delta + o(1);$

since ${\delta}$ is arbitrary, we obtain the claim. $\Box$

A more sophisticated variant of this argument (which I gave in this earlier blog post, which also has some further discussion and details) gives

Theorem 5 (Strong law of large numbers) Let ${X_1,X_2,\ldots}$ be iid scalar random variables with ${X_i \equiv X}$ for all ${i}$ , where ${X}$ is absolutely integrable. Then ${S_n/n}$ converges almost surely to ${{\bf E} X}$ .

Proof: We may assume without loss of generality that ${X}$ is real, since the complex case then follows by splitting into real and imaginary parts. By splitting ${X}$ into positive and negative parts, we may furthermore assume that ${X}$ is non-negative. (Of course, by doing so, we can no longer normalise ${X}$ to have mean zero, but for us the non-negativity will be more convenient than the zero mean property.) In particular, ${S_n}$ is now non-decreasing in ${n}$ .

Next, we apply a sparsification trick. Let ${0 < \epsilon < 1}$ . Suppose that we knew that, almost surely, ${S_{n_m}/{n_m}}$ converged to ${{\bf E} X}$ for ${n=n_m}$ of the form ${n_m := \lfloor (1+\epsilon)^m \rfloor}$ for some integer ${m}$ . Then, for all other values of ${n}$ , we see that asymptotically, ${S_n/n}$ can only fluctuate by a multiplicative factor of ${1+O(\epsilon)}$ , thanks to the monotone nature of ${S_n}$ . Because of this and countable additivity, we see that it suffices to show that ${S_{n_m}/{n_m}}$ converges to ${{\bf E} X}$ . Actually, it will be enough to show that almost surely, one has ${|S_{n_m}/{n_m} - {\bf E} X| \leq \epsilon}$ for all but finitely many ${m}$ .

Fix ${\epsilon}$ . As before, we split ${X = X_{>N_m} + X_{\leq N_m}}$ and ${S_{n_m} = S_{n_m,>N_m} + S_{n_m,\leq N_m}}$ , but with the twist that we now allow ${N = N_m}$ to depend on ${m}$ . Then for ${N_m}$ large enough we have ${|{\bf E} X_{\leq N_m} - {\bf E} X| \leq \epsilon/2}$ (say), by dominated convergence. Applying (5) as before, we see that

$\displaystyle {\bf P}( |S_{n_m,\leq N_m}/{n_m} - {\bf E} X| > \epsilon ) \leq \frac{C_\epsilon}{n_m} {\bf E} |X_{\leq N_m}|^2$

for some ${C_\epsilon}$ depending only on ${\epsilon}$ (the exact value is not important here). To handle the tail, we will not use the first moment bound (2) as done previously, but now turn to the zeroth-moment bound (1) to obtain

$\displaystyle {\bf P}( S_{n_m, >N_m} \neq 0 ) \leq n_m {\bf P}( |X| > N_m );$

summing, we conclude

$\displaystyle {\bf P}( |S_{n_m}/{n_m} - {\bf E} X| > \epsilon ) \leq \frac{C_\epsilon}{n_m} {\bf E} |X_{\leq N_m}|^2 + n_m {\bf P}( |X| > N_m ).$

Applying the Borel-Cantelli lemma (Exercise 1 from Notes 0), we see that we will be done as long as we can choose ${N_m}$ such that

$\displaystyle \sum_{m=1}^\infty \frac{1}{n_m} {\bf E} |X_{\leq N_m}|^2$

and

$\displaystyle \sum_{m=1}^\infty n_m {\bf P}( |X| > N_m )$

are both finite. But this can be accomplished by setting ${N_m := n_m}$ and interchanging the sum and expectations (writing ${{\bf P}( |X| > N_m )}$ as ${{\bf E} {\bf I}(|X| > N_m)}$ ) and using the lacunary nature of the ${n_m}$ . $\Box$

To give another illustration of the truncation method, we extend a version of the Chernoff bound to the subgaussian case.

Proposition 6 Let ${X_1,\ldots,X_n \equiv X}$ be iid copies of a subgaussian random variable ${X}$ , thus ${X}$ obeys a bound of the form

$\displaystyle {\bf P}(|X| \geq t) \leq C \exp( - c t^2 ) \ \ \ \ \ (14)$

for all ${t>0}$ and some ${C,c>0}$ . Let ${S_n := X_1+\ldots+X_n}$ . Then for any sufficiently large ${A}$ (independent of ${n}$ ) we have

$\displaystyle {\bf P}( |S_n - n {\bf E} X| \geq A n ) \leq C_A \exp( - c_A n )$

for some constants ${C_A, c_A}$ depending on ${A,p,C,c}$ . Furthermore, ${c_A}$ grows linearly in ${A}$ as ${A \rightarrow \infty}$ .

Proof: By subtracting the mean from ${X}$ we may normalise ${{\bf E} X = 0}$ . We perform a dyadic decomposition

$\displaystyle X_i = X_{i,0} + \sum_{m=1}^\infty X_{i,m}$

where ${X_{i,0} := X_i {\bf I}(|X_i| \leq 1)}$ and ${X_{i,m} := X_i {\bf I}(2^{m-1} < |X_i| \leq 2^m)}$ . We similarly split

$\displaystyle S_n = S_{n,0} + \sum_{m=1}^\infty S_{n,m}$

where ${S_{n,m} = \sum_{i=1}^n X_{i,m}}$ . Then by the union bound and the pigeonhole principle we have

$\displaystyle {\bf P}( |S_n| \geq A n ) \leq \sum_{m=0}^\infty {\bf P}( |S_{n,m}| \geq \frac{A}{100(m+1)^2} n )$

(say). Each ${X_{i,m}}$ is clearly bounded in magnitude by ${2^m}$ ; from the subgaussian hypothesis one can also verify that the mean and variance of ${X_{i,m}}$ are at most ${C' \exp( - c' 2^{2m} )}$ for some ${C', c' > 0}$ . If ${A}$ is large enough, an application of the Chernoff bound (11) (or more precisely, the refinement in Exercise 3) then gives (after some computation)

$\displaystyle {\bf P}( |S_{n,m}| \geq \frac{A}{100(m+1)^2} n ) \leq C' 2^{-m} \exp( - c' A n )$

(say) for some ${C', c' > 0}$ , and the claim follows. $\Box$

Exercise 6 Show that the hypothesis that ${A}$ is sufficiently large can be replaced by the hypothesis that ${A>0}$ is independent of ${n}$ . Hint: There are several approaches available. One can adapt the above proof; one can modify the proof of the Chernoff inequality directly; or one can figure out a way to deduce the small ${A}$ case from the large ${A}$ case.

Exercise 7 Show that the subgaussian hypothesis can be generalised to a sub-exponential tail hypothesis

$\displaystyle {\bf P}(|X| \geq t) \leq C \exp( - c t^p )$

provided that ${p>1}$ . Show that the result also extends to the case ${0 < p \leq 1}$ , except with the exponent ${\exp( - c_A n )}$ replaced by ${\exp( - c_A n^{p-\epsilon} )}$ for some ${\epsilon > 0}$ . (I do not know if the ${\epsilon}$ loss can be removed, but it is easy to see that one cannot hope to do much better than this, just by considering the probability that ${X_1}$ (say) is already as large as ${An}$ .)

— 3. Lipschitz combinations —

In the preceding discussion, we had only considered the linear combination ${X_1,\ldots,X_n}$ of independent variables ${X_1,\ldots,X_n}$ . Now we consider more general combinations ${F(X)}$ , where we write ${X := (X_1,\ldots,X_n)}$ for short. Of course, to get any non-trivial results we must make some regularity hypotheses on ${F}$ . It turns out that a particularly useful class of regularity hypothesis here is a Lipschitz hypothesis – that small variations in ${X}$ lead to small variations in ${F(X)}$ . A simple example of this is McDiarmid’s inequality:

Theorem 7 (McDiarmid’s inequality) Let ${X_1,\ldots,X_n}$ be independent random variables taking values in ranges ${R_1,\ldots,R_n}$ , and let ${F: R_1 \times \ldots \times R_n \rightarrow {\bf C}}$ be a function with the property that if one freezes all but the ${i^{th}}$ coordinate of ${F(x_1,\ldots,x_n)}$ for some ${1 \leq i \leq n}$ , then ${F}$ only fluctuates by most ${c_i > 0}$ , thus

$\displaystyle |F(x_1,\ldots,x_{i-1},x_i,x_{i+1},\ldots,x_n) -$

$\displaystyle F(x_1,\ldots,x_{i-1},x_i',x_{i+1},\ldots,x_n)| \leq c_i$

for all ${x_j \in X_j}$ , ${x'_i \in X_i}$ for ${1 \leq j \leq n}$ . Then for any ${\lambda > 0}$ , one has

$\displaystyle {\bf P}( |F(X) - {\bf E} F(X)| \geq \lambda \sigma ) \leq C \exp(- c \lambda^2 )$

for some absolute constants ${C,c>0}$ , where ${\sigma^2 := \sum_{i=1}^n c_i^2}$ .

Proof: We may assume that ${F}$ is real. By symmetry, it suffices to show the one-sided estimate

$\displaystyle {\bf P}( F(X) - {\bf E} F(X) \geq \lambda \sigma^2 ) \leq C \exp(- c \lambda^2 ). \ \ \ \ \ (15)$

To compute this quantity, we again use the exponential moment method. Let ${t > 0}$ be a parameter to be chosen later, and consider the exponential moment

$\displaystyle {\bf E} \exp( t F(X) ). \ \ \ \ \ (16)$

To compute this, let us condition ${X_1,\ldots,X_{n-1}}$ to be fixed, and look at the conditional expectation

$\displaystyle {\bf E}(\exp( t F(X) ) | X_1,\ldots,X_{n-1} ).$

We can simplify this as

$\displaystyle {\bf E}( \exp( t Y) | X_1,\ldots,X_{n-1} ) \exp( t {\bf E}(F(X)|X_1,\ldots,X_{n-1}) )$

where

$\displaystyle Y := F(X) - {\bf E}( F(X)|X_1,\ldots,X_{n-1})$

For ${X_1,\ldots,X_{n-1}}$ fixed, ${tY}$ only fluctuates by at most ${t c_n}$ and has mean zero. Applying (10), we conclude that

$\displaystyle {\bf E}( \exp( t Y) | X_1,\ldots,X_{n-1} ) \leq \exp( O( t^2 c_n^2 ) ).$

Integrating out the conditioning, we see that we have upper bounded (16) by

$\displaystyle \exp(O(t^2 c_n^2)) {\bf E} \exp( t ({\bf E}(F(X)|X_1,\ldots,X_{n-1}) ).$

We observe that ${({\bf E}(F(X)|X_1,\ldots,X_{n-1})}$ is a function ${F_{n-1}(X_1,\ldots,X_{n-1})}$ of ${X_1,\ldots,X_{n-1}}$ , where ${F_{n-1}}$ obeys the same hypotheses as ${F}$ (but for ${n-1}$ instead of ${n}$ ). We can then iterate the above computation ${n}$ times and eventually upper bound (16) by

$\displaystyle \exp(\sum_{i=1}^n O(t^2 c_i^2) ) \exp( t {\bf E} F(X) ),$

which we rearrange as

$\displaystyle {\bf E} \exp( t (F(X) - {\bf E} F(X)) ) \leq \exp( O(t^2 \sigma^2) ),$

and thus by Markov’s inequality

$\displaystyle {\bf P}( F(X) - {\bf E} F(X) \geq \lambda \sigma ) \leq \exp( O( t^2 \sigma^2 ) - t \lambda \sigma ).$

Optimising in ${t}$ then gives the claim. $\Box$

Exercise 8 Show that McDiarmid’s inequality implies Hoeffding’s inequality (Exercise 4).

Remark 4 One can view McDiarmid’s inequality as a tensorisation of Hoeffding’s lemma, as it leverages the latter lemma for a single random variable to establish an analogous result for ${n}$ random variables. It is possible to apply this tensorisation trick to random variables taking values in more sophisticated metric spaces than an interval ${[a,b]}$ , leading to a class of concentration of measure inequalities known as transportation cost-information inequalities, which will not be discussed here.

The most powerful concentration of measure results, though, do not just exploit Lipschitz type behaviour in each individual variable, but joint Lipschitz behaviour. Let us first give a classical instance of this, in the special case when the ${X_1,\ldots,X_n}$ are gaussian variables. A key property of gaussian variables is that any linear combination of independent gaussians is again an independent gaussian:

Exercise 9 Let ${X_1,\ldots,X_n}$ be independent real gaussian variables with ${X_i = N(\mu_i,\sigma_i^2)_{\bf R}}$ , and let ${c_1,\ldots,c_n}$ be real constants. Show that ${c_1 X_1 +\ldots + c_n X_n}$ is a real gaussian with mean ${\sum_{i=1}^n c_i \mu_i}$ and variance ${\sum_{i=1}^n |c_i|^2 \sigma_i^2}$ .

Show that the same claims also hold with complex gaussians and complex constants ${c_i}$ .

Exercise 10 (Rotation invariance) Let ${X = (X_1,\ldots,X_n)}$ be an ${{\bf R}^n}$ -valued random variable, where ${X_1,\ldots,X_n \equiv N(0,1)_{\bf R}}$ are iid real gaussians. Show that for any orthogonal matrix ${U \in O(n)}$ , ${UX \equiv X}$ .

Show that the same claim holds for complex gaussians (so ${X}$ is now ${{\bf C}^n}$ -valued), and with the orthogonal group ${O(n)}$ replaced by the unitary group ${U(n)}$ .

Theorem 8 (Gaussian concentration inequality for Lipschitz functions) Let ${X_1,\ldots,X_n \equiv N(0,1)_{\bf R}}$ be iid real gaussian variables, and let ${F: {\bf R}^n \rightarrow {\bf R}}$ be a ${1}$ -Lipschitz function (i.e. ${|F(x)-F(y)| \leq |x-y|}$ for all ${x,y \in {\bf R}^n}$ , where we use the Euclidean metric on ${{\bf R}^n}$ ). Then for any ${\lambda}$ one has

$\displaystyle {\bf P}( |F(X) - {\bf E} F(X)| \geq \lambda ) \leq C \exp(-c \lambda^2 )$

for some absolute constants ${C, c > 0}$ .

Proof: We use the following elegant argument of Maurey and Pisier. By subtracting a constant from ${F}$ , we may normalise ${{\bf E} F(X) = 0}$ . By symmetry it then suffices to show the upper tail estimate

$\displaystyle {\bf P}( F(X) \geq \lambda ) \leq C \exp(-c \lambda^2 ).$

By smoothing ${F}$ slightly we may assume that ${F}$ is smooth, since the general case then follows from a limiting argument. In particular, the Lipschitz bound on ${F}$ now implies the gradient estimate

$\displaystyle |\nabla F(x)| \leq 1 \ \ \ \ \ (17)$

for all ${x \in {\bf R}^n}$ .

Once again, we use the exponential moment method. It will suffice to show that

$\displaystyle {\bf E} \exp( t F(X) ) \leq \exp( C t^2 )$

for some constant ${C>0}$ and all ${t>0}$ , as the claim follows from Markov’s inequality and optimisation in ${t}$ as in previous arguments.

To exploit the Lipschitz nature of ${F}$ , we will need to introduce a second copy of ${F(X)}$ . Let ${Y}$ be an independent copy of ${X}$ . Since ${{\bf E} F(Y) = 0}$ , we see from Jensen’s inequality that

$\displaystyle {\bf E} \exp( - t F(Y) ) \geq 1$

and thus (by independence of ${X}$ and ${Y}$ )

$\displaystyle {\bf E} \exp( t F(X) ) \leq {\bf E} \exp( t (F(X) - F(Y)) ).$

It is tempting to use the fundamental theorem of calculus along a line segment,

$\displaystyle F(X) - F(Y) = \int_0^1 \frac{d}{dt} F((1-t) Y + t X)\ dt,$

to estimate ${F(X)-F(Y)}$ , but it turns out for technical reasons to be better to use a circular arc instead,

$\displaystyle F(X) - F(Y) = \int_0^{\pi/2} \frac{d}{d\theta} F(Y \cos \theta + X \sin \theta)\ d\theta,$

The reason for this is that ${X_\theta := Y \cos \theta + X \sin \theta}$ is another gaussian random variable equivalent to ${X}$ , as is its derivative ${X'_\theta := -Y \sin \theta + X \cos \theta}$ (by Exercise 9); furthermore, and crucially, these two random variables are independent (by Exercise 10).

To exploit this, we first use Jensen’s inequality to bound

$\displaystyle \exp( t (F(X) - F(Y))) \leq \frac{2}{\pi} \int_0^{\pi/2} \exp( \frac{\pi t}{2} \frac{d}{d\theta} F( X_\theta ) )\ d\theta.$

Applying the chain rule and taking expectations, we have

$\displaystyle {\bf E} \exp( t (F(X) - F(Y))) \leq \frac{2}{\pi} \int_0^{\pi/2} {\bf E} \exp( \frac{\pi t}{2} \nabla F( X_\theta ) \cdot X'_\theta )\ d\theta.$

Let us condition ${X_\theta}$ to be fixed, then ${X'_\theta \equiv X}$ ; applying Exercise 9 and (17), we conclude that ${\frac{\pi t}{2} \nabla F( X_\theta ) \cdot X'_\theta}$ is normally distributed with standard deviation at most ${\frac{\pi t}{2}}$ . As such we have

$\displaystyle {\bf E} \exp( \frac{\pi t}{2} \nabla F( X_\theta ) \cdot X'_\theta ) \leq \exp( C t^2 )$

for some absolute constant ${C}$ ; integrating out the conditioning on ${X_\theta}$ we obtain the claim. $\Box$

Exercise 11 Show that Theorem 8 is equivalent to the inequality

$\displaystyle {\bf P}(X \in A) {\bf P}(X \not \in A_\lambda) \leq C \exp(-c \lambda^2)$

holding for all ${\lambda > 0}$ and all measurable sets ${A}$ , where ${X = (X_1,\ldots,X_n)}$ is an ${{\bf R}^n}$ -valued random variable with iid gaussian components ${X_1,\ldots,X_n \equiv N(0,1)_{\bf R}}$ , and ${A_\lambda}$ is the ${\lambda}$ -neighbourhood of ${A}$ .

Now we give a powerful concentration inequality of Talagrand, which we will rely heavily on later in this course.

Theorem 9 (Talagrand concentration inequality) Let ${K > 0}$ , and let ${X_1,\ldots,X_n}$ be independent complex variables with ${|X_i| \leq K}$ for all ${1 \leq i \leq n}$ . Let ${F: {\bf C}^n \rightarrow {\bf R}}$ be a ${1}$ -Lipschitz convex function (where we identify ${{\bf C}^n}$ with ${{\bf R}^{2n}}$ for the purposes of defining “Lipschitz” and “convex”). Then for any ${\lambda}$ one has

$\displaystyle {\bf P}( |F(X) - {\bf M} F(X)| \geq \lambda K ) \leq C \exp(-c \lambda^2 ) \ \ \ \ \ (18)$

and

$\displaystyle {\bf P}( |F(X) - {\bf E} F(X)| \geq \lambda K ) \leq C \exp(-c \lambda^2 ) \ \ \ \ \ (19)$

for some absolute constants ${C, c > 0}$ , where ${{\bf M} F(X)}$ is a median of ${F(X)}$ .

We now prove the theorem, following the remarkable argument of Talagrand.

By dividing through by ${K}$ we may normalise ${K=1}$ . ${X}$ now takes values in the convex set ${\Omega^n \subset {\bf C}^n}$ , where ${\Omega}$ is the unit disk in ${{\bf C}}$ . It will suffice to establish the inequality

$\displaystyle {\bf E} \exp( c d(X,A)^2 ) \leq \frac{1}{{\bf P}(X \in A) } \ \ \ \ \ (20)$

for any convex set ${A}$ in ${\Omega^n}$ and some absolute constant ${c>0}$ , where ${d(X,A)}$ is the Euclidean distance between ${X}$ and ${A}$ . Indeed, if one obtains this estimate, then one has

$\displaystyle {\bf P}( F(X) \leq x ) {\bf P}( F(X) \geq y ) \leq \exp( - c |x-y|^2 )$

for any ${y > x}$ (as can be seen by applying (20) to the convex set ${A := \{ z \in \Omega^n: F(z) \leq x \}}$ ). Applying this inequality of one of ${x,y}$ equal to the median ${{\bf M} F(X)}$ of ${F(X)}$ yields (18), which in turn implies that

$\displaystyle {\bf E} F(X) = {\bf M} F(X) + O( 1 ),$

which then gives (19).

We would like to establish (20) by induction on dimension ${n}$ . In the case when ${X_1,\ldots,X_n}$ are Bernoulli variables, this can be done; see this previous blog post. In the general case, it turns out that in order to close the induction properly, one must strengthen (20) by replacing the Euclidean distance ${d(X,A)}$ by an essentially larger quantity, which I will call the combinatorial distance ${d_c(X,A)}$ from ${X}$ to ${A}$ . For each vector ${z = (z_1,\ldots,z_n) \in {\bf C}^n}$ and ${\omega = (\omega_1,\ldots,\omega_n) \in \{0,1\}^n}$ , we say that ${\omega}$ supports ${z}$ if ${z_i}$ is non-zero only when ${\omega_i}$ is non-zero. Define the combinatorial support ${U_A(X)}$ of ${A}$ relative to ${X}$ to be all the vectors in ${\{0,1\}^n}$ that support at least one vector in ${A-X}$ . Define the combinatorial hull ${V_A(X)}$ of ${A}$ relative to ${X}$ to be the convex hull of ${U_A(X)}$ , and then define the combinatorial distance ${d_c(X,A)}$ to be the distance between ${V_A(X)}$ and the origin.

Lemma 10 (Combinatorial distance controls Euclidean distance) Let ${A}$ be a convex subset of ${\Omega^n}$ . ${d(X,A) \leq 2 d_c(X,A)}$ .

Proof: Suppose ${d_c(X,A) \leq r}$ . Then there exists a convex combination ${t = (t_1,\ldots,t_n)}$ of elements ${\omega \in U_A(X) \subset \{0,1\}^n}$ which has magnitude at most ${r}$ . For each such ${\omega \in U_A(X)}$ , we can find a vector ${z_\omega \in A-X}$ supported by ${\omega}$ . As ${A, X}$ both lie in ${\Omega^n}$ , every coefficient of ${z_\omega}$ has magnitude at most ${2}$ , and is thus bounded in magnitude by twice the corresponding coefficent of ${\omega}$ . If we then let ${z_t}$ be the convex combination of the ${z_\omega}$ indicated by ${t}$ , then the magnitude of each coefficient of ${z_t}$ is bounded by twice the corresponding coefficient of ${t}$ , and so ${|z_t| \leq 2r}$ . On the other hand, as ${A}$ is convex, ${z_t}$ lies in ${A-X}$ , and so ${d(X,A) \leq 2r}$ . The claim follows. $\Box$

Thus to show (20) it suffices (after a modification of the constant ${c}$ ) to show that

$\displaystyle {\bf E} \exp( c d_c(X,A)^2 ) \leq \frac{1}{{\bf P}(X \in A)}. \ \ \ \ \ (21)$

We first verify the one-dimensional case. In this case, ${d_c(X,A)}$ equals ${1}$ when ${X \not \in A}$ , and ${0}$ otherwise, and the claim follows from elementary calculus (for ${c}$ small enough).

Now suppose that ${n > 1}$ and the claim has already been proven for ${n-1}$ . We write ${X=(X',X_n)}$ , and let ${A_{X_n} := \{ z' \in \Omega^{n-1}: (z',X_n) \in A \}}$ be a slice of ${A}$ . We also let ${B := \{ z' \in \Omega^{n-1}: (z',t) \in A \hbox{ for some } t \in \Omega \}}$ . We have the following basic inequality:

Lemma 11 For any ${0 \leq \lambda \leq 1}$ , we have

$\displaystyle d_c(X,A)^2 \leq (1-\lambda)^2 + \lambda d_c(X',A_{X_n})^2 + (1-\lambda) d_c(X',B)^2.$

Proof: Observe that ${U_A(X)}$ contains both ${U_{A_{X_n}}(X') \times \{0\}}$ and ${(U_{B}(X') \backslash U_{A_{X_n}}(X')) \times \{1\}}$ , and so by convexity, ${V_A(X)}$ contains one of ${(\lambda t + (1-\lambda) u, 1-\lambda)}$ or ${(\lambda t + (1-\lambda) u, 0)}$ whenever ${t \in V_{A_{X_n}}(X')}$ and ${u \in V_{B}(X')}$ . The claim then follows from Pythagoras’ theorem and the Cauchy-Schwarz inequality. $\Box$

Let us now freeze ${X_n}$ and consider the conditional expectation

$\displaystyle {\bf E} (\exp( c d_c(X,A)^2 )|X_n).$

Using the above lemma (with some ${\lambda}$ depending on ${X_n}$ to be chosen later), we may bound the left-hand side of (21) by

$\displaystyle e^{c(1-\lambda)^2} {\bf E}((e^{c d_c(X',A_{X_n})})^\lambda (e^{c d_c(X',B)})^{1-\lambda}|X_n);$

applying Hölder’s inequality and the induction hypothesis (21), we can bound this by

$\displaystyle e^{c(1-\lambda)^2} \frac{1}{{\bf P}(X' \in A_{X_n}|X_n)^\lambda {\bf P}(X' \in B|X_n)^{1-\lambda}}$

which we can rearrange as

$\displaystyle \frac{1}{{\bf P}(X' \in B)} e^{c(1-\lambda)^2} r^{-\lambda}$

where ${r := {\bf P}(X' \in A_{X_n}|X_n) / {\bf P}(X' \in B)}$ (here we note that the event ${X' \in B}$ is independent of ${X_n}$ ). Note that ${0 \leq r \leq 1}$ . We then apply the elementary inequality

$\displaystyle \inf_{0 \leq \lambda \leq 1} e^{c(1-\lambda)^2} r^{-\lambda} \leq 2-r,$

which can be verified by elementary calculus if ${c}$ is small enough (in fact one can take ${c=1/4}$ ). We conclude that

$\displaystyle {\bf E} (\exp( c d_c(X,A)^2 )|X_n) \leq \frac{1}{{\bf P}(X' \in B)} ( 2 - \frac{{\bf P}(X' \in A_{X_n}|X_n)}{{\bf P}(X' \in B)} ).$

Taking expectations in ${n}$ we conclude that

$\displaystyle {\bf E} (\exp( c d_c(X,A)^2 )) \leq \frac{1}{{\bf P}(X' \in B)} ( 2 - \frac{{\bf P}(X \in A)}{{\bf P}(X' \in B)} ).$

Using the inequality ${x(2-x) \leq 1}$ with ${x := \frac{{\bf P}(X \in A)}{{\bf P}(X' \in B)}}$ we conclude (21) as desired.

The above argument was elementary, but rather “magical” in nature. Let us now give a somewhat different argument of Ledoux, based on log-Sobolev inequalities, which gives the upper tail bound

$\displaystyle {\bf P}( F(X) - {\bf E} F(X) \geq \lambda K ) \leq C \exp(-c \lambda^2 ), \ \ \ \ \ (22)$

but curiously does not give the lower tail bound. (The situation is not symmetric, due to the convexity hypothesis on ${F}$ .)

Once again we can normalise ${K=1}$ . By regularising ${F}$ we may assume that ${F}$ is smooth. The first step is to establish the following log-Sobolev inequality:

Lemma 12 (Log-Sobolev inequality) Let ${F: {\bf C}^n \rightarrow {\bf R}}$ be a smooth convex function. Then

$\displaystyle {\bf E} F(X) e^{F(X)} \leq ({\bf E} e^{F(X)}) (\log {\bf E} e^{F(X)}) + C {\bf E} e^{F(X)} |\nabla F(X)|^2$

for some absolute constant ${C}$ (independent of ${n}$ ).

Remark 5 If one sets ${f := e^{F/2}}$ and normalises ${{\bf E} f(X)^2=1}$ , this inequality becomes

$\displaystyle {\bf E} |f(X)|^2 \log |f(X)|^2 \leq 4C {\bf E} |\nabla f(X)|^2$

which more closely resembles the classical log-Sobolev inequality of Gross. The constant ${C}$ here can in fact be taken to be ${2}$ ; see Ledoux’s paper.

Proof: We first establish the ${1}$ -dimensional case. If we let ${Y}$ be an independent copy of ${X}$ , observe that the left-hand side can be rewritten as

$\displaystyle \frac{1}{2} {\bf E} ((F(X)-F(Y)) (e^{F(X)} - e^{F(Y)})) + ({\bf E} F(X)) (({\bf E} e^{F(X)}).$

From Jensen’s inequality, ${{\bf E} F(X) \leq \log {\bf E} e^{F(X)}}$ , so it will suffice to show that

$\displaystyle {\bf E} ((F(X)-F(Y)) (e^{F(X)} - e^{F(Y)})) \leq 2C {\bf E} e^{F(X)} |\nabla F(X)|^2.$

From convexity of ${F}$ (and hence of ${e^F}$ ) and the bounded nature of ${X,Y}$ , we have

$\displaystyle F(X)-F(Y) = O( |\nabla F(X)|)$

and

$\displaystyle e^{F(X)}-e^{F(Y)} = O( |\nabla F(X)| e^{F(X)})$

when ${F(X) \geq F(Y)}$ , which leads to

$\displaystyle ((F(X)-F(Y)) (e^{F(X)} - e^{F(Y)})) = O( e^{F(X)} |\nabla F(X)|^2 )$

in this case. Similarly when ${F(X) < F(Y)}$ (swapping ${X}$ and ${Y}$ ). The claim follows.

To show the general case, we induct on ${n}$ (keeping care to ensure that the constant ${C}$ does not change in this induction process). Write ${X = (X',X_n)}$ , where ${X' := (X_1,\ldots,X_{n-1})}$ . From induction hypothesis, we have

$\displaystyle {\bf E}(F(X) e^{F(X)}|X_n) \leq f(X_n) e^{f(X_n)} + C {\bf E}(e^{F(X)} |\nabla' F(X)|^2|X_n)$

where ${\nabla'}$ is the ${n-1}$ -dimensional gradient and ${f(X_n) := \log {\bf E}(e^{F(X)}|X_n)}$ . Taking expectations, we conclude that

$\displaystyle {\bf E} F(X) e^{F(X)} \leq {\bf E} f(X_n) e^{f(X_n)} + C {\bf E} e^{F(X)} |\nabla' F(X)|^2. \ \ \ \ \ (23)$

From the convexity of ${F}$ and Hölder’s inequality we see that ${f}$ is also convex, and ${{\bf E} e^{f(X_n)} = {\bf E} e^{F(X)}}$ . By the ${n=1}$ case already established, we have

$\displaystyle {\bf E} f(X_n) e^{f(X_n)} \leq ({\bf E} e^{F(X)}) (\log {\bf E} e^{F(X)}) + C {\bf E} e^{f(X_n)} |f'(X_n)|^2. \ \ \ \ \ (24)$

Now, by the chain rule

$\displaystyle e^{f(X_n)} |f'(X_n)|^2 = e^{-f(X_n)} | {\bf E}(e^{F(X)} F_{x_n}(X)|X_n)|^2$

where ${F_{x_n}}$ is the derivative of ${F}$ in the ${x_n}$ direction. Applying Cauchy-Schwarz, we conclude

$\displaystyle e^{f(X_n)} |f'(X_n)|^2 \leq {\bf E}(e^{F(X)} |F_{x_n}(X)|^2|X_n).$

Inserting this into (23), (24) we close the induction. $\Box$

Now let ${F}$ be convex and ${1}$ -Lipschitz. Applying the above lemma to ${tF}$ for any ${t>0}$ , we conclude that

$\displaystyle {\bf E} tF(X) e^{tF(X)} \leq ({\bf E} e^{tF(X)}) (\log {\bf E} e^{tF(X)}) + C t^2 {\bf E} e^{tF(X)};$

setting ${H(t) := {\bf E} e^{tF(X)}}$ , we can rewrite this as a differential inequality

$\displaystyle t H'(t) \leq H(t) \log H(t) + C t^2 H(t)$

which we can rewrite as

$\displaystyle \frac{d}{dt} (\frac{1}{t} \log H(t) ) \leq C.$

From Taylor expansion we see that

$\displaystyle \frac{1}{t} \log H(t) \rightarrow {\bf E} F(X)$

as ${t \rightarrow 0}$ , and thus

$\displaystyle \frac{1}{t} \log H(t) \leq {\bf E} F(X) + Ct$

for any ${t>0}$ . In other words,

$\displaystyle {\bf E} e^{tF(X)} \leq \exp( t {\bf E} F(X) + C t^2 ).$

By Markov’s inequality, we conclude that

$\displaystyle {\bf P}( F(X) - {\bf E} F(X) > \lambda ) \leq \exp( Ct^2 - t \lambda);$

optimising in ${t}$ gives (22).

Remark 6 The same argument, starting with Gross’s log-Sobolev inequality for the gaussian measure, gives the upper tail component of Theorem 8, with no convexity hypothesis on ${F}$ . The situation is now symmetric with respect to reflections ${F \mapsto -F}$ , and so one obtains the lower tail component as well. The method of obtaining concentration inequalities from log-Sobolev inequalities (or related inequalities, such as Poincaré-type ienqualities) by combining the latter with the exponential moment method is known as Herbst’s argument, and can be used to establish a number of other functional inequalities of interest.

We now close with a simple corollary of the Talagrand concentration inequality, which will be extremely useful in the sequel.

Corollary 13 (Distance between random vector and a subspace) Let ${X_1,\ldots,X_n}$ be independent complex-valued random variables with mean zero and variance ${1}$ , and bounded almost surely in magnitude by ${K}$ . Let ${V}$ be a subspace of ${{\bf C}^n}$ of dimension ${d}$ . Then for any ${\lambda > 0}$ , one has

$\displaystyle {\bf P}( |d(X,V) - \sqrt{n-d}| \geq \lambda K ) \leq C \exp( - c \lambda^2 )$

for some absolute constants ${C, c > 0}$ .

Informally, this corollary asserts that the distance between a random vector ${X}$ and an arbitrary subspace ${V}$ is typically equal to ${\sqrt{n - \hbox{dim}(V)} + O(1)}$ .

Proof: The function ${z \mapsto d(z,V)}$ is convex and ${1}$ -Lipschitz. From Theorem 9, one has

$\displaystyle {\bf P}( |d(X,V) - M d(X,V)| \geq \lambda K ) \leq C \exp(-c \lambda^2 ).$

To finish the argument, it then suffices to show that

$\displaystyle M d(X,V) = \sqrt{n-d} + O( K ).$

We begin with a second moment calculation. Observe that

$\displaystyle d(X,V)^2 = \| \pi(X) \|^2 = \sum_{1 \leq i,j \leq n} p_{ij} X_i \overline{X_j},$

where ${\pi}$ is the orthogonal projection matrix to the complement ${V^\perp}$ of ${V}$ , and ${p_{ij}}$ are the components of ${\pi}$ . Taking expectations, we obtain

$\displaystyle {\bf E} d(X,V)^2 = \sum_{i=1}^n p_{ii} = \hbox{tr}(\pi) = n-d \ \ \ \ \ (25)$

where the latter follows by representing ${\pi}$ in terms of an orthonormal basis of ${V^\perp}$ . This is close to what we need, but to finish the task we need to obtain some concentration of ${d(X,V)^2}$ around its mean. For this, we write

$\displaystyle d(X,V)^2 - {\bf E} d(X,V)^2 = \sum_{1 \leq i,j \leq n} p_{ij} (X_i \overline{X_j} - \delta_{ij})$

where ${\delta_{ij}}$ is the Kronecker delta. The summands here are pairwise uncorrelated for ${1 \leq i \leq j \leq n}$ , and the ${i>j}$ cases can be combined with the ${i<j}$ cases by symmetry. Each summand also has a variance of ${O( K^2 )}$ . We thus have the variance bound

$\displaystyle {\bf Var}( d(X,V)^2 ) = O( K^2 \sum_{1 \leq i,j \leq n} |p_{ij}|^2 ) = O( K^2 (n-d) ),$

where the latter bound comes from representing ${\pi}$ in terms of an orthonormal basis of ${V^\perp}$ . From this, (25), and Chebyshev’s inequality, we see that the median of ${d(X,V)^2}$ is equal to ${n-d + O( \sqrt{K^2 (n-d)} )}$ , which implies on taking square roots that the median of ${d(X,V)}$ is ${\sqrt{n-d} + O(K)}$ , as desired. $\Box$

127 comments

Comments feed for this article

14 November, 2013 at 10:57 pm

Anonymous

Are the summands for last applicational result really independent or merely uncorrelated?

[Corrected, thanks – T.]

24 October, 2014 at 12:26 am

Anonymous

Do you want the variance to be “at least 1” or “at most 1”. The current formulation seems inconsistent with the note right after this line.

[“At most”. Note clarified to remove the confusion. -T.]

20 September, 2015 at 10:11 am

Entropy and rare events | What's new

[…] inequality, but there are of course many other estimates of this type (see e.g. this previous blog post for some others). Roughly speaking, concentration of measure inequalities allow one to make […]

21 September, 2015 at 1:22 pm

Jiasen Yang

Dear Professor Tao,

Thank you for the detailed post! I’ve been going through the proof of McDiarmid’s inequality, and I don’t see why it is necessary to assume that the $X_i$ ‘s are independent. I wonder if you could point out which step(s) I missed?

Thanks very much!

[One needs the independence to ensure that (say) $X_n$ continues to have mean zero even after conditioning on $X_1,\dots,X_{n-1}$ . -T. Note also that the theorem fails quite badly if for instance one has the very strong coupling $X_1=\dots=X_n$ . -T.]

21 September, 2015 at 7:23 pm

Jiasen Yang

Dear Professor Tao,

Thank you for your timely reply! Your coupling argument certainly makes sense, but I’m still having trouble determining which step of the proof uses independence directly. Following your point, I agree that $E[X_n] \neq E[X_n|X_1,\ldots,X_{n-1}]$ in general, but I don’t see where $X_n$ is assumed to have mean zero?

Actually, my question originally arose as I was reading a paper which claims to use McDiarmid’s inequality for dependent $X_i$’s after replacing the assumption
$$ |F(x_1,\ldots,x_{i-1},x_i,x_{i+1},\ldots,x_n) – F(x_1,\ldots,x_{i-1},x_i’,x_{i+1},\ldots,x_n)| \leq c_i $$
by
$$ |\bf{E}[f(X)|x_1,\ldots, x_{i-1},x_i] – \bf{E}[f(X)|x_1,\ldots, x_{i-1},x_i’]| \le c_i $$, but I don’t see why this new assumption resolves the issue.

Thank you again, and please forgive me for my stubbornness!

21 September, 2015 at 7:39 pm

Terence Tao

Sorry, my previous reply was quite incorrect, I was thinking of a different concentration equality. For McDiarmid, the claim that the conditioned function $F_{n-1}$ obeys the same hypotheses as the original function $F_n$ requires the independence hypothesis, as one will find if one expands out the proof of this claim.

21 September, 2015 at 8:08 pm

Jiasen Yang

Ah! I finally see your point. Thank you Professor!

23 October, 2015 at 8:06 am

275A, Notes 3: The weak and strong law of large numbers | What's new

[…] theory and high dimensional geometry. We will not discuss these topics much in this course, but see this previous blog post for some further […]

6 November, 2015 at 4:19 pm

Mahmood

Dear Professor Tao,

In the proof of Lemma 11 and the definition of set $B:=\{z'\in\Omega^{n-1}: (z',t)\in A \text{ for some } t\in\Omega\}$ . What happens if such $t$ is equal to $X_n$ ? Is it still valid to say that $U_A(X) \text{contains} U_B(X')\times\{1\}$ . For example, when the set $A$ has the points whose $n\text{th}$ coordinates are all equal to $X_n$ , then $U_A(X)$ does not have any element with the last coordinate 1. Am I missing something?

Thanks.

6 November, 2015 at 9:46 pm

Terence Tao

Oops, you’re right, the $t=X_n$ case has to be handled separately, but the bound is better in that case (one can stay in the ${\bf R}^{n-1} \times \{0\}$ slice). I’ve updated the argument accordingly.

7 November, 2015 at 8:51 am

mazrouei

Thanks for your reply.

19 November, 2015 at 2:51 pm

275A, Notes 5: Variants of the central limit theorem | What's new

[…] in which the underlying random variable is not bounded, but enjoys good moment bounds. See this previous blog post for these inequalities and some further discussion. Last, but certainly not least, there is an […]

17 February, 2016 at 6:59 pm

Anonymous

Hello Professor Tao,

In the proof of SLLN, I am not sure what do you mean by ‘countable additivity’ in building the connections bewteen convergence of two sequences. Does it mean adding O(\varepsilon)?

Thanks!

Jack

17 February, 2016 at 7:22 pm

Terence Tao

If $S_{n_m}/n_m$ is almost surely convergent to ${\bf E} X$ , then almost surely it is the case that $S_n/n$ fluctuates by at most $O(\varepsilon)$ around ${\bf E} X$ (in the sense that the limit superior and limit inferior are both almost surely ${\bf X}+O(\varepsilon)$ . Setting $\varepsilon= 1/n$ (say) for $n=1,2,3,\dots$ and using countable additivity (which implies that the countable intersection of almost sure events is still almost sure, we conclude that the limit superior and limit inferior are both almost surely ${\bf E}$ , giving the SLLN.

17 February, 2016 at 8:33 pm

Anonymous

Ah, I see. Thanks a lot for your reply!

Best,
Jack

9 April, 2016 at 7:44 am

Anonymous

Dear Prof. Tao, I believe your notes are just impressive.

What can we say on the lower bound on $Pr[|S_n|\geq \lambda]\geq ?$,depending on the moments of the distribution of the jointly independent, zero mean, variables $X_i$?

11 April, 2016 at 8:50 am

Terence Tao

This is the realm of large deviation theory; these bounds tend to depend in various subtle ways on the precise distribution of the independent variables. See for instance this paper of Nagaev for a classical survey of results; probably there are more up to date surveys also.

14 April, 2016 at 10:39 am

gninrepoli

I think that the probability of a recursive nature, so this phenomenon is possible. Probability – it’s just part of the recursion. Any probabilistic model is necessary to approximate recursion, which we can control. For example the set of prime numbers is recursive. But how to prove it.

25 April, 2016 at 3:24 pm

Nathan

Dear Prof. Tao,

First I would like to thank you for your notes, which are quite impressive!

I do not understand why each summand has a variance of $O(K^2)$ in the computation of the variance of $d(X,V)^2$ (corollary 13). Since $|X_i| \leq K$ and a term in the computation has the form $E[X_i^2X_j^2]$ , shouldn’t we have an $O(K^4)$ ?

Thanks!

26 April, 2016 at 8:34 am

Terence Tao

${\bf E}( X_i^2 X_j^2 ) \leq {\bf E}( K^2 X_j^2 ) = K^2$ , since $X_j$ has mean zero and variance one.

26 April, 2016 at 11:27 am

Nathan

Thank you!

2 March, 2017 at 12:17 am

keej

I was also stuck on this point; thanks for clarifying!

5 June, 2016 at 4:49 am

Vaibhav

Hi Prof. Terrence Tao,

I am a very simple question. In, (McDiarmid’s inequality) it is not clear how does it depend upon N, the number of random variables concerned. Intuitively, and as you mention elsewhere, the inequality should become stronger. That is, as the number of random variables increase, the deviation from the mean of the function should be less likely. Could you kindly clarify.

Quoting you:

“The basic intuition here is that it is difficult for a large number of independent variables {X_1,\ldots,X_n} to “work together” to simultaneously pull a sum {X_1+\ldots+X_n} or a more general combination {F(X_1,\ldots,X_n)} too far away from its mean. Independence here is the key; concentration of measure results typically fail if the {X_i} are too highly correlated with each other.”

I fail to see this effect.

6 June, 2016 at 1:53 pm

Terence Tao

Let’s say we are working in the normalisation $c_i=1$ . The hypothesis of Theorem 7 allows $F$ to range over an interval of size $n$ ; but McDiarmid’s inequality shows that concentration instead occurs at a shorter interval at the scale of $\sigma =\sqrt{n}$ .

6 June, 2016 at 6:19 pm

Vaibhav

Dear Prof. Tao, Thanks a lot. This helps most certainly.

6 June, 2016 at 6:51 pm

Vaibhav

Dear Prof. Tao,

I had a following question about Levy’s Lemma (another celebreated concentration of measure result).

First the Levy’s Lemma.

Let $f : S^k \rightarrow R$
be a function with Lipschitz constant $\eta$ (with respect to the Euclidean norm) and $X \in k$
and a point $X \in S^{k}$ be chosen uniformly at random. Then
\begin{align}
\label{levy}
Pr\{|f(X) – \bar{f}| > \alpha\}\le exp (-C(k+1)\alpha^{2}/\eta^2 )
\end{align}
for some constant $C >0$ and $\bar{f}$ is the mean value of the function over the sphere.

Now, here is my question: This Lemma seems to hold for a point chosen
uniformly at random over a sphere.
My question is….suppose I choose a point on an N dimensional sphere (X_1, X_2,…, X_N) such that (X_1^2, X_2^2, …, X_N^2) is chosen from a uniform distribution.
Certainly (X_1^2, X_2^2, …, X_N^2) chosen from a uniform distribution does not imply (X_1, X_2,…, X_N) will be a uniformly random point on the sphere.
But intuitively, I will expect Levy’s Lemma (or say a weaker version) to hold as I expect it to hold
for almost all points on the sphere.
My computer simulations also show this to be the case.
But, I cannot make mathematical arguments to justify my intuition and my simulations.

Best,

vaibhav

Ps: If I can show this, I can show a neat result in evolutionary biology, which I will be happy to show it to you if you would like.
Many thanks to you!

28 June, 2016 at 5:41 am

Ahn

Dear prof. Tao,

While I read the proof of proposition 6, I could not follow the step at the end where you applied chernoff bound. For example, what did you choose for lambda?

I would appreciate your help.

29 June, 2016 at 6:51 am

Terence Tao

There is a lot of latitude here in what to choose for $\lambda$ , for instance one can take $\lambda = c'' A \exp( c' 2^{2m} / 2 )$ for some small $c'' > 0$ . Note that the bound on the RHS is not optimal, but suffices for the application at hand.

8 November, 2016 at 8:40 am

Fast Randomized SVD – Facebook Research

[…] our approximation by appealing to concentration of measure results for random matrices – see Terry Tao’s lecture notes on the topic for some useful […]

12 February, 2017 at 11:53 am

keej

Could anyone please explain why in the proof of Hoeffding’s lemma, the first Taylor expansion is $e^(tX) = 1 + tX + O(t^2 X^2 exp(O(t))$ and not just the naive $e^(tX) = 1 + tX + O(t^2 X^2)$ ?

12 February, 2017 at 3:39 pm

Anonymous

The simpler estimate $O(t^2 X^2)$ for the remainder term applies only for bounded $t X$ .

12 February, 2017 at 5:01 pm

keej

Of course, thank you.

13 May, 2017 at 5:28 pm

Zahra

Hello Prof. Tao,

Any known results for the extension of theorem 8 to the case when $(X_1,\ldots, X_n)$ be a vector of iid sub-gaussian variables? Many Thanks.

[Probably. I don’t have references handy, but I would recommend checking Ledoux’s book on the subject. -T.]

22 May, 2017 at 5:55 pm

Quantitative continuity estimates | What's new

[…] is the proof by Maurey and Pisier of the gaussian concentration inequality, given in Theorem 8 of this previous blog post. In a similar vein, if one wishes to compare a scalar random variable of mean zero and variance […]

24 December, 2017 at 1:19 pm

AL Ray

I don’t understand something in the proof of the Lemma 10 why is $z_\omega \in X\setminus A$ and not $z_w\in A\setminus X$ ?

Thanks in advance!

[This is a typo, now fixed – T.]

30 December, 2017 at 10:41 pm

Tim

Dear Prof. Tao,

Thanks for the insightful notes. I have two questions:

1. When calculating the variance of d(X, V)^2, a term in the computation will be E[X_i^2 X_j^2], which I think should be 1 instead of O(K^2) since X_i and X_j are independent.
2. Are there any more direct ways to calculate / estimate the expectation of d(X, V)?

Thanks in advance!

31 December, 2017 at 10:29 am

Terence Tao

It’s true one can improve the bound on the variance of $d(X,V)^2$ , but unfortunately this does not significantly improve the concentration bound because of the uncertainty of $O(K)$ coming from the use of Theorem 9.

I don’t know of much sharper ways to control the expectation, but there are several routes to the concentration property at least, starting with the classic work of Hanson and Wright.

12 November, 2019 at 6:47 pm

254A, Notes 9 – second moment and entropy methods | What's new

[…] random variable will concentrate around its mean if its variance is not too large. See these previous notes for more discussion of the concentration of measure phenomenon. One can often obtain stronger […]

12 November, 2019 at 6:47 pm

254A, Notes 9 – second moment and entropy methods | What's new

27 June, 2020 at 8:43 pm

Yijia Liu

I have been trying to work out the computations for the final part of proposition 6 but I couldnt get the 2^-m factor.

From https://math.stackexchange.com/questions/3734229/chernoff-bound-for-sum-of-sub-gaussian-variables-via-truncation-method it seems that we cannot replace $1/100(m+1)^2$ with $2^{-m-1}$. Is this a typo or have I missed something in calculations.

[Corrected, thanks – T.]

11 October, 2021 at 9:19 am

254A, Supplement 4: Probabilistic models and heuristics for the primes (optional) | What's new

[…] above claim using concentration of measure results such as the Chernoff inequality, as discussed in this previous blog post. The estimate (2) reflects the general heuristic of expected square root cancellation: when summing […]

6 November, 2021 at 10:49 am

Alan Chang

In the proof of Theorem 8, should (all five instances of) $\frac{2t}{\pi}$ be $\frac{\pi t}{2}$ ?

[Corrected, thanks – T.]

7 April, 2022 at 6:42 am

Matteo Russo

Dear Prof.Tao,

Upon thanking you for the notes, I would like to ask whether Talagrand’s Inequality held even in the case where 1-Lipschitz is with respect to the l1-norm. If not, what type of concentration guarantees could still be claimed?

Thanks in advance!

7 April, 2022 at 11:17 am

Terence Tao

The standard concentration inequality that is used in such settings is McDiarmid’s inequality.

18 May, 2023 at 9:39 pm

David Roberts

I suspect there’s a (minor) typo in the proof of the Chernoff inequality (Theorem 2). After “we use the hypothesis that $|X| \leq 1$ and (9) to obtain” there’s an equation, following from the previous display which is an inequality.

And on the conclusion of the proof of that theorem, I suspect that the $\max$ of the two exponentials comes from an asymmetry in the tails. At least, This is the case in a proof of a special case (sums of arbitrary Bernoulli RVs) I’ve seen in one treatment elsewhere. But I’ve not seen other proofs give an upper bound of this form, in my limit searching.

One thing which I haven’t been able to justify to myself is why you consider the parameter $t$ in that proof to be bounded above by 1. Is it somehow due to implicitly needing to consider something like keeping a factor of $1-t$ positive, while at the same time keeping $t$ positive, to use in a previous estimate?

19 May, 2023 at 12:32 am

David Roberts

Sorry, one more thought: might the bound on $t$ come from needing $|tX|$ to still be bounded by 1? Just casting about at this point.

20 May, 2023 at 1:34 pm

Terence Tao

Typo now corrected. The condition $t \leq 1$ is needed to be able to estimate the $\exp(O(t(b-a)))$ term in (9) by $O(1)$ , which allows one to bound the entire RHS of (9) by $\exp( O(t^2 \sigma_i^2) )$ .

20 May, 2023 at 3:32 pm

David Roberts

Oh, excellent, thanks.

	Anonymous on Two announcements: AI for Math…
	Anonymous on 275A, Notes 3: The weak and st…
	Anonymous on 275A, Notes 3: The weak and st…
	Anonymous on Two announcements: AI for Math…
	Anonymous on Two announcements: AI for Math…
	Lior Silberman on Two announcements: AI for Math…
	Terence Tao on Marton’s conjecture in a…
	Terence Tao on 275A, Notes 3: The weak and st…
	Aditya Guha Roy on Two announcements: AI for Math…
	Two announcements: A… on Marton’s conjecture in a…
	Anonymous on Petition to support maths, sta…
	Anonymous on AI Mathematical Olympiad…
	Anonymous on Petition to support maths, sta…
	Anonymous on 254A, Notes 2: The central lim…
	Anonymous on 275A, Notes 3: The weak and st…

254A, Notes 1: Concentration of measure

Recent Comments

Articles by others

Diversions

Mathematics

Selected articles

Software

The sciences

Top Posts

Archives

Categories

The Polymath Blog

127 comments

Leave a reply to Anonymous Cancel reply

For commenters

254A, Notes 1: Concentration of measure

Share this:

Recent Comments

Articles by others

Diversions

Mathematics

Selected articles

Software

The sciences

Top Posts

Archives

Categories

The Polymath Blog

127 comments

Leave a reply to Anonymous Cancel reply

For commenters