Suppose we have a large number of scalar random variables , which each have bounded size on average (e.g. their mean and variance could be
). What can one then say about their sum
? If each individual summand
varies in an interval of size
, then their sum of course varies in an interval of size
. However, a remarkable phenomenon, known as concentration of measure, asserts that assuming a sufficient amount of independence between the component variables
, this sum sharply concentrates in a much narrower range, typically in an interval of size
. This phenomenon is quantified by a variety of large deviation inequalities that give upper bounds (often exponential in nature) on the probability that such a combined random variable deviates significantly from its mean. The same phenomenon applies not only to linear expressions such as
, but more generally to nonlinear combinations
of such variables, provided that the nonlinear function
is sufficiently regular (in particular, if it is Lipschitz, either separately in each variable, or jointly in all variables).
The basic intuition here is that it is difficult for a large number of independent variables to “work together” to simultaneously pull a sum
or a more general combination
too far away from its mean. Independence here is the key; concentration of measure results typically fail if the
are too highly correlated with each other.
There are many applications of the concentration of measure phenomenon, but we will focus on a specific application which is useful in the random matrix theory topics we will be studying, namely on controlling the behaviour of random -dimensional vectors with independent components, and in particular on the distance between such random vectors and a given subspace.
Once one has a sufficient amount of independence, the concentration of measure tends to be sub-gaussian in nature; thus the probability that one is at least standard deviations from the mean tends to drop off like
for some
. In particular, one is
standard deviations from the mean with high probability, and
standard deviations from the mean with overwhelming probability. Indeed, concentration of measure is our primary tool for ensuring that various events hold with overwhelming probability (other moment methods can give high probability, but have difficulty ensuring overwhelming probability).
This is only a brief introduction to the concentration of measure phenomenon. A systematic study of this topic can be found in this book by Ledoux.
— 1. Linear combinations, and the moment method —
We begin with the simple setting of studying a sum of random variables. As we shall see, these linear sums are particularly amenable to the moment method, though to use the more powerful moments, we will require more powerful independence assumptions (and, naturally, we will need more moments to be finite or bounded). As such, we will take the opportunity to use this topic (large deviation inequalities for sums of random variables) to give a tour of the moment method, which we will return to when we consider the analogous questions for the bulk spectral distribution of random matrices.
In this section we shall concern ourselves primarily with bounded random variables; in the next section we describe the basic truncation method that can allow us to extend from the bounded case to the unbounded case (assuming suitable decay hypotheses).
The zeroth moment method gives a crude upper bound when is non-zero,
but in most cases this bound is worse than the trivial bound . This bound, however, will be useful when performing the truncation trick, which we will discuss below.
The first moment method is somewhat better, giving the bound
which when combined with Markov’s inequality gives the rather weak large deviation inequality
As weak as this bound is, this bound is sometimes sharp. For instance, if the are all equal to a single signed Bernoulli variable
, then
, and so
, and so (2) is sharp when
. The problem here is a complete lack of independence; the
are all simultaneously positive or simultaneously negative, causing huge fluctuations in the value of
.
Informally, one can view (2) as the assertion that typically has size
.
The first moment method also shows that
and so we can normalise out the means using the identity
Replacing the by
(and
by
) we may thus assume for simplicity that all the
have mean zero.
Now we consider what the second moment method gives us. We square and take expectations to obtain
If we assume that the are pairwise independent (in addition to having mean zero), then
vanishes unless
, in which case this expectation is equal to
. We thus have
which when combined with Chebyshev’s inequality (and the mean zero normalisation) yields the large deviation inequality
Without the normalisation that the have mean zero, we obtain
Informally, this is the assertion that typically has size
, if we have pairwise independence. Note also that we do not need the full strength of the pairwise independence assumption; the slightly weaker hypothesis of being pairwise uncorrelated would have sufficed.
The inequality (5) is sharp in two ways. Firstly, we cannot expect any significant concentration in any range narrower than the standard deviation , as this would likely contradict (3). Secondly, the quadratic-type decay in
in (5) is sharp given the pairwise independence hypothesis. For instance, suppose that
, and that
, where
is drawn uniformly at random from the cube
, and
are an enumeration of the non-zero elements of
. Then a little Fourier analysis shows that each
for
has mean zero, variance
, and are pairwise independent in
; but
is equal to
, which is equal to
with probability
; this is despite the standard deviation of
being just
. This shows that (5) is essentially (i.e. up to constants) sharp here when
.
Now we turn to higher moments. Let us assume that the are normalised to have mean zero and variance at most
, and are also almost surely bounded in magnitude by some
:
. (The interesting regime here is when
, otherwise the variance is in fact strictly smaller than
.) To simplify the exposition very slightly we will assume that the
are real-valued; the complex-valued case is very analogous (and can also be deduced from the real-valued case) and is left to the reader.
Let us also assume that the are
-wise independent for some even positive integer
. With this assumption, we can now estimate the
moment
To compute the expectation of the product, we can use the -wise independence, but we need to divide into cases (analogous to the
and
cases in the second moment calculation above) depending on how various indices are repeated. If one of the
only appear once, then the entire expectation is zero (since
has mean zero), so we may assume that each of the
appear at least twice. In particular, there are at most
distinct
which appear. If exactly
such terms appear, then from the unit variance assumption we see that the expectation has magnitude at most
; more generally, if
terms appear, then from the unit variance assumption and the upper bound by
we see that the expectation has magnitude at most
. This leads to the upper bound
where is the number of ways one can assign integers
in
such that each
appears at least twice, and such that exactly
integers appear.
We are now faced with the purely combinatorial problem of estimating . We will use a somewhat crude bound. There are
ways to choose
integers from
. Each of the integers
has to come from one of these
integers, leading to the crude bound
which after using a crude form of Stirling’s formula gives
and so
If we make the mild assumption
then from the geometric series formula we conclude that
(say), which leads to the large deviation inequality
This should be compared with (2), (5). As increases, the rate of decay in the
parameter improves, but to compensate for this, the range that
concentrates in grows slowly, to
rather than
.
Remark 1 Note how it was important here that
was even. Odd moments, such as
, can be estimated, but due to the lack of the absolute value sign, these moments do not give much usable control on the distribution of the
. One could be more careful in the combinatorial counting than was done here, but the net effect of such care is only to improve the unspecified constant
(which can easily be made explicit, but we will not do so here).
Now suppose that the are not just
-wise independent for any fixed
, but are in fact jointly independent. Then we can apply (7) for any
obeying (6). We can optimise in
by setting
to be a small multiple of
, and conclude the gaussian-type bound
for some absolute constants , provided that
for some small
. (Note that the bound (8) is trivial for
, so we may assume that
is small compared to this quantity.) Thus we see that while control of each individual moment
only gives polynomial decay in
, by using all the moments simultaneously one can obtain square-exponential decay (i.e. subgaussian type decay).
By using Stirling’s formula (Exercise 2 from Notes 0a) one can show that the quadratic decay in (8) cannot be improved; see Exercise 2 below.
It was a little complicated to manage such large moments . A slicker way to proceed (but one which exploits the joint independence and commutativity more strongly) is to work instead with the exponential moments
, which can be viewed as a sort of generating function for the power moments. A useful lemma in this regard is
Lemma 1 (Hoeffding’s lemma) Let
be a scalar variable taking values in an interval
. Then for any
,
Proof: It suffices to prove the first inequality, as the second then follows using the bound and from various elementary estimates.
By subtracting the mean from we may normalise
. By dividing
(and multiplying
to balance) we may assume that
, which implies that
. We then have the Taylor expansion
which on taking expectations gives
and the claim follows.
Exercise 1 Show that the
factor in (10) can be replaced with
, and that this is sharp. (Hint: use Jensen’s inequality.)
We now have the fundamental Chernoff bound:
Theorem 2 (Chernoff inequality) Let
be independent scalar random variables with
almost surely, with mean
and variance
. Then for any
, one has
for some absolute constants
, where
and
.
Proof: By taking real and imaginary parts we may assume that the are real. By subtracting off the mean (and adjusting
appropriately) we may assume that
(and so
); dividing the
(and
) through by
we may assume that
. By symmetry it then suffices to establish the upper tail estimate
(with slightly different constants ).
To do this, we shall first compute the exponential moments
where is a real parameter to be optimised later. Expanding out the exponential and using the independence hypothesis, we conclude that
To compute , we use the hypothesis that
and (9) to obtain
Thus we have
and thus by Markov’s inequality
If we optimise this in , subject to the constraint
, we obtain the claim.
Informally, the Chernoff inequality asserts that is sharply concentrated in the range
. The bounds here are fairly sharp, at least when
is not too large:
Exercise 2 Let
be fixed independently of
, and let
be iid copies of a Bernoulli random variable that equals
with probability
, thus
and
, and so
and
. Using Stirling’s formula (Notes 0a), show that
for some absolute constants
and all
. What happens when
is much larger than
?
Exercise 3 Show that the term
in (11) can be replaced with
(which is superior when
). (Hint: Allow
to exceed
.) Compare this with the results of Exercise 2.
Exercise 4 (Hoeffding’s inequality) Let
be independent real variables, with
taking values in an interval
, and let
. Show that one has
for some absolute constants
, where
.
Remark 2 As we can see, the exponential moment method is very slick compared to the power moment method. Unfortunately, due to its reliance on the identity
, this method relies very strongly on commutativity of the underlying variables, and as such will not be as useful when dealing with noncommutative random variables, and in particular with random matrices. Nevertheless, we will still be able to apply the Chernoff bound to good effect to various components of random matrices, such as rows or columns of such matrices.
The full assumption of joint independence is not completely necessary for Chernoff-type bounds to be present. It suffices to have a martingale difference sequence, in which each can depend on the preceding variables
, but which always has mean zero even when the preceding variables are conditioned out. More precisely, we have Azuma’s inequality:
Theorem 3 (Azuma’s inequality) Let
be a sequence of scalar random variables with
almost surely. Assume also that we have the martingale difference property
almost surely for all
(here we assume the existence of a suitable disintegration in order to define the conditional expectation, though in fact it is possible to state and prove Azuma’s inequality without this disintegration). Then for any
, the sum
obeys the large deviation inequality
for some absolute constants
.
A typical example of here is a dependent random walk, in which the magnitude and probabilities of the
step are allowed to depend on the outcome of the preceding
steps, but where the mean of each step is always fixed to be zero.
Proof: Again, we can reduce to the case when the are real, and it suffices to establish the upper tail estimate
Note that almost surely, so we may assume without loss of generality that
.
Once again, we consider the exponential moment for some parameter
. We write
, so that
We do not have independence between and
, so cannot split the expectation as in the proof of Chernoff’s inequality. Nevertheless we can use conditional expectation as a substitute. We can rewrite the above expression as
The quantity is deterministic once we condition on
, and so we can pull it out of the conditional expectation:
Applying (10) to the conditional expectation, we have
and
Iterating this argument gives
and thus by Markov’s inequality
Optimising in gives the claim.
Exercise 5 Suppose we replace the hypothesis
in Azuma’s inequality with the more general hypothesis
for some scalars
. Show that we still have (13), but with
replaced by
.
Remark 3 The exponential moment method is also used frequently in harmonic analysis to deal with lacunary exponential sums, or sums involving Radamacher functions (which are the analogue of lacunary exponential sums for characteristic
). Examples here include Khintchine’s inequality (and the closely related Kahane’s inequality). The exponential moment method also combines very well with log-Sobolev inequalities, as we shall see below (basically because the logarithm inverts the exponential), as well as with the closely related hypercontractivity inequalities.
— 2. The truncation method —
To summarise the discussion so far, we have identified a number of large deviation inequalities to control a sum :
- The zeroth moment method bound (1), which requires no moment assumptions on the
but is only useful when
is usually zero, and has no decay in
.
- The first moment method bound (2), which only requires absolute integrability on the
, but has only a linear decay in
.
- The second moment method bound (5), which requires second moment and pairwise independence bounds on
, and gives a quadratic decay in
.
- Higher moment bounds (7), which require boundedness and
-wise independence, and give a
power decay in
(or quadratic-exponential decay, after optimising in
).
- Exponential moment bounds such as (11) or (13), which require boundedness and joint independence (or martingale behaviour), and give quadratic-exponential decay in
.
We thus see that the bounds with the strongest decay in require strong boundedness and independence hypotheses. However, one can often partially extend these strong results from the case of bounded random variables to that of unbounded random variables (provided one still has sufficient control on the decay of these variables) by a simple but fundamental trick, known as the truncation method. The basic idea here is to take each random variable
and split it as
, where
is a truncation parameter to be optimised later (possibly in manner depending on
),
is the restriction of to the event that
(thus
vanishes when
is too large), and
is the complementary event. One can similarly split where
and
The idea is then to estimate the tail of and
by two different means. With
, the point is that the variables
have been made bounded by fiat, and so the more powerful large deviation inequalities can now be put into play. With
, in contrast, the underlying variables
are certainly not bounded, but they tend to have small zeroth and first moments, and so the bounds based on those moment methods tend to be powerful here. (Readers who are familiar with harmonic analysis may recognise this type of divide and conquer argument as an interpolation argument.)
Let us begin with a simple application of this method.
Theorem 4 (Weak law of large numbers) Let
be iid scalar random variables with
for all
, where
is absolutely integrable. Then
converges in probability to
.
Proof: By subtracting from
we may assume without loss of generality that
has mean zero. Our task is then to show that
for all fixed
.
If has finite variance, then the claim follows from (5). If
has infinite variance, we cannot apply (5) directly, but we may perform the truncation method as follows. Let
be a large parameter to be chosen later, and split
,
(and
) as discussed above. The variable
is bounded and thus has bounded variance; also, from the dominated convergence theorem we see that
(say) if
is large enough. From (5) we conclude that
(where the rate of decay here depends on and
). Meanwhile, to deal with the tail
we use (2) to conclude that
But by the dominated convergence theorem (or monotone convergence theorem), we may make as small as we please (say, smaller than
) by taking
large enough. Summing, we conclude that
since is arbitrary, we obtain the claim.
A more sophisticated variant of this argument (which I gave in this earlier blog post, which also has some further discussion and details) gives
Theorem 5 (Strong law of large numbers) Let
be iid scalar random variables with
for all
, where
is absolutely integrable. Then
converges almost surely to
.
Proof: We may assume without loss of generality that is real, since the complex case then follows by splitting into real and imaginary parts. By splitting
into positive and negative parts, we may furthermore assume that
is non-negative. (Of course, by doing so, we can no longer normalise
to have mean zero, but for us the non-negativity will be more convenient than the zero mean property.) In particular,
is now non-decreasing in
.
Next, we apply a sparsification trick. Let . Suppose that we knew that, almost surely,
converged to
for
of the form
for some integer
. Then, for all other values of
, we see that asymptotically,
can only fluctuate by a multiplicative factor of
, thanks to the monotone nature of
. Because of this and countable additivity, we see that it suffices to show that
converges to
. Actually, it will be enough to show that almost surely, one has
for all but finitely many
.
Fix . As before, we split
and
, but with the twist that we now allow
to depend on
. Then for
large enough we have
(say), by dominated convergence. Applying (5) as before, we see that
for some depending only on
(the exact value is not important here). To handle the tail, we will not use the first moment bound (2) as done previously, but now turn to the zeroth-moment bound (1) to obtain
summing, we conclude
Applying the Borel-Cantelli lemma (Exercise 1 from Notes 0), we see that we will be done as long as we can choose such that
and
are both finite. But this can be accomplished by setting and interchanging the sum and expectations (writing
as
) and using the lacunary nature of the
.
To give another illustration of the truncation method, we extend a version of the Chernoff bound to the subgaussian case.
Proposition 6 Let
be iid copies of a subgaussian random variable
, thus
obeys a bound of the form
for all
and some
. Let
. Then for any sufficiently large
(independent of
) we have
for some constants
depending on
. Furthermore,
grows linearly in
as
.
Proof: By subtracting the mean from we may normalise
. We perform a dyadic decomposition
where and
. We similarly split
where . Then by the union bound and the pigeonhole principle we have
(say). Each is clearly bounded in magnitude by
; from the subgaussian hypothesis one can also verify that the mean and variance of
are at most
for some
. If
is large enough, an application of the Chernoff bound (11) (or more precisely, the refinement in Exercise 3) then gives (after some computation)
(say) for some , and the claim follows.
Exercise 6 Show that the hypothesis that
is sufficiently large can be replaced by the hypothesis that
is independent of
. Hint: There are several approaches available. One can adapt the above proof; one can modify the proof of the Chernoff inequality directly; or one can figure out a way to deduce the small
case from the large
case.
Exercise 7 Show that the subgaussian hypothesis can be generalised to a sub-exponential tail hypothesis
provided that
. Show that the result also extends to the case
, except with the exponent
replaced by
for some
. (I do not know if the
loss can be removed, but it is easy to see that one cannot hope to do much better than this, just by considering the probability that
(say) is already as large as
.)
— 3. Lipschitz combinations —
In the preceding discussion, we had only considered the linear combination of independent variables
. Now we consider more general combinations
, where we write
for short. Of course, to get any non-trivial results we must make some regularity hypotheses on
. It turns out that a particularly useful class of regularity hypothesis here is a Lipschitz hypothesis – that small variations in
lead to small variations in
. A simple example of this is McDiarmid’s inequality:
Theorem 7 (McDiarmid’s inequality) Let
be independent random variables taking values in ranges
, and let
be a function with the property that if one freezes all but the
coordinate of
for some
, then
only fluctuates by most
, thus
for all
,
for
. Then for any
, one has
for some absolute constants
, where
.
Proof: We may assume that is real. By symmetry, it suffices to show the one-sided estimate
To compute this quantity, we again use the exponential moment method. Let be a parameter to be chosen later, and consider the exponential moment
To compute this, let us condition to be fixed, and look at the conditional expectation
We can simplify this as
where
For fixed,
only fluctuates by at most
and has mean zero. Applying (10), we conclude that
Integrating out the conditioning, we see that we have upper bounded (16) by
We observe that is a function
of
, where
obeys the same hypotheses as
(but for
instead of
). We can then iterate the above computation
times and eventually upper bound (16) by
which we rearrange as
and thus by Markov’s inequality
Optimising in then gives the claim.
Exercise 8 Show that McDiarmid’s inequality implies Hoeffding’s inequality (Exercise 4).
Remark 4 One can view McDiarmid’s inequality as a tensorisation of Hoeffding’s lemma, as it leverages the latter lemma for a single random variable to establish an analogous result for
random variables. It is possible to apply this tensorisation trick to random variables taking values in more sophisticated metric spaces than an interval
, leading to a class of concentration of measure inequalities known as transportation cost-information inequalities, which will not be discussed here.
The most powerful concentration of measure results, though, do not just exploit Lipschitz type behaviour in each individual variable, but joint Lipschitz behaviour. Let us first give a classical instance of this, in the special case when the are gaussian variables. A key property of gaussian variables is that any linear combination of independent gaussians is again an independent gaussian:
Exercise 9 Let
be independent real gaussian variables with
, and let
be real constants. Show that
is a real gaussian with mean
and variance
.
Show that the same claims also hold with complex gaussians and complex constants
.
Exercise 10 (Rotation invariance) Let
be an
-valued random variable, where
are iid real gaussians. Show that for any orthogonal matrix
,
.
Show that the same claim holds for complex gaussians (so
is now
-valued), and with the orthogonal group
replaced by the unitary group
.
Theorem 8 (Gaussian concentration inequality for Lipschitz functions) Let
be iid real gaussian variables, and let
be a
-Lipschitz function (i.e.
for all
, where we use the Euclidean metric on
). Then for any
one has
for some absolute constants
.
Proof: We use the following elegant argument of Maurey and Pisier. By subtracting a constant from , we may normalise
. By symmetry it then suffices to show the upper tail estimate
By smoothing slightly we may assume that
is smooth, since the general case then follows from a limiting argument. In particular, the Lipschitz bound on
now implies the gradient estimate
for all .
Once again, we use the exponential moment method. It will suffice to show that
for some constant and all
, as the claim follows from Markov’s inequality and optimisation in
as in previous arguments.
To exploit the Lipschitz nature of , we will need to introduce a second copy of
. Let
be an independent copy of
. Since
, we see from Jensen’s inequality that
and thus (by independence of and
)
It is tempting to use the fundamental theorem of calculus along a line segment,
to estimate , but it turns out for technical reasons to be better to use a circular arc instead,
The reason for this is that is another gaussian random variable equivalent to
, as is its derivative
(by Exercise 9); furthermore, and crucially, these two random variables are independent (by Exercise 10).
To exploit this, we first use Jensen’s inequality to bound
Applying the chain rule and taking expectations, we have
Let us condition to be fixed, then
; applying Exercise 9 and (17), we conclude that
is normally distributed with standard deviation at most
. As such we have
for some absolute constant ; integrating out the conditioning on
we obtain the claim.
Exercise 11 Show that Theorem 8 is equivalent to the inequality
holding for all
and all measurable sets
, where
is an
-valued random variable with iid gaussian components
, and
is the
-neighbourhood of
.
Now we give a powerful concentration inequality of Talagrand, which we will rely heavily on later in this course.
Theorem 9 (Talagrand concentration inequality) Let
, and let
be independent complex variables with
for all
. Let
be a
-Lipschitz convex function (where we identify
with
for the purposes of defining “Lipschitz” and “convex”). Then for any
one has
for some absolute constants
, where
is a median of
.
We now prove the theorem, following the remarkable argument of Talagrand.
By dividing through by we may normalise
.
now takes values in the convex set
, where
is the unit disk in
. It will suffice to establish the inequality
for any convex set in
and some absolute constant
, where
is the Euclidean distance between
and
. Indeed, if one obtains this estimate, then one has
for any (as can be seen by applying (20) to the convex set
). Applying this inequality of one of
equal to the median
of
yields (18), which in turn implies that
which then gives (19).
We would like to establish (20) by induction on dimension . In the case when
are Bernoulli variables, this can be done; see this previous blog post. In the general case, it turns out that in order to close the induction properly, one must strengthen (20) by replacing the Euclidean distance
by an essentially larger quantity, which I will call the combinatorial distance
from
to
. For each vector
and
, we say that
supports
if
is non-zero only when
is non-zero. Define the combinatorial support
of
relative to
to be all the vectors in
that support at least one vector in
. Define the combinatorial hull
of
relative to
to be the convex hull of
, and then define the combinatorial distance
to be the distance between
and the origin.
Lemma 10 (Combinatorial distance controls Euclidean distance) Let
be a convex subset of
.
.
Proof: Suppose . Then there exists a convex combination
of elements
which has magnitude at most
. For each such
, we can find a vector
supported by
. As
both lie in
, every coefficient of
has magnitude at most
, and is thus bounded in magnitude by twice the corresponding coefficent of
. If we then let
be the convex combination of the
indicated by
, then the magnitude of each coefficient of
is bounded by twice the corresponding coefficient of
, and so
. On the other hand, as
is convex,
lies in
, and so
. The claim follows.
Thus to show (20) it suffices (after a modification of the constant ) to show that
We first verify the one-dimensional case. In this case, equals
when
, and
otherwise, and the claim follows from elementary calculus (for
small enough).
Now suppose that and the claim has already been proven for
. We write
, and let
be a slice of
. We also let
. We have the following basic inequality:
Lemma 11 For any
, we have
Proof: Observe that contains both
and
, and so by convexity,
contains one of
or
whenever
and
. The claim then follows from Pythagoras’ theorem and the Cauchy-Schwarz inequality.
Let us now freeze and consider the conditional expectation
Using the above lemma (with some depending on
to be chosen later), we may bound the left-hand side of (21) by
applying Hölder’s inequality and the induction hypothesis (21), we can bound this by
which we can rearrange as
where (here we note that the event
is independent of
). Note that
. We then apply the elementary inequality
which can be verified by elementary calculus if is small enough (in fact one can take
). We conclude that
Taking expectations in we conclude that
Using the inequality with
we conclude (21) as desired.
The above argument was elementary, but rather “magical” in nature. Let us now give a somewhat different argument of Ledoux, based on log-Sobolev inequalities, which gives the upper tail bound
but curiously does not give the lower tail bound. (The situation is not symmetric, due to the convexity hypothesis on .)
Once again we can normalise . By regularising
we may assume that
is smooth. The first step is to establish the following log-Sobolev inequality:
Lemma 12 (Log-Sobolev inequality) Let
be a smooth convex function. Then
for some absolute constant
(independent of
).
Remark 5 If one sets
and normalises
, this inequality becomes
which more closely resembles the classical log-Sobolev inequality of Gross. The constant
here can in fact be taken to be
; see Ledoux’s paper.
Proof: We first establish the -dimensional case. If we let
be an independent copy of
, observe that the left-hand side can be rewritten as
From Jensen’s inequality, , so it will suffice to show that
From convexity of (and hence of
) and the bounded nature of
, we have
and
when , which leads to
in this case. Similarly when (swapping
and
). The claim follows.
To show the general case, we induct on (keeping care to ensure that the constant
does not change in this induction process). Write
, where
. From induction hypothesis, we have
where is the
-dimensional gradient and
. Taking expectations, we conclude that
From the convexity of and Hölder’s inequality we see that
is also convex, and
. By the
case already established, we have
Now, by the chain rule
where is the derivative of
in the
direction. Applying Cauchy-Schwarz, we conclude
Inserting this into (23), (24) we close the induction.
Now let be convex and
-Lipschitz. Applying the above lemma to
for any
, we conclude that
setting , we can rewrite this as a differential inequality
which we can rewrite as
From Taylor expansion we see that
as , and thus
for any . In other words,
By Markov’s inequality, we conclude that
optimising in gives (22).
Remark 6 The same argument, starting with Gross’s log-Sobolev inequality for the gaussian measure, gives the upper tail component of Theorem 8, with no convexity hypothesis on
. The situation is now symmetric with respect to reflections
, and so one obtains the lower tail component as well. The method of obtaining concentration inequalities from log-Sobolev inequalities (or related inequalities, such as Poincaré-type ienqualities) by combining the latter with the exponential moment method is known as Herbst’s argument, and can be used to establish a number of other functional inequalities of interest.
We now close with a simple corollary of the Talagrand concentration inequality, which will be extremely useful in the sequel.
Corollary 13 (Distance between random vector and a subspace) Let
be independent complex-valued random variables with mean zero and variance
, and bounded almost surely in magnitude by
. Let
be a subspace of
of dimension
. Then for any
, one has
for some absolute constants
.
Informally, this corollary asserts that the distance between a random vector and an arbitrary subspace
is typically equal to
.
Proof: The function is convex and
-Lipschitz. From Theorem 9, one has
To finish the argument, it then suffices to show that
We begin with a second moment calculation. Observe that
where is the orthogonal projection matrix to the complement
of
, and
are the components of
. Taking expectations, we obtain
where the latter follows by representing in terms of an orthonormal basis of
. This is close to what we need, but to finish the task we need to obtain some concentration of
around its mean. For this, we write
where is the Kronecker delta. The summands here are pairwise uncorrelated for
, and the
cases can be combined with the
cases by symmetry. Each summand also has a variance of
. We thus have the variance bound
where the latter bound comes from representing in terms of an orthonormal basis of
. From this, (25), and Chebyshev’s inequality, we see that the median of
is equal to
, which implies on taking square roots that the median of
is
, as desired.
119 comments
Comments feed for this article
4 January, 2010 at 1:40 pm
Michael Lugo
I’m enjoying these notes. I just want to point out a technical issue. This post, and also the post Notes 0, did not appear in Google Reader, although Notes 0a does appear there, and all three posts (0, 0a, 1) appear if I go directly to https://terrytao.wordpress.com/feed/.
I don’t know what’s causing this (and the problem may be my fault, or Google’s), but I thought I’d mention this here in case other people have the same issue.
4 January, 2010 at 10:58 pm
harrison
Google Reader’s doing the same for me. It doesn’t seem like an uncommon bug — for instance, looking at my feed to Tim Gowers’ blog it doesn’t list the post on Littlewood’s conjecture. Two dropped posts in three days does seem high, though.
7 January, 2010 at 8:19 am
Jason Rute
The trouble seems to be something in the content item of the feed (the item containing the full HTML of the post). Using Yahoo Pipes, I created a copy of the feed for this blog and removed the content item. It seems to work fine in Google Reader. The RSS feed is here:
http://pipes.yahoo.com/pipes/pipe.run?_id=ec21ff6b5d982e2f244ca9fd7d10dfac&_render=rss
and the Yahoo Pipe (which I made public) is here:
http://pipes.yahoo.com/pipes/pipe.info?_id=ec21ff6b5d982e2f244ca9fd7d10dfac
Note that you will have to read the article in WordPress rather than in Google Reader.
4 January, 2010 at 8:48 pm
Kelly
Aloha Terry,
I am new to this blog and quite blown away by the sheer amount of fascinating math topics here… Just wondering- how long does it normally take to put these notes together? You don’t seem to sleep!
5 January, 2010 at 1:03 am
Anonymous
Dear Prof. Tao,
I enjoyed this yet another excellent post.
A remark – theorems 4 and 5 seem to be phrased identically – probably a copy-paste typo.
[Corrected, thanks – T.]
5 January, 2010 at 3:59 pm
mmailliw/william
The improved bound you want us to establish in Exercise 3,
(lambda K/sigma)^(-c lambda sigma), is not scale-invariant; if you multiply the X_i by a constant A, then the K, mu_i, sigma_i, mu, and sigma (and therefore the LHS of the desired inequality) remains unchanged.
However, the term (lambda K/sigma)^(-c lambda sigma) is raised to the power of A (as lambda K/sigma is unchanged but -c lambda sigma is multiplied by A).
Am I missing something here, and if not, should the bound be changed to something which is invariant under such scaling?
5 January, 2010 at 4:25 pm
Terence Tao
Ah, the exponent should be
instead. I’ve changed this accordingly.
6 January, 2010 at 1:25 am
Arnab
Dear Prof. Tao,
Such wonderful notes!
In Corollary 13, I think when you did the variance bound of d(X, V)^2,
you mean the summand are ‘pairwise uncorrelated’ instead of ‘pairwise independent’?
[Well, either would work here, so I reworded the text accordingly. -T.]
6 January, 2010 at 5:24 am
Akito Nagasaki
Dear Professor Tao,
Inequality(1) seems not always true!
Pick up a r.v. X taking 1 with probability 1/8, -1 with 1/8 and 0 with 3/4
and Y a copy of X but independent one.
Then P(X+Y not equal to zero) = 38/64 and P(X not equal to zero) +
P(Y not equal to zero)=1/2 ??
6 January, 2010 at 12:06 pm
Terence Tao
I think you may have double-counted some of the probabilities here; my computations show that X+Y is non-zero with probability 26/64.
The proof of the inequality is quite simple in this case, because the event that X+Y is nonzero is clearly contained in the union of the events that X is nonzero and Y is nonzero, thus the probability of the former is at most the sum of the probabilities of the latter.
6 January, 2010 at 10:32 am
John Sidles
For a seminar our UW QSE Group is giving this quarter, So you want to be a quantum system engineer, it would be *very* helpful to have a generalized version of Corollary 13, in which
is not a linear subspace, but rather a rank-
join space of spin-
product states of order-
.
Physicists would call
a restriction of spin-
order-
matrix product states (MPS) to states whose matrices are diagonal of Schmidt rank
… the physical point being that
then bounds the Schmidt rank of all of the
state-space bipartitions. MPS’s like
are natural in general-purpose quantum simulation precisely because they have (due to their diagonal restriction) no natural ordering among the spins.
Since we engineers are going to be exploring Corollary 13 in large dimensions numerically, it would be useful for us to know (in advance) whether we might simply replace
for some specified function
.
Might
… as follows if we approximate $V$ as a linear subspace? Here we welcome all the mathematical guidance we can get, from *any* reader(s) of this weblog!
And please accept also our apology, in advance, should it happen that the function
is well-known, or obvious (or the latex of this post doesn’t render properly).
6 January, 2010 at 11:59 am
254A, Notes 2: The central limit theorem « What’s new
[…] the previous set of notes, we were able to establish various tail bounds on . For instance, from Chebyshev’s inequality […]
6 January, 2010 at 5:49 pm
Pietro Poggi-Corradini
The display after “Applying (10), we conclude that” in the proof of McDiarmids Inequality seems to be missing a less-or-equal sign.
[Corrected, thanks – T.]
7 January, 2010 at 2:01 am
arno
Hi!
nice entry… some comments:
– the way to derive concentration from logarithmic Sobolev inequality is called Herbst’s argument as you surely know… and as the same technique may be applied (with technical details) to other functional inequalities… you should perhaps mention it…
– Mc Diarmid’s method, as well as Hoeffding’s inequality, may be considered in the larger framework of transportation’s inequality as studied initially By Marton and also Talagrand, and a very elegant way to derive concentration property… Mc Diarmid’s being then a way to tensorize these inequalities… however, it perhaps implies to consider other quantities (such as wasserstein’s distance) that you do not want to use here…
best regards
[Good suggestions, thanks – T.]
7 January, 2010 at 3:08 am
karabasov
The link to “divide and conquer argument” right before Theorem 4 is not working
[Fixed, thanks – T.]
9 January, 2010 at 12:02 pm
254A, Notes 3: The operator norm of a random matrix « What’s new
[…] bounding can be viewed as a non-commutative analogue of upper bounding the quantity studied in Notes 1. (The analogue of the central limit theorem in Notes 2 is the Wigner semi-circular law, which will […]
10 January, 2010 at 8:54 am
Anonymous
in equation
there is an extra 
[Corrected, thanks – T.]
10 January, 2010 at 9:08 am
Anonymous
in Theorem 2,
on the right hand side, we are taking maximum but I do not see with respect to which variable the max is taken.
10 January, 2010 at 9:14 am
Terence Tao
max(x,y) is simply the larger of the two quantities x and y.
10 January, 2010 at 9:18 am
Anonymous
in exercise
, is it
or
?
10 January, 2010 at 9:25 am
Terence Tao
It is
. The idea here is to show that Stirling’s formula can be used to provide a lower bound to complement the upper bounds proven in the text.
10 January, 2010 at 9:34 am
Anonymous
I see. Thanks Prof. Tao
10 January, 2010 at 11:46 am
Steven Heilman
some pedantic typos/details:
1.”gives a crude upper bound on[sic] when”
2. Hoeffding’s inequality (exercise 4): P not E
3. I suppose it’s no coincidence that Theorems 2,3 are (essentially) the crux of the proof of Khintchine’s inequality (on the unit interval). Kahane’s inequality even can be dealt with using “exponential-type” estimates of a distribution function, though the argument I know is more combinatorial (and perhaps “more basic”) in nature.
4(a). [inconsequential detail] proof of theorem 8, first use of Jensen’s,
some pi/2’s (and 2/pi’s) need to be inverted? e.g.
\displaystyle \exp( t (f(X) – f(Y))) \leq \frac{2}{\pi} \int_0^{\pi/2} \exp( \frac{\pi t}{2} \frac{d}{d\theta} f( X_\theta ) )\ d\theta.
etc.
4(b). f should be F
4(c). final bound should be exp(Ct^2)?
(Laplace transform of Gaussian is, essentially, a Gaussian)?
5. definition of combinatorial support: “to be the[sic] all”
6. final equation: errant E on the left, errant \leq in the middle
[Thanks for the corrections! Yes, Khintchine’s inequality can be viewed as part of the family of large deviation inequalities proven by the exponential moment method; I’ve added a remark to this effect -T.]
18 January, 2010 at 6:29 pm
254A, Notes 3b: Brownian motion and Dyson Brownian motion « What’s new
[…] all , then one easily verifies (using Exercise 9 of Notes 1) that is a Wiener […]
21 January, 2010 at 6:24 pm
Sean
At the risk of everyone saying “Oh no, not this guy again” are you aware of my idea of combining the Walsh Hadamard transform and randomly selected permutations to transform arbitrary numerical data into data with a Gaussian distribution? Basically it maps one point on a hypersphere to another random point on the hypersphere. The mapping is reversible. If you look up how to pick a point uniformly at random on the surface of a hypersphere you will see that one method involves random numbers with a Gaussian distribution.
I only do informal hobby research but I hope that is useful.
http://code.google.com/p/lemontree/downloads/list
Sean O’Connor
23 January, 2010 at 11:05 am
Anonymous
In the comment following equation (8) it seems that that statement “so we may assume that $\lambda$ is large compared to this quantity”, should read “\lambda” is small compared to this quantity” (otherwise we have the null statement $|S_n| >>n$)
Best
[Thanks – T.]
26 January, 2010 at 8:06 am
Anonymous
Can someone please post the solution to exercise 11.
2 February, 2010 at 1:34 pm
254A, Notes 4: The semi-circular law « What’s new
[…] As will hopefully become clearer in the next set of notes, the semi-circular law is the noncommutative (or free probability) analogue of the central limit theorem, with the semi-circular distribution (1) taking on the role of the normal distribution. Of course, there is a striking difference between the two distributions, in that the former is compactly supported while the latter is merely subgaussian. One reason for this is that the concentration of measure phenomenon is more powerful in the case of ESDs of Wigner matrices than it is for averages of iid variables; compare the concentration of measure results in Notes 3 with those in Notes 1. […]
10 February, 2010 at 9:16 am
PDEbeginner
Dear Prof. Tao,
I finished reading this chapter. I had some problems:
1. In the proof of the strong LLN, if we choose
, I didn’t see why
since we only assume the first order moment for
.
2. In Ex 7, I can get the bounds
.
10 February, 2010 at 9:23 am
Terence Tao
1. This follows from the pointwise bound
which ultimately comes from the lacunary nature of the
.
2. Hmm, that’s strange, because one should get exponential bounds when p >= 1. (For instance, when p=1, one can use the fact that E exp(cX) is bounded for small c to get exponential bounds, see e.g. Lemma 1 of Notes 3.) It may be that your choice of parameters is not fully optimal.
10 February, 2010 at 3:06 pm
PDEbeginner
Thanks!
I understood 2, but still have some problem for 1: the constant
should depend on the sample
,
is possible. To apply Borel-Cantelli Lemma, we need the expectation to be strictly less than
.
10 February, 2010 at 3:19 pm
Terence Tao
The constant C is uniform. This is because the expression
is uniformly bounded (e.g. if the
are powers of 2, then this quantity is always bounded by 2, by the geometric series formula.)
10 February, 2010 at 10:56 pm
245A, Notes 5: Free probability « What’s new
[…] probability, though if free probability is combined with tools such as concentration of measure (Notes 1) then such rate information can often be recovered. For similar reasons, free probability lets one […]
19 February, 2010 at 4:30 am
Anonymous
Dear Prof Tao,
thanks for these very nice lectures.
Shouldn’t \sigma^2 be a \sigma in McDiarmid’s inequality ?
[Corrected, thanks – T.]
5 March, 2010 at 1:47 pm
254A, Notes 7: The least singular value « What’s new
[…] Talagrand’s inequality (Notes 1), we expect each to be of size on the average, which suggests that ; this is consistent with the […]
16 April, 2010 at 7:00 am
Anonymous
Suppose one weakens the hypotheses of the above concentration of measure results, e.g.,by removing the assumption of finite mean and variance, or by weakening independence assumptions. It seems that such weaker assumptions are needed in certain interesting applications, such as financial modelling. Is there still some sort of universal behavior, such as convergence in distribution to some universal “fat tailed” distribution, that reasonable combinations of such random variables exhibit?
17 April, 2010 at 11:44 pm
Terence Tao
I believe the theory of stable laws addresses these sorts of issues.
18 April, 2010 at 9:55 am
Anonymous
Thanks for the reference.
3 May, 2010 at 8:10 am
Anonymous
In Exercise 11, I got the proof for one direction, but do not know how to prove that the statement in Exercise 11 implies Theorem 8. Could you give a relatively detailed hint?
by the way, one of the “dt” in the proof of Theorem 8 should be replaced with “d\theta”
3 May, 2010 at 8:19 am
Terence Tao
Thanks for the correction!
To get Theorem 8 from Exercise 11, first prove a variant of Theorem 8 in which the expectation is replaced by the median. To prove that variant, argue by contradiction and look at various level sets of F.
3 May, 2010 at 11:23 am
Anonymous
Would it be right to say that Azuma’s inequality is already present in Hoeffding’s 1963 paper (see top of page 18 there), ditto for McDiarmid’s inequality?
3 May, 2010 at 12:51 pm
Terence Tao
Yes, the attribution here can become somewhat complicated; for instance, one can make the case that Hoeffding’s inequality, Azuma’s inequality, and McDiarmid’s inequality were all implicitly present in the work of Bernstein several decades earlier (and are not far off from the work of Chernoff either). Certainly I have seen Azuma’s inequality, for instance, referred to instead as the Azuma-Hoeffding inequality.
Assigning attribution to a result is not a completely precise science; one has to draw the line somewhere between speculating that the result is true, asserting the result without proof, giving a nearly complete sketch, implicitly stating and proving the result en route to some other goal, giving the proof with full details, to giving applications and popularising the result (for instance by explicitly stating it as the main theorem of a paper). At a more practical level, it is also a matter of what name has taken hold in the community (it is unlikely, for instance, that L’Hopital’s rule or Stokes’ theorem will ever be renamed after their original discoverers); currently Azuma’s inequality seems to have a slight edge over Azuma-Hoeffding, though both are acceptable in my view.
See also “Stigler’s law of eponymy“.
3 May, 2010 at 10:58 pm
Anonymous
Dear Professor Tao,
Is the convex function requirement for the function F necessary? (In the Talagrand concentration inequality and also in the Log-Solobev inequality?)
Thanks.
4 May, 2010 at 7:47 am
Terence Tao
The condition for convexity can be relaxed somewhat, but I believe there are counterexamples to show that it cannot be dispensed with entirely. Ledoux’s book will surely have a thorough discussion of this issue.
30 November, 2011 at 8:40 am
Terence Tao
I’m belatedly returning to this question to record a concrete counterexample. On
, consider the function
, where
is the l^1 (or Hamming) norm. This is O(1)-Lipschitz on
, and can thus be extended (in a non-convex manner) to a O(1)-Lipschitz function on
, but concentrates at scale
rather than at scale O(1).
2 December, 2011 at 11:43 am
Nick Cook
Do you want to swap the order of max and min there? I am also not seeing that it is 1-Lipschitz (wrt Euclidean distance). I get
, and all of these are sharp, getting a Lipschitz constant of
, consistent with the scale of concentration.
2 December, 2011 at 11:47 am
Nick Cook
this reply is so nested the last inequality disappeared!
2 December, 2011 at 12:45 pm
Terence Tao
Thanks for the correction! To establish the Lipschitz property, note that F only has a range of
, so it suffices to verify the Lipschitz claim when
, which on the unit cube then gives
. F has a Lipschitz constant of
wrt the
norm, and hence of norm
wrt the Euclidean norm.
2 December, 2011 at 1:24 pm
Nick Cook
Ah right, I see now how my inequalities forget the range of
. Thanks!
11 January, 2012 at 12:49 pm
Nick Cook
Ledoux’s example is also on
. Let
even and let
be the “hereditary” set
(such sets
are extremal in the sense of isoperimetry for the Hamming metric on the discrete cube). Let
be the Euclidean distance from
to
.
is 1-Lipschitz but not convex.
Now let
be the uniform product measure on
and consider the subset
with
to be chosen later. Then for any
and any
,
, so that
. Hence,
for
sufficiently small independent of
, by the Central Limit theorem (or just Stirling’s formula).
28 August, 2010 at 1:58 am
Anonymous
Dear Prof. Tao
I have a question about Gaussian Concentration Inequality for Lipschitz Function.
Let
, where
are iid real Gaussian variables. We have already known
and can prove
is Lipschitz Function with Lipschitz Constant
.
According to the proof of Theorem 8, we can get
,
.
.
,
,
.
then
Optimising in t , we have
Since
we can get
then
Above mentioned are what I can get by Theorem 8. But I have saw a result that one can get

by Gaussian Concentration. Can we get this result in your opinion? If your answer is YES. Can you tell me why? Some hints will be OK.
31 August, 2010 at 12:07 am
Anonymous
I met this question when read one of Donoho’s papers.
It seems that the question which has haunted me nearly a week has not yet been settled. Is it so difficult?
30 August, 2010 at 1:07 am
Anonymous
Dear Prof. Tao
I hope I can have your advice about the question mentioned in above comment. Your experience will be important to a student interested in Concentration of Measure.
13 October, 2010 at 6:51 am
Camilo Chaparro
Prof. Tao. I’m doing a research about matrices and measure theory. Can you send me directions or websites where I can find information about those topics? Thanks a lot.
26 January, 2011 at 5:16 pm
yucao
Dear Prof. Tao,
I cannot fully understand what a “sharp” inequality EXACTLY mean here. I read your post on Google buzz about sharp estimate. And there is another passage on Wikipedia which mentions the mathematical jargon “sharp”. By I get confused when I try to give examples. Like “
is sharp” but “
is not sharp”? What about
and
? In the finite dimension space, one has
where
. But as long as
, one has
. So
is not sharp?
You mentioned in your Google buzz that "These are exact inequalities such as X <= Y without any unspecified constants or error terms (or X <= CY with an extremely explicit C, typically involving special constants such as pi or e). " It seems that this is only a issue about the constant C. (Even when one has X<=Z<=Y, X<=Y can still be sharp?)
27 January, 2011 at 1:50 am
Terence Tao
There are varying degrees of sharpness. An inequality
is usually considered (exactly) sharp if
can be equal to
in some non-trivial cases. (The more non-trivial cases for which equality is nearly attained, the sharper the inequality becomes; thus equalities
are the sharpest inequalities of all.) A weaker notion of sharpness than exact sharpness is sharpness up to constants:
is sharp up to constants if one has
in non-trivial cases. One can also have sharpness up to logarithms, sharpness up to epsilon factors in the exponents, etc.
One can also consider sharpness in various exponent parameters; if an estimate holds for all exponents p in a certain range, and that range is best possible in that the estimate breaks down outside that range, then the endpoints of that range are considered sharp exponents for the inequality.
29 January, 2011 at 6:01 pm
Pan
Dear Professor:
I have a question about Sobolev Spaces. Suppose there exists a sequence of functions u_n is bounded in W^{2,\infty}. Then we know, by compactness, there exists a subsequence u_k which weakly converges to a function u in W^{2,\infty}. But is that possible that the 1st derivative of u_k converges weakly to the 1st derivative of u? As well as the 2nd derivative of u_k?
Thank you very much professor!
11 February, 2011 at 12:31 pm
Meng
Dear Prof. Tao,
I have a question about the decay rate of the sum of i.i.d. random variables.
If x_i are i.i.d. N(0,1), S_n =\sum_i |x_i|. What is the sharpest bound of
P(|S_n-nE[|x_i|]|>cn) for ANY c>0?
Could I apply Proposition 6? It requires |x_i| to be subgaussian, and c sufficiently large though. Can I use it here?
Thank you!
13 April, 2011 at 1:37 am
Xiaodong
Dear Prof. Tao,
I have a question about Theorem 8 and 9. Do we have a similar result for the concentration of the 1-Lipschitz function when the probability space is a sub-Gaussian product measure.
26 October, 2011 at 7:59 am
Anthony Quas
Hi Terry,
In the introductory section, I guess you mean one is *within* $O(\log^{\frac12}n)$ standard deviations with high probability and *within* $O(\log^{\frac12+\epsilon}n)$ standard deviations with overwhelming probability.
26 October, 2011 at 8:10 am
Terence Tao
Yes, this is correct. With my conventions for O() notation, a quantity which is
has an upper bound by a constant times X, but does not necessarily have a lower bound. (I occasionally use
for a quantity which has both an upper and lower bound, though nowadays I prefer to avoid using such notation as it is not as widely understood as the O() notation.)
27 October, 2014 at 4:26 am
Martin Scorsese
Concerning your comment from the 26 October, 2011 at 8:10 am: The way you defined asymptotic notation in the first Definition from “Notes 0”, is contradictory to your comment, I believe: Here you define “$Z$ is $O(X)$” if $Z\leq CX$, but in the first definition a lower bound is also provided: $|Z|\leq CX$.
27 October, 2014 at 9:09 am
Terence Tao
In my comment, I was referring to a lower bound of the form
(or
, if Z is signed) for some strictly positive constant c (such a lower bound is assumed in some variants of the O() notation that one occasionally sees in the literature).
30 December, 2011 at 11:31 am
Raj
Dear Prof. Tao,
Consider the function
$f(z) = 1/( u^{H} (zI – X_n)^{-2} u)
where u is a n x 1 unit norm vector with uniform distribution on the sphere and X is an independent random diagonal matrix with real eigenvalues \lambda_{1}, \lambda_{n}. Assume that the spectral measure on the eigenvalues on X_n converges almost surely as n-> to a non-random measure \mu_X that is compactly supported on [a,b].
For arbitrary z \in [a,b],, it seems that f(z) – a.s. -> 0 because the numerator of u^{H} (zI – X_n)^{-2} u is O(1/n) while the denominator is O(1/n^2).
What would be the easiest way to make the above argument rigorous? Can one do so without requiring any level repulsion type assumption on the eigenvalues of X_n?
I’ve enjoyed reading your notes and the elegance of your presentations.
Thanks,
Raj
10 February, 2012 at 9:07 am
Michael
Dear Prof. Tao,
I enjoyed your excellent post.
However, I did not find anything about Khinchine-Kahane or Khinchine inequalities I am also interested in. Please, lead me to your posts or give me an advise for a good book.
Thank you very much.
4 May, 2012 at 12:22 pm
Nick Cook
I think the line under equation (24) should have conditioning on
:
. It is to this conditional expectation we apply Cauchy-Schwarz. Then we can take expectations and need an
on the left hand side of the next line:
.
[Corrected, thanks – T.]
3 July, 2012 at 1:02 am
Ahriman
Dear Prof. Tao,
I don’t understand fully your proof of the McDiarmid’s inequality.
fluctuates by
for latex $X_1, …, X_n$ fixed, and then, how do you apply (10), which says nothing about conditionnal expectations ?
How do you show that
Maybe I’m not familiar enough with conditional expectations …
Thank you very much,
Ahriman.
3 July, 2012 at 7:47 am
Terence Tao
If
are fixed, then
is deterministic (i.e. it is just a scalar constant, rather than a random variable), and by hypothesis
only fluctuates by
, so
does also. One then applies (10) to the random variable Y (which is a random variable in the conditioned probability space arising after fixing
).
4 July, 2012 at 8:29 am
Ahriman
I think the problem comes from the fact that I don’t understand what “
are fixed” means.
satisfies the same hypotheses as
.
I don’t see also why
4 July, 2012 at 8:50 am
Terence Tao
Conditional expectation is discussed in Section 4 of Notes 0. If one is unfamiliar with conditional expectation, it is probably best to begin with the discrete case, when all the X_i are discrete random variables. In this case, fixing the X_i means selecting deterministic elements
of
and working with conditional expectations such as
, which (by the independence) is also equal to
; the random variable Y then has the distribution of
for each fixed choice of values
of
. From the triangle inequality for expectations we see that fluctuation properties of F are inherited by the average
; for instance, if
for all
, then by the triangle inequality
and thus
, so that
only fluctuates by at most
in the
variable.
If you are more familiar with measure theory than with probability theory, it can be an instructive exercise to write out the argument in measure-theoretic terms, say in the simple case when n=2 and the variable X_1, X_2 are just Bernoulli distributions on {0,1} with a probability 1/2 of each, avoiding all use of probabilistic language. While this is ultimately not the most efficient way to think about such arguments, it can serve as a stepping stone towards a more probabilistic way of thinking.
4 July, 2012 at 11:16 am
Ahriman
Thank you a lot for yours answers, and time you spent writing it. The case of a finite probability space is clear for me.
Your definition of conditional expectation is not the same as the one I know. I have to rethink about it. In particular, I always tried to keep away those notions of disintegration.
Thank you again.
29 August, 2012 at 7:15 pm
Anonymous
Dear Professor Tao,
I see most of the results here are about indepdendent random variables. What is your opinion and knowledge of any literature on concentration for random variables which are not independent? Thank you very much!
29 August, 2012 at 9:32 pm
Nick Cook
I can offer some buzz words! For measures on manifolds other than product spaces, the logarithmic Sobolev inequality machinery can be employed – check out Ledoux’s book. “Dumber” methods like spectral gap / Poincare inequality may be sufficient at times. See also Sourav Chatterjee’s PhD thesis on using exchangeable pairs to get concentration inequalities, with applications to e.g. spin systems. Ledoux’s book also seems to cover transportation cost inequalities.
Sometimes there is a trick to express the variables in terms of independent variables. For example, the coordinates of a uniform random point on the simplex
(with the dependence relation
) can be expressed in terms of independent rate $n$ exponential random variables
via
. (That is, the coordinates
have beta distribution.)
It really depends on the flavor of the dependence – dependent random variables seem to be like nonlinear PDE in that there can’t be any general methodology.
14 October, 2012 at 5:44 pm
The Chowla conjecture and the Sarnak conjecture « What’s new
[…] Finally, this exponentially high concentration can be achieved by the moment method, using a slight variant of the moment method proof of the large deviation estimates such as the Chernoff inequality or Hoeffding inequality (as discussed in this blog post). […]
15 October, 2012 at 10:48 pm
Hoeffding bound | blayz
[…] sections of the book Topics in Random Matrix Theory by Terry Tao, with his draft and relevant notes available online, and the Hoeffding’s original paper along with the paper by Serfling. I […]
1 December, 2012 at 10:54 am
Jon
Dear Prof. Tao,
I am a bit unsure of what Exercise 7 is asking – is it asking to prove the result only for sufficiently large A (as in Proposition 6), or for all A > 0 (as in Exercise 6)?
Thanks.
[The former; one could consider combining both generalisations, but this would be a separate problem. -T.]
18 December, 2012 at 9:23 pm
The Rosenblog » Azuma’s Inequality and Concentration
[…] I just posted an essay which gives some applications of Azuma’s inequality to combinatorics and theoretical computer science, which is available here. Azuma’s inequality is an example of the concentration of measure phenomenon, which has rich applications in combinatorics, probability theory and Banach space theory. This book gives a very thorough survey of the phenomenon, although it approaches the subject from a very geometric/measure theoretic standpoint. A more condensed (and perhaps user-friendly) overview is available on Terence Tao’s blog. […]
5 February, 2013 at 2:47 pm
Some notes on Bakry-Emery theory « What’s new
[…] inequalities can also be used to establish concentration of measure inequalities; see for instance this previous blog post for an instance of […]
13 May, 2013 at 12:00 pm
חסם צ'רנוף, חסם אנטרופיה ומה שביניהם | One and One
[…] לקריאה נוספת: פוסט של טרי טאו על ריכוז של מידה. […]
14 November, 2013 at 10:57 pm
Anonymous
Are the summands for last applicational result really independent or merely uncorrelated?
[Corrected, thanks – T.]
24 October, 2014 at 12:26 am
Anonymous
Do you want the variance to be “at least 1” or “at most 1”. The current formulation seems inconsistent with the note right after this line.
[“At most”. Note clarified to remove the confusion. -T.]
20 September, 2015 at 10:11 am
Entropy and rare events | What's new
[…] inequality, but there are of course many other estimates of this type (see e.g. this previous blog post for some others). Roughly speaking, concentration of measure inequalities allow one to make […]
21 September, 2015 at 1:22 pm
Jiasen Yang
Dear Professor Tao,
Thank you for the detailed post! I’ve been going through the proof of McDiarmid’s inequality, and I don’t see why it is necessary to assume that the
‘s are independent. I wonder if you could point out which step(s) I missed?
Thanks very much!
[One needs the independence to ensure that (say)
continues to have mean zero even after conditioning on
. -T. Note also that the theorem fails quite badly if for instance one has the very strong coupling
. -T.]
21 September, 2015 at 7:23 pm
Jiasen Yang
Dear Professor Tao,
Thank you for your timely reply! Your coupling argument certainly makes sense, but I’m still having trouble determining which step of the proof uses independence directly. Following your point, I agree that $E[X_n] \neq E[X_n|X_1,\ldots,X_{n-1}]$ in general, but I don’t see where $X_n$ is assumed to have mean zero?
Actually, my question originally arose as I was reading a paper which claims to use McDiarmid’s inequality for dependent $X_i$’s after replacing the assumption
$$ |F(x_1,\ldots,x_{i-1},x_i,x_{i+1},\ldots,x_n) – F(x_1,\ldots,x_{i-1},x_i’,x_{i+1},\ldots,x_n)| \leq c_i $$
by
$$ |\bf{E}[f(X)|x_1,\ldots, x_{i-1},x_i] – \bf{E}[f(X)|x_1,\ldots, x_{i-1},x_i’]| \le c_i $$, but I don’t see why this new assumption resolves the issue.
Thank you again, and please forgive me for my stubbornness!
21 September, 2015 at 7:39 pm
Terence Tao
Sorry, my previous reply was quite incorrect, I was thinking of a different concentration equality. For McDiarmid, the claim that the conditioned function
obeys the same hypotheses as the original function
requires the independence hypothesis, as one will find if one expands out the proof of this claim.
21 September, 2015 at 8:08 pm
Jiasen Yang
Ah! I finally see your point. Thank you Professor!
23 October, 2015 at 8:06 am
275A, Notes 3: The weak and strong law of large numbers | What's new
[…] theory and high dimensional geometry. We will not discuss these topics much in this course, but see this previous blog post for some further […]
6 November, 2015 at 4:19 pm
Mahmood
Dear Professor Tao,
In the proof of Lemma 11 and the definition of set
. What happens if such
is equal to
? Is it still valid to say that
. For example, when the set
has the points whose
coordinates are all equal to
, then
does not have any element with the last coordinate 1. Am I missing something?
Thanks.
6 November, 2015 at 9:46 pm
Terence Tao
Oops, you’re right, the
case has to be handled separately, but the bound is better in that case (one can stay in the
slice). I’ve updated the argument accordingly.
7 November, 2015 at 8:51 am
mazrouei
Thanks for your reply.
19 November, 2015 at 2:51 pm
275A, Notes 5: Variants of the central limit theorem | What's new
[…] in which the underlying random variable is not bounded, but enjoys good moment bounds. See this previous blog post for these inequalities and some further discussion. Last, but certainly not least, there is an […]
17 February, 2016 at 6:59 pm
Anonymous
Hello Professor Tao,
In the proof of SLLN, I am not sure what do you mean by ‘countable additivity’ in building the connections bewteen convergence of two sequences. Does it mean adding O(\varepsilon)?
Thanks!
Jack
17 February, 2016 at 7:22 pm
Terence Tao
If
is almost surely convergent to
, then almost surely it is the case that
fluctuates by at most
around
(in the sense that the limit superior and limit inferior are both almost surely
. Setting
(say) for
and using countable additivity (which implies that the countable intersection of almost sure events is still almost sure, we conclude that the limit superior and limit inferior are both almost surely
, giving the SLLN.
17 February, 2016 at 8:33 pm
Anonymous
Ah, I see. Thanks a lot for your reply!
Best,
Jack
9 April, 2016 at 7:44 am
Anonymous
Dear Prof. Tao, I believe your notes are just impressive.
What can we say on the lower bound on $Pr[|S_n|\geq \lambda]\geq ?$,depending on the moments of the distribution of the jointly independent, zero mean, variables $X_i$?
11 April, 2016 at 8:50 am
Terence Tao
This is the realm of large deviation theory; these bounds tend to depend in various subtle ways on the precise distribution of the independent variables. See for instance this paper of Nagaev for a classical survey of results; probably there are more up to date surveys also.
14 April, 2016 at 10:39 am
gninrepoli
I think that the probability of a recursive nature, so this phenomenon is possible. Probability – it’s just part of the recursion. Any probabilistic model is necessary to approximate recursion, which we can control. For example the set of prime numbers is recursive. But how to prove it.
25 April, 2016 at 3:24 pm
Nathan
Dear Prof. Tao,
First I would like to thank you for your notes, which are quite impressive!
I do not understand why each summand has a variance of
in the computation of the variance of
(corollary 13). Since
and a term in the computation has the form
, shouldn’t we have an
?
Thanks!
26 April, 2016 at 8:34 am
Terence Tao
26 April, 2016 at 11:27 am
Nathan
Thank you!
2 March, 2017 at 12:17 am
keej
I was also stuck on this point; thanks for clarifying!
5 June, 2016 at 4:49 am
Vaibhav
Hi Prof. Terrence Tao,
I am a very simple question. In, (McDiarmid’s inequality) it is not clear how does it depend upon N, the number of random variables concerned. Intuitively, and as you mention elsewhere, the inequality should become stronger. That is, as the number of random variables increase, the deviation from the mean of the function should be less likely. Could you kindly clarify.
Quoting you:
“The basic intuition here is that it is difficult for a large number of independent variables {X_1,\ldots,X_n} to “work together” to simultaneously pull a sum {X_1+\ldots+X_n} or a more general combination {F(X_1,\ldots,X_n)} too far away from its mean. Independence here is the key; concentration of measure results typically fail if the {X_i} are too highly correlated with each other.”
I fail to see this effect.
6 June, 2016 at 1:53 pm
Terence Tao
Let’s say we are working in the normalisation
. The hypothesis of Theorem 7 allows
to range over an interval of size
; but McDiarmid’s inequality shows that concentration instead occurs at a shorter interval at the scale of
.
6 June, 2016 at 6:19 pm
Vaibhav
Dear Prof. Tao, Thanks a lot. This helps most certainly.
6 June, 2016 at 6:51 pm
Vaibhav
Dear Prof. Tao,
I had a following question about Levy’s Lemma (another celebreated concentration of measure result).
First the Levy’s Lemma.
Let $f : S^k \rightarrow R$
be a function with Lipschitz constant $\eta$ (with respect to the Euclidean norm) and $X \in k$
and a point $X \in S^{k}$ be chosen uniformly at random. Then
\begin{align}
\label{levy}
Pr\{|f(X) – \bar{f}| > \alpha\}\le exp (-C(k+1)\alpha^{2}/\eta^2 )
\end{align}
for some constant $C >0$ and $\bar{f}$ is the mean value of the function over the sphere.
Now, here is my question: This Lemma seems to hold for a point chosen
uniformly at random over a sphere.
My question is….suppose I choose a point on an N dimensional sphere (X_1, X_2,…, X_N) such that (X_1^2, X_2^2, …, X_N^2) is chosen from a uniform distribution.
Certainly (X_1^2, X_2^2, …, X_N^2) chosen from a uniform distribution does not imply (X_1, X_2,…, X_N) will be a uniformly random point on the sphere.
But intuitively, I will expect Levy’s Lemma (or say a weaker version) to hold as I expect it to hold
for almost all points on the sphere.
My computer simulations also show this to be the case.
But, I cannot make mathematical arguments to justify my intuition and my simulations.
Best,
vaibhav
Ps: If I can show this, I can show a neat result in evolutionary biology, which I will be happy to show it to you if you would like.
Many thanks to you!
28 June, 2016 at 5:41 am
Ahn
Dear prof. Tao,
While I read the proof of proposition 6, I could not follow the step at the end where you applied chernoff bound. For example, what did you choose for lambda?
I would appreciate your help.
29 June, 2016 at 6:51 am
Terence Tao
There is a lot of latitude here in what to choose for
, for instance one can take
for some small
. Note that the bound on the RHS is not optimal, but suffices for the application at hand.
8 November, 2016 at 8:40 am
Fast Randomized SVD – Facebook Research
[…] our approximation by appealing to concentration of measure results for random matrices – see Terry Tao’s lecture notes on the topic for some useful […]
12 February, 2017 at 11:53 am
keej
Could anyone please explain why in the proof of Hoeffding’s lemma, the first Taylor expansion is
and not just the naive
?
12 February, 2017 at 3:39 pm
Anonymous
The simpler estimate
for the remainder term applies only for bounded
.
12 February, 2017 at 5:01 pm
keej
Of course, thank you.
13 May, 2017 at 5:28 pm
Zahra
Hello Prof. Tao,
Any known results for the extension of theorem 8 to the case when
be a vector of iid sub-gaussian variables? Many Thanks.
[Probably. I don’t have references handy, but I would recommend checking Ledoux’s book on the subject. -T.]
22 May, 2017 at 5:55 pm
Quantitative continuity estimates | What's new
[…] is the proof by Maurey and Pisier of the gaussian concentration inequality, given in Theorem 8 of this previous blog post. In a similar vein, if one wishes to compare a scalar random variable of mean zero and variance […]
24 December, 2017 at 1:19 pm
AL Ray
I don’t understand something in the proof of the Lemma 10 why is
and not
?
Thanks in advance!
[This is a typo, now fixed – T.]
30 December, 2017 at 10:41 pm
Tim
Dear Prof. Tao,
Thanks for the insightful notes. I have two questions:
1. When calculating the variance of d(X, V)^2, a term in the computation will be E[X_i^2 X_j^2], which I think should be 1 instead of O(K^2) since X_i and X_j are independent.
2. Are there any more direct ways to calculate / estimate the expectation of d(X, V)?
Thanks in advance!
31 December, 2017 at 10:29 am
Terence Tao
It’s true one can improve the bound on the variance of
, but unfortunately this does not significantly improve the concentration bound because of the uncertainty of
coming from the use of Theorem 9.
I don’t know of much sharper ways to control the expectation, but there are several routes to the concentration property at least, starting with the classic work of Hanson and Wright.
12 November, 2019 at 6:47 pm
254A, Notes 9 – second moment and entropy methods | What's new
[…] random variable will concentrate around its mean if its variance is not too large. See these previous notes for more discussion of the concentration of measure phenomenon. One can often obtain stronger […]
12 November, 2019 at 6:47 pm
254A, Notes 9 – second moment and entropy methods | What's new
[…] random variable will concentrate around its mean if its variance is not too large. See these previous notes for more discussion of the concentration of measure phenomenon. One can often obtain stronger […]
27 June, 2020 at 8:43 pm
Yijia Liu
I have been trying to work out the computations for the final part of proposition 6 but I couldnt get the 2^-m factor.
From https://math.stackexchange.com/questions/3734229/chernoff-bound-for-sum-of-sub-gaussian-variables-via-truncation-method it seems that we cannot replace $1/100(m+1)^2$ with $2^{-m-1}$. Is this a typo or have I missed something in calculations.
[Corrected, thanks – T.]