You are currently browsing the tag archive for the ‘large deviation inequality’ tag.

I have uploaded to the arXiv my paper “Exploring the toolkit of Jean Bourgain“. This is one of a collection of papers to be published in the Bulletin of the American Mathematical Society describing aspects of the work of Jean Bourgain; other contributors to this collection include Keith Ball, Ciprian Demeter, and Carlos Kenig. Because the other contributors will be covering specific areas of Jean’s work in some detail, I decided to take a non-overlapping tack, and focus instead on some basic tools of Jean that he frequently used across many of the fields he contributed to. Jean had a surprising number of these “basic tools” that he wielded with great dexterity, and in this paper I focus on just a few of them:

• Reducing qualitative analysis results (e.g., convergence theorems or dimension bounds) to quantitative analysis estimates (e.g., variational inequalities or maximal function estimates).
• Using dyadic pigeonholing to locate good scales to work in or to apply truncations.
• Using random translations to amplify small sets (low density) into large sets (positive density).
• Combining large deviation inequalities with metric entropy bounds to control suprema of various random processes.

Each of these techniques is individually not too difficult to explain, and were certainly employed on occasion by various mathematicians prior to Bourgain’s work; but Jean had internalized them to the point where he would instinctively use them as soon as they became relevant to a given problem at hand. I illustrate this at the end of the paper with an exposition of one particular result of Jean, on the Erdős similarity problem, in which his main result (that any sum ${S = S_1+S_2+S_3}$ of three infinite sets of reals has the property that there exists a positive measure set ${E}$ that does not contain any homothetic copy ${x+tS}$ of ${S}$) is basically proven by a sequential application of these tools (except for dyadic pigeonholing, which turns out not to be needed here).

I had initially intended to also cover some other basic tools in Jean’s toolkit, such as the uncertainty principle and the use of probabilistic decoupling, but was having trouble keeping the paper coherent with such a broad focus (certainly I could not identify a single paper of Jean’s that employed all of these tools at once). I hope though that the examples given in the paper gives some reasonable impression of Jean’s research style.

In the previous set of notes we established the central limit theorem, which we formulate here as follows:

Theorem 1 (Central limit theorem) Let ${X_1,X_2,X_3,\dots}$ be iid copies of a real random variable ${X}$ of mean ${\mu}$ and variance ${0 < \sigma^2 < \infty}$, and write ${S_n := X_1 + \dots + X_n}$. Then, for any fixed ${a < b}$, we have

$\displaystyle {\bf P}( a \leq \frac{S_n - n \mu}{\sqrt{n} \sigma} \leq b ) \rightarrow \frac{1}{\sqrt{2\pi}} \int_a^b e^{-t^2/2}\ dt \ \ \ \ \ (1)$

as ${n \rightarrow \infty}$.

This is however not the end of the matter; there are many variants, refinements, and generalisations of the central limit theorem, and the purpose of this set of notes is to present a small sample of these variants.

First of all, the above theorem does not quantify the rate of convergence in (1). We have already addressed this issue to some extent with the Berry-Esséen theorem, which roughly speaking gives a convergence rate of ${O(1/\sqrt{n})}$ uniformly in ${a,b}$ if we assume that ${X}$ has finite third moment. However there are still some quantitative versions of (1) which are not addressed by the Berry-Esséen theorem. For instance one may be interested in bounding the large deviation probabilities

$\displaystyle {\bf P}( |\frac{S_n - n \mu}{\sqrt{n} \sigma}| \geq \lambda ) \ \ \ \ \ (2)$

in the setting where ${\lambda}$ grows with ${n}$. Chebyshev’s inequality gives an upper bound of ${1/\lambda^2}$ for this quantity, but one can often do much better than this in practice. For instance, the central limit theorem (1) suggests that this probability should be bounded by something like ${O( e^{-\lambda^2/2})}$; however, this theorem only kicks in when ${n}$ is very large compared with ${\lambda}$. For instance, if one uses the Berry-Esséen theorem, one would need ${n}$ as large as ${e^{\lambda^2}}$ or so to reach the desired bound of ${O( e^{-\lambda^2/2})}$, even under the assumption of finite third moment. Basically, the issue is that convergence-in-distribution results, such as the central limit theorem, only really control the typical behaviour of statistics in ${\frac{S_n-n \mu}{\sqrt{n} \sigma}}$; they are much less effective at controlling the very rare outlier events in which the statistic strays far from its typical behaviour. Fortunately, there are large deviation inequalities (or concentration of measure inequalities) that do provide exponential type bounds for quantities such as (2), which are valid for both small and large values of ${n}$. A basic example of this is the Chernoff bound that made an appearance in Exercise 47 of Notes 4; here we give some further basic inequalities of this type, including versions of the Bennett and Hoeffding inequalities.

In the other direction, we can also look at the fine scale behaviour of the sums ${S_n}$ by trying to control probabilities such as

$\displaystyle {\bf P}( a \leq S_n \leq a+h ) \ \ \ \ \ (3)$

where ${h}$ is now bounded (but ${a}$ can grow with ${n}$). The central limit theorem predicts that this quantity should be roughly ${\frac{h}{\sqrt{2\pi n} \sigma} e^{-(a-n\mu)^2 / 2n \sigma^2}}$, but even if one is able to invoke the Berry-Esséen theorem, one cannot quite see this main term because it is dominated by the error term ${O(1/n^{1/2})}$ in Berry-Esséen. There is good reason for this: if for instance ${X}$ takes integer values, then ${S_n}$ also takes integer values, and ${{\bf P}( a \leq S_n \leq a+h )}$ can vanish when ${h}$ is less than ${1}$ and ${a}$ is slightly larger than an integer. However, this turns out to essentially be the only obstruction; if ${X}$ does not lie in a lattice such as ${{\bf Z}}$, then we can establish a local limit theorem controlling (3), and when ${X}$ does take values in a lattice like ${{\bf Z}}$, there is a discrete local limit theorem that controls probabilities such as ${{\bf P}(S_n = m)}$. Both of these limit theorems will be proven by the Fourier-analytic method used in the previous set of notes.

We also discuss other limit theorems in which the limiting distribution is something other than the normal distribution. Perhaps the most common example of these theorems is the Poisson limit theorems, in which one sums a large number of indicator variables (or approximate indicator variables), each of which is rarely non-zero, but which collectively add up to a random variable of medium-sized mean. In this case, it turns out that the limiting distribution should be a Poisson random variable; this again is an easy application of the Fourier method. Finally, we briefly discuss limit theorems for other stable laws than the normal distribution, which are suitable for summing random variables of infinite variance, such as the Cauchy distribution.

Finally, we mention a very important class of generalisations to the CLT (and to the variants of the CLT discussed in this post), in which the hypothesis of joint independence between the variables ${X_1,\dots,X_n}$ is relaxed, for instance one could assume only that the ${X_1,\dots,X_n}$ form a martingale. Many (though not all) of the proofs of the CLT extend to these more general settings, and this turns out to be important for many applications in which one does not expect joint independence. However, we will not discuss these generalisations in this course, as they are better suited for subsequent courses in this series when the theory of martingales, conditional expectation, and related tools are developed.

In the theory of discrete random matrices (e.g. matrices whose entries are random signs ${\pm 1}$), one often encounters the problem of understanding the distribution of the random variable ${\hbox{dist}(X,V)}$, where ${X = (x_1,\ldots,x_n) \in \{-1,+1\}^n}$ is an ${n}$-dimensional random sign vector (so ${X}$ is uniformly distributed in the discrete cube ${\{-1,+1\}^n}$), and ${V}$ is some ${d}$-dimensional subspace of ${{\bf R}^n}$ for some ${0 \leq d \leq n}$.

It is not hard to compute the second moment of this random variable. Indeed, if ${P = (p_{ij})_{1 \leq i,j \leq n}}$ denotes the orthogonal projection matrix from ${{\bf R}^n}$ to the orthogonal complement ${V^\perp}$ of ${V}$, then one observes that

$\displaystyle \hbox{dist}(X,V)^2 = X \cdot P X = \sum_{i=1}^n \sum_{j=1}^n x_i x_j p_{ij}$

and so upon taking expectations we see that

$\displaystyle {\Bbb E} \hbox{dist}(X,V)^2 = \sum_{i=1}^n p_{ii} = \hbox{tr} P = n-d \ \ \ \ \ (1)$

since ${P}$ is a rank ${n-d}$ orthogonal projection. So we expect ${\hbox{dist}(X,V)}$ to be about ${\sqrt{n-d}}$ on the average.

In fact, one has sharp concentration around this value, in the sense that ${\hbox{dist}(X,V) = \sqrt{n-d}+O(1)}$ with high probability. More precisely, we have

Proposition 1 (Large deviation inequality) For any ${t>0}$, one has

$\displaystyle {\Bbb P}( |\hbox{dist}(X,V) - \sqrt{n-d}| \geq t ) \leq C \exp(- c t^2 )$

for some absolute constants ${C, c > 0}$.

In fact the constants ${C, c}$ are very civilised; for large ${t}$ one can basically take ${C=4}$ and ${c=1/16}$, for instance. This type of concentration, particularly for subspaces ${V}$ of moderately large codimension ${n-d}$, is fundamental to much of my work on random matrices with Van Vu, starting with our first paper (in which this proposition first appears). (For subspaces of small codimension (such as hyperplanes) one has to use other tools to get good results, such as inverse Littlewood-Offord theory or the Berry-Esséen central limit theorem, but that is another story.)

Proposition 1 is an easy consequence of the second moment computation and Talagrand’s inequality, which among other things provides a sharp concentration result for convex Lipschitz functions on the cube ${\{-1,+1\}^n}$; since ${\hbox{dist}(x,V)}$ is indeed a convex Lipschitz function, this inequality can be applied immediately. The proof of Talagrand’s inequality is short and can be found in several textbooks (e.g. Alon and Spencer), but I thought I would reproduce the argument here (specialised to the convex case), mostly to force myself to learn the proof properly. Note the concentration of ${O(1)}$ obtained by Talagrand’s inequality is much stronger than what one would get from more elementary tools such as Azuma’s inequality or McDiarmid’s inequality, which would only give concentration of about ${O(\sqrt{n})}$ or so (which is in fact trivial, since the cube ${\{-1,+1\}^n}$ has diameter ${2\sqrt{n}}$); the point is that Talagrand’s inequality is very effective at exploiting the convexity of the problem, as well as the Lipschitz nature of the function in all directions, whereas Azuma’s inequality can only easily take advantage of the Lipschitz nature of the function in coordinate directions. On the other hand, Azuma’s inequality works just as well if the ${\ell^2}$ metric is replaced with the larger ${\ell^1}$ metric, and one can conclude that the ${\ell^1}$ distance between ${X}$ and ${V}$ concentrates around its median to a width ${O(\sqrt{n})}$, which is a more non-trivial fact than the ${\ell^2}$ concentration bound given by that inequality. (The computation of the median of the ${\ell^1}$ distance is more complicated than for the ${\ell^2}$ distance, though, and depends on the orientation of ${V}$.)

Remark 1 If one makes the coordinates of ${X}$ iid Gaussian variables ${x_i \equiv N(0,1)}$ rather than random signs, then Proposition 1 is much easier to prove; the probability distribution of a Gaussian vector is rotation-invariant, so one can rotate ${V}$ to be, say, ${{\bf R}^d}$, at which point ${\hbox{dist}(X,V)^2}$ is clearly the sum of ${n-d}$ independent squares of Gaussians (i.e. a chi-square distribution), and the claim follows from direct computation (or one can use the Chernoff inequality). The gaussian counterpart of Talagrand’s inequality is more classical, being essentially due to Lévy, and will also be discussed later in this post.