Entropy and rare events

20 September, 2015 in expository, math.IT, math.NT | Tags: concentration of measure, entropy | by Terence Tao

Let ${X}$ and ${Y}$ be two random variables taking values in the same (discrete) range ${R}$ , and let ${E}$ be some subset of ${R}$ , which we think of as the set of “bad” outcomes for either ${X}$ or ${Y}$ . If ${X}$ and ${Y}$ have the same probability distribution, then clearly

$\displaystyle {\bf P}( X \in E ) = {\bf P}( Y \in E ).$

In particular, if it is rare for ${Y}$ to lie in ${E}$ , then it is also rare for ${X}$ to lie in ${E}$ .

If ${X}$ and ${Y}$ do not have exactly the same probability distribution, but their probability distributions are close to each other in some sense, then we can expect to have an approximate version of the above statement. For instance, from the definition of the total variation distance ${\delta(X,Y)}$ between two random variables (or more precisely, the total variation distance between the probability distributions of two random variables), we see that

$\displaystyle {\bf P}(Y \in E) - \delta(X,Y) \leq {\bf P}(X \in E) \leq {\bf P}(Y \in E) + \delta(X,Y) \ \ \ \ \ (1)$

for any ${E \subset R}$ . In particular, if it is rare for ${Y}$ to lie in ${E}$ , and ${X,Y}$ are close in total variation, then it is also rare for ${X}$ to lie in ${E}$ .

A basic inequality in information theory is Pinsker’s inequality

$\displaystyle \delta(X,Y) \leq \sqrt{\frac{1}{2} D_{KL}(X||Y)}$

where the Kullback-Leibler divergence ${D_{KL}(X||Y)}$ is defined by the formula

$\displaystyle D_{KL}(X||Y) = \sum_{x \in R} {\bf P}( X=x ) \log \frac{{\bf P}(X=x)}{{\bf P}(Y=x)}.$

(See this previous blog post for a proof of this inequality.) A standard application of Jensen’s inequality reveals that ${D_{KL}(X||Y)}$ is non-negative (Gibbs’ inequality), and vanishes if and only if ${X}$ , ${Y}$ have the same distribution; thus one can think of ${D_{KL}(X||Y)}$ as a measure of how close the distributions of ${X}$ and ${Y}$ are to each other, although one should caution that this is not a symmetric notion of distance, as ${D_{KL}(X||Y) \neq D_{KL}(Y||X)}$ in general. Inserting Pinsker’s inequality into (1), we see for instance that

$\displaystyle {\bf P}(X \in E) \leq {\bf P}(Y \in E) + \sqrt{\frac{1}{2} D_{KL}(X||Y)}.$

Thus, if ${X}$ is close to ${Y}$ in the Kullback-Leibler sense, and it is rare for ${Y}$ to lie in ${E}$ , then it is rare for ${X}$ to lie in ${E}$ as well.

We can specialise this inequality to the case when ${Y}$ a uniform random variable ${U}$ on a finite range ${R}$ of some cardinality ${N}$ , in which case the Kullback-Leibler divergence ${D_{KL}(X||U)}$ simplifies to

$\displaystyle D_{KL}(X||U) = \log N - {\bf H}(X)$

where

$\displaystyle {\bf H}(X) := \sum_{x \in R} {\bf P}(X=x) \log \frac{1}{{\bf P}(X=x)}$

is the Shannon entropy of ${X}$ . Again, a routine application of Jensen’s inequality shows that ${{\bf H}(X) \leq \log N}$ , with equality if and only if ${X}$ is uniformly distributed on ${R}$ . The above inequality then becomes

$\displaystyle {\bf P}(X \in E) \leq {\bf P}(U \in E) + \sqrt{\frac{1}{2}(\log N - {\bf H}(X))}. \ \ \ \ \ (2)$

Thus, if ${E}$ is a small fraction of ${R}$ (so that it is rare for ${U}$ to lie in ${E}$ ), and the entropy of ${X}$ is very close to the maximum possible value of ${\log N}$ , then it is rare for ${X}$ to lie in ${E}$ also.

The inequality (2) is only useful when the entropy ${{\bf H}(X)}$ is close to ${\log N}$ in the sense that ${{\bf H}(X) = \log N - O(1)}$ , otherwise the bound is worse than the trivial bound of ${{\bf P}(X \in E) \leq 1}$ . In my recent paper on the Chowla and Elliott conjectures, I ended up using a variant of (2) which was still non-trivial when the entropy ${{\bf H}(X)}$ was allowed to be smaller than ${\log N - O(1)}$ . More precisely, I used the following simple inequality, which is implicit in the arguments of that paper but which I would like to make more explicit in this post:

Lemma 1 (Pinsker-type inequality) Let ${X}$ be a random variable taking values in a finite range ${R}$ of cardinality ${N}$ , let ${U}$ be a uniformly distributed random variable in ${R}$ , and let ${E}$ be a subset of ${R}$ . Then

$\displaystyle {\bf P}(X \in E) \leq \frac{(\log N - {\bf H}(X)) + \log 2}{\log 1/{\bf P}(U \in E)}.$

Proof: Consider the conditional entropy ${{\bf H}(X | 1_{X \in E} )}$ . On the one hand, we have

$\displaystyle {\bf H}(X | 1_{X \in E} ) = {\bf H}(X, 1_{X \in E}) - {\bf H}(1_{X \in E} )$

$\displaystyle = {\bf H}(X) - {\bf H}(1_{X \in E})$

$\displaystyle \geq {\bf H}(X) - \log 2$

by Jensen’s inequality. On the other hand, one has

$\displaystyle {\bf H}(X | 1_{X \in E} ) = {\bf P}(X \in E) {\bf H}(X | X \in E )$

$\displaystyle + (1-{\bf P}(X \in E)) {\bf H}(X | X \not \in E)$

$\displaystyle \leq {\bf P}(X \in E) \log |E| + (1-{\bf P}(X \in E)) \log N$

$\displaystyle = \log N - {\bf P}(X \in E) \log \frac{N}{|E|}$

$\displaystyle = \log N - {\bf P}(X \in E) \log \frac{1}{{\bf P}(U \in E)},$

where we have again used Jensen’s inequality. Putting the two inequalities together, we obtain the claim. $\Box$

Remark 2 As noted in comments, this inequality can be viewed as a special case of the more general inequality

$\displaystyle {\bf P}(X \in E) \leq \frac{D(X||Y) + \log 2}{\log 1/{\bf P}(Y \in E)}$

for arbitrary random variables ${X,Y}$ taking values in the same discrete range ${R}$ , which follows from the data processing inequality

$\displaystyle D( f(X)||f(Y)) \leq D(X|| Y)$

for arbitrary functions ${f}$ , applied to the indicator function ${f = 1_E}$ . Indeed one has

$\displaystyle D( 1_E(X) || 1_E(Y) ) = {\bf P}(X \in E) \log \frac{{\bf P}(X \in E)}{{\bf P}(Y \in E)}$

$\displaystyle + {\bf P}(X \not \in E) \log \frac{{\bf P}(X \not \in E)}{{\bf P}(Y \not \in E)}$

$\displaystyle \geq {\bf P}(X \in E) \log \frac{1}{{\bf P}(Y \in E)} - h( {\bf P}(X \in E) )$

$\displaystyle \geq {\bf P}(X \in E) \log \frac{1}{{\bf P}(Y \in E)} - \log 2$

where ${h(u) := u \log \frac{1}{u} + (1-u) \log \frac{1}{1-u}}$ is the entropy function.

Thus, for instance, if one has

$\displaystyle {\bf H}(X) \geq \log N - o(K)$

and

$\displaystyle {\bf P}(U \in E) \leq \exp( - K )$

for some ${K}$ much larger than ${1}$ (so that ${1/K = o(1)}$ ), then

$\displaystyle {\bf P}(X \in E) = o(1).$

More informally: if the entropy of ${X}$ is somewhat close to the maximum possible value of ${\log N}$ , and it is exponentially rare for a uniform variable to lie in ${E}$ , then it is still somewhat rare for ${X}$ to lie in ${E}$ . The estimate given is close to sharp in this regime, as can be seen by calculating the entropy of a random variable ${X}$ which is uniformly distributed inside a small set ${E}$ with some probability ${p}$ and uniformly distributed outside of ${E}$ with probability ${1-p}$ , for some parameter ${0 \leq p \leq 1}$ .

It turns out that the above lemma combines well with concentration of measure estimates; in my paper, I used one of the simplest such estimates, namely Hoeffding’s inequality, but there are of course many other estimates of this type (see e.g. this previous blog post for some others). Roughly speaking, concentration of measure inequalities allow one to make approximations such as

$\displaystyle F(U) \approx {\bf E} F(U)$

with exponentially high probability, where ${U}$ is a uniform distribution and ${F}$ is some reasonable function of ${U}$ . Combining this with the above lemma, we can then obtain approximations of the form

$\displaystyle F(X) \approx {\bf E} F(U) \ \ \ \ \ (3)$

with somewhat high probability, if the entropy of ${X}$ is somewhat close to maximum. This observation, combined with an “entropy decrement argument” that allowed one to arrive at a situation in which the relevant random variable ${X}$ did have a near-maximum entropy, is the key new idea in my recent paper; for instance, one can use the approximation (3) to obtain an approximation of the form

$\displaystyle \sum_{j=1}^H \sum_{p \in {\mathcal P}} \lambda(n+j) \lambda(n+j+p) 1_{p|n+j}$

$\displaystyle \approx \sum_{j=1}^H \sum_{p \in {\mathcal P}} \frac{\lambda(n+j) \lambda(n+j+p)}{p}$

for “most” choices of ${n}$ and a suitable choice of ${H}$ (with the latter being provided by the entropy decrement argument). The left-hand side is tied to Chowla-type sums such as ${\sum_{n \leq x} \frac{\lambda(n)\lambda(n+1)}{n}}$ through the multiplicativity of ${\lambda}$ , while the right-hand side, being a linear correlation involving two parameters ${j,p}$ rather than just one, has “finite complexity” and can be treated by existing techniques such as the Hardy-Littlewood circle method. One could hope that one could similarly use approximations such as (3) in other problems in analytic number theory or combinatorics.

14 comments

Comments feed for this article

20 September, 2015 at 11:23 am

Anonymous

Dear Prof. Tao,

There is a simple exponential version of Pinsker’s inequality that almost gives what you want, even without the assumption that one of the RV’s is uniform:

\delta(X,Y) \le 1 – (1/2).exp(-KL(X,Y))

(This is proved for instance in Tsybakov’s book “Introduction to Nonparametric Estimation” (eq. 2.25).) Unfortunately this only gives constant bounds, instead of o(1), in your setting.

20 September, 2015 at 11:53 am

Yihong

A side remark on Lemma 1: (2) also follows from the “data processing” inequality in information theory, namely, $D_{KL}(f(X) \|f(Y)) \leq D(X\|Y)$ , for any function $f$ . Taking $f = 1_E$ implies (2). Furthermore, instead of deterministic $f$ , one can replace $f$ by any Markov kernel. Lemma 1 is useful in information-theoretic treatment of large deviation. For example, taking $X=(X_1,...,X_n)$ and $Y=(Y_1,...,Y_n)$ with iid components and $f=1_{\sum x_i > n a}$ shows that the large deviation exponent cannot exceed something, an idea due to Csiszar, see, e.g., http://www.emis.de/journals/em/docs/boletim/vol374/v37-4-a2-2006.pdf

[Nice! I’ve added this alternate argument to the main post. -T.]

21 September, 2015 at 2:34 am

What was the motivation from data processing for their inequality?

20 September, 2015 at 5:29 pm

David Roberts

The Remark 2 box has a broken equation. Also, at one point, shortly after KL-divergence is defined, you’ve got D_{KL}(X|Y).

[Corrected, thanks – T.]

21 September, 2015 at 12:13 pm

John Mangual

Can the prime number theorem as a kind of concentration of measure phenomenon? E.g. we can show Möbius valued coin-flips are not biased.

I don’t know how to formulate this precisely. Even the law of large numbers itself is fraught with difficult philosophical questions like “what is a typical infinite random sequence of coin flips?”

All proofs of PNT I have seen so far, ultimately lose their combinatorial flavor and either become delicate (but common) maneuvers of the zeta function with contour integrals or, equally challenging estimates about $\times 2$ (starting with Bertrand’s postulate).

In either case you can’t just conclude “gee that is the max-ent distribution” but you have to prove that entropy is increasing – this is hard. I find many expositions of PNT start off great, but wind up sweeping important sources of complexity under the rug. Sadly, I have not said very much.

24 September, 2015 at 11:55 pm

Jesper

If you use the definition of total variation distance given on Wikipedia, then the constant in Pinsker’s inequality can be improved to 1/2. A constant of 2 is only needed if one defines the total variation distance to be the L_1 distance (in the Wikipedia definition, the total variation distance is half the L_1 distance).

[Corrected, thanks – T.]

3 October, 2015 at 4:26 am

Anonymous

Since there are several kinds of entropies (e.g. Gibbs, Shannon, Perelman), is there some unifying definition of entropy for general (for both stochastic or deterministic) dynamical systems ?

4 October, 2015 at 1:32 am

PerryZhao

Reblogged this on 木秀于林.

14 October, 2015 at 5:34 am

Bowen

Reblogged this on Statistics & ML Hack.

17 October, 2015 at 9:37 am

Steven Heilman

Is it possible to re-interpret your Pinsker inequalities using the Ricci curvature of the graph G_n,H discussed in the previous post? It is known (Theorem 1.10, http://arxiv.org/pdf/1509.07160v1.pdf) that if a graph has “positive Ricci curvature,” then the graph satisfies a Pinsker inequality. (See also: http://cedricvillani.org/wp-content/uploads/2012/07/031.BV-CKP.pdf). Checking for “positive Ricci curvature” typically amounts to showing that the graph has many short cycles, and it seems that the graph G_n,H has many short cycles. Unfortunately, maybe each of you has a different definition of a Pinsker inequality, so it may be difficult to connect these topics.

17 October, 2015 at 10:23 am

Terence Tao

As far as I can tell, these are unrelated uses of graphs. In the papers you cite, the Pinsker inequality is generalised by placing a graph metric on the underlying sample space that the probability measures are supported on, sand then replacing total variation norm with Wasserstein distance. With the random graphs G_n, the graphs are not used as sample spaces, but instead each outcome in the sample space is associated to one such graph. The metric geometry of these graphs is somewhat related to their expansion properties (e.g. expanders tend to have very small diameter, and the spectral gap property of expanders is closely tied to the existence of a Poincare inequality), but it’s unclear to me how to take advantage of this if one wishes to establish expansion of these graphs.

28 October, 2015 at 12:28 pm

Grisou

Hello,
Very interesting.

Is there any similar result for continuous random variables X and Y ?

5 March, 2017 at 10:39 am

Furstenberg limits of the Liouville function | What's new

[…] some . Using the Pinsker-type inequality from this previous blog post, we conclude the lower […]

4 November, 2018 at 8:45 am

Tao’s Proof of (logarithmically averaged) Chowla’s conjecture for two point correlations | I Can't Believe It's Not Random!

[…] Proof: This is Lemma 1 in this post of Tao. […]

	Anonymous on Erratum for “An inverse…
	Anonymous on Erratum for “An inverse…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Erratum for “An inverse…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…

Entropy and rare events

Recent Comments

Articles by others

Diversions

Mathematics

Selected articles

Software

The sciences

Top Posts

Archives

Categories

The Polymath Blog

14 comments

Leave a comment Cancel reply

For commenters

Entropy and rare events

Share this:

Recent Comments

Articles by others

Diversions

Mathematics

Selected articles

Software

The sciences

Top Posts

Archives

Categories

The Polymath Blog

14 comments

Leave a comment Cancel reply

For commenters