Let and be two random variables taking values in the same (discrete) range , and let be some subset of , which we think of as the set of “bad” outcomes for either or . If and have the same probability distribution, then clearly
In particular, if it is rare for to lie in , then it is also rare for to lie in .
If and do not have exactly the same probability distribution, but their probability distributions are close to each other in some sense, then we can expect to have an approximate version of the above statement. For instance, from the definition of the total variation distance between two random variables (or more precisely, the total variation distance between the probability distributions of two random variables), we see that
for any . In particular, if it is rare for to lie in , and are close in total variation, then it is also rare for to lie in .
A basic inequality in information theory is Pinsker’s inequality
where the Kullback-Leibler divergence is defined by the formula
(See this previous blog post for a proof of this inequality.) A standard application of Jensen’s inequality reveals that is non-negative (Gibbs’ inequality), and vanishes if and only if , have the same distribution; thus one can think of as a measure of how close the distributions of and are to each other, although one should caution that this is not a symmetric notion of distance, as in general. Inserting Pinsker’s inequality into (1), we see for instance that
Thus, if is close to in the Kullback-Leibler sense, and it is rare for to lie in , then it is rare for to lie in as well.
We can specialise this inequality to the case when a uniform random variable on a finite range of some cardinality , in which case the Kullback-Leibler divergence simplifies to
is the Shannon entropy of . Again, a routine application of Jensen’s inequality shows that , with equality if and only if is uniformly distributed on . The above inequality then becomes
Thus, if is a small fraction of (so that it is rare for to lie in ), and the entropy of is very close to the maximum possible value of , then it is rare for to lie in also.
The inequality (2) is only useful when the entropy is close to in the sense that , otherwise the bound is worse than the trivial bound of . In my recent paper on the Chowla and Elliott conjectures, I ended up using a variant of (2) which was still non-trivial when the entropy was allowed to be smaller than . More precisely, I used the following simple inequality, which is implicit in the arguments of that paper but which I would like to make more explicit in this post:
Lemma 1 (Pinsker-type inequality) Let be a random variable taking values in a finite range of cardinality , let be a uniformly distributed random variable in , and let be a subset of . Then
Proof: Consider the conditional entropy . On the one hand, we have
by Jensen’s inequality. On the other hand, one has
where we have again used Jensen’s inequality. Putting the two inequalities together, we obtain the claim.
Remark 2 As noted in comments, this inequality can be viewed as a special case of the more general inequality
for arbitrary random variables taking values in the same discrete range , which follows from the data processing inequality
for arbitrary functions , applied to the indicator function . Indeed one has
where is the entropy function.
Thus, for instance, if one has
for some much larger than (so that ), then
More informally: if the entropy of is somewhat close to the maximum possible value of , and it is exponentially rare for a uniform variable to lie in , then it is still somewhat rare for to lie in . The estimate given is close to sharp in this regime, as can be seen by calculating the entropy of a random variable which is uniformly distributed inside a small set with some probability and uniformly distributed outside of with probability , for some parameter .
It turns out that the above lemma combines well with concentration of measure estimates; in my paper, I used one of the simplest such estimates, namely Hoeffding’s inequality, but there are of course many other estimates of this type (see e.g. this previous blog post for some others). Roughly speaking, concentration of measure inequalities allow one to make approximations such as
with somewhat high probability, if the entropy of is somewhat close to maximum. This observation, combined with an “entropy decrement argument” that allowed one to arrive at a situation in which the relevant random variable did have a near-maximum entropy, is the key new idea in my recent paper; for instance, one can use the approximation (3) to obtain an approximation of the form
for “most” choices of and a suitable choice of (with the latter being provided by the entropy decrement argument). The left-hand side is tied to Chowla-type sums such as through the multiplicativity of , while the right-hand side, being a linear correlation involving two parameters rather than just one, has “finite complexity” and can be treated by existing techniques such as the Hardy-Littlewood circle method. One could hope that one could similarly use approximations such as (3) in other problems in analytic number theory or combinatorics.