You are currently browsing the tag archive for the ‘Shannon entropy’ tag.
Given a random variable that takes on only finitely many values, we can define its Shannon entropy by the formula
with the convention that . (In some texts, one uses the logarithm to base rather than the natural logarithm, but the choice of base will not be relevant for this discussion.) This is clearly a nonnegative quantity. Given two random variables taking on finitely many values, the joint variable is also a random variable taking on finitely many values, and also has an entropy . It obeys the Shannon inequalities
so we can define some further nonnegative quantities, the mutual information
and the conditional entropies
More generally, given three random variables , one can define the conditional mutual information
and the final of the Shannon entropy inequalities asserts that this quantity is also non-negative.
The mutual information is a measure of the extent to which and fail to be independent; indeed, it is not difficult to show that vanishes if and only if and are independent. Similarly, vanishes if and only if and are conditionally independent relative to . At the other extreme, is a measure of the extent to which fails to depend on ; indeed, it is not difficult to show that if and only if is determined by in the sense that there is a deterministic function such that . In a related vein, if and are equivalent in the sense that there are deterministic functional relationships , between the two variables, then is interchangeable with for the purposes of computing the above quantities, thus for instance , , , , etc..
One can get some initial intuition for these information-theoretic quantities by specialising to a simple situation in which all the random variables being considered come from restricting a single random (and uniformly distributed) boolean function on a given finite domain to some subset of :
In this case, has the law of a random uniformly distributed boolean function from to , and the entropy here can be easily computed to be , where denotes the cardinality of . If is the restriction of to , and is the restriction of to , then the joint variable is equivalent to the restriction of to . If one discards the normalisation factor , one then obtains the following dictionary between entropy and the combinatorics of finite sets:
|Random variables||Finite sets|
|Mutual information||Intersection cardinality|
|Conditional entropy||Set difference cardinality|
|Conditional mutual information|
|determined by||a subset of|
|conditionally independent relative to|
Every (linear) inequality or identity about entropy (and related quantities, such as mutual information) then specialises to a combinatorial inequality or identity about finite sets that is easily verified. For instance, the Shannon inequality becomes the union bound , and the definition of mutual information becomes the inclusion-exclusion formula
For a more advanced example, consider the data processing inequality that asserts that if are conditionally independent relative to , then . Specialising to sets, this now says that if are disjoint outside of , then ; this can be made apparent by considering the corresponding Venn diagram. This dictionary also suggests how to prove the data processing inequality using the existing Shannon inequalities. Firstly, if and are not necessarily disjoint outside of , then a consideration of Venn diagrams gives the more general inequality
and a further inspection of the diagram then reveals the more precise identity
Using the dictionary in the reverse direction, one is then led to conjecture the identity
which (together with non-negativity of conditional mutual information) implies the data processing inequality, and this identity is in turn easily established from the definition of mutual information.
On the other hand, not every assertion about cardinalities of sets generalises to entropies of random variables that are not arising from restricting random boolean functions to sets. For instance, a basic property of sets is that disjointness from a given set is preserved by unions:
Applying the dictionary in the reverse direction, one might now conjecture that if was independent of and was independent of , then should also be independent of , and furthermore that
but these statements are well known to be false (for reasons related to pairwise independence of random variables being strictly weaker than joint independence). For a concrete counterexample, one can take to be independent, uniformly distributed random elements of the finite field of two elements, and take to be the sum of these two field elements. One can easily check that each of and is separately independent of , but the joint variable determines and thus is not independent of .
From the inclusion-exclusion identities
one can check that (1) is equivalent to the trivial lower bound . The basic issue here is that in the dictionary between entropy and combinatorics, there is no satisfactory entropy analogue of the notion of a triple intersection . (Even the double intersection only exists information theoretically in a “virtual” sense; the mutual information allows one to “compute the entropy” of this “intersection”, but does not actually describe this intersection itself as a random variable.)
However, this issue only arises with three or more variables; it is not too difficult to show that the only linear equalities and inequalities that are necessarily obeyed by the information-theoretic quantities associated to just two variables are those that are also necessarily obeyed by their combinatorial analogues . (See for instance the Venn diagram at the Wikipedia page for mutual information for a pictorial summation of this statement.)
One can work with a larger class of special cases of Shannon entropy by working with random linear functions rather than random boolean functions. Namely, let be some finite-dimensional vector space over a finite field , and let be a random linear functional on , selected uniformly among all such functions. Every subspace of then gives rise to a random variable formed by restricting to . This random variable is also distributed uniformly amongst all linear functions on , and its entropy can be easily computed to be . Given two random variables formed by restricting to respectively, the joint random variable determines the random linear function on the union on the two spaces, and thus by linearity on the Minkowski sum as well; thus is equivalent to the restriction of to . In particular, . This implies that and also , where is the quotient map. After discarding the normalising constant , this leads to the following dictionary between information theoretic quantities and linear algebra quantities, analogous to the previous dictionary:
|Mutual information||Dimension of intersection|
|Conditional entropy||Dimension of projection|
|Conditional mutual information|
|determined by||a subspace of|
|conditionally independent relative to||, transverse.|
The combinatorial dictionary can be regarded as a specialisation of the linear algebra dictionary, by taking to be the vector space over the finite field of two elements, and only considering those subspaces that are coordinate subspaces associated to various subsets of .
As before, every linear inequality or equality that is valid for the information-theoretic quantities discussed above, is automatically valid for the linear algebra counterparts for subspaces of a vector space over a finite field by applying the above specialisation (and dividing out by the normalising factor of ). In fact, the requirement that the field be finite can be removed by applying the compactness theorem from logic (or one of its relatives, such as Los’s theorem on ultraproducts, as done in this previous blog post).
The linear algebra model captures more of the features of Shannon entropy than the combinatorial model. For instance, in contrast to the combinatorial case, it is possible in the linear algebra setting to have subspaces such that and are separately transverse to , but their sum is not; for instance, in a two-dimensional vector space , one can take to be the one-dimensional subspaces spanned by , , and respectively. Note that this is essentially the same counterexample from before (which took to be the field of two elements). Indeed, one can show that any necessarily true linear inequality or equality involving the dimensions of three subspaces (as well as the various other quantities on the above table) will also be necessarily true when applied to the entropies of three discrete random variables (as well as the corresponding quantities on the above table).
However, the linear algebra model does not completely capture the subtleties of Shannon entropy once one works with four or more variables (or subspaces). This was first observed by Ingleton, who established the dimensional inequality
for any subspaces . This is easiest to see when the three terms on the right-hand side vanish; then are transverse, which implies that ; similarly . But and are transverse, and this clearly implies that and are themselves transverse. To prove the general case of Ingleton’s inequality, one can define and use (and similarly for instead of ) to reduce to establishing the inequality
which can be rearranged using (and similarly for instead of ) and as
but this is clear since .
Returning to the entropy setting, the analogue
of (3) is true (exercise!), but the analogue
of Ingleton’s inequality is false in general. Again, this is easiest to see when all the terms on the right-hand side vanish; then are conditionally independent relative to , and relative to , and and are independent, and the claim (4) would then be asserting that and are independent. While there is no linear counterexample to this statement, there are simple non-linear ones: for instance, one can take to be independent uniform variables from , and take and to be (say) and respectively (thus are the indicators of the events and respectively). Once one conditions on either or , one of has positive conditional entropy and the other has zero entropy, and so are conditionally independent relative to either or ; also, or are independent of each other. But and are not independent of each other (they cannot be simultaneously equal to ). Somehow, the feature of the linear algebra model that is not present in general is that in the linear algebra setting, every pair of subspaces has a well-defined intersection that is also a subspace, whereas for arbitrary random variables , there does not necessarily exist the analogue of an intersection, namely a “common information” random variable that has the entropy of and is determined either by or by .
I do not know if there is any simpler model of Shannon entropy that captures all the inequalities available for four variables. One significant complication is that there exist some information inequalities in this setting that are not of Shannon type, such as the Zhang-Yeung inequality
One can however still use these simpler models of Shannon entropy to be able to guess arguments that would work for general random variables. An example of this comes from my paper on the logarithmically averaged Chowla conjecture, in which I showed among other things that
whenever was sufficiently large depending on , where is the Liouville function. The information-theoretic part of the proof was as follows. Given some intermediate scale between and , one can form certain random variables . The random variable is a sign pattern of the form where is a random number chosen from to (with logarithmic weighting). The random variable was tuple of reductions of to primes comparable to . Roughly speaking, what was implicitly shown in the paper (after using the multiplicativity of , the circle method, and the Matomaki-Radziwill theorem on short averages of multiplicative functions) is that if the inequality (5) fails, then there was a lower bound
for any , where denotes the shifted sign pattern . On the other hand, one had the entropy bounds
and from concatenating sign patterns one could see that is equivalent to the joint random variable for any . Applying these facts and using an “entropy decrement” argument, I was able to obtain a contradiction once was allowed to become sufficiently large compared to , but the bound was quite weak (coming ultimately from the unboundedness of as the interval of values of under consideration becomes large), something of the order of ; the quantity needs at various junctures to be less than a small power of , so the relationship between and becomes essentially quadruple exponential in nature, . The basic strategy was to observe that the lower bound (6) causes some slowdown in the growth rate of the mean entropy, in that this quantity decreased by as increased from to , basically by dividing into components , and observing from (6) each of these shares a bit of common information with the same variable . This is relatively clear when one works in a set model, in which is modeled by a set of size , and is modeled by a set of the form
for various sets of size (also there is some translation symmetry that maps to a shift while preserving all of the ).
However, on considering the set model recently, I realised that one can be a little more efficient by exploiting the fact (basically the Chinese remainder theorem) that the random variables are basically jointly independent as ranges over dyadic values that are much smaller than , which in the set model corresponds to the all being disjoint. One can then establish a variant
of (6), which in the set model roughly speaking asserts that each claims a portion of the of cardinality that is not claimed by previous choices of . This leads to a more efficient contradiction (relying on the unboundedness of rather than ) that looks like it removes one order of exponential growth, thus the relationship between and is now . Returning to the entropy model, one can use (7) and Shannon inequalities to establish an inequality of the form
for a small constant , which on iterating and using the boundedness of gives the claim. (A modification of this analysis, at least on the level of the back of the envelope calculation, suggests that the Matomaki-Radziwill theorem is needed only for ranges greater than or so, although at this range the theorem is not significantly simpler than the general case).
A handy inequality in additive combinatorics is the Plünnecke-Ruzsa inequality:
The proof uses graph-theoretic techniques. Setting , we obtain a useful corollary: if has small doubling in the sense that , then we have for all , where is the sum of copies of .
In a recent paper, I adapted a number of sum set estimates to the entropy setting, in which finite sets such as in are replaced with discrete random variables taking values in , and (the logarithm of) cardinality of a set is replaced by Shannon entropy of a random variable . (Throughout this note I assume all entropies to be finite.) However, at the time, I was unable to find an entropy analogue of the Plünnecke-Ruzsa inequality, because I did not know how to adapt the graph theory argument to the entropy setting.
In fact Theorem 2 is a bit “better” than Theorem 1 in the sense that Theorem 1 needed to refine the original set to a subset , but no such refinement is needed in Theorem 2. One corollary of Theorem 2 is that if , then for all , where are independent copies of ; this improves slightly over the analogous combinatorial inequality. Indeed, the function is concave (this can be seen by using the version of Theorem 2 (or (2) below) to show that the quantity is decreasing in ).
Theorem 2 is actually a quick consequence of the submodularity inequality
in information theory, which is valid whenever are discrete random variables such that and each determine (i.e. is a function of , and also a function of ), and and jointly determine (i.e is a function of and ). To apply this, let be independent discrete random variables taking values in . Observe that the pairs and each determine , and jointly determine . Applying (1) we conclude that
and Theorem 2 follows by telescoping series.
The proof of Theorem 2 seems to be genuinely different from the graph-theoretic proof of Theorem 1. It would be interesting to see if the above argument can be somehow adapted to give a stronger version of Theorem 1. Note also that both Theorem 1 and Theorem 2 have extensions to more general combinations of than ; see this paper and this paper respectively.
It turns out to be a favourable week or two for me to finally finish a number of papers that had been at a nearly completed stage for a while. I have just uploaded to the arXiv my article “Sumset and inverse sumset theorems for Shannon entropy“, submitted to Combinatorics, Probability, and Computing. This paper evolved from a “deleted scene” in my book with Van Vu entitled “Entropy sumset estimates“. In those notes, we developed analogues of the standard Plünnecke-Ruzsa sumset estimates (which relate quantities such as the cardinalities of the sum and difference sets of two finite sets in an additive group to each other), to the entropy setting, in which the finite sets are replaced instead with discrete random variables taking values in that group G, and the (logarithm of the) cardinality |A| is replaced with the Shannon entropy
This quantity measures the information content of X; for instance, if , then it will take k bits on the average to store the value of X (thus a string of n independent copies of X will require about nk bits of storage in the asymptotic limit ). The relationship between entropy and cardinality is that if X is the uniform distribution on a finite non-empty set A, then . If instead X is non-uniformly distributed on A, one has , thanks to Jensen’s inequality.
It turns out that many estimates on sumsets have entropy analogues, which resemble the “logarithm” of the sumset estimates. For instance, the trivial bounds
have the entropy analogue
whenever X, Y are independent discrete random variables in an additive group; this is not difficult to deduce from standard entropy inequalities. Slightly more non-trivially, the sum set estimate
established by Ruzsa, has an entropy analogue
and similarly for a number of other standard sumset inequalities in the literature (e.g. the Rusza triangle inequality, the Plünnecke-Rusza inequality, and the Balog-Szemeredi-Gowers theorem, though the entropy analogue of the latter requires a little bit of care to state). These inequalities can actually be deduced fairly easily from elementary arithmetic identities, together with standard entropy inequalities, most notably the submodularity inequality
whenever X,Y,Z,W are discrete random variables such that X and Y each determine W separately (thus for some deterministic functions f, g) and X and Y determine Z jointly (thus for some deterministic function f). For instance, if X,Y,Z are independent discrete random variables in an additive group G, then and each determine separately, and determine jointly, leading to the inequality
which soon leads to the entropy Rusza triangle inequality
which is an analogue of the combinatorial Ruzsa triangle inequality
All of this was already in the unpublished notes with Van, though I include it in this paper in order to place it in the literature. The main novelty of the paper, though, is to consider the entropy analogue of Freiman’s theorem, which classifies those sets A for which . Here, the analogous problem is to classify the random variables such that , where are independent copies of X. Let us say that X has small doubling if this is the case.
For instance, the uniform distribution U on a finite subgroup H of G has small doubling (in fact in this case). In a similar spirit, the uniform distribution on a (generalised) arithmetic progression P also has small doubling, as does the uniform distribution on a coset progression H+P. Also, if X has small doubling, and Y has bounded entropy, then X+Y also has small doubling, even if Y and X are not independent. The main theorem is that these are the only cases:
Theorem 1. (Informal statement) X has small doubling if and only if for some uniform distribution U on a coset progression (of bounded rank), and Y has bounded entropy.
For instance, suppose that X was the uniform distribution on a dense subset A of a finite group G. Then Theorem 1 asserts that X is close in a “transport metric” sense to the uniform distribution U on G, in the sense that it is possible to rearrange or transport the probability distribution of X to the probability distribution of U (or vice versa) by shifting each component of the mass of X by an amount Y which has bounded entropy (which basically means that it primarily ranges inside a set of bounded cardinality). The way one shows this is by randomly translating the mass of X around by a few random shifts to approximately uniformise the distribution, and then deal with the residual fluctuation in the distribution by hand. Theorem 1 as a whole is established by using the Freiman theorem in the combinatorial setting combined with various elementary convexity and entropy inequality arguments to reduce matters to the above model case when X is supported inside a finite group G and has near-maximal entropy.
I also show a variant of the above statement: if X, Y are independent and , then we have (i.e. X has the same distribution as Y+Z for some Z of bounded entropy (not necessarily independent of X or Y). Thus if two random variables are additively related to each other, then they can be additively transported to each other by using a bounded amount of entropy.
In the last part of the paper I relate these discrete entropies to their continuous counterparts
where X is now a continuous random variable on the real line with density function . There are a number of sum set inequalities known in this setting, for instance
for independent copies of a finite entropy random variable X, with equality if and only if X is a Gaussian. Using this inequality and Theorem 1, I show a discrete version, namely that
whenever and are independent copies of a random variable in (or any other torsion-free abelian group) whose entropy is sufficiently large depending on . This is somewhat analogous to the classical sumset inequality
though notice that we have a gain of just rather than here, the point being that there is a Gaussian counterexample in the entropy setting which does not have a combinatorial analogue (except perhaps in the high-dimensional limit). The main idea is to use Theorem 1 to trap most of X inside a coset progression, at which point one can use Fourier-analytic additive combinatorial tools to show that the distribution is “smooth” in some non-trivial direction r, which can then be used to approximate the discrete distribution by a continuous one.
I also conjecture more generally that the entropy monotonicity inequalities established by Artstein, Barthe, Ball, and Naor in the continuous case also hold in the above sense in the discrete case, though my method of proof breaks down because I no longer can assume small doubling.