Special cases of Shannon entropy

1 March, 2017 in expository, math.IT, math.NT | Tags: Liouville function, randomness, Shannon entropy | by Terence Tao

Given a random variable ${X}$ that takes on only finitely many values, we can define its Shannon entropy by the formula

$\displaystyle H(X) := \sum_x \mathbf{P}(X=x) \log \frac{1}{\mathbf{P}(X=x)}$

with the convention that ${0 \log \frac{1}{0} = 0}$ . (In some texts, one uses the logarithm to base ${2}$ rather than the natural logarithm, but the choice of base will not be relevant for this discussion.) This is clearly a nonnegative quantity. Given two random variables ${X,Y}$ taking on finitely many values, the joint variable ${(X,Y)}$ is also a random variable taking on finitely many values, and also has an entropy ${H(X,Y)}$ . It obeys the Shannon inequalities

$\displaystyle H(X), H(Y) \leq H(X,Y) \leq H(X) + H(Y)$

so we can define some further nonnegative quantities, the mutual information

$\displaystyle I(X:Y) := H(X) + H(Y) - H(X,Y)$

and the conditional entropies

$\displaystyle H(X|Y) := H(X,Y) - H(Y); \quad H(Y|X) := H(X,Y) - H(X).$

More generally, given three random variables ${X,Y,Z}$ , one can define the conditional mutual information

$\displaystyle I(X:Y|Z) := H(X|Z) + H(Y|Z) - H(X,Y|Z)$

and the final of the Shannon entropy inequalities asserts that this quantity is also non-negative.

The mutual information ${I(X:Y)}$ is a measure of the extent to which ${X}$ and ${Y}$ fail to be independent; indeed, it is not difficult to show that ${I(X:Y)}$ vanishes if and only if ${X}$ and ${Y}$ are independent. Similarly, ${I(X:Y|Z)}$ vanishes if and only if ${X}$ and ${Y}$ are conditionally independent relative to ${Z}$ . At the other extreme, ${H(X|Y)}$ is a measure of the extent to which ${X}$ fails to depend on ${Y}$ ; indeed, it is not difficult to show that ${H(X|Y)=0}$ if and only if ${X}$ is determined by ${Y}$ in the sense that there is a deterministic function ${f}$ such that ${X = f(Y)}$ . In a related vein, if ${X}$ and ${X'}$ are equivalent in the sense that there are deterministic functional relationships ${X = f(X')}$ , ${X' = g(X)}$ between the two variables, then ${X}$ is interchangeable with ${X'}$ for the purposes of computing the above quantities, thus for instance ${H(X) = H(X')}$ , ${H(X,Y) = H(X',Y)}$ , ${I(X:Y) = I(X':Y)}$ , ${I(X:Y|Z) = I(X':Y|Z)}$ , etc..

One can get some initial intuition for these information-theoretic quantities by specialising to a simple situation in which all the random variables ${X}$ being considered come from restricting a single random (and uniformly distributed) boolean function ${F: \Omega \rightarrow \{0,1\}}$ on a given finite domain ${\Omega}$ to some subset ${A}$ of ${\Omega}$ :

$\displaystyle X = F \downharpoonright_A.$

In this case, ${X}$ has the law of a random uniformly distributed boolean function from ${A}$ to ${\{0,1\}}$ , and the entropy here can be easily computed to be ${|A| \log 2}$ , where ${|A|}$ denotes the cardinality of ${A}$ . If ${X}$ is the restriction of ${F}$ to ${A}$ , and ${Y}$ is the restriction of ${F}$ to ${B}$ , then the joint variable ${(X,Y)}$ is equivalent to the restriction of ${F}$ to ${A \cup B}$ . If one discards the normalisation factor ${\log 2}$ , one then obtains the following dictionary between entropy and the combinatorics of finite sets:

Random variables ${X,Y,Z}$	Finite sets ${A,B,C}$
Entropy ${H(X)}$	Cardinality ${\|A\|}$
Joint variable ${(X,Y)}$	Union ${A \cup B}$
Mutual information ${I(X:Y)}$	Intersection cardinality ${\|A \cap B\|}$
Conditional entropy ${H(X\|Y)}$	Set difference cardinality ${\|A \backslash B\|}$
Conditional mutual information ${I(X:Y\|Z)}$	${\|(A \cap B) \backslash C\|}$
${X, Y}$ independent	${A, B}$ disjoint
${X}$ determined by ${Y}$	${A}$ a subset of ${B}$
${X,Y}$ conditionally independent relative to ${Z}$	${A \cap B \subset C}$

Every (linear) inequality or identity about entropy (and related quantities, such as mutual information) then specialises to a combinatorial inequality or identity about finite sets that is easily verified. For instance, the Shannon inequality ${H(X,Y) \leq H(X)+H(Y)}$ becomes the union bound ${|A \cup B| \leq |A| + |B|}$ , and the definition of mutual information becomes the inclusion-exclusion formula

$\displaystyle |A \cap B| = |A| + |B| - |A \cup B|.$

For a more advanced example, consider the data processing inequality that asserts that if ${X, Z}$ are conditionally independent relative to ${Y}$ , then ${I(X:Z) \leq I(X:Y)}$ . Specialising to sets, this now says that if ${A, C}$ are disjoint outside of ${B}$ , then ${|A \cap C| \leq |A \cap B|}$ ; this can be made apparent by considering the corresponding Venn diagram. This dictionary also suggests how to prove the data processing inequality using the existing Shannon inequalities. Firstly, if ${A}$ and ${C}$ are not necessarily disjoint outside of ${B}$ , then a consideration of Venn diagrams gives the more general inequality

$\displaystyle |A \cap C| \leq |A \cap B| + |(A \cap C) \backslash B|$

and a further inspection of the diagram then reveals the more precise identity

$\displaystyle |A \cap C| + |(A \cap B) \backslash C| = |A \cap B| + |(A \cap C) \backslash B|.$

Using the dictionary in the reverse direction, one is then led to conjecture the identity

$\displaystyle I( X : Z ) + I( X : Y | Z ) = I( X : Y ) + I( X : Z | Y )$

which (together with non-negativity of conditional mutual information) implies the data processing inequality, and this identity is in turn easily established from the definition of mutual information.

On the other hand, not every assertion about cardinalities of sets generalises to entropies of random variables that are not arising from restricting random boolean functions to sets. For instance, a basic property of sets is that disjointness from a given set ${C}$ is preserved by unions:

$\displaystyle A \cap C = B \cap C = \emptyset \implies (A \cup B) \cap C = \emptyset.$

Indeed, one has the union bound

$\displaystyle |(A \cup B) \cap C| \leq |A \cap C| + |B \cap C|. \ \ \ \ \ (1)$

Applying the dictionary in the reverse direction, one might now conjecture that if ${X}$ was independent of ${Z}$ and ${Y}$ was independent of ${Z}$ , then ${(X,Y)}$ should also be independent of ${Z}$ , and furthermore that

$\displaystyle I(X,Y:Z) \leq I(X:Z) + I(Y:Z)$

but these statements are well known to be false (for reasons related to pairwise independence of random variables being strictly weaker than joint independence). For a concrete counterexample, one can take ${X, Y \in {\bf F}_2}$ to be independent, uniformly distributed random elements of the finite field ${{\bf F}_2}$ of two elements, and take ${Z := X+Y}$ to be the sum of these two field elements. One can easily check that each of ${X}$ and ${Y}$ is separately independent of ${Z}$ , but the joint variable ${(X,Y)}$ determines ${Z}$ and thus is not independent of ${Z}$ .

From the inclusion-exclusion identities

$\displaystyle |A \cap C| = |A| + |C| - |A \cup C|$

$\displaystyle |B \cap C| = |B| + |C| - |B \cup C|$

$\displaystyle |(A \cup B) \cap C| = |A \cup B| + |C| - |A \cup B \cup C|$

$\displaystyle |A \cap B \cap C| = |A| + |B| + |C| - |A \cup B| - |B \cup C| - |A \cup C|$

$\displaystyle + |A \cup B \cup C|$

one can check that (1) is equivalent to the trivial lower bound ${|A \cap B \cap C| \geq 0}$ . The basic issue here is that in the dictionary between entropy and combinatorics, there is no satisfactory entropy analogue of the notion of a triple intersection ${A \cap B \cap C}$ . (Even the double intersection ${A \cap B}$ only exists information theoretically in a “virtual” sense; the mutual information ${I(X:Y)}$ allows one to “compute the entropy” of this “intersection”, but does not actually describe this intersection itself as a random variable.)

However, this issue only arises with three or more variables; it is not too difficult to show that the only linear equalities and inequalities that are necessarily obeyed by the information-theoretic quantities ${H(X), H(Y), H(X,Y), I(X:Y), H(X|Y), H(Y|X)}$ associated to just two variables ${X,Y}$ are those that are also necessarily obeyed by their combinatorial analogues ${|A|, |B|, |A \cup B|, |A \cap B|, |A \backslash B|, |B \backslash A|}$ . (See for instance the Venn diagram at the Wikipedia page for mutual information for a pictorial summation of this statement.)

One can work with a larger class of special cases of Shannon entropy by working with random linear functions rather than random boolean functions. Namely, let ${S}$ be some finite-dimensional vector space over a finite field ${{\mathbf F}}$ , and let ${f: S \rightarrow {\mathbf F}}$ be a random linear functional on ${S}$ , selected uniformly among all such functions. Every subspace ${U}$ of ${S}$ then gives rise to a random variable ${X = X_U: U \rightarrow {\mathbf F}}$ formed by restricting ${f}$ to ${U}$ . This random variable is also distributed uniformly amongst all linear functions on ${U}$ , and its entropy can be easily computed to be ${\mathrm{dim}(U) \log |\mathbf{F}|}$ . Given two random variables ${X, Y}$ formed by restricting ${f}$ to ${U, V}$ respectively, the joint random variable ${(X,Y)}$ determines the random linear function ${f}$ on the union ${U \cup V}$ on the two spaces, and thus by linearity on the Minkowski sum ${U+V}$ as well; thus ${(X,Y)}$ is equivalent to the restriction of ${f}$ to ${U+V}$ . In particular, ${H(X,Y) = \mathrm{dim}(U+V) \log |\mathbf{F}|}$ . This implies that ${I(X:Y) = \mathrm{dim}(U \cap V) \log |\mathbf{F}|}$ and also ${H(X|Y) = \mathrm{dim}(\pi_V(U)) \log |\mathbf{F}|}$ , where ${\pi_V: S \rightarrow S/V}$ is the quotient map. After discarding the normalising constant ${\log |\mathbf{F}|}$ , this leads to the following dictionary between information theoretic quantities and linear algebra quantities, analogous to the previous dictionary:

Random variables ${X,Y,Z}$	Subspaces ${U,V,W}$
Entropy ${H(X)}$	Dimension ${\mathrm{dim}(U)}$
Joint variable ${(X,Y)}$	Sum ${U+V}$
Mutual information ${I(X:Y)}$	Dimension of intersection ${\mathrm{dim}(U \cap V)}$
Conditional entropy ${H(X\|Y)}$	Dimension of projection ${\mathrm{dim}(\pi_V(U))}$
Conditional mutual information ${I(X:Y\|Z)}$	${\mathrm{dim}(\pi_W(U) \cap \pi_W(V))}$
${X, Y}$ independent	${U, V}$ transverse ( ${U \cap V = \{0\}}$ )
${X}$ determined by ${Y}$	${U}$ a subspace of ${V}$
${X,Y}$ conditionally independent relative to ${Z}$	${\pi_W(U)}$ , ${\pi_W(V)}$ transverse.

The combinatorial dictionary can be regarded as a specialisation of the linear algebra dictionary, by taking ${S}$ to be the vector space ${\mathbf{F}_2^\Omega}$ over the finite field ${\mathbf{F}_2}$ of two elements, and only considering those subspaces ${U}$ that are coordinate subspaces ${U = {\bf F}_2^A}$ associated to various subsets ${A}$ of ${\Omega}$ .

As before, every linear inequality or equality that is valid for the information-theoretic quantities discussed above, is automatically valid for the linear algebra counterparts for subspaces of a vector space over a finite field by applying the above specialisation (and dividing out by the normalising factor of ${\log |\mathbf{F}|}$ ). In fact, the requirement that the field be finite can be removed by applying the compactness theorem from logic (or one of its relatives, such as Los’s theorem on ultraproducts, as done in this previous blog post).

The linear algebra model captures more of the features of Shannon entropy than the combinatorial model. For instance, in contrast to the combinatorial case, it is possible in the linear algebra setting to have subspaces ${U,V,W}$ such that ${U}$ and ${V}$ are separately transverse to ${W}$ , but their sum ${U+V}$ is not; for instance, in a two-dimensional vector space ${{\bf F}^2}$ , one can take ${U,V,W}$ to be the one-dimensional subspaces spanned by ${(0,1)}$ , ${(1,0)}$ , and ${(1,1)}$ respectively. Note that this is essentially the same counterexample from before (which took ${{\bf F}}$ to be the field of two elements). Indeed, one can show that any necessarily true linear inequality or equality involving the dimensions of three subspaces ${U,V,W}$ (as well as the various other quantities on the above table) will also be necessarily true when applied to the entropies of three discrete random variables ${X,Y,Z}$ (as well as the corresponding quantities on the above table).

However, the linear algebra model does not completely capture the subtleties of Shannon entropy once one works with four or more variables (or subspaces). This was first observed by Ingleton, who established the dimensional inequality

$\displaystyle \mathrm{dim}(U \cap V) \leq \mathrm{dim}(\pi_W(U) \cap \pi_W(V)) + \mathrm{dim}(\pi_X(U) \cap \pi_X(V)) + \mathrm{dim}(W \cap X) \ \ \ \ \ (2)$

for any subspaces ${U,V,W,X}$ . This is easiest to see when the three terms on the right-hand side vanish; then ${\pi_W(U), \pi_W(V)}$ are transverse, which implies that ${U\cap V \subset W}$ ; similarly ${U \cap V \subset X}$ . But ${W}$ and ${X}$ are transverse, and this clearly implies that ${U}$ and ${V}$ are themselves transverse. To prove the general case of Ingleton’s inequality, one can define ${Y := U \cap V}$ and use ${\mathrm{dim}(\pi_W(Y)) \leq \mathrm{dim}(\pi_W(U) \cap \pi_W(V))}$ (and similarly for ${X}$ instead of ${W}$ ) to reduce to establishing the inequality

$\displaystyle \mathrm{dim}(Y) \leq \mathrm{dim}(\pi_W(Y)) + \mathrm{dim}(\pi_X(Y)) + \mathrm{dim}(W \cap X) \ \ \ \ \ (3)$

which can be rearranged using ${\mathrm{dim}(\pi_W(Y)) = \mathrm{dim}(Y) - \mathrm{dim}(W) + \mathrm{dim}(\pi_Y(W))}$ (and similarly for ${X}$ instead of ${W}$ ) and ${\mathrm{dim}(W \cap X) = \mathrm{dim}(W) + \mathrm{dim}(X) - \mathrm{dim}(W + X)}$ as

$\displaystyle \mathrm{dim}(W + X ) \leq \mathrm{dim}(\pi_Y(W)) + \mathrm{dim}(\pi_Y(X)) + \mathrm{dim}(Y)$

but this is clear since ${\mathrm{dim}(W + X ) \leq \mathrm{dim}(\pi_Y(W) + \pi_Y(X)) + \mathrm{dim}(Y)}$ .

Returning to the entropy setting, the analogue

$\displaystyle H( V ) \leq H( V | Z ) + H(V | W ) + I(Z:W)$

of (3) is true (exercise!), but the analogue

$\displaystyle I(X:Y) \leq I(X:Y|Z) + I(X:Y|W) + I(Z:W) \ \ \ \ \ (4)$

of Ingleton’s inequality is false in general. Again, this is easiest to see when all the terms on the right-hand side vanish; then ${X,Y}$ are conditionally independent relative to ${Z}$ , and relative to ${W}$ , and ${Z}$ and ${W}$ are independent, and the claim (4) would then be asserting that ${X}$ and ${Y}$ are independent. While there is no linear counterexample to this statement, there are simple non-linear ones: for instance, one can take ${Z,W}$ to be independent uniform variables from ${\mathbf{F}_2}$ , and take ${X}$ and ${Y}$ to be (say) ${ZW}$ and ${(1-Z)(1-W)}$ respectively (thus ${X, Y}$ are the indicators of the events ${Z=W=1}$ and ${Z=W=0}$ respectively). Once one conditions on either ${Z}$ or ${W}$ , one of ${X,Y}$ has positive conditional entropy and the other has zero entropy, and so ${X, Y}$ are conditionally independent relative to either ${Z}$ or ${W}$ ; also, ${Z}$ or ${W}$ are independent of each other. But ${X}$ and ${Y}$ are not independent of each other (they cannot be simultaneously equal to ${1}$ ). Somehow, the feature of the linear algebra model that is not present in general is that in the linear algebra setting, every pair of subspaces ${U, V}$ has a well-defined intersection ${U \cap V}$ that is also a subspace, whereas for arbitrary random variables ${X, Y}$ , there does not necessarily exist the analogue of an intersection, namely a “common information” random variable ${V}$ that has the entropy of ${I(X:Y)}$ and is determined either by ${X}$ or by ${Y}$ .

I do not know if there is any simpler model of Shannon entropy that captures all the inequalities available for four variables. One significant complication is that there exist some information inequalities in this setting that are not of Shannon type, such as the Zhang-Yeung inequality

$\displaystyle I(X:Y) \leq 2 I(X:Y|Z) + I(X:Z|Y) + I(Y:Z|X)$

$\displaystyle + I(X:Y|W) + I(Z:W).$

One can however still use these simpler models of Shannon entropy to be able to guess arguments that would work for general random variables. An example of this comes from my paper on the logarithmically averaged Chowla conjecture, in which I showed among other things that

$\displaystyle |\sum_{n \leq x} \frac{\lambda(n) \lambda(n+1)}{n}| \leq \varepsilon \log x \ \ \ \ \ (5)$

whenever ${x}$ was sufficiently large depending on ${\varepsilon>0}$ , where ${\lambda}$ is the Liouville function. The information-theoretic part of the proof was as follows. Given some intermediate scale ${H}$ between ${1}$ and ${x}$ , one can form certain random variables ${X_H, Y_H}$ . The random variable ${X_H}$ is a sign pattern of the form ${(\lambda(n+1),\dots,\lambda(n+H))}$ where ${n}$ is a random number chosen from ${1}$ to ${x}$ (with logarithmic weighting). The random variable ${Y_H}$ was tuple ${(n \hbox{ mod } p)_{p \sim \varepsilon^2 H}}$ of reductions of ${n}$ to primes ${p}$ comparable to ${\varepsilon^2 H}$ . Roughly speaking, what was implicitly shown in the paper (after using the multiplicativity of ${\lambda}$ , the circle method, and the Matomaki-Radziwill theorem on short averages of multiplicative functions) is that if the inequality (5) fails, then there was a lower bound

$\displaystyle I( X_H : Y_H ) \gg \varepsilon^7 \frac{H}{\log H}$

on the mutual information between ${X_H}$ and ${Y_H}$ . From translation invariance, this also gives the more general lower bound

$\displaystyle I( X_{H_0,H} : Y_H ) \gg \varepsilon^7 \frac{H}{\log H} \ \ \ \ \ (6)$

for any ${H_0}$ , where ${X_{H_0,H}}$ denotes the shifted sign pattern ${(\lambda(n+H_0+1),\dots,\lambda(n+H_0+H))}$ . On the other hand, one had the entropy bounds

$\displaystyle H( X_{H_0,H} ), H(Y_H) \ll H$

and from concatenating sign patterns one could see that ${X_{H_0,H+H'}}$ is equivalent to the joint random variable ${(X_{H_0,H}, X_{H_0+H,H'})}$ for any ${H_0,H,H'}$ . Applying these facts and using an “entropy decrement” argument, I was able to obtain a contradiction once ${H}$ was allowed to become sufficiently large compared to ${\varepsilon}$ , but the bound was quite weak (coming ultimately from the unboundedness of ${\sum_{\log H_- \leq j \leq \log H_+} \frac{1}{j \log j}}$ as the interval ${[H_-,H_+]}$ of values of ${H}$ under consideration becomes large), something of the order of ${H \sim \exp\exp\exp(\varepsilon^{-7})}$ ; the quantity ${H}$ needs at various junctures to be less than a small power of ${\log x}$ , so the relationship between ${x}$ and ${\varepsilon}$ becomes essentially quadruple exponential in nature, ${x \sim \exp\exp\exp\exp(\varepsilon^{-7})}$ . The basic strategy was to observe that the lower bound (6) causes some slowdown in the growth rate ${H(X_{kH})/kH}$ of the mean entropy, in that this quantity decreased by ${\gg \frac{\varepsilon^7}{\log H}}$ as ${k}$ increased from ${1}$ to ${\log H}$ , basically by dividing ${X_{kH}}$ into ${k}$ components ${X_{jH, H}}$ , ${j=0,\dots,k-1}$ and observing from (6) each of these shares a bit of common information with the same variable ${Y_H}$ . This is relatively clear when one works in a set model, in which ${Y_H}$ is modeled by a set ${B_H}$ of size ${O(H)}$ , and ${X_{H_0,H}}$ is modeled by a set of the form

$\displaystyle X_{H_0,H} = \bigcup_{H_0 < h \leq H_0+H} A_h$

for various sets ${A_h}$ of size ${O(1)}$ (also there is some translation symmetry that maps ${A_h}$ to a shift ${A_{h+1}}$ while preserving all of the ${B_H}$ ).

However, on considering the set model recently, I realised that one can be a little more efficient by exploiting the fact (basically the Chinese remainder theorem) that the random variables ${Y_H}$ are basically jointly independent as ${H}$ ranges over dyadic values that are much smaller than ${\log x}$ , which in the set model corresponds to the ${B_H}$ all being disjoint. One can then establish a variant

$\displaystyle I( X_{H_0,H} : Y_H | (Y_{H'})_{H' < H}) \gg \varepsilon^7 \frac{H}{\log H} \ \ \ \ \ (7)$

of (6), which in the set model roughly speaking asserts that each ${B_H}$ claims a portion of the ${\bigcup_{H_0 < h \leq H_0+H} A_h}$ of cardinality ${\gg \varepsilon^7 \frac{H}{\log H}}$ that is not claimed by previous choices of ${B_H}$ . This leads to a more efficient contradiction (relying on the unboundedness of ${\sum_{\log H_- \leq j \leq \log H_+} \frac{1}{j}}$ rather than ${\sum_{\log H_- \leq j \leq \log H_+} \frac{1}{j \log j}}$ ) that looks like it removes one order of exponential growth, thus the relationship between ${x}$ and ${\varepsilon}$ is now ${x \sim \exp\exp\exp(\varepsilon^{-7})}$ . Returning to the entropy model, one can use (7) and Shannon inequalities to establish an inequality of the form

$\displaystyle \frac{1}{2H} H(X_{2H} | (Y_{H'})_{H' \leq 2H}) \leq \frac{1}{H} H(X_{H} | (Y_{H'})_{H' \leq H}) - \frac{c \varepsilon^7}{\log H}$

for a small constant ${c>0}$ , which on iterating and using the boundedness of ${\frac{1}{H} H(X_{H} | (Y_{H'})_{H' \leq H})}$ gives the claim. (A modification of this analysis, at least on the level of the back of the envelope calculation, suggests that the Matomaki-Radziwill theorem is needed only for ranges ${H}$ greater than ${\exp( (\log\log x)^{\varepsilon^{7}} )}$ or so, although at this range the theorem is not significantly simpler than the general case).

29 comments

Comments feed for this article

1 March, 2017 at 12:25 pm

Manny

I think the equation in first sentence is missing a minus.

1 March, 2017 at 12:28 pm

Nikita Sidorov

In the first formula the minus is missing.

1 March, 2017 at 12:29 pm

Nikita Sidorov

Sorry, didn’t notice 1/… It’s all good, then.

1 March, 2017 at 1:06 pm

Ravi Andrew Bajaj

Dear Professor Tao,

Are there applications of this result to random matrix theory?

–RAB

1 March, 2017 at 1:46 pm

Nick Cook

Nice post! Just to note a couple of typos: In the third line of the dictionary for subspaces, either change the random variable to its entropy or the dimension to a subspace. In the paragraph below (5) I think you mean “if the inequality (5) fails…”

[Corrected, thanks – T.]

1 March, 2017 at 3:01 pm

Your dictionary as in other dictionaries misses the non-trivial $X,Y,Z$ are correlated case. How would you give a dictionary between information theory and set theory/subspace for the cases $X,Y,Z$ are correlated?

1 March, 2017 at 10:16 pm

domotorp

What happens if we swap union and intersection in the set dictionary, or we project ONTO the subspaces in the subspace dictionary? Is it only that some inequalities and signs turn?

Also, I think there are a few more typos:
1, if and if -> if and only if
2, equivalent to the restriction to f -> of f
3, S/U is the quotient map -> S/V

2 March, 2017 at 9:43 am

Terence Tao

Thanks for the corrections!

In the set dictionary, one can replace all the sets $A$ with their complements $\Omega \backslash A$ in the ambient space $\Omega$ . This interchanges unions and intersections by de Morgan’s laws, but also cardinality $|A|$ must now be replaced by cocardinality $|\Omega| - |A|$ . The analogue of conditional entropy $|A \backslash B|$ is now the cocardinality of $A \cap B$ inside $B$ (viewing the latter as the new ambient space, thus conditioning corresponds to restricting the ambient space and all the sets inside it to $B$ ).

Similarly, in the linear algebra dictionary, by replacing all the vector spaces with their complements (viewed as subspaces of the dual space $S^*$ , or if one has a nondegenerate bilinear form on $S$ one can take orthogonal complements with respect to that form and stay inside $S$ ), one can interchange sums and intersections, replacing dimensions with codimensions throughout. Again, the analogue of projection is now restriction (restricting the ambient space $S^*$ and all of its subspaces to a smaller space $V$ ). For instance, the analogue of conditional entropy $\mathrm{dim}(\pi_V(U))$ is now the codimension of $U \cap V$ inside $V$ .

1 March, 2017 at 10:25 pm

Djdkd

Dear Terry, I think you proved something considerably stronger than (5)

2 March, 2017 at 3:53 am

Anonymous

In the second paragraph, the word only is missing in “…vanishes if and if …”

2 March, 2017 at 6:47 am

John Fries

Professor Tao,

Have you read Amari’s paper “Information Geometry on Hierarchy of Probability Distributions”? Instead of mapping information theory to set theory or linear algebra, he maps it to differential geometry. Of particular note is that he is able to discuss triplewise and higher interactions between random variables.

Here is a link: https://pdfs.semanticscholar.org/aa12/aee8cb3fb5ef2c545c0a9efa383326270340.pdf

Sincerely,
John Fries

3 March, 2017 at 5:29 pm

Jeff

most readers of this blog probably won’t get confused by this, but it’s slightly confusing to call conditional entropy “relative entropy”, which many take as synonymous with the fairly unrelated concept of KL divergence.

[Corrected, thanks – T.]

4 March, 2017 at 6:13 am

Oliver Knill

Nice analogies. Since cardinality is generalized by Euler characteristic for finite abstract simplicial complexes, one could push the analogy even further. The inclusion-exclusion formula becomes the property of a “valuation” on complexes which is the defining relation. Valuation is understood in the sense of Klain and Rota, who developed the discrete analogue of continuum geometric probability theory. Zero dimensional complexes are sets so that this extends the finite set analogy.

Entropy, an important functional (maybe the most important one) in a probabilistic set-up when dealing with random variables X is then linked to Euler characteristic an important quantity (maybe the most important one) in topology, when dealing with simplicial complexes X.

Euler characteristic has long known to be the only valuation which is invariant under Barycentric refinement of the complex G, the complex, in which the sets of G are the points of the new base and the new sets are the subsets. The reason is that there is a universal linear map A which maps the f-vector of G to the f-vector of the refinement of G. This matrix A has a unique eigenvalue 1 which is (1,-1,1,-1,…) leading to the Euler characteristic. Euler characteristic is then a Perron-Frobenius eigenvector of the inverse of the Barycentric refinement operator. It is unique if one normalizes it so that its value on a 1-point complex 1 is 1.

There is a similar defining renormalization picture for entropy: Shannon in his 1948 article already established such a condition which defines entropy uniquely (Theorem 2 in his paper). Much earlier, Boltzmann saw entropy as a functional on measures as he defined it as the expectation of -log(P) (famously displayed on his tomb stone) and already on physical grounds wanted entropy to be additive for physical systems which do not interact Taking about random variables X means the entropy of its law (which essentially corresponds to the Shannon condition). Obviously, entropy H satisfies H(X+X’)/2=H(X) if X,X’ are IID and H(X)=0 if X is a constant random variable. This renormalisation operator X -> X+X’ is the analogue of Barycentric refinement and the requirement H(X+X’)/2=H(X) defines the entropy functional uniquely if normalized for the constant random variable. (Shannon seems to have a stronger assumptions I(X:Y)=0 for independent X,Y but this could be bootstrapped from H(X+X’)/2=H(X) for IID X,X’ using polarization from H(X+Y + X’+Y’)). The value H=-log(X) for Euler characteristic X is an analogue of entropy as it is additive with respect to Cartesian products of complexes but topologists refrain from taking the log as we also want to deal with spaces of Euler characteristic 0 like the circle. A special case is the product G x 1 with one-point complex 1 which is the Barycentric refinement of G. The analogue of H(A + B)=H(A) + H(B) for IID random variables A,B is H(A x B) = H(A) + H(B) for simplicial complexes. The analogue of H(1)=0 (H=entropy, 1=constant random variable) is H(1)=0 (H=-log(Euler characteristic, 1=1 point complex)).

There is also an analogue central limit theorem: the operator X -> X+X’ for random variables (X’ is just an IID copy of X and X+X’ is normalized to have variance 1 obviously leads to the classical central limit theorem and entropy is a Lyapunov function under the operation and iterating the map is a discrete time gradient type flow leading to a Gaussian distribution, the maximum. Similarly as a random variable X defines a law P, any simplicial complex X has a “law” P: it is the density of states of its Laplacian, a discrete point measure P on the real line. What happens during Barycentric refinement is that one is close to adding up independent complexes (there is a boundary part which becomes negligible and controlling this needs a Lidskii-Last inequality). The story is even more intriguing as in probability theory because so far, we only know the limiting measure for one-dimensional complexes. It is the potential theoretical equilibrium measure on an interval, the arcsine law. The limiting universal measures for higher dimensional complexes depends only on the maximal dimension of the complex and are unidentified so far. Its integrated density of states (= CDF) appears discontinuous in dimensions larger than 1.

4 March, 2017 at 5:22 pm

Shannon Entropy and Euler Characteristic - Quantum Calculus

[…] A blog entry mentions some analogies between entropy and other combinatorial notions. One can push the analogy in an other direction and compare random variables with simplicial complexes, Shannon entropy with Euler characteristic. […]

5 March, 2017 at 10:39 am

Furstenberg limits of the Liouville function | What's new

[…] of ) shows that eventually becomes negative for sufficiently large , which is absurd. (See also this previous blog post for a sketch of a slightly different way to conclude the argument from entropy […]

8 March, 2017 at 12:24 pm

Machine Learning Notes/Pointers | SChaiken's Blog

[…] Terry Tao on Shannon Entropy and its analogies to set functions […]

19 March, 2017 at 12:14 pm

Joe Triscari

Interesting article.

It makes me wonder if there are category theoretic ideas to be milked out of these analogies.

The free vector space on a set preserves entropy. Maybe there’s some entropy concept lurking under other categories and there are functors that preserve the entropy…

22 March, 2017 at 1:25 am

David Tse

Nice post. A pedagogical comment. Instead of defining the set theoretic model in terms of a random mapping F, wouldn’t it be more straightforward to define it in terms of a set of i.i.d. Bern(0.5) random variables $X_1, X_2, \ldots X_n$ , where $n = |\Omega|$? Then for every subset $A \subset \Omega$ , one can define a random vector $X_A = (X_i)_{i \in A}$ .

[One could indeed do this, but I preferred to use the random function model as it made the similarity between the set theoretic dictionary and the linear algebra dictionary more apparent. -T.]

24 March, 2017 at 4:58 am

David Tse

Continuing on my previous comment and in response to Terry’s note, I believe the representation of the set theoretic model in terms of basic random variables $X_1, \ldots, X_n$ can be generalized to the linear algebra model by starting again with $X_1, \ldots, X_n$ but now thinking it as a random vector X. For each linear transformation T, we can define $X_T = TX$ . The set theoretic model corresponds to the random vectors obtained by restricting T to be diagonal with only 1’s and 0’s.

23 March, 2017 at 6:54 pm

Raymond Yeung

Nice post by Terry and others. Here are some supplementary comments.

1. The first rigorous proof of the set-entropy dictionary, to my knowledge, is the 1962 paper by Hu (originally in Russian):

http://epubs.siam.org/doi/pdf/10.1137/1107041

2. The analogy between sets and entropy was enhanced into an equivalence in my 1991 paper. For a fixed finite set of discrete random variables, we have for example

$I(X:Y|Z) = \mu^*(A \cap B – C)$ ,

where $\mu^*$, called the I-Measure, is a unique signed measure defined on a suitably constructed $\sigma$-field. $\mu^*$ is unique in the sense that
it is determined by the joint entropies of all the random variables involved. With this equivalence, one can formally apply all the tools in set theory to information theory. In short, in the information theory domain, every identity (not inequality) that one can read off from the Venn diagram is valid and one does not have to go back to prove it in information theory.

3. As pointed out in Terry’s post, the quantity

$I(X:Y:Z) = I(X:Y) – I(X:Y|Z)$

can be negative. Though without an operational meaning, quantities such as $I(X:Y:Z)$ have a clear measure-theoretic meaning and are very handy for manipulating terms.

The I-Measure $\mu^*$ is a signed measure in general. However, when the random variables form a Markov chain, $\mu^*$ is always nonnegative and hence a measure. There also exists a very simple form of Venn diagram for Markov chains, allowing one to read off simple identities/ inequalities like the data processing inequality and even elaborate identities like

$H(Y) + H(T) = I(Z: X, Y, T, U) + I(X, Y : T, U) + H(Y|Z) + H(T|Z)$

when $X – Y – Z – T – U$ form a Markov chain.

Terry pointed out correctly that one cannot read off entropy inequalities from Venn diagrams for 3 or more random variables. This is precisely because $\mu^*$ is not necessarily a measure. Nevertheless, for identities this is not a problem. For Markov chains, $\mu^*$ is always a measure and so one can read off inequalities from the Venn diagram.

4. Shannon’s information measures exhibit a very beautiful set-theoretic structure for a certain class of conditional independencies called full conditional independencies (FCMI), including Markov random field as a special case. Markov random field in turn includes Markov chain as a special case.

*****

The above can be found in Ch. 3 and Ch. 12 of my 2008 book (soft copy should be available at most university libraries). There are also historical notes at the end of the chapter which may be of interest to some.

5 April, 2017 at 1:14 pm

Szymon Toruńczyk

About the model for four or more variables – below is a proposal, originating from “On a relation between information inequalities and group theory” by Terence H. Chan and Raymond W. Yeung.
Consider a finite group $G$ and a random variable $g$ taking as values elements of $G$ , with uniform distribution. For a subgroup $H$ of $G$ , define the random variable $g\cdot H$ . This random variable is uniformly distributed on the set of cosets $G/H$ , so its entropy is equal to $\log |G/H|$ . For a subset $X$ of a group $G$ , denote $\log \frac{|G|}{|X|}$ by $\textrm{codim}^G(X)$ .
Below is a dictionary between information theoretic quantities and group theoretic quantities:
Random variables $X,Y,Z$ – subgroups $K,H,S$
Entropy $H(X)$ – codimension $\textrm{codim}^G(K)=\log \frac{|G|}{|K|}$
Joint variable $(X,Y)$ – subgroup intersection $K\cap H$
Mutual information $I(X:Y)$ – codimension of product set $\textrm{codim}^G(K\cdot H)$
Conditional entropy $H(X|Y)$ – relative codimension $\textrm{codim}^H(K\cap H)$
Conditional mutual information $I(X:Y|Z)$ – $\textrm{codim}^S((K\cap S)\cdot (H\cap S))$
$X,Y$ independent – $K\cdot H=G$
$X$ determined by $Y$ – $H\subset K$
$X,Y$ conditionally independent relative to $Z$ – $(K\cap S)\cdot (H\cap S)=S$

The linear algebra dictionary can be regarded as a special case of the group dictionary, by taking $G$ to be the additive group $S^*$ of all linear functions from $S$ to the field $\mathbf F$ .
Every linear inequality which holds for an expression involving joint entropies of several random variables,such as the inequality $H(X,Z)+H(Y,Z)\ge H(X,Y,Z)+H(Z)$ (equivalent to $I(X:Y|Z)\ge 0$ ), automatically yields a valid inequality concerning sizes of intersections of subgroups of $G$ , obtained
by replacing each entropy of a joint random variable $H(X_1,X_2,\ldots, X_k)$ by $\textrm{codim}^G(S_1\cap S_2\cap \ldots \cap S_k)$ .
In the particular example above we obtain, after exponentiation and canceling out the $|G|$ terms, the inequality $|K\cap S|\cdot |H\cap S|\le |K\cap H\cap S|\cdot |S|$ , which can be also proved directly as follows $|S|\ge |(K\cap S)\cdot (H\cap S)|=(|K\cap S|\cdot |H\cap S|)/|(K\cap H\cap S)|$ .

Surprisingly, the group theoretic model captures all non-strict linear inequalities among joint entropies of several random variables; this was proved by Chan and Yeung. The proof is actually very simple, and goes roughly as follows.
Suppose that there is a tuple of random variables $X_1,\ldots,X_n$ such that some linear combination $\ell$ of some of their joint entropies is strictly negative. By approximating, we can assume that the joint random variable $X=(X_1,\ldots,X_n)$ has a distribution with rational coefficients. We view $X$ as a table with $n$ columns and in which each row $r$ has a rational weight $p_r$ assigned to it; the weights of all the rows add up to $1$ and we assume they have a common denominator $q$ . For a large number $k$ which is a multiple of $q$ , let
$T$ be a table with $k$ rows obtained from $X$ by replicating each row $r$ in $k\cdot p_r$ many copies.
Let $G$ be the group of all permutations of the rows of the table $T$ , and for $i=1,\ldots,n$ , let $G_i\subset G$ be the subgroup consisting of those permutations $\pi$ which preserve the value in the $i$ th column, i.e., $\pi(r)[i]=r[i]$ , for each row $r$ of $T$ . The group $G$ is the symmetric group on $k$ elements, and $G_i$ is a product of symmetric groups, whose size can be easily described by the distribution of the variable $X_i$ . Applying Stirling’s formula we get that as $k$ goes to $\infty$ ,
the value $\log|G|/\log|G_i|$ is asymptotically
equivalent to $k\cdot H(X_i)$ and, similarly, $\log|G|/\log|G_{i_1}\cap\cdots\cap G_{i_k}|$ is asymptotically equivalent to $k\cdot H(X_{i_1},\ldots,X_{i_k})$ . In particular, for sufficiently large $k$ , we obtain a group $G$ and its subgroups $G_1,\ldots,G_n$ such that the linear combination of codimensions of group intersections
corresponding to $\ell$ is strictly negative.

12 April, 2017 at 7:57 am

Shang Shan

Prof Tao, Have you ever heard arguments that the entropy should actually have been defined as $- \sum p ( \ln p – 1 )$ ? I have read Shannon’s original paper, and he speaks generally, that it should be a logarithmic function. So there is no reason why he would not have defined it this way, by shifting the zero. In this way, one could accomodate the additional constraint that $\sum p = 1$ with a Lagrange multiplier of -1.

13 April, 2017 at 2:41 pm

Terence Tao

Of course, one is free to define one’s terms however one pleases, however if one shifts the entropy, then the Shannon entropy axioms (listed on pages 392-393 of Shannon’s original paper) will be violated (in particular, axiom 3 will fail). Also, the form of many basic propeties of Shannon entropy, e.g. the entropy inequalities such as $H(X,Y) \leq H(X) + H(Y)$ , will become a bit more complicated, and the intuitive interpretation of entropy as a measure of information becomes less clear (since now an empty piece of information will have positive entropy rather than zero entropy). Also, all the dictionaries mentioned in this post become a little bit more complicated also. So there does not seem to be much to gain by shifting the entropy in this fashion.

17 April, 2017 at 1:58 am

Shang Shan

Thank you Prof Tao.

20 July, 2020 at 4:31 pm

The sunflower lemma via Shannon entropy | What's new

[…] this previous blog post for some intuitive analogies to understand Shannon […]

19 May, 2021 at 3:04 pm

Entropy Estimation via Two Chains: Streamlining the Proof of the Sunflower Lemma – Theory Dish

[…] of Shannon entropy are also discussed. (We also highly recommend Tao’s other blog posts about Shannon entropy and the entropy compression […]

7 November, 2021 at 7:53 pm

Venn and Euler type diagrams for vector spaces and abelian groups | What's new

[…] of a dimension in a vector space is analogous in many ways to that of cardinality of a set; see this previous post for an instance of this analogy (in the context of Shannon […]

4 September, 2022 at 1:24 pm

Anonymous

In the RHS of (5), I think you want $\epsilon \log(x)$ instead of $\epsilon x$

[Corrected, thanks -T.]

27 March, 2024 at 9:41 am

Anonymous

Does quantum entropy have a place in these discusssions?

	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Erratum for “An inverse…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on A Banach algebra proof of the…
	Anonymous on A Banach algebra proof of the…
	Aleksandar on 245C, Notes 4: Sobolev sp…
	Anonymous on Erratum for “An inverse…
	Anonymous on Erratum for “An inverse…
	Anonymous on Erratum for “An inverse…
	Anonymous on Erratum for “An inverse…
	Terence Tao on 245C, Notes 4: Sobolev sp…

Special cases of Shannon entropy

Recent Comments

Articles by others

Diversions

Mathematics

Selected articles

Software

The sciences

Top Posts

Archives

Categories

The Polymath Blog

29 comments

Leave a comment Cancel reply

For commenters

Special cases of Shannon entropy

Share this:

Recent Comments

Articles by others

Diversions

Mathematics

Selected articles

Software

The sciences

Top Posts

Archives

Categories

The Polymath Blog

29 comments

Leave a comment Cancel reply

For commenters