A family {A_1,\dots,A_r} of sets for some {r \geq 1} is a sunflower if there is a core set {A_0} contained in each of the {A_i} such that the petal sets {A_i \backslash A_0, i=1,\dots,r} are disjoint. If {k,r \geq 1}, let {\mathrm{Sun}(k,r)} denote the smallest natural number with the property that any family of {\mathrm{Sun}(k,r)} distinct sets of cardinality at most {k} contains {r} distinct elements {A_1,\dots,A_r} that form a sunflower. The celebrated Erdös-Rado theorem asserts that {\mathrm{Sun}(k,r)} is finite; in fact Erdös and Rado gave the bounds

\displaystyle  (r-1)^k \leq \mathrm{Sun}(k,r) \leq (r-1)^k k! + 1. \ \ \ \ \ (1)

The sunflower conjecture asserts in fact that the upper bound can be improved to {\mathrm{Sun}(k,r) \leq O(1)^k r^k}. This remains open at present despite much effort (including a Polymath project); after a long series of improvements to the upper bound, the best general bound known currently is

\displaystyle  \mathrm{Sun}(k,r) \leq O( r \log(kr) )^k \ \ \ \ \ (2)

for all {k,r \geq 2}, established in 2019 by Rao (building upon a recent breakthrough a month previously of Alweiss, Lovett, Wu, and Zhang). Here we remove the easy cases {k=1} or {r=1} in order to make the logarithmic factor {\log(kr)} a little cleaner.

Rao’s argument used the Shannon noiseless coding theorem. It turns out that the argument can be arranged in the very slightly different language of Shannon entropy, and I would like to present it here. The argument proceeds by locating the core and petals of the sunflower separately (this strategy is also followed in Alweiss-Lovett-Wu-Zhang). In both cases the following definition will be key. In this post all random variables, such as random sets, will be understood to be discrete random variables taking values in a finite range. We always use boldface symbols to denote random variables, and non-boldface for deterministic quantities.

Definition 1 (Spread set) Let {R > 1}. A random set {{\bf A}} is said to be {R}-spread if one has

\displaystyle  {\mathbb P}( S \subset {\bf A}) \leq R^{-|S|}

for all sets {S}. A family {(A_i)_{i \in I}} of sets is said to be {R}-spread if {I} is non-empty and the random variable {A_{\bf i}} is {R}-spread, where {{\bf i}} is drawn uniformly from {I}.

The core can then be selected greedily in such a way that the remainder of a family becomes spread:

Lemma 2 (Locating the core) Let {(A_i)_{i \in I}} be a family of subsets of a finite set {X}, each of cardinality at most {k}, and let {R > 1}. Then there exists a “core” set {S_0} of cardinality at most {k} such that the set

\displaystyle  J := \{ i \in I: S_0 \subset A_i \} \ \ \ \ \ (3)

has cardinality at least {R^{-|S_0|} |I|}, and such that the family {(A_j \backslash S_0)_{j \in J}} is {R}-spread. Furthermore, if {|I| > R^k} and the {A_i} are distinct, then {|S_0| < k}.

Proof: We may assume {I} is non-empty, as the claim is trivial otherwise. For any {S \subset X}, define the quantity

\displaystyle  Q(S) := R^{|S|} |\{ i \in I: S \subset A_i\}|,

and let {S_0} be a subset of {X} that maximizes {Q(S_0)}. Since {Q(\emptyset) = |I| > 0} and {Q(S)=0} when {|S| >k}, we see that {0 \leq |S_0| \leq K}. If the {A_i} are distinct and {|I| > R^k}, then we also have {Q(S) \leq R^k < |I| = Q(\emptyset)} when {|S|=k}, thus in this case we have {|S_0| < k}.

Let {J} be the set (3). Since {Q(S_0) \geq Q(\emptyset)>0}, {J} is non-empty. It remains to check that the family {(A_j \backslash S_0)_{j \in J}} is {R}-spread. But for any {S \subset X} and {{\bf j}} drawn uniformly at random from {J} one has

\displaystyle  {\mathbb P}( S \subset A_{\bf j} \backslash S_0 ) = \frac{|\{ i \in I: S_0 \cup S \subset A_i\}|}{|\{ i \in I: S_0 \subset A_i\}|} = R^{|S_0|-|S_0 \cup S|} \frac{Q(S)}{Q(S_0)}.

Observe that {Q(S) \leq Q(S_0)}, and the probability is only non-empty when {S, S_0} are disjoint, so that {|S_0|-|S_0 \cup S| = - |S|}. The claim follows. \Box

In view of the above lemma, the bound (2) will then follow from

Proposition 3 (Locating the petals) Let {r, k \geq 2} be natural numbers, and suppose that {R \geq C r \log(kr)} for a sufficiently large constant {C}. Let {(A_i)_{i \in I}} be a finite family of subsets of a finite set {X}, each of cardinality at most {k} which is {R}-spread. Then there exist {i_1,\dots,i_r \in I} such that {A_{i_1},\dots,A_{i_r}} is disjoint.

Indeed, to prove (2), we assume that {(A_i)_{i \in I}} is a family of sets of cardinality greater than {R^k} for some {R \geq Cr \log(kr)}; by discarding redundant elements and sets we may assume that {I} is finite and that all the {A_i} are contained in a common finite set {X}. Apply Lemma 2 to find a set {S_0 \subset X} of cardinality {|S_0| < k} such that the family {(A_j \backslash S_0)_{j \in J}} is {R}-spread. By Proposition 3 we can find {j_1,\dots,j_r \in J} such that {A_{j_1} \backslash S_0,\dots,A_{j_r} \backslash S_0} are disjoint; since these sets have cardinality {k - |S_0| > 0}, this implies that the {j_1,\dots,j_r} are distinct. Hence {A_{j_1},\dots,A_{j_r}} form a sunflower as required.

Remark 4 Proposition 3 is easy to prove if we strengthen the condition on {R} to {R > k(r-1)}. In this case, we have {\mathop{\bf P}_{i \in I}( x \in A_i) < 1/k(r-1)} for every {x \in X}, hence by the union bound we see that for any {i_1,\dots,i_j \in I} with {j \leq r-1} there exists {i_{j+1} \in I} such that {A_{i_{j+1}}} is disjoint from the set {A_{i_1} \cup \dots \cup A_{i_j}}, which has cardinality at most {k(r-1)}. Iterating this, we obtain the conclusion of Proposition 3 in this case. This recovers a bound of the form {\mathrm{Sun}(k,r) \leq (k(r-1))^k+1}, and by pursuing this idea a little further one can recover the original upper bound (1) of Erdös and Rado.

It remains to prove Proposition 3. In fact we can locate the petals one at a time, placing each petal inside a random set.

Proposition 5 (Locating a single petal) Let the notation and hypotheses be as in Proposition 3. Let {{\bf V}} be a random subset of {X}, such that each {x \in X} lies in {{\bf V}} with an independent probability of {1/r}. Then with probability greater than {1-1/r}, {{\bf V}} contains one of the {A_i}.

To see that Proposition 5 implies Proposition 3, we randomly partition {X} into {{\bf V}_1 \cup \dots \cup {\bf V}_r} by placing each {x \in X} into one of the {{\bf V}_j}, {j=1,\dots,r} chosen uniformly and independently at random. By Proposition 5 and the union bound, we see that with positive probability, it is simultaneously true for all {j=1,\dots,r} that each {{\bf V}_j} contains one of the {A_i}. Selecting one such {A_i} for each {{\bf V}_j}, we obtain the required disjoint petals.

We will prove Proposition 5 by gradually increasing the density of the random set and arranging the sets {A_i} to get quickly absorbed by this random set. The key iteration step is

Proposition 6 (Refinement inequality) Let {R > 1} and {0 < \delta < 1}. Let {{\bf A}} be a random subset of a finite set {X} which is {R}-spread, and let {{\bf V}} be a random subset of {X} independent of {{\bf A}}, such that each {x \in X} lies in {{\bf V}} with an independent probability of {\delta}. Then there exists another {R}-spread random subset {{\bf A}'} of {X} whose support is contained in the support of {{\bf A}}, such that {{\bf A}' \backslash {\bf V} \subset {\bf A}} and

\displaystyle  {\mathbb E} |{\bf A}' \backslash {\bf V}| \leq \frac{5}{\log(R\delta)} {\mathbb E} |{\bf A}|.

Note that a direct application of the first moment method gives only the bound

\displaystyle  {\mathbb E} |{\bf A} \backslash {\bf V}| \leq (1-\delta) {\mathbb E} |{\bf A}|,

but the point is that by switching from {{\bf A}} to an equivalent {{\bf A}'} we can replace the {1-\delta} factor by a quantity significantly smaller than {1}.

One can iterate the above proposition, repeatedly replacing {{\bf A}, X} with {{\bf A}' \backslash {\bf V}, X \backslash {\bf V}} (noting that this preserves the {R}-spread nature of {{\bf A}}) to conclude

Corollary 7 (Iterated refinement inequality) Let {R > 1}, {0 < \delta < 1}, and {m \geq 1}. Let {{\bf A}} be a random subset of a finite set {X} which is {R}-spread, and let {{\bf V}} be a random subset of {X} independent of {{\bf A}}, such that each {x \in X} lies in {{\bf V}} with an independent probability of {1-(1-\delta)^m}. Then there exists another random subset {{\bf A}'} of {X} with support contained in the support of {{\bf A}}, such that

\displaystyle  {\mathbb E} |{\bf A}' \backslash {\bf V}| \leq (\frac{5}{\log(R\delta)})^m {\mathbb E} |{\bf A}|.

Now we can prove Proposition 5. Let {m} be chosen shortly. Applying Corollary 7 with {{\bf A}} drawn uniformly at random from the {(A_i)_{i \in I}}, and setting {1-(1-\delta)^m = 1/r}, or equivalently {\delta = 1 - (1 - 1/r)^{1/m}}, we have

\displaystyle  {\mathbb E} |{\bf A}' \backslash {\bf V}| \leq (\frac{5}{\log(R\delta)})^m k.

In particular, if we set {m = \lceil \log kr \rceil}, so that {\delta \sim \frac{1}{r \log kr}}, then by choice of {R} we have {\frac{5}{\log(R\delta)} < \frac{1}{2}}, hence

\displaystyle  {\mathbb E} |{\bf A}' \backslash {\bf V}| < \frac{1}{r}.

In particular with probability at least {1 - \frac{1}{r}}, there must exist {A_i} such that {|A_i \backslash {\bf V}| = 0}, giving the proposition.

It remains to establish Proposition 6. This is the difficult step, and requires a clever way to find the variant {{\bf A}'} of {{\bf A}} that has better containment properties in {{\bf V}} than {{\bf A}} does. The main trick is to make a conditional copy {({\bf A}', {\bf V}')} of {({\bf A}, {\bf V})} that is conditionally independent of {({\bf A}, {\bf V})} subject to the constraint {{\bf A} \cup {\bf V} = {\bf A}' \cup {\bf V}'}. The point here is that this constrant implies the inclusions

\displaystyle  {\bf A}' \backslash {\bf V} \subset {\bf A} \cap {\bf A}' \subset {\bf A} \ \ \ \ \ (4)

and

\displaystyle  {\bf A}' \backslash {\bf A} \subset {\bf V}. \ \ \ \ \ (5)

Because of the {R}-spread hypothesis, it is hard for {{\bf A}} to contain any fixed large set. If we could apply this observation in the contrapositive to {{\bf A} \cap {\bf A}'} we could hope to get a good upper bound on the size of {{\bf A} \cap {\bf A}'} and hence on {{\bf A} \backslash {\bf V}} thanks to (4). One can also hope to improve such an upper bound by also employing (5), since it is also hard for the random set {{\bf V}} to contain a fixed large set. There are however difficulties with implementing this approach due to the fact that the random sets {{\bf A} \cap {\bf A}', {\bf A}' \backslash {\bf A}} are coupled with {{\bf A}, {\bf V}} in a moderately complicated fashion. In Rao’s argument a somewhat complicated encoding scheme was created to give information-theoretic control on these random variables; below the fold we accomplish a similar effect by using Shannon entropy inequalities in place of explicit encoding. A certain amount of information-theoretic sleight of hand is required to decouple certain random variables to the extent that the Shannon inequalities can be effectively applied. The argument bears some resemblance to the “entropy compression method” discussed in this previous blog post; there may be a way to more explicitly express the argument below in terms of that method. (There is also some kinship with the method of dependent random choice, which is used for instance to establish the Balog-Szemerédi-Gowers lemma, and was also translated into information theoretic language in these unpublished notes of Van Vu and myself.)

— 1. Shannon entropy —

In this section we lay out all the tools from the theory of Shannon entropy that we will need.

Define an empirical sequence for a random variable {{\bf X}} taking values in a discrete set {S} to be a sequence {X_1,X_2,\dots} in {S} such that the empirical samples of this sequence converge in distribution to {{\bf X}} in the sense that

\displaystyle  {\mathbb P}( X_{\bf n} = X ) = {\mathbb P}( {\bf X} = X ) + o(1) \ \ \ \ \ (6)

as {N \rightarrow \infty}, where {{\bf n} = {\bf n}_N} is a uniform random variable drawn from {\{1,\dots,N\}}, and {o(1)} denotes a quantity that goes to zero as {N \rightarrow \infty} (uniformly over other parameters independent of {N}). Every random variable {{\bf X}} has at least one empirical sequence; indeed, from the strong law of large numbers one can find such a sequence almost surely by taking an infinite number of independent trials of {{\bf X}}.

If {{\bf X}} is a random variable taking values in some set {S}, its Shannon entropy {{\mathbb H}({\bf X})} is defined by the formula

\displaystyle  {\mathbb H}({\bf X}) := \sum_{X \in S} {\mathbb P}( {\bf X} = X ) \log \frac{1}{{\mathbb P}({\bf X} = X)}

with the convention {0 \log \frac{1}{0} = 0}, and using the base two logarithm throughout. We can condition the Shannon entropy to events {E} of positive probability,

\displaystyle  {\mathbb H}({\bf X}|E) := \sum_{X \in S} {\mathbb P}( {\bf X} = X |E) \log \frac{1}{{\mathbb P}({\bf X} = X|E)},

and condition the Shannon entropy to other random variables {{\bf Y}} taking values in a set {T} by

\displaystyle  {\mathbb H}({\bf X}|{\bf Y}) := \sum_{Y \in T} {\mathbb P}( {\bf Y} = Y) {\mathbb H}( {\bf X} | {\bf Y} = Y)

with the convention that the summand is zero if {{\mathbb P}({\bf Y} = Y) = 0}.

We record the following standard and easily verified facts:

Lemma 8 (Basic Shannon inequalities) Let {{\bf X}, {\bf Y}, {\bf Z}} be random variables.
  • (i) (Monotonicity) If {{\bf X}} is a deterministic function {{\bf X} = f({\bf Y})} of {{\bf Y}}, then {0 \leq {\mathbb H}({\bf X}) \leq {\mathbb H}({\bf Y})}. More generally, if {{\bf X}} is a deterministic function {{\bf X} = f({\bf Y}, {\bf Z})} of {{\bf Y}} and {{\bf Z}}, then {0 \leq {\mathbb H}({\bf X}|{\bf Z}) \leq {\mathbb H}({\bf Y}|{\bf Z})}. If {{\bf Y}} is a deterministic function of {{\bf Z}}, then {{\mathbb H}({\bf X}|{\bf Z}) \leq {\mathbb H}({\bf X}|{\bf Y})}.
  • (ii) (Subadditivity) One has {{\mathbb H}({\bf X}, {\bf Y}) \leq {\mathbb H}({\bf X})+{\mathbb H}({\bf Y})}, with equality iff {{\bf X}}, {{\bf Y}} are independent. More generally, one has {{\mathbb H}({\bf X}, {\bf Y}|{\bf Z}) \leq {\mathbb H}({\bf X}|{\bf Z})+{\mathbb H}({\bf Y}|{\bf Z})}, with equality iff {{\bf X}}, {{\bf Y}} are conditionally independent with respect to {{\bf Z}}.
  • (iii) (Chain rule) One has {{\mathbb H}({\bf X}, {\bf Y}) = {\mathbb H}({\bf X}) + {\mathbb H}({\bf Y}|{\bf X}) = {\mathbb H}({\bf Y}) + {\mathbb H}({\bf X}|{\bf Y})}. More generally {{\mathbb H}({\bf X}, {\bf Y}|{\bf Z}) = {\mathbb H}({\bf X}|{\bf Z}) + {\mathbb H}({\bf Y}|{\bf X},{\bf Z}) = {\mathbb H}({\bf Y}|{\bf Z}) + {\mathbb H}({\bf X}|{\bf Y},{\bf Z})}. In particular {0 \leq {\mathbb H}({\bf X}|{\bf Y}) \leq {\mathbb H}({\bf X})}, and {{\mathbb H}({\bf X}|{\bf Y}) = {\mathbb H}({\bf X})} iff {{\bf X},{\bf Y}} are independent; similarly, {0 \leq {\mathbb H}({\bf X}|{\bf Y},{\bf Z}) \leq {\mathbb H}({\bf X}|{\bf Z})}, and {{\mathbb H}({\bf X}|{\bf Y},{\bf Z}) = {\mathbb H}({\bf X}|{\bf Z})} iff {{\bf X},{\bf Y}} are conditionally independent with respect to {{\bf Z}}.
  • (iv) (Jensen’s inequality) If {{\bf X}} takes values in a finite set {S} then {{\mathbb H}({\bf X}) \leq \log |S|}, with equality iff {{\bf X}} is uniformly distributed in {S}. More generally, if {{\bf X}} takes values in a set {S_{\bf Y}} that depends on {{\bf Y}}, then {{\mathbb H}({\bf X}|{\bf Y}) \leq {\mathbb E} \log |S_{\bf Y}|}, with equality iff {{\bf X}} is uniformly distributed in {S_{\bf Y}} after conditioning on {{\bf Y}}.
  • (v) (Gibbs inequality) If {{\bf X}, {\bf Y}} take values in the same finite set {S}, then

    \displaystyle  {\mathbb H}({\bf X}) \leq \sum_{X \in S} {\mathbb P}({\bf X} = X) \log \frac{1}{{\mathbb P}({\bf Y}=X)}

    (we permit the right-hand side to be infinite, which makes the inequality vacuously true).

See this previous blog post for some intuitive analogies to understand Shannon entropy.

Now we establish some inequalities of relevance to random sets.

We first observe that any small random set largely determines any of its subsets. Define a random subset of a random set {{\bf B}} to be a random set {{\bf A}} such that {{\bf B} \subset {\bf A}} holds almost surely.

Lemma 9 (Subsets of small sets have small conditional entropy) Let {{\bf B}} be a random finite set.
  • (i) One has {{\mathbb H}({\bf A}|{\bf B}) \leq {\mathbb E} |{\bf B}|} for any random subset {{\bf A}} of {{\bf B}}.
  • (ii) One has {{\mathbb H}( |{\bf B}| ) \leq 1+{\mathbb E} |{\bf B}|}. If {{\bf B}} is almost surely non-empty, we can improve this to {{\mathbb H}( |{\bf B}| ) \leq {\mathbb E} |{\bf B}|}.

Proof: The set {{\bf A}} takes values in the power set {2^{{\bf B}}} of {{\bf B}}, so the claim (i) follows from Lemma 8(iv). (Note how it is convenient here that we are using the base {2} for the logarithm.)

For (ii), apply Lemma 8(v) with {{\bf X} = |{\bf B}|} and {{\bf Y}} the geometric random variable {{\mathbb P}({\bf Y} = n) = 2^{-n-1}} for natural numbers {n} (or {{\mathbb P}({\bf Y} = n) = 2^{-n}} for positive {n}, if {{\bf B}} is non-empty). \Box

Now we encode the property of a random variable {{\bf A}} being {R}-spread in the language of Shannon entropy.

Lemma 10 (Information-theoretic interpretation of spread) Let {{\bf A}} be a random finite set that is {R}-spread for some {R \geq 1}.
  • (i) If {{\bf A}} is uniformly distributed amongst some finite collection {{\mathcal A}} of sets, then

    \displaystyle  {\mathbb H}({\bf A} | {\bf S} ) \leq {\mathbb H}({\bf A}) - (\log R) {\mathbb E} |{\bf S}|

    for all random subsets {{\bf S}} of {{\bf A}}.
  • (ii) In the general case, if {A_1, A_2, A_3,\dots} are an empirical sequence of {{\bf A}}, then

    \displaystyle  {\mathbb H}({\bf n} | {\bf S} ) \leq {\mathbb H}({\bf n}) - (\log R) {\mathbb E} |{\bf S}| - o(1)

    as {N \rightarrow \infty}, where {{\bf n}} is drawn uniformly from {\{1,\dots,N\}} and {{\bf S}} is a random subset of {A_{\bf n}}.

Informally: large random subsets of an {R}-spread set {{\bf A}} necessarily have a lot of mutual information with {{\bf A}}. Conversely, one can bound the size of a random subset of an {R}-spread set by bounding its mutual information with {{\bf A}}.

Proof: In case (i), it suffices by Lemma 8(iv) to establish the bound

\displaystyle  {\mathbb H}({\bf A} | {\bf S} = S ) \leq \log |{\mathcal A}| - (\log R) {\mathbb E} |S|

for each finite set {S} in the range of {{\bf S}}. But as {{\bf A}} is {R}-spread and uniformly distributed in {{\mathcal A}}, there are at most {R^{-S} |{\mathcal A}|} elements of {{\mathcal A}} that contain {S}, and the claim follows from a second appeal to Lemma 8(iv). Case (ii) follows from case (i) thanks to (6). \Box

Given a finite non-empty set {X} and {0 \leq k \leq |X|}, let {\binom{X}{k}} denote the collection of {k}-element subsets of {X}. A uniformly chosen element {{\bf V}} of {\binom{X}{k}} is thus a random {k}-element subset of {X}; we refer to the quantity {\delta = k/|X|} as the density of this random subset, and {{\bf V}} as a uniformly chosen random subset of {X} of density {\delta}. (Of course, this is only defined when {0 \leq \delta \leq X} is an integer multiple of {1/|X|}.) Uniformly chosen random sets have the following information-theoretic relationships to small random sets:

Lemma 11 Let {X} be a finite non-empty set, let {{\bf W}} be a uniformly chosen random subset of {X} of some density {0 \leq \delta \leq 1} (which is a multiple of {1/|X|}).
  • (i) (Spread) If {{\bf S}} is a random subset of {{\bf W}}, then

    \displaystyle  {\mathbb H}({\bf W} | {\bf S} ) \leq {\mathbb H}({\bf W}) - (\log \frac{1}{\delta}) {\mathbb E} |{\bf S}|.

  • (ii) (Absorption) If {{\bf S}} is a random subset of {X}, then

    \displaystyle  {\mathbb H}({\bf W} \cup {\bf S} ) \leq {\mathbb H}({\bf W}) + 1 + (1+\log \frac{1}{\delta}) {\mathbb E} |{\bf S}|.

Proof: To prove (i), it suffices by Lemma 10(i) to show that {{\bf W}} is {\frac{1}{\delta}}-spread, which amounts to showing that

\displaystyle  \binom{|X \backslash S|}{\delta|X| - |S|} \leq \delta^{|S|} \binom{|X|}{\delta |X|}

for all subsets {S \subset X} with {0 \leq |S| \leq \delta |X|}. But this comes from multiplying together the inequalities {\frac{\delta |X|-i}{|X|-i} \leq \delta} for {0 \leq i < |S|}.

For (ii), by replacing {{\bf S}} with {{\bf S} \backslash {\bf W}} we may assume that {{\bf S}, {\bf W}} are disjoint. From Lemma 8(iii) and Lemma 9(ii) it suffices to show that

\displaystyle  {\mathbb H}({\bf W} \cup {\bf S} | |{\bf S}| ) \leq {\mathbb H}({\bf W}) + \log \frac{1}{\delta} {\mathbb E} |{\bf S}|

which in turn is implied by

\displaystyle  {\mathbb H}({\bf W} \cup {\bf S} | |{\bf S}| = k ) \leq {\mathbb H}({\bf W}) + \log \frac{1}{\delta} k

for each {0 \leq k \leq |X| - \delta |X|}. By Lemma 8(iv) it suffices to show that

\displaystyle  \binom{|X|}{\delta |X|+k} \leq \delta^{-k} \binom{|X|}{\delta |X|}

but this follows from multiplying together the inequalities {\frac{X+i}{\delta |X|+i}\leq \frac{1}{\delta}} for {1 \leq i \leq k}. \Box

The following “relative product” construction will be important for us. Given a random variable {{\bf X}} and a deterministic function {f({\bf X})} of that variable, one can construct a conditionally independent copy {{\bf X}'} of {{\bf X}} subject to the condition {f({\bf X})=f({\bf X}')}, with the joint distribution

\displaystyle  {\mathbb P}( ({\bf X}, {\bf X}') = (X,X') ) = 1_{f(X)=f(X')} {\mathbb P}( f({\bf X}) = f(X) ) {\mathbb P}( {\bf X} = X | f({\bf X}) = f(X) ) {\mathbb P}( {\bf X} = X' | f({\bf X}) = f(X) ).

(Note that this is usually not the same as starting with a completely independent copy {{\bf X}'} of {{\bf X}} and then conditioning to the event {f({\bf X}) = f({\bf X}')}; cf. Simpson’s paradox.) By construction, {{\bf X}'} has the same distribution as {{\bf X}}, and is conditionally independent of {{\bf X}} relative to {{\bf X}'}. In particular, from Lemma 8 we have

\displaystyle  {\mathbb H}( {\bf X} | {\bf X}' ) = {\mathbb H}( {\bf X} | f({\bf X}) ) = {\mathbb H}({\bf X}) - {\mathbb H}(f({\bf X})) \ \ \ \ \ (7)

which we can also write as a “entropy Cauchy-Schwarz identity”

\displaystyle  {\mathbb H}({\bf X}, {\bf X}' ) = 2 {\mathbb H}({\bf X}) - {\mathbb H}(f({\bf X})).

This can be compared with the combinatorial inequality

\displaystyle  |\{ (x,x') \in X^2: f(x)=f(x') \}| \geq \frac{|X|^2}{|f(X)|}

or equivalently

\displaystyle  \log |\{ (x,x') \in X^2: f(x)=f(x') \}| \geq 2 \log |X| - \log |f(X)|

whenever {f: X \rightarrow Y} is a function on a non-empty finite set {X}, which is easily proven as a consequence of Cauchy-Schwarz. One nice advantage of the entropy formalism over the combinatorial one is that the analogue of this instance of the Cauchy-Schwarz inequality automatically becomes an equality (this is related to the asymptotic equipartition property in the microstates interpretation of entropy).

— 2. Proof of refinement inequality —

Now we have enough tools to prove Proposition 6. Let {R, \delta, {\bf A}, {\bf V}} be as in that proposition. If {\emptyset} is in the support of {{\bf A}} we can set {{\bf A}' = \emptyset} so we can instead assume that {{\bf A}} is never empty. In particular

\displaystyle  {\mathbb E} |{\bf A}| \geq 1. \ \ \ \ \ (8)

In order to use Lemma 10 we fix an empirical sequence {A_1,A_2,\dots} for {{\bf A}}. We relabel {X} as {\{1,\dots,|X|\}}, and let {N \geq |X|} be a parameter going off to infinity (so in particular {X} is identified with a subset of {\{1,\dots,N\}}. We let {{\bf n}} be drawn uniformly at random from {\{1,\dots,N\}}, and let {{\bf W}} be a uniform random subset of {\{1,\dots,N\}} of density {\lfloor \delta N\rfloor/N = \delta + o(1)} independent of {{\bf n}}. Observe from Stirling’s formula that {{\bf W} \cap X} converges in distribution to {{\bf V}}. Thus it will suffice to find another uniform random variable {{\bf n}'} from {\{1,\dots,N\}} such that

\displaystyle  {\mathbb E} |A_{{\bf n}'} \backslash {\bf W}| \leq \frac{5}{\log(R\delta)} {\mathbb E} |A_{{\bf n}}| + o(1) \ \ \ \ \ (9)

as {N \rightarrow \infty}, since we can pass to a subsequence in which {(A_{{\bf n}'}, {\bf W} \cap X)} converges in distribution to {({\bf A}, {\bf V})}. From (8) we have

\displaystyle  {\mathbb E} |A_{\bf n}| \geq 1 - o(1). \ \ \ \ \ (10)

From {{\bf n}, {\bf W}} we can form the random set {A_{\bf n} \cup {\bf W}}; we then form a conditionally independent copy {({\bf n}', {\bf W}')} of {({\bf n}, {\bf W})} subject to the constraint

\displaystyle  A_{{\bf n}} \cup {\bf W} = A_{{\bf n}'} \cup {\bf W}'. \ \ \ \ \ (11)

We use {{\bf n}'} as the uniform variable to establish (9). The point is that the relation (11) implies that

\displaystyle  A_{{\bf n}'} \backslash {\bf W} \subset A_{{\bf n}} \cap A_{{\bf n}'}

so it will suffice to show that

\displaystyle  {\mathbb E} |A_{{\bf n}} \cap A_{{\bf n}'}| \leq \frac{5}{\log(R\delta)} {\mathbb E} |A_{{\bf n}}| + o(1) \ \ \ \ \ (12)

By Lemma 11(ii), (10) we have

\displaystyle  {\mathbb H}(A_{\bf n} \cup {\bf W}) \leq {\mathbb H}({\bf W}) + (2 + \log \frac{1}{\delta}) {\bf E} |A_{\bf n}| + o(1),

and hence by (7)

\displaystyle  {\mathbb H}( {\bf n}, {\bf W} | {\bf n}', {\bf W}' ) \geq {\mathbb H}({\bf n}, {\bf W}) - {\mathbb H}({\bf W}) - (2 + \log \frac{1}{\delta}) {\bf E} |A_{\bf n}| - o(1)

and hence by Lemma 8(ii) and independence of {{\bf n}, {\bf W}}

\displaystyle  {\mathbb H}({\bf n}) \leq {\mathbb H}( {\bf n}, {\bf W} | {\bf n}', {\bf W} ) + (2 + \log \frac{1}{\delta}) {\bf E} |A_{\bf n}| + o(1).

Now we try to relate the first term on the left-hand side with {A_{{\bf n}} \cap A_{{\bf n}'}}. Note from (11) that we have the identity

\displaystyle  {\bf W} = (A_{{\bf n}'} \cup {\bf W}') \backslash (A_{\bf n} \backslash {\bf W})

and hence by Lemma 8(i)

\displaystyle  {\mathbb H}( {\bf n}, {\bf W} | {\bf n}', {\bf W}' ) \leq {\mathbb H}( {\bf n}, A_{\bf n} \backslash {\bf W} | {\bf n}', {\bf W}' ).

We estimate the relative entropy of {{\bf n}, A_{\bf n} \backslash {\bf W}} here by selecting first {A_{\bf n} \cap A_{{\bf n}'}}, then {{\bf n}}, then {A_{\bf n} \backslash {\bf W}}. More precisely, using the chain rule and monotonicity (Lemma 8(i), (iii)) we have

\displaystyle  {\mathbb H}( {\bf n}, A_{\bf n} \backslash {\bf W} | {\bf n}', {\bf W}' ) \leq {\mathbb H}( A_{\bf n} \cap A_{{\bf n}'} | {\bf n}', {\bf W}' )

\displaystyle  + {\mathbb H}( {\bf n} | A_{\bf n} \cap A_{{\bf n}'}, {\bf n}', {\bf W}' ) + {\mathbb H}( A_{\bf n} \backslash {\bf W} | {\bf n}, A_{\bf n} \cap A_{{\bf n}'}, {\bf n}', {\bf W}' ).

From Lemma 9(i) we have

\displaystyle {\mathbb H}( A_{\bf n} \cap A_{{\bf n}'} | {\bf n}', {\bf W} ) \leq {\mathbb H}( A_{\bf n} \cap A_{{\bf n}'} | A_{{\bf n}'} ) \leq {\mathbb E} |A_{\bf n}|

and

\displaystyle  {\mathbb H}( A_{\bf n} \backslash {\bf W} | {\bf n}, A_{\bf n} \cap A_{{\bf n}'}, {\bf n}', {\bf W}' ) \leq {\mathbb H}( A_{\bf n} \backslash {\bf W} | A_{{\bf n}} ) \leq {\mathbb E} |A_{\bf n}|.

Putting all this together, we conclude

\displaystyle  {\mathbb H}({\bf n}) \leq {\mathbb H}( {\bf n} | A_{\bf n} \cap A_{{\bf n}'}, {\bf n}', {\bf W}' ) + (4 + \log \frac{1}{\delta}) {\bf E} |A_{\bf n}| + o(1).

If we apply Lemma 10(ii) right away we will get the estimate

\displaystyle  (\log R) {\bf E} |A_{\bf n} \cap A_{{\bf n}'}| \leq (4 + \log \frac{1}{\delta}) {\bf E} |A_{\bf n}| + o(1)

which is a bound resembling (12), but the dependence on the parameters {\delta, R} are too weak. To do better we return to the relative product construction to decouple some of the random variables here. From the tuple {({\bf n}, {\bf W}, {\bf n}', {\bf W}')} we can form the random variable {(A_{\bf n} \cap A_{{\bf n}'}, A_{{\bf n}} \cup {\bf W})}, then form a conditionally independent copy {({\bf m}, {\bf U}, {\bf m}', {\bf U}')} of {({\bf n}, {\bf W}, {\bf n}', {\bf W}')} subject to the constraints

\displaystyle  A_{\bf n} \cap A_{{\bf n}'} = A_{\bf m} \cap A_{{\bf m}'}; \quad A_{\bf n} \cup {\bf W} = A_{\bf m} \cup {\bf U}. \ \ \ \ \ (13)

From (11) and Lemma 8(i) we then have

\displaystyle  {\mathbb H}( {\bf n} | A_{\bf n} \cap A_{{\bf n}'}, {\bf n}', {\bf W}' ) \leq {\mathbb H}( {\bf n} | A_{\bf n} \cap A_{{\bf n}'}, A_{\bf n} \cup {\bf W} )

\displaystyle = {\mathbb H}( {\bf m} | A_{\bf n} \cap A_{{\bf n}'}, A_{\bf n} \cup {\bf W} ).

The point is that {{\bf m}} is now conditionally independent of {{\bf n}, {\bf W}} relative to {A_{\bf n} \cap A_{{\bf n}'}, A_{\bf n} \cup {\bf W}}, so we can also rewrite the above conditional entropy as

\displaystyle  {\mathbb H}( {\bf m} | A_{\bf n} \cap A_{{\bf n}'}, A_{\bf n} \cup {\bf W}, {\bf n}, {\bf W} )

\displaystyle  ={\mathbb H}( {\bf m} | A_{\bf n} \cap A_{{\bf n}'}, {\bf n}, {\bf W} )

We now use the chain rule to disentangle the role of {{\bf W}}, writing the previous as

\displaystyle  {\mathbb H}( {\bf m}, {\bf W} | A_{\bf n} \cap A_{{\bf n}'}, {\bf n} ) - {\mathbb H}( {\bf W} | A_{\bf n} \cap A_{{\bf n}'}, {\bf n} )

\displaystyle  = {\mathbb H}( {\bf m}, {\bf W} | A_{\bf n} \cap A_{{\bf n}'}, {\bf n} ) - {\mathbb H}( {\bf W} | {\bf n} )

\displaystyle + {\mathbb H}( A_{\bf n} \cap A_{{\bf n'}} | {\bf n} ) - {\mathbb H}( A_{\bf n} \cap A_{{\bf n'}} | {\bf n}, {\bf W} ).

From independence we have

\displaystyle {\mathbb H}( {\bf W} | {\bf n} ) = {\mathbb H}( {\bf W} )

and from Lemma 9(i) we have

\displaystyle  {\mathbb H}( A_{\bf n} \cap A_{{\bf n'}} | {\bf n} ) \leq {\bf E} |A_{\bf n}|.

We discard the negative term {{\mathbb H}( A_{\bf n} \cap A_{{\bf n'}} | {\bf n}, {\bf W} )}. Putting all this together, we obtain

\displaystyle  {\mathbb H}({\bf n}) \leq {\mathbb H}( {\bf m}, {\bf W} | A_{\bf n} \cap A_{{\bf n}'}, {\bf n} ) \ \ \ \ \ (14)

\displaystyle  - {\mathbb H}( {\bf W} ) + (5 + \log \frac{1}{\delta}) {\bf E} |A_{\bf n}| + o(1).

Now, from the constraints (11), (13) we have

\displaystyle  A_{\bf n} \cap A_{{\bf n}'} \subset A_{{\bf m}}

and

\displaystyle  A_{{\bf n}'} \backslash A_{\bf n} \subset {\bf W}.

Thus by Lemma 8(i), (ii), followed by Lemma 10(ii) and Lemma 11(i), we have

\displaystyle  {\mathbb H}( {\bf m}, {\bf W} | A_{\bf n} \cap A_{{\bf n}'}, {\bf n} ) \leq {\mathbb H}( {\bf m} | A_{\bf n} \cap A_{{\bf n}'} ) + {\mathbb H}( {\bf W} | A_{{\bf n}'} \backslash A_{\bf n} )

\displaystyle  \leq {\mathbb H}({\bf m}) - (\log R) {\mathbb E} |A_{\bf n} \cap A_{{\bf n}'}| + {\mathbb H}({\bf W}) - \log \frac{1}{\delta} {\mathbb E} |A_{{\bf n}'} \backslash A_{\bf n}| + o(1)

which when inserted back into (14) using {{\mathbb H}({\bf m}) = {\mathbb H}({\bf n})} and {|A_{{\bf n}'} \backslash A_{\bf n}| = |A_{{\bf n}'}| - |A_{{\bf n}'} \cap A_{\bf n}|} simplifies to

\displaystyle  (\log (R\delta)) {\mathbb E} |A_{\bf n} \cap A_{{\bf n}'}| \leq 5 {\bf E} |A_{\bf n}| + o(1)

and the claim follows.