A handy inequality in additive combinatorics is the Plünnecke-Ruzsa inequality:

Theorem 1 (Plünnecke-Ruzsa inequality) Let {A, B_1, \ldots, B_m} be finite non-empty subsets of an additive group {G}, such that {|A+B_i| \leq K_i |A|} for all {1 \leq i \leq m} and some scalars {K_1,\ldots,K_m \geq 1}. Then there exists a subset {A'} of {A} such that {|A' + B_1 + \ldots + B_m| \leq K_1 \ldots K_m |A'|}.

The proof uses graph-theoretic techniques. Setting {A=B_1=\ldots=B_m}, we obtain a useful corollary: if {A} has small doubling in the sense that {|A+A| \leq K|A|}, then we have {|mA| \leq K^m |A|} for all {m \geq 1}, where {mA = A + \ldots + A} is the sum of {m} copies of {A}.

In a recent paper, I adapted a number of sum set estimates to the entropy setting, in which finite sets such as {A} in {G} are replaced with discrete random variables {X} taking values in {G}, and (the logarithm of) cardinality {|A|} of a set {A} is replaced by Shannon entropy {{\Bbb H}(X)} of a random variable {X}. (Throughout this note I assume all entropies to be finite.) However, at the time, I was unable to find an entropy analogue of the Plünnecke-Ruzsa inequality, because I did not know how to adapt the graph theory argument to the entropy setting.

I recently discovered, however, that buried in a classic paper of Kaimonovich and Vershik (implicitly in Proposition 1.3, to be precise) there was the following analogue of Theorem 1:

Theorem 2 (Entropy Plünnecke-Ruzsa inequality) Let {X, Y_1, \ldots, Y_m} be independent random variables of finite entropy taking values in an additive group {G}, such that {{\Bbb H}(X+Y_i) \leq {\Bbb H}(X) + \log K_i} for all {1 \leq i \leq m} and some scalars {K_1,\ldots,K_m \geq 1}. Then {{\Bbb H}(X+Y_1+\ldots+Y_m) \leq {\Bbb H}(X) + \log K_1 \ldots K_m}.

In fact Theorem 2 is a bit “better” than Theorem 1 in the sense that Theorem 1 needed to refine the original set {A} to a subset {A'}, but no such refinement is needed in Theorem 2. One corollary of Theorem 2 is that if {{\Bbb H}(X_1+X_2) \leq {\Bbb H}(X) + \log K}, then {{\Bbb H}(X_1+\ldots+X_m) \leq {\Bbb H}(X) + (m-1) \log K} for all {m \geq 1}, where {X_1,\ldots,X_m} are independent copies of {X}; this improves slightly over the analogous combinatorial inequality. Indeed, the function {m \mapsto {\Bbb H}(X_1+\ldots+X_m)} is concave (this can be seen by using the {m=2} version of Theorem 2 (or (2) below) to show that the quantity {{\Bbb H}(X_1+\ldots+X_{m+1})-{\Bbb H}(X_1+\ldots+X_m)} is decreasing in {m}).

Theorem 2 is actually a quick consequence of the submodularity inequality

\displaystyle  {\Bbb H}(W) + {\Bbb H}(X) \leq {\Bbb H}(Y) + {\Bbb H}(Z) \ \ \ \ \ (1)

in information theory, which is valid whenever {X,Y,Z,W} are discrete random variables such that {Y} and {Z} each determine {X} (i.e. {X} is a function of {Y}, and also a function of {Z}), and {Y} and {Z} jointly determine {W} (i.e {W} is a function of {Y} and {Z}). To apply this, let {X, Y, Z} be independent discrete random variables taking values in {G}. Observe that the pairs {(X,Y+Z)} and {(X+Y,Z)} each determine {X+Y+Z}, and jointly determine {(X,Y,Z)}. Applying (1) we conclude that

\displaystyle  {\Bbb H}(X,Y,Z) + {\Bbb H}(X+Y+Z) \leq {\Bbb H}(X,Y+Z) + {\Bbb H}(X+Y,Z)

which after using the independence of {X,Y,Z} simplifies to the sumset submodularity inequality

\displaystyle  {\Bbb H}(X+Y+Z) + {\Bbb H}(Y) \leq {\Bbb H}(X+Y) + {\Bbb H}(Y+Z) \ \ \ \ \ (2)

(this inequality was also recently observed by Madiman; it is the {m=2} case of Theorem 2). As a corollary of this inequality, we see that if {{\Bbb H}(X+Y_i) \leq {\Bbb H}(X) + \log K_i}, then

\displaystyle  {\Bbb H}(X+Y_1+\ldots+Y_i) \leq {\Bbb H}(X+Y_1+\ldots+Y_{i-1}) + \log K_i,

and Theorem 2 follows by telescoping series.

The proof of Theorem 2 seems to be genuinely different from the graph-theoretic proof of Theorem 1. It would be interesting to see if the above argument can be somehow adapted to give a stronger version of Theorem 1. Note also that both Theorem 1 and Theorem 2 have extensions to more general combinations of {X,Y_1,\ldots,Y_m} than {X+Y_i}; see this paper and this paper respectively.

It is also worth remarking that the above inequalities largely carry over to the non-abelian setting. For instance, if {X_1, X_2,\ldots} are iid copies of a discrete random variable in a multiplicative group {G}, the above arguments show that the function {m \mapsto {\Bbb H}(X_1 \ldots X_m)} is concave. In particular, the expression {\frac{1}{m} {\Bbb H}(X_1 \ldots X_m)} decreases monotonically to a limit, the asymptotic entropy {{\Bbb H}(G,X)}. This quantity plays an important role in the theory of bounded harmonic functions on {G}, as observed by Kaimonovich and Vershik:

Proposition 3 Let {G} be a discrete group, and let {X} be a discrete random variable in {G} with finite entropy, whose support generates {G}. Then there exists a non-constant bounded function {f: G \rightarrow {\Bbb R}} which is harmonic with respect to {X} (which means that {{\Bbb E} f(X x) = f(x)} for all {x \in G}) if and only if {{\Bbb H}(G,X) \neq 0}.

Proof: (Sketch) Suppose first that {{\Bbb H}(G,X) = 0}, then we see from concavity that the successive differences {{\Bbb H}(X_1 \ldots X_m) - {\Bbb H}(X_1 \ldots X_{m-1})} converge to zero. From this it is not hard to see that the mutual information

\displaystyle  {\Bbb I}(X_m, X_1 \ldots X_m) := {\Bbb H}(X_m) + {\Bbb H}(X_1 \ldots X_m) - {\Bbb H}(X_m|X_1 \ldots X_m)

goes to zero as {m \rightarrow \infty}. Informally, knowing the value of {X_m} reveals very little about the value of {X_1 \ldots X_m} when {m} is large.

Now let {f: G \rightarrow {\Bbb R}} be a bounded harmonic function, and let {m} be large. For any {x \in G} and any value {s} in the support of {X_m}, we observe from harmonicity that

\displaystyle  f( s x ) = {\Bbb E}( f( X_1 \ldots X_m x ) | X_m = s ).

But from the asymptotic vanishing of mutual information and the boundedness of {f}, one can show that the right-hand side will converge to {{\Bbb E}( f(X_1 \ldots X_m x) )}, which by harmonicity is equal to {f(x)}. Thus {f} is invariant with respect to the support of {X}, and is thus constant since this support generates {G}.

Conversely, if {{\Bbb H}(G,X)} is non-zero, then the above arguments show that {{\Bbb I}(X_m, X_1 \ldots X_m)} stays bounded away from zero as {m \rightarrow \infty}, thus {X_1 \ldots X_m} reveals a non-trivial amount of information about {X_m}. This turns out to be true even if {m} is not deterministic, but is itself random, varying over some medium-sized range. From this, one can find a bounded function {F} such that the conditional expectation {{\Bbb E}(F(X_1 \ldots X_m) | X_m = s)} varies non-trivially with {s}. On the other hand, the bounded function {x \mapsto {\Bbb E} F(X_1 \ldots X_{m-1} x)} is approximately harmonic (because we are varying {m}), and has some non-trivial fluctuation near the identity (by the preceding sentence). Taking a limit as {m \rightarrow \infty} (using Arzelá-Ascoli) we obtain a non-constant bounded harmonic function as desired. \Box