A generalized Cauchy-Schwarz inequality via the Gibbs variational formula

10 December, 2023 in expository, math.CA, math.PR | Tags: Anthony Carbery, Cauchy-Schwarz, Gibbs variational formula, Shannon entropy | by Terence Tao

Let ${S}$ be a non-empty finite set. If ${X}$ is a random variable taking values in ${S}$ , the Shannon entropy ${H[X]}$ of ${X}$ is defined as

$\displaystyle H[X] = -\sum_{s \in S} {\bf P}[X = s] \log {\bf P}[X = s].$

There is a nice variational formula that lets one compute logs of sums of exponentials in terms of this entropy:

Lemma 1 (Gibbs variational formula) Let ${f: S \rightarrow {\bf R}}$ be a function. Then
$\displaystyle \log \sum_{s \in S} \exp(f(s)) = \sup_X {\bf E} f(X) + {\bf H}[X]. \ \ \ \ \ (1)$

Proof: Note that shifting ${f}$ by a constant affects both sides of (1) the same way, so we may normalize ${\sum_{s \in S} \exp(f(s)) = 1}$ . Then ${\exp(f(s))}$ is now the probability distribution of some random variable ${Y}$ , and the inequality can be rewritten as

$\displaystyle 0 = \sup_X \sum_{s \in S} {\bf P}[X = s] \log {\bf P}[Y = s] -\sum_{s \in S} {\bf P}[X = s] \log {\bf P}[X = s].$

But this is precisely the Gibbs inequality. (The expression inside the supremum can also be written as ${-D_{KL}(X||Y)}$ , where ${D_{KL}}$ denotes Kullback-Leibler divergence. One can also interpret this inequality as a special case of the Fenchel–Young inequality relating the conjugate convex functions ${x \mapsto e^x}$ and ${y \mapsto y \log y - y}$ .) $\Box$

In this note I would like to use this variational formula (which is also known as the Donsker-Varadhan variational formula) to give another proof of the following inequality of Carbery.

Theorem 2 (Generalized Cauchy-Schwarz inequality) Let ${n \geq 0}$ , let ${S, T_1,\dots,T_n}$ be finite non-empty sets, and let ${\pi_i: S \rightarrow T_i}$ be functions for each ${i=1,\dots,n}$ . Let ${K: S \rightarrow {\bf R}^+}$ and ${f_i: T_i \rightarrow {\bf R}^+}$ be positive functions for each ${i=1,\dots,n}$ . Then
$\displaystyle \sum_{s \in S} K(s) \prod_{i=1}^n f_i(\pi_i(s)) \leq Q \prod_{i=1}^n (\sum_{t_i \in T_i} f_i(t_i)^{n+1})^{1/(n+1)}$
where ${Q}$ is the quantity
$\displaystyle Q := (\sum_{(s_0,\dots,s_n) \in \Omega_n} K(s_0) \dots K(s_n))^{1/(n+1)}$
where ${\Omega_n}$ is the set of all tuples ${(s_0,\dots,s_n) \in S^{n+1}}$ such that ${\pi_i(s_{i-1}) = \pi_i(s_i)}$ for ${i=1,\dots,n}$ .

Thus for instance, the identity is trivial for ${n=0}$ . When ${n=1}$ , the inequality reads

$\displaystyle \sum_{s \in S} K(s) f_1(\pi_1(s)) \leq (\sum_{s_0,s_1 \in S: \pi_1(s_0)=\pi_1(s_1)} K(s_0) K(s_1))^{1/2}$

$\displaystyle ( \sum_{t_1 \in T_1} f_1(t_1)^2)^{1/2},$

which is easily proven by Cauchy-Schwarz, while for ${n=2}$ the inequality reads

$\displaystyle \sum_{s \in S} K(s) f_1(\pi_1(s)) f_2(\pi_2(s))$

$\displaystyle \leq (\sum_{s_0,s_1, s_2 \in S: \pi_1(s_0)=\pi_1(s_1); \pi_2(s_1)=\pi_2(s_2)} K(s_0) K(s_1) K(s_2))^{1/3}$

$\displaystyle (\sum_{t_1 \in T_1} f_1(t_1)^3)^{1/3} (\sum_{t_2 \in T_2} f_2(t_2)^3)^{1/3}$

which can also be proven by elementary means. However even for ${n=3}$ , the existing proofs require the “tensor power trick” in order to reduce to the case when the ${f_i}$ are step functions (in which case the inequality can be proven elementarily, as discussed in the above paper of Carbery).

We now prove this inequality. We write ${K(s) = \exp(k(s))}$ and ${f_i(t_i) = \exp(g_i(t_i))}$ for some functions ${k: S \rightarrow {\bf R}}$ and ${g_i: T_i \rightarrow {\bf R}}$ . If we take logarithms in the inequality to be proven and apply Lemma 1, the inequality becomes

$\displaystyle \sup_X {\bf E} k(X) + \sum_{i=1}^n g_i(\pi_i(X)) + {\bf H}[X]$

$\displaystyle \leq \frac{1}{n+1} \sup_{(X_0,\dots,X_n)} {\bf E} k(X_0)+\dots+k(X_n) + {\bf H}[X_0,\dots,X_n]$

$\displaystyle + \frac{1}{n+1} \sum_{i=1}^n \sup_{Y_i} (n+1) {\bf E} g_i(Y_i) + {\bf H}[Y_i]$

where ${X}$ ranges over random variables taking values in ${S}$ , ${X_0,\dots,X_n}$ range over tuples of random variables taking values in ${\Omega_n}$ , and ${Y_i}$ range over random variables taking values in ${T_i}$ . Comparing the suprema, the claim now reduces to

Lemma 3 (Conditional expectation computation) Let ${X}$ be an ${S}$ -valued random variable. Then there exists a ${\Omega_n}$ -valued random variable ${(X_0,\dots,X_n)}$ , where each ${X_i}$ has the same distribution as ${X}$ , and
$\displaystyle {\bf H}[X_0,\dots,X_n] = (n+1) {\bf H}[X]$

$\displaystyle - {\bf H}[\pi_1(X)] - \dots - {\bf H}[\pi_n(X)].$

Proof: We induct on ${n}$ . When ${n=0}$ we just take ${X_0 = X}$ . Now suppose that ${n \geq 1}$ , and the claim has already been proven for ${n-1}$ , thus one has already obtained a tuple ${(X_0,\dots,X_{n-1}) \in \Omega_{n-1}}$ with each ${X_0,\dots,X_{n-1}}$ having the same distribution as ${X}$ , and

$\displaystyle {\bf H}[X_0,\dots,X_{n-1}] = n {\bf H}[X] - {\bf H}[\pi_1(X)] - \dots - {\bf H}[\pi_{n-1}(X)].$

By hypothesis, ${\pi_n(X_{n-1})}$ has the same distribution as ${\pi_n(X)}$ . For each value ${t_n}$ attained by ${\pi_n(X)}$ , we can take conditionally independent copies of ${(X_0,\dots,X_{n-1})}$ and ${X}$ conditioned to the events ${\pi_n(X_{n-1}) = t_n}$ and ${\pi_n(X) = t_n}$ respectively, and then concatenate them to form a tuple ${(X_0,\dots,X_n)}$ in ${\Omega_n}$ , with ${X_n}$ a further copy of ${X}$ that is conditionally independent of ${(X_0,\dots,X_{n-1})}$ relative to ${\pi_n(X_{n-1}) = \pi_n(X)}$ . One can the use the entropy chain rule to compute

$\displaystyle {\bf H}[X_0,\dots,X_n] = {\bf H}[\pi_n(X_n)] + {\bf H}[X_0,\dots,X_n| \pi_n(X_n)]$

$\displaystyle = {\bf H}[\pi_n(X_n)] + {\bf H}[X_0,\dots,X_{n-1}| \pi_n(X_n)] + {\bf H}[X_n| \pi_n(X_n)]$

$\displaystyle = {\bf H}[\pi_n(X)] + {\bf H}[X_0,\dots,X_{n-1}| \pi_n(X_{n-1})] + {\bf H}[X_n| \pi_n(X_n)]$

$\displaystyle = {\bf H}[\pi_n(X)] + ({\bf H}[X_0,\dots,X_{n-1}] - {\bf H}[\pi_n(X_{n-1})])$

$\displaystyle + ({\bf H}[X_n] - {\bf H}[\pi_n(X_n)])$

$\displaystyle ={\bf H}[X_0,\dots,X_{n-1}] + {\bf H}[X_n] - {\bf H}[\pi_n(X_n)]$

and the claim now follows from the induction hypothesis. $\Box$

With a little more effort, one can replace ${S}$ by a more general measure space (and use differential entropy in place of Shannon entropy), to recover Carbery’s inequality in full generality; we leave the details to the interested reader.

30 comments

Comments feed for this article

11 December, 2023 at 1:04 pm

Anonymous

Thanks Prof Tao for this useful information. A new inequality might be the key to solving some difficult problems. Dr. Hayes

11 December, 2023 at 8:46 pm

Anonymous

“Kullback-Leibler divergence” is the correct spelling.
[Corrected, thanks – T.]

11 December, 2023 at 8:48 pm

Anonymous

Note: The comment software does not appear to allow for any identification of the commenter, but just prints the disembodied comment and attributes it to “Anonymous”.

I hope this can be fixed soon.

12 December, 2023 at 10:34 am

Anonymous

It won’t be fixed. It’s the new feature.

13 December, 2023 at 12:48 am

Anonymous

Who the f*ck is Anonymous????? … Terry Tao you might as well rename your blog to that of your gutless faceless proof-reader who persists in correcting every single tiny little unimportant spelling mistake for the simple reason of totally dominating this world renowned blog with his petulant presence … my name is Jenda Vondra and I’m from Australia and we Aussies can only take so much crap!

8 February, 2024 at 1:16 pm

Anonymous

What the heck bro 😭

12 December, 2023 at 12:40 am

Anonymous

We’re going to need a bigger number.

12 December, 2023 at 10:32 am

Anonymous

Yes, much bigger.

17 December, 2023 at 12:31 pm

Anonymous

even bigger than that

12 December, 2023 at 3:31 am

Tony Carbery

Thanks Terry, it’s great to see this inequality in a wider context!

12 December, 2023 at 7:45 am

Terence Tao

Now I remember where I saw this trick before – it appears in this paper of Carlen and Cordero-Erasquin in which they demonstrate using the Gibbs variational formula that the Brascamp-Lieb inequality is equivalent to an entropy version of that inequality. As the above example demonstrates, this equivalence extends to a more general class of inequalities than the classical Brascamp-Lieb inequalities (in that paper, for instance, they also note that it applies to spherical analogues of that inequality).

12 December, 2023 at 9:01 am

Anonymous

Is this apropos of anything? Does it have anything to do with the proof of PFR, beyond “this is an inequality about entropy that should be more widely known”?

12 December, 2023 at 10:36 am

Anonymous

Why does it have to be apropos of anything?

14 December, 2023 at 6:17 pm

Anonymous

In this comment, Prof. Tao read a paper about this inequality and decided to write a blog post.

15 December, 2023 at 6:00 pm

YahyaAA1

“… the inequality can be rewritten as …”

Surely, at this point, we have an equation, rather than an inequality? That’s what the “=” sign means usually. Unless, of course, it has a different meaning in the calc. of variations …

25 December, 2023 at 4:08 pm

Jose Alvarado

Thank you, Prof. Tao, for sharing this interesting approach.

I apologize if the validity of the following comment seems obvious, but it appears that we can deduce (from the proof of Lemma 3) a similar lemma with extra flexibility. The statement is as follows:

“ $\text{Let } X \in \mathcal{X} \text{ and } Y \in \mathcal{Y} \text{ be two random variables. Given two functions } f: \mathcal{X} \to \mathcal{T} \text{ and } g: \mathcal{Y} \to \mathcal{T}. \text{ Then, there exists a random variable } (X',Y') \in \Omega := \{(x,y) \in \mathcal{X} \times \mathcal{Y} : f(x) = g(y)\} \text{ such that } X' \text{ is a copy of } X, Y' \text{ is a copy of } Y, \text{ and}$
$\mathbf{H}[X',Y'] = \mathbf{H}[X'] + \mathbf{H}[Y'] - \mathbf{H}[g(Y')].$ ”

Does that make sense?

26 December, 2023 at 4:32 pm

Terence Tao

I think one may need a hypothesis that $f(X)$ and $g(Y)$ have the same distribution in order to reach this conclusion. (For instance, if $f(X)$ and $g(Y)$ take completely disjoint values, then $\Omega$ is empty and the claim cannot hold.)

26 December, 2023 at 7:57 pm

JoseDAlvarado

Oops… Of course, you are right! It is imperative to add a hypothesis concerning the random variables $f(X)$ and $g(Y)$ . Thanks for the clarification :)

11 January, 2024 at 8:33 am

Joseph Malkoun

I have a small comment. If you have, say, only 2 real random variables, say $X$ and $Y$ , and you have $N$ data points $(x_i, y_i)$ in the $xy$ -plane, for $i = 1, \ldots, n$ , one can, say, form the $2 \times N$ matrix $A$ containing $(x_i, y_i)$ as its $i$ -th column. In order to talk about the sample covariance matrix of $X$ and $Y$ , one could of course follow the probability definition, but it also can be viewed geometrically in 2 steps: removing the mean value of $X$ from $X$ , for example, corresponds to orthogonal projection onto the orthogonal complement of $(1, \ldots, 1)$ ( $N$ components) in $\mathbb{R}^N$ . The covariance matrix is then the Gram matrix of the orthogonal projections (as in my previous sentence) of the 2 rows of $A$ .

Thus, in some sense, the language of probability is dual to the language of geometry. It is a bit like deciding whether to look at $A$ as $N$ points in $\mathbb{R}^2$ , or as $2$ points in $\mathbb{R}^N$ , which are of course dual points of view.

My remark is trivial, and of course known, but I wonder if it can be, or perhaps has been already made into a systematic transform/duality.

26 January, 2024 at 10:26 am

shannon7774

This looks a bit like reflection-positivity, maybe just because that uses Cauchy-Schwarz. Is there a quantum analogue, if you replace expectations with respect to measures by states, and replace random variables by self adjoint operators?

1 March, 2024 at 5:22 am

Jas, the Physicist

Yes.

8 February, 2024 at 2:44 am

Anonymous

Sir please explain the sphere emersion , I kindly request to you, thanking you sir with huge respect

8 February, 2024 at 1:15 pm

Anonymous

Our of context but, how would the Gibbs variational formula work for measure theoretic entropy? I don’t see it anywhere properly defined anyway except this blog

20 February, 2024 at 1:51 pm

Anonymous

Hey big bro you have to contribute to mathstackexchange the poor kids are trying to learn

24 April, 2024 at 11:00 am

Anonymous

Lmao yes nationalize math education

21 February, 2024 at 6:44 am

T Retsu

In the proof of Lemma 1, the inequality can be rewritten as:

$\displaystyle0=\sup_X\sum_{s\in S}\textbf{P}[X=s]\textbf{P}[Y=s]-\sum_{s\in S}\textbf{P}[X=s]\mathrm{log}\textbf{P}[X=s]$ .

There isn’t any $\mathrm{log}$ in the $\textbf{E}$ .

[Actually, the identity was correct as originally written: note that $\textbf{P}[Y=s]$ is equal to $\exp(f(s))$ -T.]

24 March, 2024 at 9:11 pm

Anonymous

Many thanks for the answer. It is correct. Lemma 1 is proven.

A question about the terminology:

convex conjugate

instead of “conjugate convex”

26 February, 2024 at 1:02 am

Anonymous

Dear prof. Tao,
Do you know in what regards these generalized inequalities could be lifted to infinite (perhaps uncountable sets S)?

7 March, 2024 at 6:05 pm

Anonymous

Thank you Prof Tao for working on cybersecurity for the federal government, I can’t wait to see your new work on lie algebra and prime numbers.

11 April, 2024 at 1:32 pm

Anonymous

“working for the federal government” aka being a slave of the state

	Anonymous on It ought to be common knowledg…
	Ring Theory Intervie… on Reading seminar: “Stable…
	Anonymous on Work hard
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Erratum for “An inverse…
	Anonymous on Erratum for “An inverse…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…

A generalized Cauchy-Schwarz inequality via the Gibbs variational formula

Recent Comments

Articles by others

Diversions

Mathematics

Selected articles

Software

The sciences

Top Posts

Archives

Categories

The Polymath Blog

30 comments

Leave a comment Cancel reply

For commenters

A generalized Cauchy-Schwarz inequality via the Gibbs variational formula

Share this:

Recent Comments

Articles by others

Diversions

Mathematics

Selected articles

Software

The sciences

Top Posts

Archives

Categories

The Polymath Blog

30 comments

Leave a comment Cancel reply

For commenters