Let be a non-empty finite set. If is a random variable taking values in , the Shannon entropy of is defined as
There is a nice variational formula that lets one compute logs of sums of exponentials in terms of this entropy:
Lemma 1 (Gibbs variational formula) Let be a function. Then
Proof: Note that shifting by a constant affects both sides of (1) the same way, so we may normalize . Then is now the probability distribution of some random variable , and the inequality can be rewritten as
But this is precisely the Gibbs inequality. (The expression inside the supremum can also be written as , where denotes Kullback-Leibler divergence. One can also interpret this inequality as a special case of the Fenchel–Young inequality relating the conjugate convex functions and .)In this note I would like to use this variational formula (which is also known as the Donsker-Varadhan variational formula) to give another proof of the following inequality of Carbery.
Theorem 2 (Generalized Cauchy-Schwarz inequality) Let , let be finite non-empty sets, and let be functions for each . Let and be positive functions for each . Then where is the quantity where is the set of all tuples such that for .
Thus for instance, the identity is trivial for . When , the inequality reads
which is easily proven by Cauchy-Schwarz, while for the inequality reads which can also be proven by elementary means. However even for , the existing proofs require the “tensor power trick” in order to reduce to the case when the are step functions (in which case the inequality can be proven elementarily, as discussed in the above paper of Carbery).We now prove this inequality. We write and for some functions and . If we take logarithms in the inequality to be proven and apply Lemma 1, the inequality becomes
where ranges over random variables taking values in , range over tuples of random variables taking values in , and range over random variables taking values in . Comparing the suprema, the claim now reduces to
Lemma 3 (Conditional expectation computation) Let be an -valued random variable. Then there exists a -valued random variable , where each has the same distribution as , and
Proof: We induct on . When we just take . Now suppose that , and the claim has already been proven for , thus one has already obtained a tuple with each having the same distribution as , and
By hypothesis, has the same distribution as . For each value attained by , we can take conditionally independent copies of and conditioned to the events and respectively, and then concatenate them to form a tuple in , with a further copy of that is conditionally independent of relative to . One can the use the entropy chain rule to compute and the claim now follows from the induction hypothesis.With a little more effort, one can replace by a more general measure space (and use differential entropy in place of Shannon entropy), to recover Carbery’s inequality in full generality; we leave the details to the interested reader.
30 comments
Comments feed for this article
11 December, 2023 at 1:04 pm
Anonymous
Thanks Prof Tao for this useful information. A new inequality might be the key to solving some difficult problems. Dr. Hayes
11 December, 2023 at 8:46 pm
Anonymous
“Kullback-Leibler divergence” is the correct spelling.
[Corrected, thanks – T.]
11 December, 2023 at 8:48 pm
Anonymous
Note: The comment software does not appear to allow for any identification of the commenter, but just prints the disembodied comment and attributes it to “Anonymous”.
I hope this can be fixed soon.
12 December, 2023 at 10:34 am
Anonymous
It won’t be fixed. It’s the new feature.
13 December, 2023 at 12:48 am
Anonymous
Who the f*ck is Anonymous????? … Terry Tao you might as well rename your blog to that of your gutless faceless proof-reader who persists in correcting every single tiny little unimportant spelling mistake for the simple reason of totally dominating this world renowned blog with his petulant presence … my name is Jenda Vondra and I’m from Australia and we Aussies can only take so much crap!
8 February, 2024 at 1:16 pm
Anonymous
What the heck bro 😭
12 December, 2023 at 12:40 am
Anonymous
We’re going to need a bigger number.
12 December, 2023 at 10:32 am
Anonymous
Yes, much bigger.
17 December, 2023 at 12:31 pm
Anonymous
even bigger than that
12 December, 2023 at 3:31 am
Tony Carbery
Thanks Terry, it’s great to see this inequality in a wider context!
12 December, 2023 at 7:45 am
Terence Tao
Now I remember where I saw this trick before – it appears in this paper of Carlen and Cordero-Erasquin in which they demonstrate using the Gibbs variational formula that the Brascamp-Lieb inequality is equivalent to an entropy version of that inequality. As the above example demonstrates, this equivalence extends to a more general class of inequalities than the classical Brascamp-Lieb inequalities (in that paper, for instance, they also note that it applies to spherical analogues of that inequality).
12 December, 2023 at 9:01 am
Anonymous
Is this apropos of anything? Does it have anything to do with the proof of PFR, beyond “this is an inequality about entropy that should be more widely known”?
12 December, 2023 at 10:36 am
Anonymous
Why does it have to be apropos of anything?
14 December, 2023 at 6:17 pm
Anonymous
In this comment, Prof. Tao read a paper about this inequality and decided to write a blog post.
15 December, 2023 at 6:00 pm
YahyaAA1
“… the inequality can be rewritten as …”
Surely, at this point, we have an equation, rather than an inequality? That’s what the “=” sign means usually. Unless, of course, it has a different meaning in the calc. of variations …
25 December, 2023 at 4:08 pm
Jose Alvarado
Thank you, Prof. Tao, for sharing this interesting approach.
I apologize if the validity of the following comment seems obvious, but it appears that we can deduce (from the proof of Lemma 3) a similar lemma with extra flexibility. The statement is as follows:
“
”
Does that make sense?
26 December, 2023 at 4:32 pm
Terence Tao
I think one may need a hypothesis that and have the same distribution in order to reach this conclusion. (For instance, if and take completely disjoint values, then is empty and the claim cannot hold.)
26 December, 2023 at 7:57 pm
JoseDAlvarado
Oops… Of course, you are right! It is imperative to add a hypothesis concerning the random variables and . Thanks for the clarification :)
11 January, 2024 at 8:33 am
Joseph Malkoun
I have a small comment. If you have, say, only 2 real random variables, say and , and you have data points in the -plane, for , one can, say, form the matrix containing as its -th column. In order to talk about the sample covariance matrix of and , one could of course follow the probability definition, but it also can be viewed geometrically in 2 steps: removing the mean value of from , for example, corresponds to orthogonal projection onto the orthogonal complement of ( components) in . The covariance matrix is then the Gram matrix of the orthogonal projections (as in my previous sentence) of the 2 rows of .
Thus, in some sense, the language of probability is dual to the language of geometry. It is a bit like deciding whether to look at as points in , or as points in , which are of course dual points of view.
My remark is trivial, and of course known, but I wonder if it can be, or perhaps has been already made into a systematic transform/duality.
26 January, 2024 at 10:26 am
shannon7774
This looks a bit like reflection-positivity, maybe just because that uses Cauchy-Schwarz. Is there a quantum analogue, if you replace expectations with respect to measures by states, and replace random variables by self adjoint operators?
1 March, 2024 at 5:22 am
Jas, the Physicist
Yes.
8 February, 2024 at 2:44 am
Anonymous
Sir please explain the sphere emersion , I kindly request to you, thanking you sir with huge respect
8 February, 2024 at 1:15 pm
Anonymous
Our of context but, how would the Gibbs variational formula work for measure theoretic entropy? I don’t see it anywhere properly defined anyway except this blog
20 February, 2024 at 1:51 pm
Anonymous
Hey big bro you have to contribute to mathstackexchange the poor kids are trying to learn
24 April, 2024 at 11:00 am
Anonymous
Lmao yes nationalize math education
21 February, 2024 at 6:44 am
T Retsu
In the proof of Lemma 1, the inequality can be rewritten as:
.
There isn’t any in the .
[Actually, the identity was correct as originally written: note that is equal to -T.]
24 March, 2024 at 9:11 pm
Anonymous
Many thanks for the answer. It is correct. Lemma 1 is proven.
A question about the terminology:
convex conjugate
instead of “conjugate convex”
26 February, 2024 at 1:02 am
Anonymous
Dear prof. Tao,
Do you know in what regards these generalized inequalities could be lifted to infinite (perhaps uncountable sets S)?
7 March, 2024 at 6:05 pm
Anonymous
Thank you Prof Tao for working on cybersecurity for the federal government, I can’t wait to see your new work on lie algebra and prime numbers.
11 April, 2024 at 1:32 pm
Anonymous
“working for the federal government” aka being a slave of the state