Some notes on Bakry-Emery theory

5 February, 2013 in expository, math.AP, math.PR | Tags: Bakry-Emery theory, heat equation, Ornstein-Uhlenbeck process, random matrices | by Terence Tao

[These are notes intended mostly for myself, as these topics are useful in random matrix theory, but may be of interest to some readers also. -T.]

One of the most fundamental partial differential equations in mathematics is the heat equation

$\displaystyle \partial_t f = L f \ \ \ \ \ (1)$

where ${f: [0,+\infty) \times {\bf R}^n \rightarrow {\bf R}}$ is a scalar function ${(t,x) \mapsto f(t,x)}$ of both time and space, and ${L}$ is the Laplacian ${L := \frac{1}{2} \Delta = \sum_{i=1}^n \frac{\partial^2}{\partial x_i^2}}$ . For the purposes of this post, we will ignore all technical issues of regularity and decay, and always assume that the solutions to equations such as (1) have all the regularity and decay in order to justify all formal operations such as the chain rule, integration by parts, or differentiation under the integral sign. The factor of ${\frac{1}{2}}$ in the definition of the heat propagator ${L}$ is of course an arbitrary normalisation, chosen for some minor technical reasons; one can certainly continue the discussion below with other choices of normalisations if desired.

In probability theory, this equation takes on particular significance when ${f}$ is restricted to be non-negative, and furthermore to be a probability measure at each time, in the sense that

$\displaystyle \int_{{\bf R}^n} f(t,x)\ dx = 1$

for all ${t}$ . (Actually, it suffices to verify this constraint at time ${t=0}$ , as the heat equation (1) will then preserve this constraint.) Indeed, in this case, one can interpret ${f(t,x)\ dx}$ as the probability distribution of a Brownian motion

$\displaystyle dx = dB(t) \ \ \ \ \ (2)$

where ${x = x(t) \in {\bf R}^n}$ is a stochastic process with initial probability distribution ${f(0,x)\ dx}$ ; see for instance this previous blog post for more discussion.

A model example of a solution to the heat equation to keep in mind is that of the fundamental solution

$\displaystyle G(t,x) = \frac{1}{(2\pi t)^{n/2}} e^{-|x|^2/2t} \ \ \ \ \ (3)$

defined for any ${t>0}$ , which represents the distribution of Brownian motion of a particle starting at the origin ${x=0}$ at time ${t=0}$ . At time ${t}$ , ${G(t,x)}$ represents an ${{\bf R}^n}$ -valued random variable, each coefficient of which is an independent random variable of mean zero and variance ${t}$ . (As ${t \rightarrow 0^+}$ , ${G(t)}$ converges in the sense of distributions to a Dirac mass at the origin.)

The heat equation can also be viewed as the gradient flow for the Dirichlet form

$\displaystyle D(f,g) := \frac{1}{2} \int_{{\bf R}^n} \nabla f \cdot \nabla g\ dx \ \ \ \ \ (4)$

since one has the integration by parts identity

$\displaystyle \int_{{\bf R}^n} Lf(x) g(x)\ dx = \int_{{\bf R}^n} f(x) Lg(x)\ dx = - D(f,g) \ \ \ \ \ (5)$

for all smooth, rapidly decreasing ${f,g}$ , which formally implies that ${L f}$ is (half of) the negative gradient of the Dirichlet energy ${D(f,f) = \frac{1}{2} \int_{{\bf R}^n} |\nabla f|^2\ dx}$ with respect to the ${L^2({\bf R}^n,dx)}$ inner product. Among other things, this implies that the Dirichlet energy decreases in time:

$\displaystyle \partial_t D(f,f) = - 2 \int_{{\bf R}^n} |Lf|^2\ dx. \ \ \ \ \ (6)$

For instance, for the fundamental solution (3), one can verify for any time ${t>0}$ that

$\displaystyle D(G,G) = \frac{n}{2^{n+2} \pi^{n/2}} \frac{1}{t^{(n+2)/2}} \ \ \ \ \ (7)$

(assuming I have not made a mistake in the calculation). In a similar spirit we have

$\displaystyle \partial_t \int_{{\bf R}^n} |f|^2\ dx = - 2 D(f,f). \ \ \ \ \ (8)$

Since ${D(f,f)}$ is non-negative, the formula (6) implies that ${\int_{{\bf R}^n} |Lf|^2\ dx}$ is integrable in time, and in particular we see that ${Lf}$ converges to zero as ${t \rightarrow \infty}$ , in some averaged ${L^2}$ sense at least; similarly, (8) suggests that ${D(f,f)}$ also converges to zero. This suggests that ${f}$ converges to a constant function; but as ${f}$ is also supposed to decay to zero at spatial infinity, we thus expect solutions to the heat equation in ${{\bf R}^n}$ to decay to zero in some sense as ${t \rightarrow \infty}$ . However, the decay is only expected to be polynomial in nature rather than exponential; for instance, the solution (3) decays in the ${L^\infty}$ norm like ${O(t^{-n/2})}$ .

Since ${L1=0}$ , we also observe the basic cancellation property

$\displaystyle \int_{{\bf R}^n} Lf(x) \ dx = 0 \ \ \ \ \ (9)$

for any function ${f}$ .

There are other quantities relating to ${f}$ that also decrease in time under heat flow, particularly in the important case when ${f}$ is a probability measure. In this case, it is natural to introduce the entropy

$\displaystyle S(f) := \int_{{\bf R}^n} f(x) \log f(x)\ dx.$

Thus, for instance, if ${f(x)\ dx}$ is the uniform distribution on some measurable subset ${E}$ of ${{\bf R}^n}$ of finite measure ${|E|}$ , the entropy would be ${-\log |E|}$ . Intuitively, as the entropy decreases, the probability distribution gets wider and flatter. For instance, in the case of the fundamental solution (3), one has ${S(G) = -\frac{n}{2} \log( 2 \pi e t )}$ for any ${t>0}$ , reflecting the fact that ${G(t)}$ is approximately uniformly distributed on a ball of radius ${O(\sqrt{t})}$ (and thus of measure ${O(t^{n/2})}$ ).

A short formal computation shows (if one assumes for simplicity that ${f}$ is strictly positive, which is not an unreasonable hypothesis, particularly in view of the strong maximum principle) using (9), (5) that

$\displaystyle \partial_t S(f) = \int_{{\bf R}^n} (Lf) \log f + f \frac{Lf}{f}\ dx$

$\displaystyle = \int_{{\bf R}^n} (Lf) \log f\ dx$

$\displaystyle = - D( f, \log f )$

$\displaystyle = - \frac{1}{2} \int_{{\bf R}^n} \frac{|\nabla f|^2}{f}\ dx$

$\displaystyle = - 4D( g, g )$

where ${g := \sqrt{f}}$ is the square root of ${f}$ . For instance, if ${f}$ is the fundamental solution (3), one can check that ${D(g,g) = \frac{n}{8t}}$ (note that this is a significantly cleaner formula than (7)!).

In particular, the entropy is decreasing, which corresponds well to one’s intuition that the heat equation (or Brownian motion) should serve to spread out a probability distribution over time.

Actually, one can say more: the rate of decrease ${4D(g,g)}$ of the entropy is itself decreasing, or in other words the entropy is convex. I do not have a satisfactorily intuitive reason for this phenomenon, but it can be proved by straightforward application of basic several variable calculus tools (such as the chain rule, product rule, quotient rule, and integration by parts), and completing the square. Namely, by using the chain rule we have

$\displaystyle L \phi(f) = \phi'(f) Lf + \frac{1}{2} \phi''(f) |\nabla f|^2, \ \ \ \ \ (10)$

valid for for any smooth function ${\phi: {\bf R} \rightarrow {\bf R}}$ , we see from (1) that

$\displaystyle 2 g \partial_t g = 2 g L g + |\nabla g|^2$

and thus (again assuming that ${f}$ , and hence ${g}$ , is strictly positive to avoid technicalities)

$\displaystyle \partial_t g = Lg + \frac{|\nabla g|^2}{2g}.$

We thus have

$\displaystyle \partial_t D(g,g) = 2 D(g,Lg) + D(g, \frac{|\nabla g|^2}{g} ).$

It is now convenient to compute using the Einstein summation convention to hide the summation over indices ${i,j = 1,\ldots,n}$ . We have

$\displaystyle 2 D(g,Lg) = \frac{1}{2} \int_{{\bf R}^n} (\partial_i g) (\partial_i \partial_j \partial_j g)\ dx$

and

$\displaystyle D(g, \frac{|\nabla g|^2}{g} ) = \frac{1}{2} \int_{{\bf R}^n} (\partial_i g) \partial_i \frac{\partial_j g \partial_j g}{g}\ dx.$

By integration by parts and interchanging partial derivatives, we may write the first integral as

$\displaystyle 2 D(g,Lg) = - \frac{1}{2} \int_{{\bf R}^n} (\partial_i \partial_j g) (\partial_i \partial_j g)\ dx,$

and from the quotient and product rules, we may write the second integral as

$\displaystyle D(g, \frac{|\nabla g|^2}{g} ) = \int_{{\bf R}^n} \frac{(\partial_i g) (\partial_j g) (\partial_i \partial_j g)}{g} - \frac{(\partial_i g) (\partial_j g) (\partial_i g) (\partial_j g)}{2g^2}\ dx.$

Gathering terms, completing the square, and making the summations explicit again, we see that

$\displaystyle \partial_t D(g,g) =- \frac{1}{2} \int_{{\bf R}^n} \frac{\sum_{i=1}^n \sum_{j=1}^n |g \partial_i \partial_j g - (\partial_i g) (\partial_j g)|^2}{g^2}\ dx$

and so in particular ${D(g,g)}$ is always decreasing.

The above identity can also be written as

$\displaystyle \partial_t D(g,g) = - \frac{1}{2} \int_{{\bf R}^n} |\nabla^2 \log g|^2 g^2\ dx.$

Exercise 1 Give an alternate proof of the above identity by writing ${f = e^{2u}}$ , ${g = e^u}$ and deriving the equation ${\partial_t u = Lu + |\nabla u|^2}$ for ${u}$ .

It was observed in a well known paper of Bakry and Emery that the above monotonicity properties hold for a much larger class of heat flow-type equations, and lead to a number of important relations between energy and entropy, such as the log-Sobolev inequality of Gross and of Federbush, and the hypercontractivity inequality of Nelson; we will discuss one such family of generalisations (or more precisely, variants) below the fold.

We modify the above examples by replacing the Lebesgue measure ${dx}$ on ${{\bf R}^n}$ by a probability measure ${\mu}$ , which (inspired by the formalism of statistical mechanics) is customarily written in the form

$\displaystyle d\mu = \frac{1}{Z} e^{-\beta H(x)}\ dx,$

where the Hamiltonian ${H: {\bf R}^n \rightarrow {\bf R}}$ is some (smooth) function that grows at a suitable rate at infinity (to make ${e^{-\beta H}}$ have finite mass), ${\beta > 0}$ is a scaling parameter (sometimes referred to as the inverse temperature), and ${Z = \int_{{\bf R}^n} e^{-\beta H}\ dx}$ is a normalisation constant (known as the partition function). A model example here is that of the Gaussian distribution

$\displaystyle d\mu = \frac{1}{(2\pi)^{n/2}} e^{-|x|^2/2}\ dx \ \ \ \ \ (11)$

which would correspond to the harmonic oscillator Hamiltonian ${H(x) = |x|^2/2}$ with ${\beta=1}$ , or of a random Gaussian vector in which each coefficient independently has a normal distribution of mean zero and variance one. (For the purposes of this blog post, one could combine the ${\beta}$ and ${H}$ quantities into a single object if one wished, effectively normalising ${\beta}$ to equal ${1}$ , but there can be some minor advantages to using a different normalisation to ${\beta}$ for some applications, for instance if one is interested in large dimensional limits ${n \rightarrow \infty}$ .)

While we are altering the base measure ${\mu}$ here, we will retain the Euclidean gradient structure; one can of course consider even more general variants of heat flow (such as heat flow on Riemannian manifolds) in which we also change the gradient structure; these are also of interest for many applications, but we will not consider them here.

We can form the Dirichlet form ${D_\mu}$ relative to ${\mu}$ (but retaining the Euclidean gradient structure, as mentioned above) by the formula

$\displaystyle D_\mu(f,g) := \frac{1}{2} \int_{{\bf R}^n} \nabla f \cdot \nabla g\ d\mu$

in analogy with (4). In analogy with (5), we then have

$\displaystyle \int_{{\bf R}^n} (Lf) g\ d\mu = \int_{{\bf R}^n} f (Lg)\ d\mu = -D_\mu(f,g) \ \ \ \ \ (12)$

where ${L}$ is the differential operator

$\displaystyle L f = L_\mu f = \frac{1}{2} \Delta f - \frac{1}{2} \beta \nabla H \cdot \nabla f.$

For instance, in the Gaussian case (11), ${L}$ becomes the Ornstein-Uhlenbeck operator

$\displaystyle Lf = \frac{1}{2} (\Delta f - x \cdot \nabla f)$

(note that the normalisation factor of ${\frac{1}{2}}$ is not always present in the literature concerning this operator). Observe that ${L1=0}$ (or equivalently, that ${D(f,1)=0}$ for all reasonable ${f}$ ) and so from (12) we have the analogue of (9), namely that

$\displaystyle \int_{{\bf R}^n} Lf\ d\mu = 0 \ \ \ \ \ (13)$

for all (smooth, not too rapidly increasing) ${f}$ .

The corresponding gradient flow (now using the inner product structure of ${L^2({\bf R}^n,d\mu)}$ , rather than ${L^2({\bf R}^n, dx)}$ ) is then

$\displaystyle \partial_t f = Lf. \ \ \ \ \ (14)$

This equation becomes probabilistically meaningful if ${f\ d\mu}$ is a probability measure, thus ${f}$ is non-negative and

$\displaystyle \int_{{\bf R}^n} f\ d\mu = 1$

for all times ${t}$ (but again, it is easy to see using (13) that if this constraint holds for one time, it holds for all subsequent times also). In probabilistic terms, the probability measure ${f(t,x)\ d\mu(x)}$ then represents the probability distribution at time ${t}$ of a particle ${x: t \mapsto x(t) \in {\bf R}^n}$ with initial distribution ${f(0,x)\ d\mu(x)}$ at time zero, and obeying the stochastic differential equation

$\displaystyle dx = dB_t - \frac{1}{2} \beta \nabla H(x) dt,$

which can be viewed as a combination of Brownian motion ${dx = dB_t}$ and gradient flow ${dx = -\nabla H(x) dt}$ . For instance, in the Gaussian case (11), this equation becomes the Ornstein-Uhlenbeck process

$\displaystyle dx = dB_t - \frac{1}{2} x\ dt, \ \ \ \ \ (15)$

with the relative density ${f}$ then obeying the Ornstein-Uhlenbeck equation

$\displaystyle \partial_t f = \frac{1}{2} (\Delta f - x \cdot \nabla f). \ \ \ \ \ (16)$

As before, we have a number of monotonicity formulae for the heat flow (14), such as

$\displaystyle \partial_t \int_{{\bf R}^n} |f|^2\ d\mu = - 2 D_\mu(f,f)$

and

$\displaystyle \partial_t D_\mu(f,f) = - 2\int_{{\bf R}^n} |Lf|^2\ d\mu,$

which strongly suggests (as before) that ${f}$ should be converging as ${t \rightarrow \infty}$ to a function with ${D_\mu(f,f)=0}$ , i.e. to a constant function. In particular, as ${\mu}$ is normalised to be a probability measure, and ${f\ d\mu}$ is also a probability measure, then we expect ${f}$ to converge to ${1}$ .

One way to quantify this convergence is to introduce the entropy

$\displaystyle S_\mu(f) := \int_{{\bf R}^n} f \log f\ d\mu.$

If ${\mu}$ and ${f\ d\mu}$ are both probability measures, then a simple application of Jensen’s inequality gives us that ${S_\mu(f)}$ is non-negative. Indeed, we can say more:

Lemma 1 (Entropy controls total variation) We have ${(\int_{{\bf R}^n} |f-1|\ d\mu)^2 \leq 2 S_\mu(f)}$ .

This inequality is also known as Pinsker’s inequality.

Proof: We first give a crude argument that loses a factor of ${4}$ . As ${x \log x}$ has a second derivative of ${\frac{1}{x}}$ , we see from Taylor expansion at ${1}$ that

$\displaystyle x \log x \geq (x-1) + \frac{1}{2} (1-x)_+^2$

for any ${x \geq 0}$ , and hence on replacing ${x}$ by ${f(x)}$ and integrating against ${\mu}$ , we see that

$\displaystyle S_\mu(f) \geq \frac{1}{2} \int_{{\bf R}^n} (1-f)_+^2\ d\mu$

and hence by Cauchy-Schwarz

$\displaystyle \int_{{\bf R}^n} (1-f)_+\ d\mu \leq \sqrt{2} S_\mu(f)^{1/2}$

and hence since ${f}$ has mean ${1}$ with respect to ${\mu}$

$\displaystyle \int_{{\bf R}^n} |f-1|\ d\mu \leq 2 \sqrt{2} S_\mu(f)^{1/2}$

which on squaring gives the bound with a loss of ${4}$ . To eliminate the loss, we use an argument of Kemperman (which also appears in a subsequent paper of Diaconis and Saloff-Coste), and use the more precise inequality

$\displaystyle x \log x \geq (x-1) + \frac{(x-1)^2}{2+2(x-1)/3}$

which can be verified by elementary calculus (indeed, this function vanishes to fourth order at ${x=1}$ and has the non-negative second derivative of ${\frac{(x-1)^2(x+8)}{x(x+2)^3}}$ ). We then conclude that

$\displaystyle S_\mu(f) \geq \int_{{\bf R}^n} \frac{|f-1|^2}{2 + 2(f-1)/3}\ d\mu;$

since ${\int_{{\bf R}^n} 2+2(f-1)/3\ d\mu = 2}$ , the claim then follows from the Cauchy-Schwarz inequality. $\Box$

The computations used to compute the time derivative of ${S(f)}$ also work for ${S_\mu(f)}$ , leading to the identity

$\displaystyle \partial_t S_\mu(f) = - 4D_\mu(g,g) \ \ \ \ \ (17)$

where ${f=g^2}$ . The formula for the time derivative for ${D_\mu(g,g)}$ is a bit more interesting. The chain rule (10) still holds for ${L}$ (why?), and so we still have

$\displaystyle \partial_t g = Lg + \frac{|\nabla g|^2}{2g}$

and thus

$\displaystyle \partial_t D_\mu(g,g) = 2 D_\mu(g,Lg) + D_\mu(g, \frac{|\nabla g|^2}{g} ).$

As in the Euclidean case, we have

$\displaystyle D_\mu(g, \frac{|\nabla g|^2}{g} ) = \int_{{\bf R}^n} \frac{(\partial_i g) (\partial_j g) (\partial_i \partial_j g)}{g} - \frac{(\partial_i g) (\partial_j g) (\partial_i g) (\partial_j g)}{2g^2}\ d\mu(x). \ \ \ \ \ (18)$

The situation with ${D_\mu(g,Lg)}$ is however a bit more complicated:

$\displaystyle 2 D_\mu(g,Lg) = \frac{1}{2} \int_{{\bf R}^n} (\partial_i g) (\partial_i \partial_j \partial_j g)\ d\mu - \frac{\beta}{2} \int_{{\bf R}^n} (\partial_i g) \partial_i( (\partial_j H) (\partial_j g) )\ d\mu.$

If we perform integration by parts on the first integral, it becomes

$\displaystyle - \frac{1}{2} \int_{{\bf R}^n} (\partial_i \partial_j g) (\partial_i \partial_j g)\ d\mu(x) + \frac{\beta}{2} \int_{{\bf R}^n} (\partial_i g) (\partial_i \partial_j g) (\partial_j H)\ d\mu;$

this partially cancels the second integral, leading to the formula

$\displaystyle 2 D_\mu(g,Lg) = \frac{1}{2} \int_{{\bf R}^n} (\partial_i \partial_j g) (\partial_i \partial_j g)\ d\mu - \frac{\beta}{2} \int_{{\bf R}^n} (\partial_i \partial_j H) (\partial_i g) (\partial_j g)\ d\mu.$

Combining this with (18) and completing the square as in the Euclidean case, we conclude that

$\displaystyle \partial_t D_\mu(g,g) = - \frac{1}{2} \int_{{\bf R}^n} |\nabla^2 \log g|^2 g^2\ d\mu - \frac{\beta}{2} \int_{{\bf R}^n} \nabla^2 H( \nabla g, \nabla g)\ d\mu$

where ${\nabla^2 H}$ is the Hessian of ${H}$ :

$\displaystyle \nabla^2 H( u, v ) := (\partial_i \partial_j H) (\partial_i u) (\partial_j v).$

In particular, we have the Bakry-Emery inequality

$\displaystyle \partial_t D_\mu(g,g) \leq - \frac{\beta}{2} \int_{{\bf R}^n} \nabla^2 H( \nabla g, \nabla g)\ d\mu. \ \ \ \ \ (19)$

If ${H}$ is convex, then this implies that ${D_\mu(g,g)}$ is non-increasing, much as in the Euclidean case. However, the situation improves significantly if ${H}$ is (uniformly) strictly convex in the sense that

$\displaystyle \nabla^2 H( u, u ) \geq \rho |u|^2$

for some fixed ${\rho>0}$ . In this case, we have

$\displaystyle \partial_t D_\mu(g,g) \leq -\beta\rho D_\mu(g,g)$

and thus by Gronwall’s inequality we have the exponential decay

$\displaystyle D_\mu(g,g) \leq \exp( - \beta \rho t ) D_\mu(g(0),g(0)) \ \ \ \ \ (20)$

for all ${t \geq 0}$ . This suggests that ${g}$ (and hence ${f}$ ) relaxes exponentially fast to the equilibrium state ${1}$ , with the time of relaxation being of the order of ${O(1/(\rho \beta))}$ . In particular, we expect ${S_\mu(f)}$ to converge to ${S_\mu(1)=0}$ as ${t \rightarrow \infty}$ . From (17) and the fundamental theorem of calculus we therefore have (formally at least) that

$\displaystyle S_\mu(f(0)) = 4\int_0^\infty D_\mu(g(t),g(t))\ dt$

and hence from (20) we have the log-Sobolev inequality

$\displaystyle S_\mu(f(0)) \leq \frac{4}{\beta \rho} D_\mu(g(0),g(0))$

or to put in a more familiar way, we have

$\displaystyle \int_{{\bf R}^n} g^2 \log g^2\ d\mu \leq \frac{2}{\beta \rho} \int_{{\bf R}^n} |\nabla g|^2\ d\mu \ \ \ \ \ (21)$

whenever ${\int_{{\bf R}^n} g^2\ d\mu = 1}$ .

Exercise 2 Formally establish the exponential decay of entropy

$\displaystyle S_\mu(f(t)) \leq \exp( - \beta \rho t ) S_\mu(f(0))$

whenever ${f\ d\mu}$ is a probability measure obeying the heat equation (14).

In the gaussian case (11), we have ${\beta=\rho=1}$ , and (21) is the classical log-Sobolev inequality of Gross and of Federbush. But we see from the above argument that log-Sobolev inequalities in fact hold for all convex Hamiltonians ${H}$ .

Exercise 3 Using (21), formally establish the hypercontractivity inequality

$\displaystyle \| f(t) \|_{L^{\beta \rho t}({\bf R}^n, d\mu)} \leq \|f(0) \|_{L^1({\bf R}^n,d\mu)} \ \ \ \ \ (22)$

whenever ${f}$ is a non-negative solution to the heat flow (14). (Hint: compute the derivative of ${\| f(t) \|_{L^{\beta \rho t}({\bf R}^n, d\mu)}^{\beta \rho t}}$ at some time ${t=t_0}$ , after performing the normalisation ${\| f(t_0) \|_{L^{\beta \rho t_0}({\bf R}^n, d\mu)} = 1}$ .) Conversely, show that the hypercontractivity inequality (22) implies the log-Sobolev inequality (21). In the Gaussian case (11), the hypercontractivity inequality is due to Nelson; the equivalence between the two inequalities was first observed by Federbush.

Log-Sobolev inequalities can also be used to establish concentration of measure inequalities; see for instance this previous blog post for an instance of this.

The exponential convergence to equilibrium in the convex case should be contrasted with the polynomial convergence to zero in the Euclidean case. As always, the model gaussian case (11) is instructive. Let us use the Brownian motion formalism. In the Euclidean case, using the Brownian motion (2), and assuming mean zero ${\mathop{\bf E} x(t) = 0}$ for simplicity, we have constant variance production thanks to Ito’s formula ${\mathop{\bf E} |dB_t|^2 = dt}$ :

$\displaystyle \frac{d}{dt} \mathop{\bf E}|x(t)|^2 = 1.$

On the other hand, with the Ornstein-Uhlenbeck process (15) with mean zero, a short computation shows that the variance instead obeys the formula

$\displaystyle \frac{d}{dt} \mathop{\bf E} |x(t)|^2 = 1 - \mathop{\bf E} |x(t)|^2$

and thus ${\mathop{\bf E} |x(t)|^2 - 1}$ relaxes exponentially fast to the identity. One can also see this from the explicit solution

$\displaystyle f(t,x)\ d\mu(x) = \frac{1}{(2\pi (1+e^{-t}))^{n/2}} e^{-|x|^2/2(1+e^{-t})}\ dx$

to the Ornstein-Uhlenbeck equation (16).

The exponential convergence to equilibrium can also be understood as a probabilistic mixing inequality. Let us take the model case of a quadratic Hamiltonian attaining a minimum at ${x_0}$ :

$\displaystyle H(x) = \frac{1}{2} \rho |x-x_0|^2 + H(x_0). \ \ \ \ \ (23)$

In this case, the stochastic equation is a more general Ornstein-Uhlenbeck process

$\displaystyle dx = dB_t - \beta \rho (x-x_0) dt.$

The natural length scale for the measure ${\mu}$ is ${O((\beta \rho)^{-1/2})}$ , since the weight ${e^{-\beta H}}$ starts decreasing faster than exponentially once one wanders more than this length scale away from the Hamiltonian minimiser ${x_0}$ . The natural scale for the drift velocity ${-\beta \rho(x-x_0)}$ is thus ${O( (\beta \rho)^{1/2} )}$ . Over a time scale ${t=O(T)}$ , the net drift should then be ${O( (\beta \rho)^{1/2} T)}$ , while the net diffusion from the Brownian motion component ${dB_t}$ of the process should be ${O(T^{1/2})}$ . The two contributions meet at ${T \sim (\beta \rho)^{-1}}$ , which helps explain why this is the mixing time for relaxation to equilibrium; it is basically the time needed for the random portion of the dynamics to fill out the space drawn by the equilibrium.

The above Bakry-Emery theory works well when the Hamiltonian is “isotropic” in the sense that it has about the same amount of convexity in any direction (or to put it another way, the Hessian ${\nabla^2 H}$ is well-conditioned), as was the case in the quadratic examples (23). But it does not give completely satisfactory results for “non-isotropic” choices of Hamiltonian. A simple example is provided by the two-dimensional anisotropic quadratic Hamiltonian

$\displaystyle H(x_1,x_2) = \frac{1}{2} \rho_1 x_1^2 + \frac{1}{2} \rho_2 x_2^2 \ \ \ \ \ (24)$

for some ${0 < \rho_1 < \rho_2}$ with ${\beta=1}$ ; this corresponds to two decoupled Ornstein-Uhlenbeck process, a “slow” process

$\displaystyle dx_1 = dB^1_t - \rho_1 x dt$

which takes time ${O(1/\rho_1)}$ to converge to equilibrium, and an independent “fast” process

$\displaystyle dx_2 = dB^2_t - \rho_2 x dt$

which takes the shorter time of ${O(1/\rho_2)}$ to converge to equilibrium. One should view ${x_1}$ as a “macroscopic” or “global” coordinate of a system, while ${x_2}$ is a “microscopic” or “local” coordinate. We have the strict convexity bound

$\displaystyle \nabla^2 H(u,u) = \rho_1 u_1^2 + \rho_2 u_2^2 \geq \rho_1 |u|^2 \ \ \ \ \ (25)$

which by the above theory gives convergence to equilibrium of the combined process ${(x_1,x_2)}$ in time ${O(1/\rho_1)}$ . This is the right order of magnitude if one is interested in convergence to global equilibrium, but it does not capture the fact that convergence to local equilibrium (as measured by convergence of the ${x_2}$ component) can occur at times much sooner than ${O(1/\rho_1)}$ . However, in a series of papers by Erdos, Schlein, Yau, and Yin (see also these later surveys), the local relaxation flow method were developed to improve upon the basic Bakry-Emery theory, with applications to the universality theory of random matrices. We can illustrate some of the basic ideas of this method with the simple example of the decoupled Hamiltonian (24) (though the whole point of the method, for the purpose of applications, is that it can also be applied to more complicated Hamiltonians in which the slow and fast modes are coupled together).

The first trick is to refrain from applying the inequality (25) to the Bakry-Emery inequality (19) as it is too lossy in the fast variables. Instead, since

$\displaystyle \nabla^2 H(u,u) \geq \rho_2 u_2^2$

we see that (19) implies that

$\displaystyle \partial_t D_\mu(g,g) \leq - \frac{\rho_2}{2} \int_{{\bf R}^2} (\partial_2 g)^2\ d\mu.$

Integrating this, we obtain an additional bound on ${\partial_2 g}$ which becomes useful in the regime when ${\rho_2}$ is much larger than ${\rho_1}$ :

$\displaystyle \int_0^\infty \int_{{\bf R}^2} (\partial_2 g)^2\ d\mu dt \leq \frac{2}{\rho_2} D_\mu(g(0),g(0)). \ \ \ \ \ (26)$

This gives some additional smoothing in the ${x_2}$ direction. For instance, given a test function ${F(x_1,x_2) = F(x_2)}$ of the ${x_2}$ variable only, we have

$\displaystyle \partial_t \int_{{\bf R}^2} F f\ d\mu = \int_{{\bf R}^2} F Lf\ d\mu$

$\displaystyle = - D_\mu(F,f)$

$\displaystyle = - \frac{1}{2} \int_{{\bf R}^2} F'(x_2) \partial_2 f\ d\mu$

$\displaystyle = - \int_{{\bf R}^2} F'(x_2) g \partial_2 g\ d\mu;$

since ${\int_{{\bf R}^2} g^2\ d\mu = 1}$ , we thus have from Cauchy-Schwarz that

$\displaystyle |\partial_t \int_{{\bf R}^2} F f\ d\mu| \ll \|F'\|_{L^\infty} (\int_{{\bf R}^2} (\partial_2 g)^2\ d\mu)^{1/2};$

integrating this from ${t=0}$ to some time ${t=T}$ and using Cauchy-Schwarz again followed by, we see that

$\displaystyle |\int_{{\bf R}^2} F f(0)\ d\mu - \int_{{\bf R}^2} F f(T)\ d\mu| \ll \frac{T^{1/2}}{\rho_2^{1/2}} \|F'\|_{L^\infty} D_\mu(g(0),g(0))^{1/2}.$

Given that the time to relaxation to equilibrium is ${O(1/\rho_1)}$ , we morally have ${f(T) \approx 1}$ for ${T \gg 1/\rho_1}$ , so heuristically at least we have the bound

$\displaystyle |\int_{{\bf R}^2} F f(0)\ d\mu - \int_{{\bf R}^2} F\ d\mu| \ll \frac{1}{\rho_1^{1/2} \rho_2^{1/2}} \|F'\|_{L^\infty} D_\mu(g(0),g(0))^{1/2}. \ \ \ \ \ (27)$

For sake of comparison, the log-Sobolev inequality (21) followed by Lemma 1 gives a bound of the shape

$\displaystyle |\int_{{\bf R}^2} F f(0)\ d\mu - \int_{{\bf R}^2} F\ d\mu| \ll \frac{1}{\rho_1^{1/2}} \|F\|_{L^\infty} D_\mu(g(0),g(0))^{1/2}$

for all ${F}$ (which can now depend on ${x_1}$ as well as ${x_2}$ ). The bound (27) is better than the log-Sobolev bound when the test function ${F}$ is spread out over scales larger than the natural length scale ${\rho_2^{1/2}}$ of the fast variable. This suggests that such statistics are universal among all distributions ${f\ d\mu}$ for which one can control the Dirichlet functional ${D_\mu(\sqrt{f},\sqrt{f})}$ . (In the context of the theory of universality theory for Wigner random matrices, this corresponds to local spectral statistics that have been averaged in the energy parameter in order to reduce the dependence of these statistics on fast variables.)

The above trick improves the estimates somewhat, but it still involves the slow relaxation time ${T = O(1/\rho_1)}$ , which is undesirable for some applications. The second trick in the method of local relaxation flow is to modify the Hamiltonian, working with a more convexified (but slightly artificial) Hamiltonian ${\tilde H}$ , such as

$\displaystyle \tilde H(x) := H(x) + \frac{1}{2} \rho_* |x|^2 \ \ \ \ \ (28)$

for some parameter ${\rho_*}$ , which one can think of as being intermediate between ${\rho_1}$ and ${\rho_2}$ for our toy example. Since ${\nabla^2 \tilde H(u,u) \geq \rho_* |u|^2}$ , Bakry-Emery theory tells us that the associated heat equation ${\partial_t \tilde f = \tilde L \tilde f}$ associated to this modified Hamiltonian (which Erdos, Schlein, Yau, and Yin refer to as the local relaxation flow) will relax to a local equilibrium measure ${\tilde \mu}$ in time ${O(1/\rho_*)}$ , which can be considerably faster than the relaxation time ${O(1/\rho_1)}$ to global equilibrium if the parameter ${\rho_*}$ is chosen appropriately. In particular, this can lead to improvements in the constants for bounds such as (27) (namely, by replacing ${\rho_1}$ there with ${\rho_*}$ ), at the cost of replacing the original Dirichlet energy ${D_\mu(g(0),g(0))}$ with a modified Dirichlet energy ${D_{\tilde \mu}(g(0),g(0))}$ .

Of course, in practice one is not directly interested in the local relaxation flow, but in solutions to the original heat equation ${\partial_t f = L f}$ . As such, the Bakry-Emery inequality (19) for such flows does not directly say much about the modified Dirichlet energy ${D_{\tilde \mu}(g,g)}$ , as it only directly controls the original Dirichlet energy ${D_\mu(g,g)}$ (and in practice, the ratio between ${\tilde \mu}$ and ${\mu}$ will be quite large). Nevertheless, one can adapt the Bakry-Emery methods to say something useful in this setting. Starting with a solution to the original heat equation ${\partial_t f = Lf}$ , we define a modified function ${\tilde f}$ by the formula

$\displaystyle f\ d\mu = \tilde f\ d\tilde \mu.$

In other words, we have ${f = \tilde f \psi}$ , where ${\psi}$ is the weight ${\psi := \frac{d\tilde \mu}{d\mu}}$ . Instead of working with the entropy ${S_\mu(f) = \int_{{\bf R}^n} f \log f\ d\mu}$ of ${f}$ relative to ${\mu}$ , we instead work with the entropy

$\displaystyle S_{\tilde \mu}(\tilde f) = \int_{{\bf R}^n} \tilde f \log \tilde f\ d\tilde \mu$

of ${\tilde f}$ with respect to ${\mu}$ ; we may write this as

$\displaystyle S_{\tilde \mu}(\tilde f) = S_\mu(f) - \int_{{\bf R}^n} f \log \psi\ d\mu.$

Differentiating this using (17), one has

$\displaystyle \partial_t S_{\tilde \mu}(\tilde f) = - 4 D_\mu(g,g) - \int_{{\bf R}^n} Lf \log \psi\ d\mu$

where ${g = f^{1/2}}$ . We easily verify using (12), (10) that

$\displaystyle \int_{{\bf R}^n} Lf \log \psi\ d\mu = \int_{{\bf R}^n} f L \log \psi\ d\mu$

$\displaystyle = \int_{{\bf R}^n} f (\frac{L \psi}{\psi} - \frac{|\nabla \psi|^2}{2\psi^2})\ d\mu$

$\displaystyle = \int_{{\bf R}^n} \tilde f (\frac{L \psi}{\psi} - \frac{|\nabla \psi|^2}{2\psi^2})\ d\tilde \mu.$

Meanwhile, ${\tilde g = \tilde f^{1/2} = g / \psi^{1/2}}$ , we have ${\nabla g = \nabla \tilde g \sqrt{\psi} + \tilde g \frac{\nabla \psi}{2\sqrt{\psi}}}$ , and so

$\displaystyle 4D_\mu(g,g) = 2 \int_{{\bf R}^n} |\nabla \tilde g \sqrt{\psi} + \tilde g \frac{\nabla \psi}{2\sqrt{\psi}}|^2\ d\mu$

$\displaystyle =\int_{{\bf R}^n} ( 2|\nabla \tilde g|^2 \psi + \nabla(\tilde g)^2 \nabla \psi + |\tilde g|^2 \frac{|\nabla \psi|^2}{2\psi})\ d\mu$

$\displaystyle = \int_{{\bf R}^n} ( 2|\nabla \tilde g|^2 \psi - 2 (\tilde g)^2 L \psi + |\tilde g|^2 \frac{|\nabla \psi|^2}{2\psi})\ d\mu$

$\displaystyle = 4 D_{\tilde \mu}(\tilde g,\tilde g) - 2 \int_{{\bf R}^n} \tilde f L\psi\ d\mu +\int_{{\bf R}^n} \tilde f \frac{|\nabla \psi|^2}{2\psi^2}\ d\tilde \mu$

and thus on combining these identities and simplifying, we see that

$\displaystyle \partial_t S_{\tilde \mu}(\tilde f) = - 4 D_{\tilde \mu}(\tilde g,\tilde g) + \int_{{\bf R}^n} \tilde f L\psi\ d\mu.$

We can write the second term as

$\displaystyle \int_{{\bf R}^n} L\tilde f \psi\ d\mu,$

but from (13) one has

$\displaystyle \int_{{\bf R}^n} \tilde L\tilde f \psi\ d\mu = \int_{{\bf R}^n} \tilde L \tilde f\ d\tilde \mu = 0$

and thus

$\displaystyle \partial_t S_{\tilde \mu}(\tilde f) = - 4 D_{\tilde \mu}(\tilde g,\tilde g) + \int_{{\bf R}^n} (L-\tilde L) \tilde f\ d\tilde \mu.$

If ${\tilde H}$ is related to ${H}$ by (28) with ${\beta=1}$ , then

$\displaystyle L - \tilde L = \rho_* x \cdot \nabla$

and so

$\displaystyle |\int_{{\bf R}^n} (L-\tilde L) \tilde f\ d\tilde \mu| \leq 2 \rho_* \int_{{\bf R}^n} \tilde g |x| |\nabla \tilde g|\ d\tilde \mu.$

To remove the derivative here, we can apply the arithmetic mean-geometric mean inequality to obtain

$\displaystyle 2 \rho_* \int_{{\bf R}^n} \tilde g |x| |\nabla \tilde g|\ d\tilde \mu \leq D_{\tilde \mu}(\tilde g,\tilde g) + O( \int_{{\bf R}^n} |x|^2 \tilde f\ d\tilde \mu)$

and so

$\displaystyle \partial_t S_{\tilde \mu}(\tilde f) \leq - D_{\tilde \mu}(\tilde g, \tilde g) + O( \int_{{\bf R}^n} |x|^2 \tilde f\ d\tilde \mu).$

In many applications, one can get good control on the variance ${\int_{{\bf R}^n} |x|^2 \tilde f\ d\tilde \mu}$ ; in the context of universality for Wigner matrices, such bounds can be obtained from local semi-circular laws, and particularly from a strong version of such a law, known as eigenvalue rigidity. Because of this, one can get good bounds on the entropy ${S_{\tilde \mu}(\tilde f)}$ and Dirichlet form ${D_{\tilde \mu}(\tilde g,\tilde g)}$ with respect to the local equilibrium measure in these cases in timescales as short as ${O(1/\rho_*)}$ ; see for instance this survey of Erdos and Yau for details.

One of the main applications of the local relaxation flow method is to obtain universality for various local statistics of random matrices after averaging in the energy parameter. In some cases (notably in that of complex Wigner matrices that match the GUE ensemble to second order) it is possible to remove the energy averaging by using methods based on explicit formulae for the correlation functions, rather than by the local relaxation flow method; this was first achieved for certain classes of ensembles by Johansson, and later extended to increasingly wider sets of ensembles by many authors. However, it appears difficult to obtain such fixed-energy results from heat flow methods alone, as there is an insufficient amount of concentration of measure for the fixed-energy measurements. However, by combining these methods with (a variant of) the Holder continuity theory of parabolic equations of Caffarelli, Chan, and Vasseur, Erdos and Yau were recently able to obtain universality for a slightly different statistic, namely a fixed gap ${\lambda_{i+1}-\lambda_i}$ between eigenvalues, without any averaging in the ${i}$ parameter. This avoids the need for energy averaging, but on the other hand the statistic studied is invariant with respect to the slow mode in which all the eigenvalues are translated by the same amount. It remains a technical challenge to discover a technique for understanding statistics that are not invariant (or nearly invariant) with respect to the slow mode, which does not invoke either explicit formulae or energy averaging.

11 comments

Comments feed for this article

6 February, 2013 at 8:46 am

hanbangxian

I am not sure if the problem you mentioned for random graph has relationship with the Ricci curvature of a graph.

7 February, 2013 at 8:00 am

I think your definition of entropy has the opposite sign to the standard one.

7 February, 2013 at 8:16 am

Terence Tao

Actually, relative entropy is usually defined using the positive sign (which, among other things, makes this quantity non-negative), in contrast to Shannon entropy or thermodynamic entropy which usually has the negative sign.

11 February, 2013 at 10:28 am

Djalil

Dear Terry,

regarding the sharp inequality for the proof of the optimal entropy-total-variation comparison, I though that it goes back at least to Kemperman (1969) http://www.ams.org/mathscinet-getitem?mr=252112 pages 2174-2175. It seems to me also that the inequality is often referred to as the Pinsker or Csiszár-Kullback inequality.

Best.

[Reference added, thanks – T.]

11 February, 2013 at 2:38 pm

Djalil

Precision: the first know proof is due to Pinsker (1963 or 64). However, the argument using the lower bound with a rational fraction is due to Kemperman (1969)… Best.

[Corrected, thanks – T.]

24 July, 2013 at 10:58 am

salazar

In the display above (19), the quadratic form should be defined by $\nabla^2 H(u,v) = (\partial_i \partial_j H)(u_i)(v_j)$ in Einstein convention, where $u$ and $v$ are vectors?

[Corrected, thanks – T.]

28 September, 2014 at 7:49 pm

FC7

For your question “I do not have a satisfactorily intuitive reason for this phenomenon, but it can be proved by”. I think we have a partial answer for it. In a paper, we prove that the the third order derivative of S(f) is >=0 and the fourth order <=0. The paper can be found in arXiv: http://arxiv.org/abs/1409.5543
I have also sent an email to you.

20 September, 2015 at 11:30 am

Entropy and rare events | What's new

[…] this previous blog post for a proof of this inequality.) A standard application of Jensen’s inequality reveals that […]

30 December, 2020 at 5:13 pm

Anonymous

I check the result of equation (7)，I think it should be 2^（ n/2） in the denominator

[I believe the exponent is correct as stated – T.]

13 March, 2022 at 2:32 pm

After eqn 10, you consider 2g d_t g for which one expression is 2g Lg by eqn 1. But another expression that you use is d_t g^2 = L g^2 and then appeal to eqn 10 to get 2gLg + |\nabla g|^2. But this would imply |\nabla g|^2 = 0 which seems wrong. Not sure what mistake I made….

13 March, 2022 at 9:31 pm

Oh I got the issue. Nevermind.

	Anonymous on Erratum for “An inverse…
	Anonymous on Erratum for “An inverse…
	Anonymous on 275A, Notes 3: The weak and st…
	Anonymous on It ought to be common knowledg…
	Ring Theory Intervie… on Reading seminar: “Stable…
	Anonymous on Work hard
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Erratum for “An inverse…
	Anonymous on Erratum for “An inverse…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…
	Anonymous on Infinite partial sumsets in th…

Some notes on Bakry-Emery theory

Recent Comments

Articles by others

Diversions

Mathematics

Selected articles

Software

The sciences

Top Posts

Archives

Categories

The Polymath Blog

11 comments

Leave a comment Cancel reply

For commenters

Some notes on Bakry-Emery theory

Share this:

Recent Comments

Articles by others

Diversions

Mathematics

Selected articles

Software

The sciences

Top Posts

Archives

Categories

The Polymath Blog

11 comments

Leave a comment Cancel reply

For commenters