Let {A, B} be two Hermitian {n \times n} matrices. When {A} and {B} commute, we have the identity

\displaystyle  e^{A+B} = e^A e^B.

When {A} and {B} do not commute, the situation is more complicated; we have the Baker-Campbell-Hausdorff formula

\displaystyle  e^{A+B} = e^A e^B e^{-\frac{1}{2}[A,B]} \ldots

where the infinite product here is explicit but very messy. On the other hand, taking determinants we still have the identity

\displaystyle  \hbox{det}(e^{A+B}) = \hbox{det}(e^A e^B).

Recently I learned (from Emmanuel Candes, who in turn learned it from David Gross) that there is another very nice relationship between {e^{A+B}} and {e^A e^B}, namely the Golden-Thompson inequality

\displaystyle  \hbox{tr}(e^{A+B}) \leq \hbox{tr}(e^A e^B). \ \ \ \ \ (1)

The remarkable thing about this inequality is that no commutativity hypotheses whatsoever on the matrices {A, B} are required. Note that the right-hand side can be rearranged using the cyclic property of trace as {\hbox{tr}( e^{B/2} e^A e^{B/2} )}; the expression inside the trace is positive definite so the right-hand side is positive. (On the other hand, there is no reason why expressions such as {\hbox{tr}(e^A e^B e^C)} need to be positive or even real, so the obvious extension of the Golden-Thompson inequality to three or more Hermitian matrices fails.) I am told that this inequality is quite useful in statistical mechanics, although I do not know the details of this.
To get a sense of how delicate the Golden-Thompson inequality is, let us expand both sides to fourth order in {A, B}. The left-hand side expands as

\displaystyle  \hbox{tr} 1 + \hbox{tr} (A+B) + \frac{1}{2} \hbox{tr} (A^2 + AB + BA + B^2) + \frac{1}{6} \hbox{tr} (A+B)^3

\displaystyle  + \frac{1}{24} \hbox{tr} (A+B)^4 + \ldots

while the right-hand side expands as

\displaystyle  \hbox{tr} 1 + \hbox{tr} (A+B) + \frac{1}{2} \hbox{tr} (A^2 + 2AB + B^2)

\displaystyle  + \frac{1}{6} \hbox{tr} (A^3 + 3A^2 B + 3 A B^2+B^3) +

\displaystyle  \frac{1}{24} \hbox{tr} (A^4 + 4 A^3 B + 6 A^2 B^2 + 4 A B^3 +B^4) + \ldots

Using the cyclic property of trace {\hbox{tr}(AB) = \hbox{tr}(BA)}, one can verify that all terms up to third order agree. Turning to the fourth order terms, one sees after expanding out {(A+B)^4} and using the cyclic property of trace as much as possible, we see that the fourth order terms almost agree, but the left-hand side contains a term {\frac{1}{12} \hbox{tr}(ABAB)} whose counterpart on the right-hand side is {\frac{1}{12} \hbox{tr}(ABBA)}. The difference between the two can be factorised (again using the cyclic property of trace) as {-\frac{1}{24} \hbox{tr} [A,B]^2}. Since {[A,B] := AB-BA} is skew-Hermitian, {-[A,B]^2} is positive definite, and so we have proven the Golden-Thompson inequality to fourth order. (One could also have used the Cauchy-Schwarz inequality for the Frobenius norm to establish this; see below.)
Intuitively, the Golden-Thompson inequality is asserting that interactions between a pair {A, B} of non-commuting Hermitian matrices are strongest when cross-interactions are kept to a minimum, so that all the {A} factors lie on one side of a product and all the {B} factors lie on the other. Indeed, this theme will be running through the proof of this inequality, to which we now turn.

The proof of the Golden-Thompson inequality relies on the somewhat magical power of the tensor power trick. For any even integer {p = 2,4,6,\ldots} and any {n \times n} matrix {A} (not necessarily Hermitian), we define the {p}-Schatten norm {\|A\|_p} of {A} by the formula

\displaystyle  \| A \|_p := (\hbox{tr}(AA^*)^{p/2})^{1/p}.

(This formula in fact defines a norm for any {p \geq 1}, but we will only need the even integer case here.) This norm can be viewed as a non-commutative analogue of the {\ell^p} norm; indeed, the {p}-Schatten norm of a diagonal matrix is just the {\ell^p} norm of the coefficients.
Note that the {2}-Schatten norm

\displaystyle  \|A\|_2 := (\hbox{tr}(AA^*))^{1/2}

is the Hilbert space norm associated to the Frobenius inner product (or Hilbert-Schmidt inner product)

\displaystyle  \langle A, B \rangle := \hbox{tr}(A B^*).

This is clearly a non-negative Hermitian inner product, so by the Cauchy-Schwarz inequality we conclude that

\displaystyle  |\hbox{tr}(A_1 A_2^*)| \leq \| A_1 \|_2 \|A_2\|_2

for any {n \times n} matrices {A_1, A_2}. As {\|A_2\|_2 = \|A_2^*\|_2}, we conclude in particular that

\displaystyle  |\hbox{tr}(A_1 A_2)| \leq \| A_1 \|_2 \|A_2\|_2

We can iterate this and establish the non-commutative Hölder inequality

\displaystyle  |\hbox{tr}(A_1 A_2 \ldots A_p)| \leq \| A_1 \|_p \|A_2\|_p \ldots \|A_p\|_p \ \ \ \ \ (2)

whenever {p=2,4,8,\ldots} is an even power of {2}. Indeed, we induct on {p}, the case {p=2} already having been established. If {p \geq 4} is a power of {2}, then by the induction hypothesis (grouping {A_1 \ldots A_p} into {p/2} pairs) we can bound

\displaystyle  |\hbox{tr}(A_1 A_2 \ldots A_p)| \leq \| A_1 A_2 \|_{p/2} \|A_3 A_4\|_{p/2} \ldots \|A_{p-1} A_p\|_{p/2}. \ \ \ \ \ (3)

On the other hand, we may expand

\displaystyle  \| A_1 A_2\|_{p/2}^{p/2} = \hbox{tr} A_1 A_2 A_2^* A_1^* \ldots A_1 A_2 A_2^* A_1^*.

We use the cyclic property of trace to move the rightmost {A_1^*} factor to the left. Applying the induction hypothesis again, we conclude that

\displaystyle  \| A_1 A_2\|_{p/2}^{p/2} \leq \| A_1^* A_1 \|_{p/2} \|A_2 A_2^*\|_{p/2} \ldots \| A_1^* A_1 \|_{p/2} \| A_2 A_2^* \|_{p/2}.

But from the cyclic property of trace again, we have {\| A_1^* A_1 \|_{p/2} = \|A_1\|_p^2} and {\| A_2 A_2^* \|_{p/2} = \|A_2\|_p^2}. We conclude that

\displaystyle  \|A_1 A_2 \|_{p/2} \leq \|A_1\|_p \|A_2\|_p

and similarly for {\|A_3 A_4\|_{p/2}}, etc. Inserting this into (3) we obtain (2).

Remark 1 Though we will not need to do so here, it is interesting to note that one can use the tensor power trick to amplify (2) for {p} equal to a power of two, to obtain (2) for all positive integers {p}, at least when the {A_i} are all Hermitian. Indeed, pick a large integer {m} and let {N} be the integer part of {2^m/p}. Then expand the left-hand side of (2) as {\hbox{tr}( A_1^{1/N} \ldots A_1^{1/N} A_2^{1/N} \ldots A_p^{1/N} \ldots A_p^{1/N} )} and apply (2) with {p} replaced by {2^m} to bound this by {\| A_1^{1/N} \|_{2^m}^N \ldots \|A_p^{1/N}\|_{2^m}^N \| 1 \|_{2^m}^{2^m-pN}}. Sending {m \rightarrow \infty} (noting that {2^m = (1+o(1)) Np}) we obtain the claim.

Specialising (2) to the case where {A_1=\ldots=A_p = AB} for some Hermitian matrices {A, B}, we conclude that

\displaystyle  \hbox{tr}( (AB)^{p} ) \leq \| AB \|_p^p

and hence by cyclic permutation

\displaystyle  \hbox{tr}( (AB)^{p} ) \leq \hbox{tr}( (A^2 B^2)^{p/2} )

for any {p = 2,4,\ldots}. Iterating this we conclude that

\displaystyle  \hbox{tr}( (AB)^{p} ) \leq \hbox{tr}( A^p B^p ). \ \ \ \ \ (4)

Applying this with {A, B} replaced by {e^{A/p}} and {e^{B/p}} respectively, we obtain

\displaystyle  \hbox{tr}( (e^{A/p} e^{B/p})^{p} ) \leq \hbox{tr}( e^A e^B ).

Now we send {p \rightarrow \infty}. Since {e^{A/p} = 1 + A/p + O(1/p^2)} and {e^{B/p} = 1 + B/p + O(1/p^2)}, we have {e^{A/p} e^{B/p} = e^{(A+B)/p + O(1/p^2)}}, and so the left-hand side is {\hbox{tr}( e^{A+B + O(1/p)} )}; taking the limit as {p \rightarrow \infty} we obtain the Golden-Thompson inequality. (See also these notes of Vershynin for a slight variant of this proof.)
If we stop the iteration at an earlier point, then the same argument gives the inequality

\displaystyle  \| e^{A+B} \|_p \leq \| e^A e^B \|_p

for {p=2,4,8,\ldots} a power of two; one can view the original Golden-Thompson inequality as the {p=1} endpoint of this case in some sense. (In fact, the Golden-Thompson inequality is true in any operator norm; see Theorem 9.3.7 of Bhatia’s book.) In the limit {p \rightarrow \infty}, we obtain in particular the operator norm inequality

\displaystyle  \| e^{A+B} \|_{op} \leq \| e^A e^B \|_{op} \ \ \ \ \ (5)

This inequality has a nice consequence:

Corollary 2 Let {A, B} be Hermitian matrices. If {e^A \leq e^B} (i.e. {e^B-e^A} is positive semi-definite), then {A \leq B}.

Proof: Since {e^A \leq e^B}, we have {\langle e^A x, x \rangle \leq \langle e^B x, x \rangle} for all vectors {x}, or in other words {\|e^{A/2} x \| \leq \| e^{B/2} x \|} for all {x}. This implies that {e^{A/2} e^{-B/2}} is a contraction, i.e. {\|e^{A/2} e^{-B/2} \|_{op} \leq 1}. By (5), we conclude that {\|e^{(A-B)/2}\|_{op} \leq 1}, thus {(A-B)/2 \leq 0}, and the claim follows. \Box

It is not difficult to reverse the above argument and conclude that (2) is in fact equivalent to (5).

It is remarkably tricky to try to prove Corollary 2 directly. Here is a somewhat messy proof; I would be interested in seeing a more elegant argument. By the fundamental theorem of calculus, it suffices to show that whenever {A(t)} is a Hermitian matrix depending smoothly on a real parameter with {\frac{d}{dt} e^{A(t)} \geq 0}, then {\frac{d}{dt} A(t) \geq 0}. Indeed, Corollary 2 follows from this claim by setting {A(t) := \log(e^A + t (e^B - e^A))} and concluding that {A(1) \geq A(0)}.

To obtain this claim, we use the Duhamel formula

\displaystyle  \frac{d}{dt} e^{A(t)} = \int_0^1 e^{(1-s)A(t)} (\frac{d}{dt} A(t)) e^{sA(t)}\ ds.

This formula can be proven by Taylor expansion, or by carefully approximating {e^{A(t)}} by {(1 + A(t)/N)^N}; alternatively, one can integrate the identity

\displaystyle  \frac{\partial}{\partial s}( e^{-sA(t)} \frac{\partial }{\partial t} e^{sA(t)} ) = e^{-sA(t)} (\frac{\partial}{\partial t} A(t)) e^{sA(t)}

which follows from the product rule and by interchanging the {s} and {t} derivatives at a key juncture. We rearrange the Duhamel formula as

\displaystyle  \frac{d}{dt} e^{A(t)} = e^{A(t)/2} (\int_{-1/2}^{1/2} e^{sA(t)} (\frac{d}{dt} A(t)) e^{-sA(t)}\ ds) e^{A(t)/2}.

Using the basic identity {e^A B e^{-A} = e^{\hbox{ad}(A)} B}, we thus have

\displaystyle  \frac{d}{dt} e^{A(t)} = e^{A(t)/2} [(\int_{-1/2}^{1/2} e^{s \hbox{ad}(A(t))}\ ds) (\frac{d}{dt} A(t))] e^{A(t)/2};

formally evaluating the integral, we obtain

\displaystyle  \frac{d}{dt} e^{A(t)} = e^{A(t)/2} [\frac{\sinh(\hbox{ad}(A(t))/2)}{\hbox{ad}(A(t))/2} (\frac{d}{dt} A(t))] e^{A(t)/2},

and thus

\displaystyle  \frac{d}{dt} A(t) = \frac{\hbox{ad}(A(t))/2}{\sinh(\hbox{ad}(A(t))/2)} ( e^{-A(t)/2} (\frac{d}{dt} e^{A(t)}) e^{-A(t)/2} ).

As {\frac{d}{dt} e^{A(t)}} was positive semi-definite by hypothesis, {e^{-A(t)/2} (\frac{d}{dt} e^{A(t)}) e^{-A(t)/2}} is also. It thus suffices to show that for any Hermitian {A}, the operator {\frac{\hbox{ad}(A)}{\sinh(\hbox{ad}(A))}} preserves the property of being semi-definite.
Note that for any real {\xi}, the operator {e^{2\pi i \xi \hbox{ad}(A)}} maps a positive semi-definite matrix {B} to another positive semi-definite matrix, namely {e^{2\pi i \xi A} B e^{-2\pi i \xi A}}. By the Fourier inversion formula, it thus suffices to show that the kernel {F(x) := \frac{x}{\sinh(x)}} is positive semi-definite in the sense that it has non-negative Fourier transform (cf. Bochner’s theorem). But a routine (but somewhat tedious) application of contour integration shows that the Fourier transform {\hat F(\xi) = \int_{\bf R} e^{-2\pi i x \xi} F(x)\ dx} is given by the formula {\hat F(\xi) = \frac{1}{8 \cosh^2( \pi^2 \xi)}}, and the claim follows.

Because of the Golden-Thompson inequality, many applications of the exponential moment method in commutative probability theory can be extended without difficulty to the non-commutative case, as was observed by Ahlswede and Winter. For instance, consider (a special case of) the Chernoff inequality

\displaystyle  {\bf P}( X_1 + \ldots + X_N \geq \lambda \sigma ) \leq \max( e^{-\lambda^2/4}, e^{-\lambda \sigma / 2} )

for any {\lambda > 0}, where {X_1,\ldots,X_n \equiv X} are iid scalar random variables taking values in {[-1,1]} of mean zero and with total variance {\sigma^2} (i.e. each factor has variance {\sigma^2/N}). We briefly sketch the standard proof of this inequality. We first use Markov’s inequality to obtain

\displaystyle  {\bf P}( X_1 + \ldots + X_N \geq \lambda \sigma ) \leq e^{-t\lambda \sigma } {\bf E} e^{t(X_1 + \ldots + X_N)}

for some parameter {t>0} to be optimised later. In the scalar case, we can factor {e^{t(X_1+\ldots+X_N)}} as {e^{tX_1} \ldots e^{tX_N}} and then use the iid hypothesis to write the right-hand side as

\displaystyle  e^{-t\lambda \sigma } ( {\bf E} e^{tX} )^N.

An elementary Taylor series computation then reveals the bound {{\bf E} e^{tX} \leq \exp( t^2 \sigma^2 / N )} when {0 \leq t \leq 1}; inserting this bound and optimising in {t} we obtain the claim.
Now suppose that {X_1,\ldots,X_n \equiv X} are iid {d \times d} Hermitian matrices. One can try to adapt the above method to control the size of the sum {X_1 + \ldots + X_N}. The key point is then to bound expressions such as

\displaystyle  {\bf E} \hbox{tr} e^{t(X_1 + \ldots + X_N)}.

As {X_1,\ldots,X_N} need not commute, we cannot separate the product completely. But by Golden-Thompson, we can bound this expression by

\displaystyle  {\bf E} \hbox{tr} e^{t(X_1 + \ldots + X_{N-1})} e^{tX_n}

which by independence we can then factorise as

\displaystyle  \hbox{tr} ({\bf E} e^{t(X_1 + \ldots + X_{N-1})}) ({\bf E} e^{tX_n}).

As the matrices involved are positive definite, we can then take out the final factor in operator norm:

\displaystyle  \| {\bf E} e^{tX_n} \|_{op} \hbox{tr} {\bf E} e^{t(X_1 + \ldots + X_{N-1})}.

Iterating this procedure, we can eventually obtain the bound

\displaystyle  {\bf E} \hbox{tr} e^{t(X_1 + \ldots + X_N)} \leq \| {\bf E} e^{tX} \|_{op}^N.

Combining this with the rest of the Chernoff inequality argument, we can establish a matrix generalisation

\displaystyle  {\bf P}( \| X_1 + \ldots + X_N \|_{op} \geq \lambda \sigma ) \leq n \max( e^{-\lambda^2/4}, e^{-\lambda \sigma / 2} )

of the Chernoff inequality, under the assumption that the {X_1,\ldots,X_N} are iid with mean zero, have operator norm bounded by {1}, and have total variance {\sum_{i=1}^n \| {\bf E} X_i^2 \|_{op}} equal to {\sigma^2}; see for instance these notes of Vershynin for details.
Further discussion of the use of the Golden-Thompson inequality and its variants to non-commutative Chernoff-type inequalities can be found in this paper of Gross, these notes of Vershynin and this recent article of Tropp. It seems that the use of this inequality may be quite useful in simplifying the proofs of several of the basic estimates in this subject.